Java
和Python
實現(xiàn) Avro 轉(zhuǎn)換成Parquet
格式,chema
都是在 Avro 中定義的。這里要嘗試的是如何定義Parquet
的Schema
, 然后據(jù)此填充數(shù)據(jù)并生成Parquet
文件。
一、簡單字段定義
1、定義 Schema 并生成 Parquet 文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
|
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # 定義 Schema schema = pa.schema([ ( 'id' , pa.int32()), ( 'email' , pa.string()) ]) # 準(zhǔn)備數(shù)據(jù) ids = pa.array([ 1 , 2 ], type = pa.int32()) # 生成 Parquet 數(shù)據(jù) batch = pa.RecordBatch.from_arrays( [ids, emails], schema = schema ) table = pa.Table.from_batches([batch]) # 寫 Parquet 文件 plain.parquet pq.write_table(table, 'plain.parquet' ) import pandas as pd import pyarrow as pa import pyarrow . parquet as pq # 定義 Schema schema = pa . schema ( [ ( 'id' , pa . int32 ( ) ) , ( 'email' , pa . string ( ) ) ] ) # 準(zhǔn)備數(shù)據(jù) ids = pa . array ( [ 1 , 2 ] , type = pa . int32 ( ) ) # 生成 Parquet 數(shù)據(jù) batch = pa . RecordBatch . from_arrays ( [ ids , emails ] , schema = schema ) table = pa . Table . from_batches ( [ batch ] ) # 寫 Parquet 文件 plain.parquet pq . write_table ( table , 'plain.parquet' ) |
2、驗證 Parquet 數(shù)據(jù)文件
我們可以用工具 parquet-tools
來查看 plain.parquet
文件的數(shù)據(jù)和 Schema
1
|
$ parquet - tools schema plain.parquet message schema { optional int32 id ; optional binary email (STRING); } $ parquet - tools cat - - json plain.parquet { "id" : 1 , "email" : "[email protected]" } { "id" : 2 , "email" : "[email protected]" } |
沒問題,與我們期望的一致。也可以用 pyarrow
代碼來獲取其中的 Schema
和數(shù)據(jù)
1
2
3
4
5
6
7
8
9
10
11
12
|
schema = pq.read_schema( 'plain.parquet' ) print (schema) df = pd.read_parquet( 'plain.parquet' ) print (df.to_json()) schema = pq . read_schema ( 'plain.parquet' ) print ( schema ) df = pd . read_parquet ( 'plain.parquet' ) print ( df . to_json ( ) ) |
輸出為:
1
2
3
4
5
6
7
8
9
10
11
12
|
schema = pq.read_schema( 'plain.parquet' ) print (schema) df = pd.read_parquet( 'plain.parquet' ) print (df.to_json()) schema = pq . read_schema ( 'plain.parquet' ) print ( schema ) df = pd . read_parquet ( 'plain.parquet' ) print ( df . to_json ( ) ) |
二、含嵌套字段定義
下面的 Schema
定義加入一個嵌套對象,在 address
下分 email_address
和 post_address
,Schema
定義及生成 Parquet
文件的代碼如下
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
|
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # 內(nèi)部字段 address_fields = [ ( 'email_address' , pa.string()), ( 'post_address' , pa.string()), ] # 定義 Parquet Schema,address 嵌套了 address_fields schema = pa.schema(j) # 準(zhǔn)備數(shù)據(jù) ids = pa.array([ 1 , 2 ], type = pa.int32()) addresses = pa.array( pa.struct(address_fields) ) # 生成 Parquet 數(shù)據(jù) batch = pa.RecordBatch.from_arrays( [ids, addresses], schema = schema ) table = pa.Table.from_batches([batch]) # 寫 Parquet 數(shù)據(jù)到文件 pq.write_table(table, 'nested.parquet' ) import pandas as pd import pyarrow as pa import pyarrow . parquet as pq # 內(nèi)部字段 address_fields = [ ( 'email_address' , pa . string ( ) ) , ( 'post_address' , pa . string ( ) ) , ] # 定義 Parquet Schema,address 嵌套了 address_fields schema = pa . schema ( j ) # 準(zhǔn)備數(shù)據(jù) ids = pa . array ( [ 1 , 2 ] , type = pa . int32 ( ) ) addresses = pa . array ( pa . struct ( address_fields ) ) # 生成 Parquet 數(shù)據(jù) batch = pa . RecordBatch . from_arrays ( [ ids , addresses ] , schema = schema ) table = pa . Table . from_batches ( [ batch ] ) # 寫 Parquet 數(shù)據(jù)到文件 pq . write_table ( table , 'nested.parquet' ) |
1、驗證 Parquet 數(shù)據(jù)文件
同樣用 parquet-tools
來查看下 nested.parquet
文件
1
|
$ parquet - tools schema nested.parquet message schema { optional int32 id ; optional group address { optional binary email_address (STRING); optional binary post_address (STRING); } } $ parquet - tools cat - - json nested.parquet { "id" : 1 , "address" :{ "email_address" : "[email protected]" , "post_address" : "city1" }} { "id" : 2 , "address" :{ "email_address" : "[email protected]" , "post_address" : "city2" }} |
用 parquet-tools
看到的 Schama
并沒有 struct
的字樣,但體現(xiàn)了它 address
與下級屬性的嵌套關(guān)系。
用 pyarrow
代碼來讀取 nested.parquet
文件的 Schema
和數(shù)據(jù)是什么樣子
1
2
3
4
5
6
7
8
9
10
11
12
|
schema = pq.read_schema( "nested.parquet" ) print (schema) df = pd.read_parquet( 'nested.parquet' ) print (df.to_json()) schema = pq . read_schema ( "nested.parquet" ) print ( schema ) df = pd . read_parquet ( 'nested.parquet' ) print ( df . to_json ( ) ) |
輸出:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
id : int32 - - field metadata - - PARQUET:field_id: '1' address: struct<email_address: string, post_address: string> child 0 , email_address: string - - field metadata - - PARQUET:field_id: '3' child 1 , post_address: string - - field metadata - - PARQUET:field_id: '4' - - field metadata - - PARQUET:field_id: '2' { "id" :{ "0" : 1 , "1" : 2 }, "address" :{ "0" :{ "email_address" : "[email protected]" , "post_address" : "city1" }, "1" :{ "email_address" : "[email protected]" , "post_address" : "city2" }}} id : int32 - - field metadata - - PARQUET : field_id : '1' address : struct & lt ; email_address : string , post_address : string & gt ; child 0 , email_address : string - - field metadata - - PARQUET : field_id : '3' child 1 , post_address : string - - field metadata - - PARQUET : field_id : '4' - - field metadata - - PARQUET : field_id : '2' { "id" : { "0" : 1 , "1" : 2 } , "address" : { "0" : { "email_address" : "[email protected]" , "post_address" : "city1" } , "1" : { "email_address" : "[email protected]" , "post_address" : "city2" } } } |
數(shù)據(jù)當(dāng)然是一樣的,有略微不同的是顯示的 Schema
中, address
標(biāo)識為 struct<email_address: string, post_address: string>
, 明確的表明它是一個 struct
類型,而不是只展示嵌套層次。
到此這篇關(guān)于用 Python
定義 Schema
并生成 Parquet
文件詳情的文章就介紹到這了,更多相關(guān)用 Python
定義 Schema
并生成 Parquet
文件內(nèi)容請搜索服務(wù)器之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持服務(wù)器之家!
原文鏈接:https://www.tuicool.com/articles/mEfMZrM