hive - Avro,parquet and SequenceFileFormat position in Hadoop Ecosystem and their utility -

- May 15, 2014

i have seen different file formats being used while importing , storing hdfs , data processing engines utilize these formats while performing own set of procedures.so kind of difference these file formats make , how choice made different use cases.being newbie creates confusion.kindly same.

the choice depends on use case facing according type of data have, compatibility processing tools, schema evolution, file size, type of query , read performances.

in general :

avro more suitable event data can change on time
sequence datasets sharded between mr jobs
parquet more suitable analytics due it's columnar format

here keys can you

writing performance ( more + have faster )

sequence : +++
avro : ++
parquet : +

reading performance ( more + have faster )

sequence : +
avro : + + +
parquet : + + + + +

file sizes ( more + have smaller file )

sequence : +
avro : ++
parquet : + + +

and here facts each file type

avro :

better in schema evolution
is row oriented binary format
has schema
the file contain schema in addition data.
supports schema evolution
can compressed
compact , fast binary format

parquet :

slow in writing fast in reading
is column oriented binary format
supports compression
optimized , efficient in terms of disk i/o when specific columns need queried

sequencefile :

is row oriented format
supports splitting if data compressed
can used pack small files in hadoop

i wish answer you

Search This Blog

Stadnd

hive - Avro,parquet and SequenceFileFormat position in Hadoop Ecosystem and their utility -

Comments

Post a Comment

Popular posts from this blog

Capture and play voice with Asterisk ARI -

c - Unrecognised emulation mode: elf_i386 on MinGW32 -

python - Statsmodels.api Logit model error ValueError: endog must be in the unit interval -