hive - Avro,parquet and SequenceFileFormat position in Hadoop Ecosystem and their utility -


i have seen different file formats being used while importing , storing hdfs , data processing engines utilize these formats while performing own set of procedures.so kind of difference these file formats make , how choice made different use cases.being newbie creates confusion.kindly same.

the choice depends on use case facing according type of data have, compatibility processing tools, schema evolution, file size, type of query , read performances.

in general :

  • avro more suitable event data can change on time
  • sequence datasets sharded between mr jobs
  • parquet more suitable analytics due it's columnar format

here keys can you

writing performance ( more + have faster )

  • sequence : +++
  • avro : ++
  • parquet : +

reading performance ( more + have faster )

  • sequence : +
  • avro : + + +
  • parquet : + + + + +

file sizes ( more + have smaller file )

  • sequence : +
  • avro : ++
  • parquet : + + +

and here facts each file type

avro :

  • better in schema evolution
  • is row oriented binary format
  • has schema
  • the file contain schema in addition data.
  • supports schema evolution
  • can compressed
  • compact , fast binary format

parquet :

  • slow in writing fast in reading
  • is column oriented binary format
  • supports compression
  • optimized , efficient in terms of disk i/o when specific columns need queried

sequencefile :

  • is row oriented format
  • supports splitting if data compressed
  • can used pack small files in hadoop

i wish answer you


Comments

Popular posts from this blog

ruby - Trying to change last to "x"s to 23 -

jquery - Clone last and append item to closest class -

c - Unrecognised emulation mode: elf_i386 on MinGW32 -