RCF and ORC

  • RCF (Row Columnar File) old not used anymore
  • ORC (Optimized Row Columnar) has the highest compressing
  • Both the above file formats are by Facebook. And data is stored in columnar form

Parquet

  • Parquet by Cloudera (Used more in industry. Decompression is faster than ORC). Spark uses this format by default
  • There is no waste of space in these formats. They don’t allocate space for NULL and have an optimized way to handle duplicate data.
  • They are faster to read data from compared to textile

Avro

  • Avro (It is a data serialization system native to Hadoop. It is also language independent format)
  • .avsc (Avro Schema file) .avro (Avro Data).
  • Data internally stored as key, value pair similar to JSON
  • Uses RPC for data movement

Sequence File

Sequence file (Hadoop flat files which stores values in binary key-value pairs)

Hadoop File Formats, when and what to use?

New in Hadoop: Various File Format in Hadoop | Towards Data Science