In today’s day and age where we are completely surrounded by data, it may be in the form video, text, images, tables, etc, we want to store this data somewhere in the form of files.
And while storing it we want to check if that particular file format doesn’t occupy a lot of my disk space for simple reasons:
- Have more resources left for other storage on disk.
- Transferring the file via any network would be faster if the file size is small.
Now speaking from a perspective of a data scientist or a data engineer having an optimized storage is ‘Yay!!’, but at the same time when we want to run our code using the file formats. This would require us to take into consideration the time taken for reading and writing these files to and from the disk.
Now for a while we all have been using CSV file format to store our tabular data on disk but let us see how can we be more efficient and still be able to store our tabular data on disk. The results below show results for the following file formats:
The following libraries where used in the benchmarking the file formats
The reason for two libraries is that Datatables doesn’t support parquet and feather files formats but does have support for CSV and jay, and CSV read and write are very fast using datatables.
Each data file comprises of 100 columns and 500K rows.
Now if we talk about the data we also need to keep in mind the datatype of the data that we are storing. I have taken into consideration 4 datatypes and have shown results for:
Read fastest: Jay(0.01s) in competition with Feather(0.04s),
Write fastest: Feather(0.33s) in competition with Jay(0.39s)
Least size on disk: Parquet with gzip compression (156.76MB)
Read fastest: Jay(0.01s),
Write fastest: Jay(0.75s) in competition with Feather(0.87s)
Least size on disk: Feather and Jay at neck to neck with each other (381.48MB)
Read fastest: Jay(0.02s),
Write fastest: CSV(2.21s) in competition with Jay(2.29s)
Least size on disk: Parquet with brotli compression (413.04MB)
Read fastest: Jay(0.01s)
Write fastest: Jay(0.69s)
Least size on disk: Parquet with brotli compression (6.76MB)
Mixed (25% Integer, 25% Float, 25% String, 25% Boolean)
Read fastest: Jay(0.01s)
Write fastest: Jay(0.62s)
Least size on disk: Parquet with brotli compression (241.51MB)
While Jay is super-fast in a lot of cases it ends up taking more space than even CSV in boolean and string datatype but is comparable to parquet and feather in other datatypes
CSV seems to be very fast using Datatables library but ends up occupying a lot more space than the other file formats. The reason for the read and write operation to be so fast is because the library has optimized itself in doing the IO operation by using multi-threading which makes it heavily dependent on the machine that you are hosting your code on. Of course the library has done an excellent job on that but is this same enhancement available outside this library and would optimize reads and writes across languages?
Feather is a file format that sometimes outperforms even parquet but is really not the file format to use while saving boolean file format.
This also begs the question of what your use case is, does your application perform more reads or more writes? Because in some cases it is possible for a particular file format to take more time to write than it does to read comparatively, and if your application does more write than read then you would use the file format that is more optimized for write than it optimized to read for your datasets.
Being said all this, again this is a very isolated result, this benchmark would differ from dataset to dataset with varying datatypes and it is best to see what performs best for you and your dataset.
You can run your benchmarks using this code for comparing with various file formats.