Energy File Optimisation

Comparative analysis of the reading/writing performance (time and size) of CSV, Feather, Parquet, and Pickle formats.

Experimental Plan, Performance Analysis

Projet: Alstom Transport Company

March, 2023

Durée: 1 month

Visit Website

This project consisted of an in-depth comparative study of common data file formats (CSV, Feather, Parquet, Pickle) to identify the most efficient format in terms of read/write time and storage size for Alstom's specific energy data. The goal was to recommend an optimized format to enhance the efficiency of data processing pipelines and reduce infrastructure costs, with a clear presentation of results via Power BI.

Context:

As part of the management of energy data at Alstom, the performance of input/output (I/O) operations is a critical factor. Significant volumes of data are generated and consumed, and the choice of file format can have a significant impact on processing speed, resource utilization, and ultimately, on the operational efficiency of systems.

Issue:

What is the most suitable file format for Alstom's energy data, considering performance constraints (read/write time) and storage optimization (file size and compression level)? How can the results of this benchmark be effectively visualized and communicated to technical and business teams?

Solution:

The solution involved establishing a rigorous experimental plan in Python. I designed and developed scripts to generate datasets representative of Alstom's energy data, and then to measure read and write times as well as file sizes and their compression levels for each format (CSV, Feather, Parquet, Pickle). Performance data was then collected, transformed, and exported to be visualized interactively on a Power BI dashboard, allowing for easy analysis and informed decision-making.

Achievements:

Design of the Experimental Plan and Benchmark: I defined a detailed experimental plan and a benchmarking methodology to fairly compare file formats based on read/write performance criteria and compression level, using datasets simulating the structure and volume of Alstom's energy data.
Development of Python Scripts: I developed robust Python scripts (using Pandas and specific libraries for each format such as pyarrow for Feather/Parquet) to automate performance testing, including the generation of synthetic data, writing in different formats, reading, and accurately measuring times, sizes, and compression levels.
Data Analysis and Recommendations: I analyzed the raw results of the benchmark to identify trends and performance specific to each format, allowing for clear and evidence-based recommendations on the most optimal format for energy data.
Creation of Power BI Dashboard: I designed and implemented an interactive dashboard on Power BI. This dashboard visualizes read/write times, file sizes, and compression levels for each format, offering filters and segments for dynamic exploration of results and effective communication to stakeholders.

Technical Stack

Languages: Python
Python Libraries: Pandas, pyarrow (for Feather/Parquet), fastparquet, pickle
File Formats: CSV, Feather, Parquet, Pickle
Data Visualization: Microsoft Power BI
Tools: Jupyter Notebook (for experimentation and analysis)