Energy File Optimisation
Comparative analysis of the reading/writing performance (time and size) of CSV, Feather, Parquet, and Pickle formats.

Experimental Plan, Performance Analysis
Projet: Alstom Transport Company
This project consisted of an in-depth comparative study of common data file formats (CSV, Feather, Parquet, Pickle) to identify the most efficient format in terms of read/write time and storage size for Alstom's specific energy data. The goal was to recommend an optimized format to enhance the efficiency of data processing pipelines and reduce infrastructure costs, with a clear presentation of results via Power BI.
Context:
As part of the management of energy data at Alstom, the performance of input/output (I/O) operations is a critical factor. Significant volumes of data are generated and consumed, and the choice of file format can have a significant impact on processing speed, resource utilization, and ultimately, on the operational efficiency of systems.
Issue:
What is the most suitable file format for Alstom's energy data, considering performance constraints (read/write time) and storage optimization (file size and compression level)? How can the results of this benchmark be effectively visualized and communicated to technical and business teams?
Solution:
The solution involved establishing a rigorous experimental plan in Python. I designed and developed scripts to generate datasets representative of Alstom's energy data, and then to measure read and write times as well as file sizes and their compression levels for each format (CSV, Feather, Parquet, Pickle). Performance data was then collected, transformed, and exported to be visualized interactively on a Power BI dashboard, allowing for easy analysis and informed decision-making.
Achievements:
Design of the Experimental Plan and Benchmark: I defined a detailed experimental plan and a benchmarking methodology to fairly compare file formats based on read/write performance criteria and compression level, using datasets simulating the structure and volume of Alstom's energy data.
Development of Python Scripts: I developed robust Python scripts (using
Pandasand specific libraries for each format such aspyarrowfor Feather/Parquet) to automate performance testing, including the generation of synthetic data, writing in different formats, reading, and accurately measuring times, sizes, and compression levels.Data Analysis and Recommendations: I analyzed the raw results of the benchmark to identify trends and performance specific to each format, allowing for clear and evidence-based recommendations on the most optimal format for energy data.
Creation of Power BI Dashboard: I designed and implemented an interactive dashboard on Power BI. This dashboard visualizes read/write times, file sizes, and compression levels for each format, offering filters and segments for dynamic exploration of results and effective communication to stakeholders.
Technical Stack
Languages: Python
Python Libraries: Pandas,
pyarrow(for Feather/Parquet),fastparquet,pickleFile Formats: CSV, Feather, Parquet, Pickle
Data Visualization: Microsoft Power BI
Tools: Jupyter Notebook (for experimentation and analysis)
Tags
Experimental Design, Data Engineering, Data Visualization, Performance Analysis
You might also like



