Python library for data engineering

Design and development of a Python library (alstompy) within a team of 5 people to interact with the data from Alstom's HealthHub predictive maintenance platform, improving access to and analysis of railway data.

Software Engineering, Transport, Data

Projet: Alstom Transport Company

Jan. 2022 - Aug. 2022

Durée: 8 months

Visit Website

Presentation:

The alstompy project involved the design and development, within a team of 5 people, of a Python library dedicated to Alstom's employees and customers. Its main goal was to facilitate interaction with the vast datasets captured by the HealthHub platform, an intelligent predictive maintenance solution for railway systems (trains, infrastructure, signalling). This library has enabled the democratization of access to complex data and optimized their use for in-depth analysis.

Context:

In the industrial and transport sector, predictive maintenance is crucial to ensure the reliability and safety of railway operations. Alstom's HealthHub platform collects significant volumes of data (events, KPIs, incidents, time series) from railway systems. However, accessing and manipulating this data for custom analysis or third-party integrations could prove complex for non-expert users.

Problem:

How to allow Alstom's employees and clients to interact simply, effectively, and securely with the raw and processed data from the HealthHub platform, without requiring in-depth expertise in database manipulation or complex API calls? The goal was to create a standardized tool for data access, integration with other systems, and automation of data engineering tasks.

Solution:

The solution was the design and implementation of the alstompy Python library. This library encapsulates the complexity of interactions with the HealthHub platform and its data sources (notably Amazon S3 and APIs). It offers a set of intuitive Python functions for retrieving, processing, and exporting data. Specific connectors have been developed to facilitate importing from Amazon S3 and creating data sources for Tableau Software, thus ensuring seamless integration into the existing analysis ecosystem.

Achievements:

Gathering and Integrating Needs: I collaborated closely with the main stakeholders (technical, business, and managerial teams) to gather, integrate, and refine the functional and technical requirements of the library, ensuring its alignment with strategic objectives.
Development of Key Python Functions: I participated in the design and implementation of robust Python functions to effectively interact with the data from the HealthHub platform, enabling queries, filters, and complex data transformations.
Implementation of Data Connectors: I contributed to the development of Python connectors for importing data from Amazon S3 and for the automated creation of data sources for Tableau Software, thereby simplifying data access for visualization.
Design of Complementary Services: I participated in the design and planning of additional services, including the extension of functionalities related to Tableau Server and the enhancement of the Task Scheduler component for orchestrating and scheduling Python scripts.
Testing and Code Quality: I actively contributed to the setup and execution of unit and integration tests (using Pytest) to ensure the robustness, reliability, and compliance of the library's functions. I also collaborated in writing the complete technical and user documentation of the library, and I actively contributed to identifying and communicating errors, thereby ensuring the maintainability of the code.

Technical Stack:

Language: Python (Pytest, Pandas, Logger, Boto, Requests, BeautifulSoup)
Cloud: Amazon S3
Data Visualization: Tableau Software
File Formats: Feather, CSV, JSON, XML, YAML, RDATA
Version Control: Git, GitHub
APIs: Interaction with various APIs
Data: Time series, Events (alerts, KPI, incidents)
Development Environments: PyCharm, Jupyter Notebook, Postman