Intern
Developed a Python-based data engineering tool to automate data processing, transformation, and compression, enabling efficient ingestion from SQL and CSV sources into Parquet format
- Gained experience in enterprise data systems and industry best practices, adapting to corporate workflows and high-volume data environments
- Achieved a 76.54x performance improvement in data processing by implementing parallelized Parquet processing with multiprocessing
- Increased data processing efficiency by - 10x by utilizing Ray for parallel processing, creating multiple concurrent jobs to handle large-scale data workloads
- Enhanced data pipeline efficiency by integrating DuckDB, Pandas, and PyArrow, improving query performance and reducing memory overhead
- Familiarized with containerization and orchestration using Docker and Kubernetes