As a Data Engineer, I'm fueled by a passion for crafting ingenious and streamlined data solutions while maintaining an insatiable curiosity for embracing the cutting-edge technologies of tomorrow. Eager to embark on scalable projects, I thrive in my role as a vital coordinator, driving the success of my team and propelling both my organization and the industry forward.
(Capgemini Ltd) NAB [Client] [Banking Domain]
✓ Data Ingestion / Data Transformation
➢ Description: NAB is Australia is Australia based MNC which deals with Banking domain related data. We fetch the data from multiple external sources, Cleanse the data, create the data pipeline and pass it to downstream so end user can access the data according to their needs .
➢ Role: Amazon S3 serves as our primary data source, housing a variety of file formats, including Parquet, CSV, XML, and more. To facilitate the seamless movement of this data to our target destination, Redshift, we have established a robust data pipeline. This pipeline is orchestrated by the Airflow tool, which schedules Python scripts at regular intervals.
Within this data pipeline, we've implemented a step function that encapsulates multiple Glue jobs responsible for data transformations. To facilitate efficient data transfer between these steps, we leverage AWS Lambda functions. Finally, the transformed data is loaded into the target database.
In tandem with this migration effort, we are actively working on setting up code pipelines for Continuous Integration and Continuous Deployment (CI/CD). This initiative aligns with our broader goal of migrating on-premises processes to the cloud environment. By replicating and optimizing our existing workflow in the cloud, we aim to reduce manual dependencies and enhance operational efficiency.
Technology: Python, Pyspark, SQL, Aws Services
New York Life Insurance (Virtusa Polaris's) [Finance Domain]
✓ Data Ingestion / Data Transformation
➢ Description: New York life insurance is a US Based MNC which deals with insurance related data. This fetches the data from external sources, give good shape to data so end user can access the data according to their needs
➢ Role: We are in the process of establishing a data flow that seamlessly fetches data from our source AWS S3 repository, housing a diverse range of file formats such as CSV, JSON, fixed-length, and Parquet. The data is then routed to our designated targets, either Redshift or Postgres SQL databases.
This intricate data flow is orchestrated through a step function, which encompasses multiple Glue jobs meticulously designed for data transformation. To facilitate efficient data transfer and processing between these stages, AWS Lambda functions are employed for intermediate data handling. The final step involves loading the refined data into the target database.
In terms of data governance and security, we ensure data integrity by employing AWS Lake Formation and Glue functionalities. Our ongoing initiative involves migrating an existing on-premises data flow to the cloud environment, driven by the pursuit of faster computation and the elimination of redundant intermediate steps.
➢ Technology: Python, Pyspark, SQL, Aws Services
Empower Analytics (Accenture) [Finance Domain]
✓ Data Ingestion / Data Transformation
➢ Description: Empower Analytics is a US Based MNC which deals with PF related data. It is responsible for providing retirement plans to US citizens .
➢ Role: In our role, we orchestrate a data flow that retrieves diverse data formats, including CSV, JSON, fixed-length, and Parquet, from a source repository within AWS S3. We employ a combination of Python, PySpark, and SQL for data transformation, weaving them together using step functions and DynamoDB. Our process culminates in the execution of these workflows via an EMR cluster, ultimately delivering the refined data to Redshift, where it becomes readily accessible for in-depth analysis.
➢ Technology: Python, Pyspark, SQL, AWS services
Maxis Pvt Ltd (Amdocs) [Telecom Domain]
✓ ETL Scripting/On-Prem to Cloud Migration
➢ Description: Maxis Telecom is Malaysia’s telecommunication company which was responsible for creating the bill according to customer usage of broadband, Mobile etc.
➢ Role: To meet our data needs, we embark on a journey to extract information from various sources, including multiple vendors stored within AWS S3. Subsequently, we apply intricate data transformations using SQL, enriching and refining the data to make it readily usable for our end customers.
In our pursuit of efficiency, we employ Python to craft bash scripts that automate the data retrieval and transformation process, streamlining the workflow and ensuring a seamless and consistent data delivery system.
➢ Technology: Python, SQL
Bank of America Pvt Ltd (Infosys) [Banking Domain]
✓ SQL/ETL
➢ Description: Bank of America deals with banking related data. The mortgage and loan related data need to be beautified before sending it to client.
➢ Role: Our mission involves the strategic use of SQL scripts to meticulously transform data, making it readily accessible and usable for our valued end customers. This transformation journey encompasses several key steps, including data cleansing, where null values are systematically removed, and data types are meticulously updated.
Once the data has undergone these vital refinements, it finds its way back into our tables, effectively primed for analysis purposes. This approach ensures that the data is not only accurate but also optimized for our customers' analytical needs.
➢ Technology: SQL, PL SQL
✓ Programming Languages: Python, Pyspark, Sql
✓ Scripting Languages: Python, Bash
✓ Tools/Utilities: JIRA, Jenkins, PyCharm, Alteryx, Databricks, Airflow
✓ Repositories: Git, Perforce
✓ Databases : MySQL, Oracle, SQL, PL-SQL, Redshift, Athena
✓ Aws Services: Step functions, Glue, Lambda, S3, EMR, DynamoDB, DMS, Redshift, Athena, Basics of CI/CD etc
✓ Got the best scrum team of the year award
✓ Got Client Appreciations for completing the work before time