Summary
Overview
Work History
Education
Skills
Timeline
Generic

Nathan Bennett

PENSHURST

Summary

As a skilled Data Engineer, I specialize in creating efficient data pipelines, robust architectures, and scalable solutions. My passion for technology drives me to explore various tools to find the best fit for each project. I'm dedicated to continuous learning and staying abreast of industry trends. With a successful track record in data engineering, I'm also adept at building and deploying machine learning models, adding valuable expertise to data-driven teams.

Overview

6
6
years of professional experience

Work History

Data Engineer

Swipejobs
02.2018 - Current
  • Successfully trained and validated ML models using H2O and MLflow, which were then deployed into production. Employed Spark Streaming to efficiently process events from Kafka, H2O for prediction calculations, and Kafka for sending back results. This helped to enhance model accuracy and ensure smooth integration into our production environment.
  • Developed and maintained production ETL using Spark/Scala scripts to extract data from a variety of sources, such as MongoDB collections, log stash, and Postgres tables. Stored this data in S3 as Parquet files to ensure efficient data retrieval. This streamlined our data pipeline and improved data quality.
  • Utilized Apache Airflow to effectively orchestrate all tasks within the data warehouse, including ETL, reporting, and ML models running in production. This resulted in a more efficient and streamlined process.
  • Successfully migrated Spark jobs from running on EC2 instances to Kubernetes, resulting in a 33% reduction in daily ETL latency. Wrote a custom Spark submit operator in Airflow to create new pods in Kubernetes and execute Spark jobs. This allowed for faster data processing and more accurate reporting.
  • Utilized H2O.ai and NLP techniques to build and deploy a machine learning model as a Java microservice, parsing resumes to extract work history, skills, and education, resulting in a 27% faster profile creation process for applicants.
  • Integrated Apache Superset into our internal service desk to effectively display metrics and charts. Over 1600 users currently use these metrics to make key business decisions. This has significantly improved our ability to make data-driven decisions.
  • Successfully implemented Trino (formerly Presto) with Hive metastore to aggregate terabytes of Parquet files stored in S3. This allowed us to efficiently manage our data resources and perform complex queries in a timely manner.
  • Implemented and maintained Deequ for ETL tasks to ensure data quality. This helped us to identify and correct issues in our data pipeline, resulting in more accurate data and reporting.

Education

Bachelor of Science (Honours) - Mathematics

University of Technology Sydney
Sydney, NSW
07.2017

Skills

  • Programming Languages: Scala, Python, Java
  • Big Data Technologies: Apache Spark, Trino, Apache Hive, Apache Kafka
  • Workflow Management: Apache Airflow
  • Database Management: PostgreSQL, MongoDB
  • Data Visualization: Apache Superset
  • Cloud Services: AWS
  • Containers: Docker
  • Machine Learning & NLP: H2Oai, PyTorch, Scikit-learn

Timeline

Data Engineer

Swipejobs
02.2018 - Current

Bachelor of Science (Honours) - Mathematics

University of Technology Sydney
Nathan Bennett