John is a Senior Big Data Engineer with 4 years of experience designing and developing big data pipelines from data sources including CSV, JSON files, RESTful APIs, SQL, non-SQL databases, and big data lake environments using Hadoop, Databricks, and recently AWS EMR with S3 storage. He has extensive skills in Python writing and optimizing SQL queries for supporting data ingestion processes feeding Data Analytics Platforms. John has 3 years of experience writing data transformation processes in pySpark. He is an AWS Certified Cloud Practitioner. He has 12 years of experience writing and troubleshooting SQL performance in relational databases including earning an Oracle SQL expert certified by Oracle. Additionally, he has 10 years of experience writing Bash Shell scripting for supporting and deploying data automation maintenance tasks from development to production.
Hire JohnCurrent Project: Data suppression ETL processing for health reporting system
Canadian Government company Data Privacy enhancement reporting solution – Full life cycle end to end
Project: Bank Global Risk Management ETL- ML migration from pySpark on-premises to Azure Databricks
Implemented data pipeline ingestion jobs in PySpark using in-house pySpark data ingestion framework running on Databricks data lake using AWS Cloud services including S3 storage. Performed python programming for data manipulation supporting ETL transformations in pySpark.
Project: Sobeys data pipeline implementation to support daily data loads with around 1 GB daily, the pipeline update multiple fact tables with a size of 15 – 20 billion rows in average.