Knowledge Domains | Data Engineering & Platform Systems across cloud, hybrid, and regulated data environments
Technical Quality | Focused on production-grade pipelines: from raw ingestion -> trusted datasets -> analytics-ready products
Engineering Practice | Strong emphasis on reliability, reproducibility, auditability, and systems that age well
Broad Curriculum | Experience spanning ETL/ELT, distributed processing, cloud/HPC, data quality, and ML-adjacent systems
I would like to know more...
Hello, and welcome to my profile. My name is Eduardo - grab a cup of coffee and allow me to introduce myself.
I build data platforms and production-grade data systems where messy real-world inputs are transformed into reliable, auditable, and analysis-ready products.
In practical terms, my work lives somewhere between:
- Data ingestion - where reality arrives poorly formatted and with opinions.
- Transformation layers - where Python, SQL, Spark, and modeling discipline try to restore civilization.
- Data quality - because a pipeline that runs and a pipeline that is correct are not the same animal.
- Delivery - where analytics, BI, ML, and internal users need datasets that are stable enough to trust and boring enough to maintain.
I specialize in Python, SQL, PySpark/Spark, cloud and hybrid data infrastructure, ETL/ELT pipelines, data modeling, validation frameworks, and reproducible platform workflows. I have worked across environments involving AWS, GCP, Azure, HPC clusters, Docker/Kubernetes, CI/CD, and large heterogeneous datasets that rarely introduce themselves politely.
My engineering bias is simple: build systems that are clear, testable, observable, and maintainable after the original excitement has left the room.
This usually translates to:
- Production data pipelines with strong reproducibility, monitoring, and failure handling
- Scalable batch and distributed processing for high-volume analytical workloads
- Data modeling layers supporting analytics, BI, reporting, and ML workflows
- Validation and governance practices for environments where correctness is not decorative
- Cloud, hybrid, and HPC workflows designed to survive both scale and human memory
My background includes complex healthcare and enterprise data environments, but my current professional identity is straightforward: Senior Data Engineer / Data Platform Engineer. Machine Learning is still part of the toolbox, but the main job is now the plumbing, contracts, orchestration, and reliability that make downstream intelligence possible.
I value clean design, explicit trade-offs, and systems that are understandable by humans - not just machines with suspicious confidence.
Ethics, reproducibility, and long-term sustainability are not optional; they are part of the job.
Availability | Currently open to remote, hybrid, relocation-friendly, and long-term Data Engineering / Data Platform roles. See contact details. Relocation and onboarding take planning - good systems (and good moves) benefit from doing things properly.
2025 | Committer | Awarded Apache Spark Committer Status | The Apache Software Foundation (ASF) | Finland & Brazil
2022 | Senior Transition | Senior Data Engineer | Turku Biosciences & Brazilian Ministry of Health | Finland & Brazil
2020 | Outreach | Award-Winning COVID-19 Outreach Campaign | Göttingen University Medical School | Germany
2017 | Patent | LAG3-Targeting Cancer Therapy | Current owner: Bristol Myers Squibb | USA
2016 | Industry Transition | Data Engineer | Dana-Farber Cancer Institute | USA
2013 | Research | Computational Biology Researcher | RWTH University Medical School | Germany
I would like to know more...
|
|
name: "Eduardo Gusmao"
role: "Senior Data Engineer | Data Platform Engineer"
contact: "Recife, Brazil | eduardogade@gmail.com | github.com/eggduzao | linkedin.com/in/eduardogade"
languages: "English fluent | Portuguese native | Spanish B2 | German A2 | Finnish A1"
education: "2x PhD in Biomedical Informatics and Data Engineering / Computational Life Sciences; BSc + MSc in Computer Science"
summary: "Data Platform Engineer with 8+ years designing scalable data platforms, distributed systems, and production-grade pipelines across healthcare, life sciences, and enterprise environments. Strong Python, SQL, PySpark/Spark, AWS, Docker/Kubernetes, CI/CD, data quality, and analytics/BI platform experience."
professional_engagements:
current_role:
company: "Turku Biosciences / Brazilian Ministry of Health"
title: "Senior Data Engineer"
location: "Finland / Brazil"
date: "Sep 2022 - Present"
scope:
- "Lead a national-scale precision-medicine data platform integrating genomic, phenotypic, and clinical EHR data for 65,000+ individuals."
- "Build scalable Python, SQL, PySpark/Spark, Databricks-adjacent, HPC/SLURM, and API-driven ingestion and transformation workflows."
- "Deliver regulated ingestion, validation, governance, PII-compliant processing, observability, idempotency, and data quality controls."
- "Support analytics, BI, and ML workloads through reusable integration layers, backend data services, and optimized Parquet-based processing."
development_environment:
infrastructure: "AWS | Azure | GCP | HPC/SLURM | Docker | Kubernetes | Terraform | GitHub Actions"
languages: "Python | SQL | PySpark | Bash/Shell | Scala | Java | C/C++ | YAML | HCL"
data_stack: "PySpark | Spark | Pandas | Polars | NumPy | BigQuery | PostgreSQL | Parquet | JSON | dbt | dimensional modeling"
platform_engineering: "ETL/ELT | ingestion frameworks | transformation layers | platform APIs | CI/CD | observability | validation | data quality"
ml_ai_stack: "ML pipelines | MLOps | feature engineering | LLM APIs | embeddings | RAG | Hugging Face"
collaboration: "Agile/Scrum | stakeholder enablement | analytics teams | data scientists | engineers | product | infrastructure | security"I would like to know more...
name: "Eduardo Gusmao"
role: "Senior Data Engineer | Data Platform Engineer | Cloud Data Engineer"
location: "Recife, Brazil"
contact: "eduardogade@gmail.com | github.com/eggduzao | linkedin.com/in/eduardogade"
languages: "English fluent | Portuguese native | Spanish B2 | German A2 | Finnish A1"
summary: "Data Platform Engineer with 8+ years of experience designing and operating scalable data platforms, distributed data systems, and production-grade pipelines across healthcare, life sciences, and enterprise environments. Strong expertise in Python and SQL, with hands-on experience in PySpark/Spark, Kafka-style event-driven workflows, Airflow, AWS, Docker/Kubernetes, Terraform, and CI/CD to build reliable data infrastructure and platform services."
core_expertise:
- "Data Engineering"
- "Data Platform Engineering"
- "Cloud and Hybrid Data Infrastructure"
- "Distributed Data Systems"
- "ETL/ELT Pipelines"
- "Data Modeling and Analytics Engineering"
- "Data Quality and Governance"
- "Healthcare and Life Sciences Data"
- "Machine Learning Data Pipelines"
- "Production Reliability and Observability"
career_profile:
- "8+ years building scalable data platforms and distributed data systems"
- "Production-grade ingestion, transformation, validation, observability, and internal tooling"
- "Strong Python, SQL, PySpark/Spark, cloud, CI/CD, Docker/Kubernetes, and data quality background"
- "Experience supporting analytics, BI, ML workflows, and mission-critical data products"
- "Comfortable translating complex stakeholder requirements into maintainable platform capabilities"
professional_engagements:
current:
company: "Turku Biosciences / Brazilian Ministry of Health"
title: "Senior Data Engineer"
location: "Finland / Brazil"
date: "Sep 2022 - Present"
scope:
- "Lead the design and delivery of a national-scale data platform for precision medicine, integrating multi-modal genomic, phenotypic, and clinical EHR data for 65,000+ individuals."
- "Build scalable Python-based pipelines, distributed systems, PySpark/Spark workflows, Databricks-adjacent processing, HPC/SLURM execution, and API-driven ingestion workflows."
- "Architect production-grade data platform services for regulated ingestion, validation, transformation, governance, PII-compliant processing, reproducible workflows, and quality/reliability controls."
- "Develop high-performance processing and modeling layers using Python, SQL, PySpark, partitioning strategies, Parquet formats, and distributed query tuning."
- "Design reusable integration layers and backend data services connecting heterogeneous clinical, genomic, and ERP/SAP data sources."
- "Enable event-driven workflows, orchestration patterns, and batch/streaming-adjacent pipelines supporting analytics, BI, and ML systems."
- "Collaborate with product, analytics, engineering, and infrastructure stakeholders to deliver platform capabilities, CI/CD, Docker/Kubernetes workloads, observability, logging, alerting, and performance tuning."
outcomes:
- "Integrated precision-medicine datasets for 65,000+ individuals."
- "Improved pipeline efficiency by approximately 25%."
- "Reduced storage costs by approximately 80%."
- "Enabled more than 40% faster data delivery for downstream analytics, BI, and ML systems."
previous_mid:
company: "Göttingen General Hospital"
title: "Data Engineer II"
location: "Germany"
date: "Mar 2019 - Sep 2022"
scope:
- "Designed and implemented scalable data platform services using Python and SQL on cloud and hybrid environments."
- "Enabled reliable ingestion, transformation, and low-latency access for downstream analytics, BI, and application workloads."
- "Developed and optimized high-performance ETL/ELT pipelines with Python and PySpark, leveraging distributed processing, batch workflows, and orchestration patterns."
- "Refactored legacy systems into modular, production-grade platform services with CI/CD, automated testing, monitoring/logging, idempotency, retries, and robust error handling."
- "Built reusable data processing frameworks and integration layers for large-scale heterogeneous datasets."
- "Applied data modeling, validation, lifecycle standards, and governance across 12 cross-functional teams in a distributed environment."
outcomes:
- "Improved data availability and system responsiveness by approximately 33%."
- "Improved reliability, maintainability, and operational efficiency by approximately 50-60%."
- "Supported consistent data lifecycle practices across 12 cross-functional teams."
previous_old:
company: "Dana-Farber Cancer Institute"
title: "Data Engineer I"
location: "USA"
date: "Jan 2016 - Mar 2019"
scope:
- "Developed cloud-native data platform services supporting large-scale drug discovery."
- "Built Python-based ETL/ELT pipelines and API-driven integration layers for heterogeneous biomedical, operational, and financial datasets."
- "Implemented end-to-end data processing pipelines using Python, SQL, and PySpark on Apache Spark distributed systems."
- "Enabled scalable ingestion, transformation, validation, and batch workflows for analytics and ML-driven applications."
- "Collaborated with product, analytics, and research stakeholders to define KPIs and translate requirements into data models, backend data logic, and reusable platform components."
- "Contributed to production-grade data engineering practices including Git version control, validation checks, documentation, maintainable system design, reliability, and reproducibility."
outcomes:
- "Improved data accessibility and reduced operational costs by more than 25%."
- "Supported analytics and ML-driven applications through reusable data platform components."
- "Established reliable, reproducible lifecycle standards for heterogeneous biomedical and operational data."
education:
phd_biomedical_informatics:
degree: "Ph.D. in Biomedical Informatics"
institution: "Harvard Medical School"
location: "Boston / Cambridge, USA"
date: "2013 - 2017"
phd_computational_life_sciences:
degree: "Ph.D. Dr. rer. nat. in Data Engineering and Computational Life Sciences"
institution: "RWTH Aachen University"
location: "Aachen, Germany"
date: "2011 - 2015"
bachelor_master_computer_science:
degree: "B.Sc. and M.Sc. in Computer Science"
institution: "Federal University of Pernambuco"
location: "Recife, Brazil"
date: "2008 - 2011"
technical_strengths:
programming:
primary: ["Python", "SQL", "PySpark", "Spark SQL", "Bash/Shell"]
secondary: ["Scala", "Java", "C/C++", "YAML", "HCL"]
: ["REST APIs", "Async programming", "Data serialization", "Production-grade software engineering", "Parquet", "JSON"]
data_platform_engineering:
capabilities: ["Scalable data platforms", "Distributed data systems", "Internal data tooling", "Reusable ingestion frameworks", "Transformation layers", "Platform APIs", "Developer-facing abstractions", "Self-service data capabilities", "Analytics enablement", "ML workflow support", "BI workload support"]
distributed_data_systems:
tools: ["PySpark", "Apache Spark", "Pandas", "Polars", "NumPy"]
capabilities: ["Large-scale processing", "Distributed compute", "Performance tuning", "Partitioning", "Query optimization", "Resource efficiency", "Batch pipelines", "Streaming-adjacent pipelines", "Kafka", "Spark Streaming patterns"]
cloud_hybrid_infrastructure:
cloud: ["AWS", "Azure", "GCP"]
infrastructure: ["HPC/SLURM", "Docker", "Kubernetes", "Terraform", "GitHub Actions", "GitLab CI"]
capabilities: ["Cloud-native data infrastructure", "Hybrid data infrastructure", "Infrastructure-aware engineering", "Containerized workloads", "Deployment environments", "Scalable platform operations"]
hadoop_on_prem_ecosystems:
technologies: ["HDFS", "YARN", "Hive", "Kerberos"]
capabilities: ["Distributed storage patterns", "Distributed compute patterns", "Legacy-to-modern platform evolution", "Secure access-controlled data environments"]
data_modeling_tooling:
: ["Dimensional modeling", "Semantic modeling", "Schema design", "Metadata management", "Transformation layers", "Data contracts", "Lineage", "Modeling standards", "Analytics enablement", "Platform consistency"]
tools: ["dbt", "BigQuery", "PostgreSQL", "Parquet", "JSON"]
software_engineering_devops_reliability:
tools: ["Git", "GitHub", "GitHub Actions", "GitLab CI", "Docker", "Kubernetes", "Terraform"]
practices: ["CI/CD pipelines", "Automated testing", "Deployment automation", "Monitoring", "Logging", "Alerting", "Observability", "Incident response", "Idempotency", "Retries", "SLA/SLO thinking", "Fault-tolerant design"]
machine_learning_data_pipelines:
capabilities: ["ML pipelines", "MLOps", "Feature engineering", "Data preparation", "Personalization workflows", "AI-enabled data workflows", "Production-oriented ML data support"]
ai_llm: ["LLM APIs", "Embedding pipelines", "RAG", "Hugging Face"]
data_security_governance_quality:
capabilities: ["Data privacy", "PII-aware processing", "Compliance-aware pipelines", "Access control", "Validation strategies", "Auditability", "Data quality checks", "Governance practices", "Secure data lifecycle management", "Reliability controls", "Consistency checks"]
processes_collaboration:
practices: ["High-ownership engineering mindset", "Agile/Scrum", "Cross-functional collaboration", "Stakeholder enablement", "Requirements translation", "Technical documentation", "Platform capability delivery"]
collaborators: ["Analysts", "Data scientists", "Engineers", "Product teams", "Infrastructure teams", "Security teams"]
development_environment:
hardware: ["Apple Silicon", "ARM", "Intel", "NVIDIA GPU environments", "HPC clusters"]
operating_systems: ["macOS", "Ubuntu", "Debian", "Fedora", "Windows"]
infrastructure:
cloud_computing: ["AWS", "Azure", "GCP"]
hpc: ["SLURM", "OpenPBS", "Distributed compute environments"]
containers: ["Docker", "Kubernetes", "Singularity"]
infrastructure_as_code: ["Terraform", "HCL", "Cloud deployment"]
languages:
data_engineering: ["Python", "SQL", "PySpark", "Spark SQL", "Bash/Shell"]
systems_and_general: ["C/C++", "Java", "Scala"]
markup_and_config: ["YAML", "Markdown", "LaTeX", "HTML/CSS", "HCL"]
data_stack:
distributed_processing: ["Apache Spark", "PySpark", "Spark SQL", "Pandas", "Polars", "NumPy"]
storage_formats: ["Parquet", "JSON", "CSV", "HDF5"]
databases_and_warehouses: ["BigQuery", "PostgreSQL", "MongoDB", "DynamoDB", "Relational databases", "NoSQL databases"]
modeling_and_quality: ["Dimensional modeling", "Semantic modeling", "Schema design", "Data contracts", "Lineage", "Validation checks", "Data quality checks", "dbt"]
ml_ai_stack:
frameworks_and_tools: ["PyTorch", "TensorFlow", "Keras", "Scikit-Learn", "Hugging Face", "NLTK"]
workflows: ["ML pipelines", "MLOps", "Feature engineering", "Embedding pipelines", "RAG", "LLM APIs"]
systems_tooling:
version_control: ["Git", "GitHub"]
packaging_and_environments: ["pip", "poetry", "micromamba", "mamba", "conda", "npm"]
ci_cd: ["GitHub Actions", "GitLab CI"]
observability: ["Logging", "Monitoring", "Alerting", "Observability", "Prometheus", "Grafana"]
github_positioning:
short_pitch: "I build reliable data platforms, distributed pipelines, and production-ready data systems for analytics, BI, ML, and healthcare/life-sciences workloads."
engineering_style:
- "Clean, maintainable, typed Python"
- "Data quality and reliability first"
- "Production-aware platform design"
- "Reproducible workflows"
- "Strong documentation"
- "Pragmatic cloud and hybrid infrastructure" LinkedIN | https://www.linkedin.com/eduardogade
Location | Recife, Brazil | Remote-friendly
Status | Open to Data Engineering roles
I would like to know more...
LinkedIn: https://www.linkedin.com/in/eduardogade/
Website & Blog: https://www.gusmaolab.org
One-Page Resume: https://www.gusmaolab.org/Gusmao-EG-CV.pdf
Stack Overflow: https://stackoverflow.com/users/32223943/eduardo-gusmao
Medium: https://medium.com/@eduardogade
Preferred contact: Email | LinkedIn
Response time: 1-2 business days
Open to remote, hybrid, or relocation
See [availability & engagement details](#availability)
Placeholder.
I would like to know more...
Placeholder.
Placeholder.
I would like to know more...
Placeholder.
Placeholder.
I would like to know more...
Placeholder.
Placeholder.
Placeholder.
I would like to know more...
Placeholder.
Machine & Deep Learning | Repository | Publication
Variational Inference | Repository | Publication
Precision Medicine | Repository | Publication
Regulatory Genomics | Repository | Publication
I would like to know more...
Global age-sex-specific all-cause mortality and life expectancy estimates for 204 countries and territories and 660 subnational locations, 1950-2023: a demographic analysis for the Global Burden of Disease Study 2023
The Lancet · Oct 18, 2025
Contributions:
- Responsible for orchestrating the LATAM-branch with 45+ PIs and 200+ researchers.
- Horizontal meetings for data and experience sharing have shown great success, with ~380% more efficiency than the second most efficient branch - per capita.
- Has solved pharmacological conflict of interests by cross-deployment and blind-genotype blind-phenotype strategy, which exhibit 17% increased accuracy over North America (first COI - percapita) and 5% over Asia (second COI - per capita).
Cell Reports · Nov 26, 2024
Contributions:
- The tool
Bloomhas increased analysis mechanism by promoting different views into the regulatory spatial configuration, resulting in ~50% wet-lab equipment cost reduction and solving a stalled-case. - Provided personal guidance towards architecture and Hi-C methodology, saving 15% overall lab-time.
- Overall, this was the first non-trivial non-intermediary-distance (>1Gbp) lncRNA interference in a region unknown to be a regulatory enhancer.
Global, regional, and national burden of diabetes from 1990 to 2021, with projections of prevalence to 2050: a systematic analysis for the Global Burden of Disease Study 2021
The Lancet · Jul 15, 2023
Contributions:
- Responsible for orchestrating a team of 3 brazilian PIs and 5 independent investigators.
- Used scrum, coupled with CRISP-DM, delivering net gains (profitability converted back) through network revenue saving and wet/dry-lab material cost reduction.
- Developed national-scale geno/phenotype QC pipeline -
Fabric(PhenotekaModule) - used across 20+ institutes.
The New England Journal of Medicine · Oct 11, 2022
Contributions:
- Developed
Blacksmith, that coupled with Bloom improved operating margin by over 15%. - Freed at least 15 engineering hours per week with Blacksmith coupled with
Apollo. - Intending to lower carbon footprint, we adopted a trademarked DB 'bit-brushing' methodology (currently owned by Databricks Inc.).
Molecular Systems Biology · Jun 24, 2021
Contributions:
- Analized Spatial Chromatin Biology and RNA-seq to identify - for the first time - HMGB1 as a 'rheostat' factor.
- Reduced cloud compute costs by 40% using
Apollo's strong mathematical features and Bloom to analyse Chromatin conformation. - After this project's results, we have earned an ESG compliance through impeccable waste management and safety handling.
Redundant and specific roles of cohesin STAG subunits in chromatin looping and transcriptional control
Genome Research · Apr 6, 2020
Contributions:
- Analized most omics in a single project: ChIP-seq, degron-X, RNA-seq, Hi-C, STORM, DNase-seq, ATAC-seq, MSMS and MS-based microscopy.
- Developed Musique, shortening development cycles by ~9 weeks.
- Musique saved 300 GPU-hours per month by performing simple heuristics which are generalizeable to any dataset.
Spatial chromosome folding and active transcription drive DNA fragility and formation of oncogenic MLL translocations
Molecular Cell · Jul 25, 2019
Contributions:
- Patented technique for BLISS-seq data processing, earning ~25% extra funds for the laboratory.
- Lower wet-lab costs using dry-lab tools by ~30% (estimated for this project); achieving reproducible and insightful results on MLL fusions.
- Created the triple-correlation method. Translating category theory into a real-world phenomenon.
HMGB2 loss upon senescence entry disrupts genomic organisation and induces CTCF clustering across cell types
Molecular Cell · May 17, 2018
Contributions:
- Developed
BloomandApollo, which reduced processing time by at least 3 months. - Very agile methodology with microprocessed multicycled days, leading to novel discoveries and decreasing overall time-to-delivery.
- Reduced local infrastructure storage footprint by ~100TB with
Bloom&Apollo.
Nature · Jan 23, 2017
Contributions:
- Devised bioinformatics pipelines with collaborators and created the Gaussian-as-DPMM method of clustering, increasing speed by, at least, ~100x.
- Clustering was able to identify 3+ unique subtypes never previously reported.
- Created a deep regulatory network, especially with SHKBP1 ERBB3 and TGFBR2; which contained 98% of the cancer mortality information variability.
Nature Methods · Feb 22, 2016
Contributions:
- Landmark study on comparing 12+ footprinting methods. The study was the cover of Nature Methods magazine.
- Without any dry experiment, we were able to identify the limits of sequencing technologies, and propose results that exceded ~5% AUPR of known methods.
- Our method - Olympus (published in 2023) - offers ~7x most complete analysis of regulatory genomics than any other tool.
Nucleic Acids Research · Oct 17, 2015
Contributions:
- First use of
Faun, the motif enrichment analysis that uses hypergeometric distributions to query the sensitivity and specificity of TF occupancy in a certain genomic region. - Proposed the usage of
Cytoscape, widely minimising meeting preparation time by ~25%. - Proposed use of fewer histone modification essays by recreating chromatin states in silico; thus, minimizing project costs by ~30%
Placeholder.
I would like to know more...
Placeholder.
Placeholder.
I would like to know more...
Placeholder.
Placeholder.
I would like to know more...
Placeholder.
2018 | PhD | Biomedical Informatics | Harvard Medical School
2015 | PhD | Life Sciences | RWTH Aachen University
2011 | MSc | Machine & Deep Learning | RWTH Aachen University
2010 | BSc | Computer Science | RWTH Aachen University
I would like to know more...
Placeholder.
Placeholder.
I would like to know more...
Placeholder.
Flagship: 🏳️⚧️ | 🏳️🌈 | 🇺🇳
Placeholder.
I would like to know more...
Placeholder.
Placeholder.
I would like to know more...
Placeholder.
Placeholder.
I would like to know more...
Placeholder.




