Job Description
Role Overview
Lead the design and build of scalable, secure, high performance data
platforms with a
software engineering mindset
—treating pipelines as products built in factory mode, inner-sourced for
reuse, and automated end-to-end. Drive
metadata-driven
development and put
data quality
and observability at the core, across batch and streaming.
Key Responsibilities
-
Engineer reusable pipeline frameworks
(batch & streaming) with standard scaffolding, templates, and
golden paths that teams can adopt and extend.
-
Model data for analytics and interoperability
(dimensional/
star & snowflake,
Data Vault 2.0
, SCD types) with clear conventions and documentation.
-
Optimize cloud data warehouses
(e.g., BigQuery/Snowflake/Redshift/
Synapse/Databricks SQL) for performance and cost using partitioning,
clustering, caching, statistics, and workload management.
-
Build and operate streaming dataflows
(Kafka/Pub/Sub/
Kinesis + Spark/Flink) with exactly-once processing, replay, and robust
SLAs/SLOs.
-
Embed quality at the pinnacle
: define data contracts, DQ rules/tests, anomaly detection, reconciliation,
and CI/CD quality gates.
-
Make it metadatad-riven
: automate capture/propagation of schema, lineage, ownership,
sensitivity/PII tags, KPIs/metrics definitions, and business glossary links.
-
Establish BI & semantic layers
: publish conformed dimensions, metric logic, and consumable views/models to
power dashboards and self-serve analytics.
-
Lay AIready foundations
: curate feature-friendly datasets; design for knowledge layers (semantic
models, ontologies,
knowledge graphs
) and future vector/embedding use.
-
Ensure observability & FinOps
: lineage, logging, metrics and tracing; query/job profiling; capacity and
cost guardrails.
-
Uplift engineering excellence
: Git‑based workflows, code reviews, automated testing, IaC,
containerization, security by design, and mentoring of engineers.
Required Skills
-
Programming & data processing:
Advanced SQL and Python; plus Scala/Java for Spark/Flink. Go lang is a
plus
-
Cloud data platforms:
Hands‑on with one or more among BigQuery, Snowflake, Redshift,
Synapse/Databricks SQL; deep understanding of cloud DW vs traditional MPP
trade‑offs.
-
Data modelling:
Dimensional (star/snowflake),
Data Vault 2.0
, SCD implementations, and schema versioning/evolution.
-
Streaming:
Kafka/Pub/Sub/
Kinesis with Spark Structured Streaming or Flink; event schemas
(Avro/Protobuf), idempotency, back‑pressure, replay.
-
Orchestration & ELT:
Airflow/Composer/Managed Workflows and/or dbt (or equivalents) for
transformations, testing, and documentation.
-
CI/CD & platform engineering:
Git workflows (trunk/PR), automated build/test/deploy, artifact
versioning,
Terraform/
CloudFormation
, Docker/Kubernetes.
-
Data quality & governance:
Data contracts, testing frameworks (e.g., Great Expectations/dbt
tests), catalogue/lineage tooling, access policies.
-
BI & semantics:
Experience shaping
semantic layers
,
KPIs/metrics
logic, and consumption models; familiarity with enterprise BI tools
and metric stores.
-
AI readiness:
Understanding of feature engineering, data for ML/GenAI,
knowledge graphs/ontologies
, and patterns that enable future knowledge layers.
-
Security & compliance:
IAM design, encryption, key management, masking/tokenization, and
auditability in regulated environments.