Job Details

DevOps & Infrastructure

Senior

Full time

Apr 15

Senior DevOps Engineer (DWH/ML Platform)

Senior DevOps Engineer wanted to build a scalable data platform using AWS, Kubernetes, and IaC. Responsibilities include developing infrastructure for Trino, Spark, and ML models. Requires expert-level Kubernetes, IaC, GitLab CI, and AWS experience.

We are looking for a DevOps Engineer who not only "keeps the production running" but also builds a scalable data platform. You will develop the infrastructure on which Trino, Spark, and ML models run, using best practices of IaC and Kubernetes. Technology Stack Core Infra: AWS (EKS, VPC, IAM), Kubernetes, Terragrunt CI/CD: GitLab CI Compute & Query: Trino, Apache Spark Storage: S3 (Data Lake, Apache Iceberg), ClickHouse, ScyllaDB Orchestration: Apache Airflow (Kubernetes Executor) Observability: Prometheus, Grafana, ELK Deployment: Helm What you will do: Development of Kubernetes platform (EKS) Writing and supporting complex Helm charts for stateful applications (Trino, ClickHouse, Solr, ScyllaDB). Resource management, autoscaling (HPA/VPA, Cluster Autoscaler, Karpenter). Configuring network policies, Ingress, service mesh if necessary. Infrastructure as Code Complete infrastructure description via Terragrunt: EKS clusters, VPC, IAM, S3, RDS, etc. Support for the DRY principle, state management in AWS S3. Code structuring for multiple environments (dev/stage/prod). CI/CD pipelines Building code and data delivery processes through GitLab CI. Configuring GitLab Runners (including on Kubernetes), artifact caching. Automation of infrastructure, Helm charts, and Terraform module testing. Observability Configuring metrics collection via Prometheus (ServiceMonitors, PodMonitors, Exporters). Visualization in Grafana: dashboards for Data components and infrastructure. Configuring alerting for critical scenarios: replication lag, Spark/Airflow task queue, resource utilization. Centralized log collection. Data Ops Support for Trino and Spark clusters, ensuring their interaction with S3 (Iceberg) and databases (Solr, ScyllaDB). Configuring Airflow on Kubernetes Executor, assisting the DE team with infrastructure issues. Participation in incident response. ML Support Ensuring stable operation of ML services (Solr, ScyllaDB, Redis) in production. Supporting infrastructure for MLflow, Feast, inference services.

What we want to see: If you haven't worked with the tools below, unfortunately, it will be difficult for us: Kubernetes (Expert level): You don't just use ready-made manifests, but understand the internals of EKS, can write your own Helm charts from scratch, and debug complex problems (OOMKilled, Pending pods, PVC issues, networking). IaC: Ability to structure code for multiple environments (dev/stage/prod). GitLab CI: Deep understanding of .gitlab-ci.yml, experience configuring pipelines with complex logic. AWS: Understanding of network interaction and permissions management (IAM Policies/Roles, IRSA). Experience operating a Big Data stack: e.g., Trino (Presto), Spark, Airflow. Experience with columnar NoSQL (ScyllaDB/Cassandra). Experience with search engines (Solr or Elasticsearch). Understanding of MLOps processes and experience with ML infrastructure (MLflow, Feast, KServe). Experience with GitOps (ArgoCD, Flux). Experience with Apache Iceberg and Data Lake architecture.

ScyllaDB

AWS

Kubernetes

Grafana

IAM

KServe

Prometheus

Solr

Trino

Spark

Presto

Terragrunt