GitLab08.04.2026

Senior Site Reliability Engineer, Tenant Services: Geo

Remote

Обязанности

01Execute Dedicated Geo migrations and cutovers end-to-end, including planning, pre-cutover validation, execution, and post-cutover verification and cleanup
02Join the team's shift and weekend coverage rotation for Dedicated cutovers across EMEA and US hours
03Participate in the SaaS Site Reliability Engineering (SRE) on-call rotation to respond to incidents that impact GitLab.com availability
04Operate and improve the Geo operational surface for Dedicated, including environment preparation and data hygiene checks prior to migrations
05Execute replication, validation, and cutover procedures
06Handle Geo-related escalations from Support and internal partners
07Design, build, and maintain automation, tooling, and runbooks that make migrations, cutovers, and Geo escalations repeatable
08Run infrastructure with tools such as Ansible, Chef, Terraform, GitLab CI/CD, and Kubernetes
09Build and maintain monitoring, alerting, and dashboards that detect symptoms early and track migration success metrics
10Collaborate with core Geo team on improving Geo features and operability
11Collaborate with Dedicated migrations and Support on migration planning and customer communications
12Contribute to readiness reviews, incident reviews, and root cause analyses
13Document every action including runbooks, architecture decisions, and post-incident reviews
14Proactively identify and reduce toil by automating repetitive operational work

01Experience operating highly-available distributed systems at scale, ideally in a SaaS environment with customer-facing SLAs
02Hands-on experience with at least one major cloud provider (e.g., Google Cloud Platform or Amazon Web Services), including networking, storage, and managed services
03Experience with Kubernetes and its ecosystem (e.g., Helm), including deploying and troubleshooting workloads
04Experience with infrastructure as code and configuration management tools such as Terraform, Ansible, or Chef
05Strong programming skills in at least one general-purpose language (preferably Go or Ruby) and proficiency with scripting (e.g., Shell, Python)
06Experience with observability systems (e.g., Prometheus, Grafana, logging stacks) and using metrics and logs to troubleshoot performance and reliability issues
07Practical exposure to data replication, backup/restore, or migration scenarios where data integrity and downtime risk must be carefully managed
08Comfort participating in an on-call rotation, investigating incidents across the stack, and driving follow-through on corrective actions
09Ability to engage directly with enterprise customers during migrations and incidents, including on live calls and through clear written communication

01Join team's shift and weekend coverage rotation for Dedicated cutovers across EMEA and US hours
02Participate in SaaS Site Reliability Engineering (SRE) on-call rotation