GitLab08.04.2026
Senior Site Reliability Engineer, Tenant Services: Geo
Remote
Обязанности
- 01Execute Dedicated Geo migrations and cutovers end-to-end, including planning, pre-cutover validation, execution, and post-cutover verification and cleanup
- 02Join the team's shift and weekend coverage rotation for Dedicated cutovers across EMEA and US hours
- 03Participate in the SaaS Site Reliability Engineering (SRE) on-call rotation to respond to incidents that impact GitLab.com availability
- 04Operate and improve the Geo operational surface for Dedicated, including environment preparation and data hygiene checks prior to migrations
- 05Execute replication, validation, and cutover procedures
- 06Handle Geo-related escalations from Support and internal partners
- 07Design, build, and maintain automation, tooling, and runbooks that make migrations, cutovers, and Geo escalations repeatable
- 08Run infrastructure with tools such as Ansible, Chef, Terraform, GitLab CI/CD, and Kubernetes
- 09Build and maintain monitoring, alerting, and dashboards that detect symptoms early and track migration success metrics
- 10Collaborate with core Geo team on improving Geo features and operability
- 11Collaborate with Dedicated migrations and Support on migration planning and customer communications
- 12Contribute to readiness reviews, incident reviews, and root cause analyses
- 13Document every action including runbooks, architecture decisions, and post-incident reviews
- 14Proactively identify and reduce toil by automating repetitive operational work
Требования
- 01Experience operating highly-available distributed systems at scale, ideally in a SaaS environment with customer-facing SLAs
- 02Hands-on experience with at least one major cloud provider (e.g., Google Cloud Platform or Amazon Web Services), including networking, storage, and managed services
- 03Experience with Kubernetes and its ecosystem (e.g., Helm), including deploying and troubleshooting workloads
- 04Experience with infrastructure as code and configuration management tools such as Terraform, Ansible, or Chef
- 05Strong programming skills in at least one general-purpose language (preferably Go or Ruby) and proficiency with scripting (e.g., Shell, Python)
- 06Experience with observability systems (e.g., Prometheus, Grafana, logging stacks) and using metrics and logs to troubleshoot performance and reliability issues
- 07Practical exposure to data replication, backup/restore, or migration scenarios where data integrity and downtime risk must be carefully managed
- 08Comfort participating in an on-call rotation, investigating incidents across the stack, and driving follow-through on corrective actions
- 09Ability to engage directly with enterprise customers during migrations and incidents, including on live calls and through clear written communication
Условия
- 01Join team's shift and weekend coverage rotation for Dedicated cutovers across EMEA and US hours
- 02Participate in SaaS Site Reliability Engineering (SRE) on-call rotation