GitLab08.04.2026

Senior Site Reliability Engineer, Tenant Services: Geo

Remote

Обязанности

  • 01Execute Dedicated Geo migrations and cutovers end-to-end, including planning, pre-cutover validation, execution, and post-cutover verification and cleanup
  • 02Join the team's shift and weekend coverage rotation for Dedicated cutovers across EMEA and US hours
  • 03Participate in the SaaS Site Reliability Engineering (SRE) on-call rotation to respond to incidents that impact GitLab.com availability
  • 04Operate and improve the Geo operational surface for Dedicated, including environment preparation and data hygiene checks prior to migrations
  • 05Execute replication, validation, and cutover procedures
  • 06Handle Geo-related escalations from Support and internal partners
  • 07Design, build, and maintain automation, tooling, and runbooks that make migrations, cutovers, and Geo escalations repeatable
  • 08Run infrastructure with tools such as Ansible, Chef, Terraform, GitLab CI/CD, and Kubernetes
  • 09Build and maintain monitoring, alerting, and dashboards that detect symptoms early and track migration success metrics
  • 10Collaborate with core Geo team on improving Geo features and operability
  • 11Collaborate with Dedicated migrations and Support on migration planning and customer communications
  • 12Contribute to readiness reviews, incident reviews, and root cause analyses
  • 13Document every action including runbooks, architecture decisions, and post-incident reviews
  • 14Proactively identify and reduce toil by automating repetitive operational work

Требования

  • 01Experience operating highly-available distributed systems at scale, ideally in a SaaS environment with customer-facing SLAs
  • 02Hands-on experience with at least one major cloud provider (e.g., Google Cloud Platform or Amazon Web Services), including networking, storage, and managed services
  • 03Experience with Kubernetes and its ecosystem (e.g., Helm), including deploying and troubleshooting workloads
  • 04Experience with infrastructure as code and configuration management tools such as Terraform, Ansible, or Chef
  • 05Strong programming skills in at least one general-purpose language (preferably Go or Ruby) and proficiency with scripting (e.g., Shell, Python)
  • 06Experience with observability systems (e.g., Prometheus, Grafana, logging stacks) and using metrics and logs to troubleshoot performance and reliability issues
  • 07Practical exposure to data replication, backup/restore, or migration scenarios where data integrity and downtime risk must be carefully managed
  • 08Comfort participating in an on-call rotation, investigating incidents across the stack, and driving follow-through on corrective actions
  • 09Ability to engage directly with enterprise customers during migrations and incidents, including on live calls and through clear written communication

Условия

  • 01Join team's shift and weekend coverage rotation for Dedicated cutovers across EMEA and US hours
  • 02Participate in SaaS Site Reliability Engineering (SRE) on-call rotation