Grafana Labs3 дня назад

Staff Software Engineer - Databases SRE | Ireland | Remote

Republic of Ireland (Remote)

Обязанности

  • 01Partner closely with product engineering squads (embedded model)
  • 02Own production reliability for high-SLA and complex customer environments
  • 03Design and implement automation to scale our reliability practices
  • 04Ensuring our customers meet our SLO targets
  • 05Define and evolve per-tenant SLOs and reliability models
  • 06Proactively reduce SLO burn to prevent repeat incidents
  • 07Serving as a primary escalation point and on-call for relevant incidents
  • 08Lead customer-impacting incident response and post-incident reviews
  • 09Contribute to design docs and code reviews
  • 10Influence feature design to ensure production scalability and operability
  • 11Build automation to eliminate toil where needed
  • 12Improve alert quality and reduce noisy escalations
  • 13Reviewing and creating SLOs, proactively investigating ways to reduce budget burn
  • 14Improve observability of customers within their environments
  • 15Designing and implementing solutions to ensure reliability and scalability
  • 16Develop fault-tolerant design patterns
  • 17Collaborating with Engineering Leaders to help define and influence product strategy
  • 18Participate in PR review and collaborating with other engineers
  • 19Teach others about Site Reliability Engineering and communicate best practices
  • 20Participate in Incident Response when applicable

Требования

  • 018+ years engineering experience, 4+ in SRE/CRE/production engineering
  • 02Strong preference for those with formal customer reliability engineering experience
  • 03Strong Kubernetes experience in AWS, GCP, or Azure
  • 04Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.)
  • 05Strong experience with technical leadership, leading a team through projects
  • 06Experience mentoring other engineers
  • 07Experience operating multi-tenant systems in production
  • 08Strong experience designing and implementing SLOs
  • 09Experience with one or more programming languages (e.g. Go, Python, Java, etc)
  • 10Experience with Linux operating systems internals
  • 11Knowledge of networking, cloud storage, and scaling
  • 12Excellent problem-solving and troubleshooting skills
  • 13Experience with calmly and actively participating in blame-free Incident Response
  • 14Experience following up on actions and writing high quality PIRs
  • 15Ability to reason about performance, scaling, and failure modes
  • 16Comfortable working within an engineering team with strong sense of autonomy
  • 17Ability to partner deeply with product engineering teams
  • 18Intellectually curious
  • 19Default to transparency
  • 20Possess a high bias towards action
  • 21Kind personality

Условия

  • 01Remote opportunity
  • 02Looking for candidates from UK, Sweden, Spain, Germany or Ireland
  • 03Company-funded AI coding assistants usage budget
  • 04Access to frontier models (GPT-Codex 5/3, Claude Opus 4.6, Gemini 3 Pro)
  • 05Regular 1:1s with manager and colleagues
  • 06100% remote company
  • 07Team members across 40+ countries