Grafana Labs6 дней назад
Staff Software Engineer - Databases SRE | Sweden | Remote
Sweden (Remote)
Обязанности
- 01Partner closely with product engineering squads (embedded model)
- 02Own production reliability for high-SLA and complex customer environments
- 03Design and implement automation to scale our reliability practices
- 04Ensure our customers meet our SLO targets
- 05Define and evolve per-tenant SLOs and reliability models
- 06Proactively reduce SLO burn to prevent repeat incidents
- 07Serve as a primary escalation point and on-call for relevant incidents
- 08Lead customer-impacting incident response and post-incident reviews
- 09Contribute to design docs and code reviews
- 10Influence feature design to ensure production scalability and operability
- 11Build automation to eliminate toil where needed
- 12Improve alert quality and reduce noisy escalations
- 13Review and create SLOs, proactively investigating ways to reduce budget burn
- 14Improve observability of customers within their environments
- 15Design and implement solutions to ensure reliability and scalability of environments
- 16Develop fault-tolerant design patterns ensuring reliability at all stages of the service lifecycle
- 17Collaborate with Engineering Leaders to define and influence product strategy, roadmaps, and technical designs
- 18Participate in PR review and collaborate with other engineers on their Design Docs
- 19Teach others about Site Reliability Engineering and communicate best practices
- 20Participate in Incident Response when applicable, including investigation
Требования
- 018+ years engineering experience
- 024+ years in SRE/CRE/production engineering
- 03Strong preference for those with formal customer reliability engineering experience
- 04Strong Kubernetes experience in AWS, GCP, or Azure
- 05Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.)
- 06Strong experience with technical leadership, leading a team through projects
- 07Experience mentoring other engineers on the team
- 08Experience operating multi-tenant systems in production
- 09Strong experience designing and implementing SLOs
- 10Experience with one or more programming languages (e.g. Go, Python, Java, etc)
- 11Experience with Linux operating systems internals
- 12Knowledge of networking, cloud storage, and scaling
- 13Excellent problem-solving and troubleshooting skills
- 14Experience with calmly and actively participating in blame-free Incident Response
- 15Ability to write high-quality PIRs (Post Incident Reviews)
- 16Ability to reason about performance, scaling, and failure modes
- 17Comfortable working within an engineering team with strong autonomy and self-direction
- 18Ability to partner deeply with product engineering teams
- 19Intellectual curiosity, transparency, high bias towards action, and kindness
Условия
- 01Remote opportunity
- 02Candidates from the UK, Sweden, Spain, or Germany
- 03Access to modern AI coding assistants (within security guidelines) with company-funded usage budget
- 04Access to frontier models (e.g., GPT-Codex 5/3, Claude Opus 4.6, Gemini 3 Pro)
- 05100% remote company with global team members
- 06Backed by leading investors