Grafana Labs6 дней назад
Staff Software Engineer - Databases SRE | Spain | Remote
Spain (Remote)
Обязанности
- 01Partner closely with product engineering squads (embedded model)
- 02Own production reliability for high-SLA and complex customer environments
- 03Design and implement automation to scale reliability practices
- 04Ensure customers meet SLO targets
- 05Define and evolve per-tenant SLOs and reliability models
- 06Proactively reduce SLO burn to prevent repeat incidents
- 07Serve as a primary escalation point and on-call for relevant incidents
- 08Lead customer-impacting incident response and post-incident reviews
- 09Contribute to design docs and code reviews
- 10Influence feature design to ensure production scalability and operability
- 11Build automation to eliminate toil where needed
- 12Improve alert quality and reduce noisy escalations
- 13Hold regular 1:1s with manager and colleagues
- 14Review and create SLOs, investigate ways to reduce budget burn, including monitoring, automation, self-healing, auto-scaling improvements
- 15Improve observability of customers within their environments
- 16Design and implement solutions to ensure reliability and scalability for rapidly increasing demands
- 17Develop fault-tolerant design patterns considering reliability at all service lifecycle stages
- 18Collaborate with Engineering Leaders to define and influence product strategy, roadmaps, and technical designs
- 19Participate in PR review and collaborate with other engineers on their Design Docs
- 20Teach others about Site Reliability Engineering and communicate best practices for early feature development
- 21Participate in Incident Response when applicable, including investigation
Требования
- 018+ years engineering experience, 4+ in SRE/CRE/production engineering
- 02Strong preference for formal customer reliability engineering experience
- 03Strong Kubernetes experience in AWS, GCP, or Azure
- 04Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.)
- 05Strong experience with technical leadership, leading projects, mentoring engineers, serving as a force-multiplier
- 06Experience operating multi-tenant systems in production
- 07Strong experience designing and implementing SLOs
- 08Experience with one or more programming languages (e.g., Go, Python, Java)
- 09Experience with Linux operating systems internals, and knowledge of networking, cloud storage, and scaling
- 10Excellent problem-solving and troubleshooting skills
- 11Experience participating in blame-free Incident Response, following up on actions, and writing high-quality PIRs
- 12Ability to reason about performance, scaling, and failure modes
- 13Comfortable working in an engineering team that encourages autonomy and self-direction
- 14Ability to partner deeply with product engineering teams
- 15Intellectually curious, defaults to transparency, high bias towards action, and kind
Условия
- 01Fully remote position
- 02Eligible candidates from UK, Sweden, Spain, or Germany
- 03100% remote company with 1,600+ team members across 40+ countries
- 04Company-funded budget for modern AI coding assistants
- 05Access to frontier models (e.g., GPT-Codex 5/3, Claude Opus 4.6, Gemini 3 Pro)