Baseten12.05.2026
SRE
Полная занятостьУдалёнка
Обязанности
- 01Own the reliability of Baseten's multi-cloud Kubernetes infrastructure, including incident response, post-mortems, and remediation tracking
- 02Build and maintain observability infrastructure — metrics, logging, dashboards, and alerting — as code
- 03Author, validate, and improve runbooks for recurring failure patterns, ensuring they're structured for low-context, safe execution
- 04Identify high-frequency failure patterns and convert them into automated mitigations or self-healing automations
- 05Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management
- 06Define and instrument SLOs and SLIs across customer workloads and internal services
- 07Navigate ambiguity, make principled tradeoffs, and avoid unnecessary complexity in the systems you build and the processes you define
Требования
- 01Extensive hands-on experience with Kubernetes (multi-cloud experience across EKS, GKE, or similar is a strong plus)
- 02Experience in building and maintaining scalable infrastructure
- 03Strong foundation in observability tooling: metrics (VictoriaMetrics, Prometheus), logging (Loki, ELK), dashboards (Grafana), and alerting pipelines. Observability-as-code experience is a plus
- 04Experience with infrastructure-as-code (Terraform, Helm) and GitOps workflows (Flux CD, ArgoCD)
- 05Experience writing and improving runbooks, leading incident response, and doing post-mortem analysis
- 06Comfort working at the intersection of engineering and operations — you write code, but you also think deeply about process, escalation paths, and operational leverage
- 07Familiarity with incident management platforms (incident.io or similar) is a plus
- 08No prior ML experience required, but curiosity about how ML models are deployed and served at scale will serve you well
Условия
- 01Competitive compensation, including meaningful equity
- 02100% coverage of medical, dental, and vision insurance for employee and dependents
- 03Flexible PTO policy including company wide Winter Break (offices closed from Christmas Eve to New Year's Day)
- 04Paid parental leave
- 05Fertility and family-building stipend through Carrot
- 06Company-facilitated 401(k)
- 07Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities