Baseten12.05.2026

SRE

Полная занятостьУдалёнка

Обязанности

  • 01Own the reliability of Baseten's multi-cloud Kubernetes infrastructure, including incident response, post-mortems, and remediation tracking
  • 02Build and maintain observability infrastructure — metrics, logging, dashboards, and alerting — as code
  • 03Author, validate, and improve runbooks for recurring failure patterns, ensuring they're structured for low-context, safe execution
  • 04Identify high-frequency failure patterns and convert them into automated mitigations or self-healing automations
  • 05Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management
  • 06Define and instrument SLOs and SLIs across customer workloads and internal services
  • 07Navigate ambiguity, make principled tradeoffs, and avoid unnecessary complexity in the systems you build and the processes you define

Требования

  • 01Extensive hands-on experience with Kubernetes (multi-cloud experience across EKS, GKE, or similar is a strong plus)
  • 02Experience in building and maintaining scalable infrastructure
  • 03Strong foundation in observability tooling: metrics (VictoriaMetrics, Prometheus), logging (Loki, ELK), dashboards (Grafana), and alerting pipelines. Observability-as-code experience is a plus
  • 04Experience with infrastructure-as-code (Terraform, Helm) and GitOps workflows (Flux CD, ArgoCD)
  • 05Experience writing and improving runbooks, leading incident response, and doing post-mortem analysis
  • 06Comfort working at the intersection of engineering and operations — you write code, but you also think deeply about process, escalation paths, and operational leverage
  • 07Familiarity with incident management platforms (incident.io or similar) is a plus
  • 08No prior ML experience required, but curiosity about how ML models are deployed and served at scale will serve you well

Условия

  • 01Competitive compensation, including meaningful equity
  • 02100% coverage of medical, dental, and vision insurance for employee and dependents
  • 03Flexible PTO policy including company wide Winter Break (offices closed from Christmas Eve to New Year's Day)
  • 04Paid parental leave
  • 05Fertility and family-building stipend through Carrot
  • 06Company-facilitated 401(k)
  • 07Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities