Baseten18.05.2026

Manager, Cloud Platform & Site Reliability

Полная занятостьУдалёнка

Обязанности

  • 01Lead, grow, and develop team leads across the Cloud Platform and Site Reliability Engineering orgs, building a culture of ownership, technical excellence, and continuous improvement
  • 02Set the technical direction and roadmap for infrastructure, reliability, and platform engineering at the org level — balancing near-term operational needs with long-term strategic investments
  • 03Own the reliability posture of the platform end-to-end, establishing and enforcing org-wide standards for SLOs/SLIs, incident response, observability-as-code, runbooks, and post-incident reviews
  • 04Drive cross-functional collaboration with product, engineering, and customer-facing teams to ensure infrastructure capabilities and reliability investments align with product goals and enterprise customer requirements
  • 05Oversee incident management and escalation processes for high-severity production issues, ensuring clear communication, rapid resolution, and systemic follow-through
  • 06Translate recurring operational pain points and customer feedback into roadmap priorities, product improvements, and runbook enhancements across both teams
  • 07Ensure best practices for CI/CD, infrastructure-as-code, GitOps, Kubernetes, and cloud resource management are consistently adopted and maintained across the org
  • 08Partner with forward-deployed and customer success teams to support enterprise accounts with strict SLAs and complex infrastructure requirements
  • 09Navigate ambiguity and make sound architectural and organizational tradeoffs, avoiding unnecessary complexity while enabling your teams to move fast
  • 10Demonstrate accountability, pride of ownership, and high standards — and expect the same from your leads and their teams

Требования

  • 01Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field
  • 02Proven experience managing managers and leading multiple high-performing infrastructure, platform, or SRE teams in a fast-paced, high-growth environment
  • 03Deep technical expertise in Kubernetes (multi-cloud across EKS, GKE, or similar), cloud infrastructure, and distributed systems, with the ability to engage credibly in architectural and operational decisions
  • 04Hands-on background with infrastructure-as-code (e.g., Terraform, Pulumi) and CI/CD tooling (e.g., GitHub Actions, GitLab CI, Jenkins); familiarity with GitOps workflows (e.g., Flux CD, ArgoCD, Helm)
  • 05Strong foundation in observability tooling — metrics (Prometheus, VictoriaMetrics), logging (Loki, ELK), dashboards (Grafana), tracing (OpenTelemetry) — and a track record of raising reliability standards through SLOs, SLIs, and observability-as-code
  • 06Experience owning incident management and enterprise SLAs at scale, including executive-level communication during high-severity incidents and rigorous post-incident follow-through
  • 07Demonstrated ability to lead complex, multi-stakeholder technical initiatives from scoping through execution, balancing engineering excellence with pragmatic delivery
  • 08Strong communication skills with executive presence, capable of representing technical work clearly to both technical and non-technical audiences
  • 09No prior machine learning experience required, but should be open to learning about ML infrastructure and model serving

Условия

  • 01Competitive compensation, including meaningful equity
  • 02100% coverage of medical, dental, and vision insurance