Baseten18.05.2026
Manager, Cloud Platform & Site Reliability
Полная занятостьУдалёнка
Обязанности
- 01Lead, grow, and develop team leads across the Cloud Platform and Site Reliability Engineering orgs, building a culture of ownership, technical excellence, and continuous improvement
- 02Set the technical direction and roadmap for infrastructure, reliability, and platform engineering at the org level — balancing near-term operational needs with long-term strategic investments
- 03Own the reliability posture of the platform end-to-end, establishing and enforcing org-wide standards for SLOs/SLIs, incident response, observability-as-code, runbooks, and post-incident reviews
- 04Drive cross-functional collaboration with product, engineering, and customer-facing teams to ensure infrastructure capabilities and reliability investments align with product goals and enterprise customer requirements
- 05Oversee incident management and escalation processes for high-severity production issues, ensuring clear communication, rapid resolution, and systemic follow-through
- 06Translate recurring operational pain points and customer feedback into roadmap priorities, product improvements, and runbook enhancements across both teams
- 07Ensure best practices for CI/CD, infrastructure-as-code, GitOps, Kubernetes, and cloud resource management are consistently adopted and maintained across the org
- 08Partner with forward-deployed and customer success teams to support enterprise accounts with strict SLAs and complex infrastructure requirements
- 09Navigate ambiguity and make sound architectural and organizational tradeoffs, avoiding unnecessary complexity while enabling your teams to move fast
- 10Demonstrate accountability, pride of ownership, and high standards — and expect the same from your leads and their teams
Требования
- 01Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field
- 02Proven experience managing managers and leading multiple high-performing infrastructure, platform, or SRE teams in a fast-paced, high-growth environment
- 03Deep technical expertise in Kubernetes (multi-cloud across EKS, GKE, or similar), cloud infrastructure, and distributed systems, with the ability to engage credibly in architectural and operational decisions
- 04Hands-on background with infrastructure-as-code (e.g., Terraform, Pulumi) and CI/CD tooling (e.g., GitHub Actions, GitLab CI, Jenkins); familiarity with GitOps workflows (e.g., Flux CD, ArgoCD, Helm)
- 05Strong foundation in observability tooling — metrics (Prometheus, VictoriaMetrics), logging (Loki, ELK), dashboards (Grafana), tracing (OpenTelemetry) — and a track record of raising reliability standards through SLOs, SLIs, and observability-as-code
- 06Experience owning incident management and enterprise SLAs at scale, including executive-level communication during high-severity incidents and rigorous post-incident follow-through
- 07Demonstrated ability to lead complex, multi-stakeholder technical initiatives from scoping through execution, balancing engineering excellence with pragmatic delivery
- 08Strong communication skills with executive presence, capable of representing technical work clearly to both technical and non-technical audiences
- 09No prior machine learning experience required, but should be open to learning about ML infrastructure and model serving
Условия
- 01Competitive compensation, including meaningful equity
- 02100% coverage of medical, dental, and vision insurance