Databricks03.03.2026
Sr. Staff Technical Program Manager - Reliability
Bellevue
Обязанности
- 01Lead the strategy, execution, and continuous improvement of critical Reliability initiatives across infrastructure and product engineering teams
- 02Lead cross-company programs to enhance reliability, performance, and operational excellence of multi-cloud infrastructure
- 03Partner with senior engineering leaders to define Reliability strategy and set long-term goals
- 04Execute multi-quarter programs to build the most reliable cloud platform
- 05Anticipate risks, shape technical direction, and deliver complex programs across product, engineering, SRE, and cloud partner teams
- 06Lead Reliability Strategy and Multi-Quarter Roadmaps with senior engineering leadership
- 07Ensure clarity and alignment on priorities across engineering teams (Platform Engineering, Compute Fleet Management, SRE, Security, Cloud Partnerships)
- 08Own program execution end-to-end: planning, risk management, dependency mapping, trade-off decisions, status reporting, and delivery
- 09Identify gaps in process or architecture and work with TLs to drive organizational or technical improvements
- 10Partner deeply with engineering to influence technical direction and facilitate alignment between cross-functional teams
- 11Bring systems thinking to diagnose reliability bottlenecks and drive improvements to scalability, fault tolerance, automation, and operational tooling
- 12Drive adoption of reliability best practices across engineering teams (error budgets, incident reviews, design-for-resilience patterns, operational readiness)
- 13Define and implement program governance, repeatable processes, metrics, and documentation to scale reliability efforts
- 14Evangelize reliability expectations and engineer-empowering processes to reduce operational load and improve incident preparedness
Требования
- 0110+ years of experience managing and delivering large-scale technical programs in cloud infrastructure, distributed systems, SRE, or platform engineering environments
- 02Experience developing infrastructure at two or more hyperscale cloud providers (AWS, Azure, GCP), with knowledge of cloud primitives, multi-AZ/region architecture, and control plane/data plane patterns
- 03Demonstrated success leading Reliability Programs at scale (availability, failover, operational excellence, incident reduction, dependency hardening)
- 04Strong understanding of infrastructure, distributed systems, or SRE practices; previous engineering or SRE experience is highly preferred
- 05Experience partnering directly with senior engineering leadership to define strategy and drive large, multi-team initiatives
- 06Ability to translate ambiguous goals into actionable program plans with clear milestones, KPIs, and success metrics
- 07Demonstrated ability to manage complex cross-organizational dependencies, technical risks, and multi-quarter timelines
- 08Experience delivering programs across multiple clouds and/or large-scale cloud-native services
- 09Experience building and scaling engineering processes, operational frameworks, and stakeholder alignment mechanisms
- 10Background in distributed systems engineering, SRE, platform infrastructure, or cloud services
- 11Experience with large-scale compute fleets, container orchestration, autoscaling, or control-plane architecture
- 12Familiarity with reliability methodologies (SLOs, error budgets, chaos engineering, failure mode analysis, incident management frameworks)
- 13Expertise using Jira or equivalent tools for program tracking and execution
- 14Bachelor’s degree in Computer Science, Engineering, or related technical field; advanced degree preferred
Условия
- 01Pay range transparency with expected salary range for non-commissionable roles or on-target earnings for commissionable roles
- 02Total compensation package may include eligibility for annual performance bonus