Databricks03.03.2026

Sr. Staff Technical Program Manager - Reliability

Bellevue

Обязанности

  • 01Lead the strategy, execution, and continuous improvement of critical Reliability initiatives across infrastructure and product engineering teams
  • 02Lead cross-company programs to enhance reliability, performance, and operational excellence of multi-cloud infrastructure
  • 03Partner with senior engineering leaders to define Reliability strategy and set long-term goals
  • 04Execute multi-quarter programs to build the most reliable cloud platform
  • 05Anticipate risks, shape technical direction, and deliver complex programs across product, engineering, SRE, and cloud partner teams
  • 06Lead Reliability Strategy and Multi-Quarter Roadmaps with senior engineering leadership
  • 07Ensure clarity and alignment on priorities across engineering teams (Platform Engineering, Compute Fleet Management, SRE, Security, Cloud Partnerships)
  • 08Own program execution end-to-end: planning, risk management, dependency mapping, trade-off decisions, status reporting, and delivery
  • 09Identify gaps in process or architecture and work with TLs to drive organizational or technical improvements
  • 10Partner deeply with engineering to influence technical direction and facilitate alignment between cross-functional teams
  • 11Bring systems thinking to diagnose reliability bottlenecks and drive improvements to scalability, fault tolerance, automation, and operational tooling
  • 12Drive adoption of reliability best practices across engineering teams (error budgets, incident reviews, design-for-resilience patterns, operational readiness)
  • 13Define and implement program governance, repeatable processes, metrics, and documentation to scale reliability efforts
  • 14Evangelize reliability expectations and engineer-empowering processes to reduce operational load and improve incident preparedness

Требования

  • 0110+ years of experience managing and delivering large-scale technical programs in cloud infrastructure, distributed systems, SRE, or platform engineering environments
  • 02Experience developing infrastructure at two or more hyperscale cloud providers (AWS, Azure, GCP), with knowledge of cloud primitives, multi-AZ/region architecture, and control plane/data plane patterns
  • 03Demonstrated success leading Reliability Programs at scale (availability, failover, operational excellence, incident reduction, dependency hardening)
  • 04Strong understanding of infrastructure, distributed systems, or SRE practices; previous engineering or SRE experience is highly preferred
  • 05Experience partnering directly with senior engineering leadership to define strategy and drive large, multi-team initiatives
  • 06Ability to translate ambiguous goals into actionable program plans with clear milestones, KPIs, and success metrics
  • 07Demonstrated ability to manage complex cross-organizational dependencies, technical risks, and multi-quarter timelines
  • 08Experience delivering programs across multiple clouds and/or large-scale cloud-native services
  • 09Experience building and scaling engineering processes, operational frameworks, and stakeholder alignment mechanisms
  • 10Background in distributed systems engineering, SRE, platform infrastructure, or cloud services
  • 11Experience with large-scale compute fleets, container orchestration, autoscaling, or control-plane architecture
  • 12Familiarity with reliability methodologies (SLOs, error budgets, chaos engineering, failure mode analysis, incident management frameworks)
  • 13Expertise using Jira or equivalent tools for program tracking and execution
  • 14Bachelor’s degree in Computer Science, Engineering, or related technical field; advanced degree preferred

Условия

  • 01Pay range transparency with expected salary range for non-commissionable roles or on-target earnings for commissionable roles
  • 02Total compensation package may include eligibility for annual performance bonus