Robinhood23.04.2026
Senior Software Engineer - Robinhood Command Center
Menlo Park
Обязанности
- 01Serve as a senior technical leader driving the long-term reliability and observability strategy across Robinhood’s infrastructure
- 02Partner closely across many different types of engineers to raise the bar for operational excellence and incident response
- 03Lead incident mitigation efforts by coordinating service owners, facilitating time-sensitive decisions like rollbacks, traffic shifts, and maintaining a clear source of truth during active incidents
- 04Develop and maintain incident management processes and procedures to ensure timely resolution and minimize customer impact
- 05Own incident discovery at the company level by defining and maintaining global dashboards and alerts tied to critical user journeys (CUJs), availability, and business-impact metrics
- 06Own and evolve incident response tooling and processes, including education, adoption, and measurement of MTTD/MTTR improvements
- 07Drive post-incident governance and learning, defining standards for postmortems, SEV reviews, and follow-up tracking to ensure durable reliability improvements
- 08Design and implement next-generation failure mitigation strategies that avoid full-region or full-datacenter failovers
- 09Define and build frameworks to improve monitoring, alerting, and observability across hundreds of services and systems
- 10Define and own the roadmap of bringing observability to critical user journeys for Robinhood’s products
- 11Deliver key insights and executive-level reporting to enable better business decisions around service quality and reliability
- 12Act as a force multiplier through mentoring, technical influence, and contributions to hiring and engineering culture
Требования
- 015+ years of software engineering experience, including significant experience operating production systems
- 022+ years focused on reliability engineering, infrastructure, distributed systems, or production operations
- 03Hands-on experience serving in incident leadership roles (e.g., IMOC, incident commander, primary oncall)
- 04Strong communication and cross-functional collaboration skills, especially during high-severity incidents
- 05Deep knowledge of systems reliability, observability frameworks, and fault-tolerant architecture design
- 06Experience with multi-region or multi-cluster architectures, capacity planning, and failover strategies
- 07Familiarity with modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana)
- 08Demonstrated ability to drive measurable improvements in MTTD, MTTR, availability, or customer impact
Условия
- 01Role is based in Menlo Park, California office with in-person attendance expected at least 3 days per week
- 02Performance driven compensation with multipliers for outsized impact, bonus programs, equity ownership, and 401(k) matching
- 03100% paid health insurance for employees with 90% coverage for dependents
- 04Lifestyle wallet - a highly flexible benefits spending account for wellness, learning, and more
- 05Employer-paid life & disability insurance, fertility benefits, and mental health benefits
- 06Time off to recharge including company holidays, paid time off, sick time, parental leave, and more
- 07Exceptional office experience with catered meals, events, and comfortable workspaces
- 08Base Pay Range: Zone 1 $196,000 - $230,000 USD
- 09Base Pay Range: Zone 2 $172,000 - $202,000 USD
- 10Base Pay Range: Zone 3 $153,000 - $179,000 USD