Vapi04.06.2026

Member of Technical Staff, Site Reliablity Engineer

Полная занятостьУдалёнка

Обязанности

  • 01Join the oncall rotation and analyze 15 stability-gap incidents to create a prioritized reliability backlog
  • 02Define the first set of SLOs for the call-completion path
  • 03Stand up error budgets and SLO-based alerting in Chronosphere/Prometheus for the highest-impact services
  • 04Run the first proper load test against provider rate limits and per-org concurrency
  • 05Tune autoscaling for wscaler / workerpool-cron-scaler
  • 06Ship a real platform service (capacity forecaster, auto-remediation, or oncall tooling) in Go or TypeScript
  • 07Own the postmortem process and drive a measurable improvement in p99 call completion or MTTR

Требования

  • 01Experience running incident command and postmortem discipline at scale on a real oncall rotation
  • 02Experience operating SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog
  • 03Experience in capacity planning and load testing for production systems with real users
  • 04Fluency in Kubernetes production ops (pod crash diagnosis, HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown)
  • 05Knowledge of backpressure and autoscaling patterns (KEDA, custom metrics scaling)
  • 06Ability to ship code, not just scripts, and build platform services in Go or TypeScript
  • 07Background in real-time / latency-sensitive product where degraded means a dropped call, not a slow dashboard

Условия

  • 01Competitive salary and excellent equity ownership
  • 02Comprehensive health coverage (medical, dental, vision)
  • 03Team love and quarterly off-sites
  • 04Flexible time off
  • 05Catered meals, transportation, gym, and $10k annual L&D budget