Vapi04.06.2026
Member of Technical Staff, Site Reliablity Engineer
Полная занятостьУдалёнка
Обязанности
- 01Join the oncall rotation and analyze 15 stability-gap incidents to create a prioritized reliability backlog
- 02Define the first set of SLOs for the call-completion path
- 03Stand up error budgets and SLO-based alerting in Chronosphere/Prometheus for the highest-impact services
- 04Run the first proper load test against provider rate limits and per-org concurrency
- 05Tune autoscaling for wscaler / workerpool-cron-scaler
- 06Ship a real platform service (capacity forecaster, auto-remediation, or oncall tooling) in Go or TypeScript
- 07Own the postmortem process and drive a measurable improvement in p99 call completion or MTTR
Требования
- 01Experience running incident command and postmortem discipline at scale on a real oncall rotation
- 02Experience operating SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog
- 03Experience in capacity planning and load testing for production systems with real users
- 04Fluency in Kubernetes production ops (pod crash diagnosis, HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown)
- 05Knowledge of backpressure and autoscaling patterns (KEDA, custom metrics scaling)
- 06Ability to ship code, not just scripts, and build platform services in Go or TypeScript
- 07Background in real-time / latency-sensitive product where degraded means a dropped call, not a slow dashboard
Условия
- 01Competitive salary and excellent equity ownership
- 02Comprehensive health coverage (medical, dental, vision)
- 03Team love and quarterly off-sites
- 04Flexible time off
- 05Catered meals, transportation, gym, and $10k annual L&D budget