Vapi04.06.2026

Member of Technical Staff, Site Reliablity Engineer

Полная занятостьУдалёнка

Обязанности

01Join the oncall rotation and analyze 15 stability-gap incidents to create a prioritized reliability backlog
02Define the first set of SLOs for the call-completion path
03Stand up error budgets and SLO-based alerting in Chronosphere/Prometheus for the highest-impact services
04Run the first proper load test against provider rate limits and per-org concurrency
05Tune autoscaling for wscaler / workerpool-cron-scaler
06Ship a real platform service (capacity forecaster, auto-remediation, or oncall tooling) in Go or TypeScript
07Own the postmortem process and drive a measurable improvement in p99 call completion or MTTR

01Experience running incident command and postmortem discipline at scale on a real oncall rotation
02Experience operating SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog
03Experience in capacity planning and load testing for production systems with real users
04Fluency in Kubernetes production ops (pod crash diagnosis, HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown)
05Knowledge of backpressure and autoscaling patterns (KEDA, custom metrics scaling)
06Ability to ship code, not just scripts, and build platform services in Go or TypeScript
07Background in real-time / latency-sensitive product where degraded means a dropped call, not a slow dashboard