Synthesia16 дней назад

Senior Site Reliability Engineer

Полная занятостьУдалёнка

Обязанности

  • 01Incident management & operational excellence — take custody of the incident process: on-call quality, response, post-mortems, and driving down incident count, time-to-detect, and time-to-resolve
  • 02Automation & reliability engineering — automate low-frequency, high-consequence operations (the certificate-renewal class of problem — rare, easy to forget, outage-causing when missed), not just the high-frequency toil
  • 03A platform domain — over time, deep ownership of a domain such as Temporal, observability, or Kubernetes operations, partnering with the engineers building in it
  • 04Vendor & third-party management — own key external relationships and integrations (e.g. LLM API providers, third-party services), today managed manually and informally
  • 05FinOps — own cloud and platform cost visibility and efficiency, and the mechanics of how usage maps to billing

Требования

  • 01Strong production operations experience on AWS and Kubernetes
  • 02Comfortable with MongoDB and scripting/automation in Python
  • 03An operations-and-reliability mindset — you take pride in systems that run quietly
  • 04Sound judgement on incidents and risk; calm and clear under pressure
  • 05Influences through relationships and evidence, not escalation; comfortable owning a domain and partnering across teams

Условия

  • 01Remote (US East Coast preferred, for timezone coverage)