Okta12.03.2026

Staff Site Reliability Engineer - Observability GCP

Bellevue

Обязанности

01Design, build, and maintain scalable observability infrastructure using tools like Terraform
02Optimize the collection, processing, and storage of Observability data to ensure high reliability and low latency of our Splunk and Grafana services
03Participate in on-call rotations and lead post-incident reviews to drive systemic improvements and observability-driven development
04Eliminate toil by automating the deployment and scaling of observability agents and collectors

01Minimum 5+ Experience scaling and managing observability in a Google Cloud platform
02Expertise in creating intuitive, actionable Splunk or Grafana dashboards that correlate data across multiple sources
03Minimum 3+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems
04Strong coding skills in Python, Go for building internal tools and automating workflows
05Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/GKE)
06A data-driven approach to debugging complex, cross-service performance bottlenecks
07Ability to access federal environments and/or have access to protected federal data
08U.S. Person status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee)