Crusoe05.05.2026
Senior Production Engineer, Operational Excellence
Полная занятостьОфис
Обязанности
- 01Collaborate with cross-functional teams to define and evolve availability metrics for Crusoe’s cloud platform, including establishing, measuring, and improving SLIs and SLOs
- 02Participate in production incident response, diagnosing and resolving service disruptions while contributing to post-incident reviews and root cause analysis
- 03Build, operate, and improve observability across Crusoe’s infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry
- 04Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems
- 05Develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure
- 06Partner with compute, networking, storage, and platform teams to strengthen service resilience and disaster recovery capabilities
- 07Contribute to improving operational processes, knowledge sharing, and reliability best practices across the engineering organization
- 08Continue growing technical depth through mentorship, training, and hands-on work operating large-scale AI infrastructure
Требования
- 015+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations
- 02Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems
- 03Strong knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space
- 04Previous experience in Infrastructure roles building or managing compute, storage or networking platforms
- 05Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)
- 06Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar)
- 07Experience with monitoring and observability tools such as Prometheus and Grafana, or a strong desire to deepen expertise in this area
- 08Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible
- 09Scripting or programming experience with languages such as Go, Python, C, or C++
- 10Strong communication skills and the ability to collaborate across engineering teams
- 11Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments
- 12A growth mindset and strong interest in reliability engineering, automation, and operational excellence
Условия
- 01Compensation range: $172,000 – $209,000 + Bonus
- 02Restricted Stock Units included in all offers
- 03Health insurance package options (HDHP and PPO, vision, dental)
- 04Employer contributions to HSA accounts
- 05Paid Parental Leave
- 06Paid life insurance, short-term and long-term disability
- 07Teladoc
- 08401(k) with 100% match up to 4% of salary
- 09Generous paid time off and holiday schedule
- 10Cell phone reimbursement
- 11Tuition reimbursement
- 12Subscription to the Calm app
- 13MetLife Legal
- 14Company paid commuter benefit: $300 per month