Crusoe05.05.2026

Senior Production Engineer, Operational Excellence

Полная занятостьОфис

Обязанности

01Collaborate with cross-functional teams to define and evolve availability metrics for Crusoe’s cloud platform, including establishing, measuring, and improving SLIs and SLOs
02Participate in production incident response, diagnosing and resolving service disruptions while contributing to post-incident reviews and root cause analysis
03Build, operate, and improve observability across Crusoe’s infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry
04Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems
05Develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure
06Partner with compute, networking, storage, and platform teams to strengthen service resilience and disaster recovery capabilities
07Contribute to improving operational processes, knowledge sharing, and reliability best practices across the engineering organization
08Continue growing technical depth through mentorship, training, and hands-on work operating large-scale AI infrastructure

015+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations
02Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems
03Strong knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space
04Previous experience in Infrastructure roles building or managing compute, storage or networking platforms
05Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)
06Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar)
07Experience with monitoring and observability tools such as Prometheus and Grafana, or a strong desire to deepen expertise in this area
08Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible
09Scripting or programming experience with languages such as Go, Python, C, or C++
10Strong communication skills and the ability to collaborate across engineering teams
11Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments
12A growth mindset and strong interest in reliability engineering, automation, and operational excellence