Crusoe05.05.2026

Senior Production Engineer, Operational Excellence

Полная занятостьОфис

Обязанности

  • 01Collaborate with cross-functional teams to define and evolve availability metrics for Crusoe’s cloud platform, including establishing, measuring, and improving SLIs and SLOs
  • 02Participate in production incident response, diagnosing and resolving service disruptions while contributing to post-incident reviews and root cause analysis
  • 03Build, operate, and improve observability across Crusoe’s infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry
  • 04Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems
  • 05Develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure
  • 06Partner with compute, networking, storage, and platform teams to strengthen service resilience and disaster recovery capabilities
  • 07Contribute to improving operational processes, knowledge sharing, and reliability best practices across the engineering organization
  • 08Continue growing technical depth through mentorship, training, and hands-on work operating large-scale AI infrastructure

Требования

  • 015+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations
  • 02Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems
  • 03Strong knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space
  • 04Previous experience in Infrastructure roles building or managing compute, storage or networking platforms
  • 05Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)
  • 06Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar)
  • 07Experience with monitoring and observability tools such as Prometheus and Grafana, or a strong desire to deepen expertise in this area
  • 08Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible
  • 09Scripting or programming experience with languages such as Go, Python, C, or C++
  • 10Strong communication skills and the ability to collaborate across engineering teams
  • 11Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments
  • 12A growth mindset and strong interest in reliability engineering, automation, and operational excellence

Условия

  • 01Compensation range: $172,000 – $209,000 + Bonus
  • 02Restricted Stock Units included in all offers
  • 03Health insurance package options (HDHP and PPO, vision, dental)
  • 04Employer contributions to HSA accounts
  • 05Paid Parental Leave
  • 06Paid life insurance, short-term and long-term disability
  • 07Teladoc
  • 08401(k) with 100% match up to 4% of salary
  • 09Generous paid time off and holiday schedule
  • 10Cell phone reimbursement
  • 11Tuition reimbursement
  • 12Subscription to the Calm app
  • 13MetLife Legal
  • 14Company paid commuter benefit: $300 per month