Crusoe24.03.2026

Principal Production Engineer

Полная занятостьОфис

Обязанности

  • 01Own the reliability and scalability of Crusoe's cloud infrastructure — compute, storage, and networking — defining SLOs, leading incident response, and driving systemic improvements that reduce toil and raise the bar across the platform
  • 02Build and mature the observability and tooling layer — from network fabric telemetry and storage health to control plane instrumentation and on-call tooling — so the team can detect, diagnose, and resolve issues faster than customers notice them
  • 03Drive platform reliability improvements across the full cloud stack, partnering closely with software, hardware, and network engineering teams to influence architecture decisions early, before they become operational debt
  • 04Act as a trusted advisor to senior leadership, bringing perspective on observability trends and advocating for the right long-term technology investments
  • 05Set the technical standards for how Crusoe's production engineering organization builds, operates, and scales — defining on-call culture, incident frameworks, and reliability practices that grow with the company
  • 06Mentor senior and staff engineers, elevate the team's collective technical depth, and be the person others seek out when the problem is genuinely hard

Требования

  • 0115+ years of experience in infrastructure, networking, or production engineering — with meaningful time at companies operating at internet scale (cloud providers, CDNs, large-scale social/media platforms, or similar)
  • 02Strong systems fundamentals: Linux, distributed systems, storage, compute scheduling — you understand the full stack from hardware up
  • 03Hands-on data center experience: you've done physical infra, understand power and thermal constraints, and can reason about reliability at the facility level, not just the server level
  • 04The ability to write code — not necessarily full-time, but enough to automate what shouldn't be manual, instrument what isn't observable, and build tooling your team will actually use
  • 05Excellent analytical and problem-solving skills, including the ability to synthesize ambiguous customer and system signals into clear problem statements and designs
  • 06Strong incident command: you lead calmly under pressure, communicate clearly during outages, and run blameless retrospectives that actually improve systems

Условия

  • 01Compensation: $261,000 - $326,000 + Bonus
  • 02Restricted Stock Units in a fast growing, well-funded technology company
  • 03Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • 04Employer contributions to HSA accounts
  • 05Paid Parental Leave
  • 06Paid life insurance, short-term and long-term disability
  • 07Teladoc
  • 08401(k) with a 100% match up to 4% of salary
  • 09Generous paid time off and holiday schedule
  • 10Cell phone reimbursement
  • 11Tuition reimbursement
  • 12Subscription to the Calm app
  • 13MetLife Legal
  • 14Company paid commuter benefit; $300 per month