Crusoe24.03.2026

Principal Production Engineer

Полная занятостьОфис

Обязанности

01Own the reliability and scalability of Crusoe's cloud infrastructure — compute, storage, and networking — defining SLOs, leading incident response, and driving systemic improvements that reduce toil and raise the bar across the platform
02Build and mature the observability and tooling layer — from network fabric telemetry and storage health to control plane instrumentation and on-call tooling — so the team can detect, diagnose, and resolve issues faster than customers notice them
03Drive platform reliability improvements across the full cloud stack, partnering closely with software, hardware, and network engineering teams to influence architecture decisions early, before they become operational debt
04Act as a trusted advisor to senior leadership, bringing perspective on observability trends and advocating for the right long-term technology investments
05Set the technical standards for how Crusoe's production engineering organization builds, operates, and scales — defining on-call culture, incident frameworks, and reliability practices that grow with the company
06Mentor senior and staff engineers, elevate the team's collective technical depth, and be the person others seek out when the problem is genuinely hard

0115+ years of experience in infrastructure, networking, or production engineering — with meaningful time at companies operating at internet scale (cloud providers, CDNs, large-scale social/media platforms, or similar)
02Strong systems fundamentals: Linux, distributed systems, storage, compute scheduling — you understand the full stack from hardware up
03Hands-on data center experience: you've done physical infra, understand power and thermal constraints, and can reason about reliability at the facility level, not just the server level
04The ability to write code — not necessarily full-time, but enough to automate what shouldn't be manual, instrument what isn't observable, and build tooling your team will actually use
05Excellent analytical and problem-solving skills, including the ability to synthesize ambiguous customer and system signals into clear problem statements and designs
06Strong incident command: you lead calmly under pressure, communicate clearly during outages, and run blameless retrospectives that actually improve systems

01Compensation: $261,000 - $326,000 + Bonus
02Restricted Stock Units in a fast growing, well-funded technology company
03Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
04Employer contributions to HSA accounts
05Paid Parental Leave
06Paid life insurance, short-term and long-term disability
07Teladoc
08401(k) with a 100% match up to 4% of salary
09Generous paid time off and holiday schedule
10Cell phone reimbursement
11Tuition reimbursement
12Subscription to the Calm app
13MetLife Legal
14Company paid commuter benefit; $300 per month