Crusoe24.03.2026
Principal Production Engineer
Полная занятостьОфис
Обязанности
- 01Own the reliability and scalability of Crusoe's cloud infrastructure — compute, storage, and networking — defining SLOs, leading incident response, and driving systemic improvements that reduce toil and raise the bar across the platform
- 02Build and mature the observability and tooling layer — from network fabric telemetry and storage health to control plane instrumentation and on-call tooling — so the team can detect, diagnose, and resolve issues faster than customers notice them
- 03Drive platform reliability improvements across the full cloud stack, partnering closely with software, hardware, and network engineering teams to influence architecture decisions early, before they become operational debt
- 04Act as a trusted advisor to senior leadership, bringing perspective on observability trends and advocating for the right long-term technology investments
- 05Set the technical standards for how Crusoe's production engineering organization builds, operates, and scales — defining on-call culture, incident frameworks, and reliability practices that grow with the company
- 06Mentor senior and staff engineers, elevate the team's collective technical depth, and be the person others seek out when the problem is genuinely hard
Требования
- 0115+ years of experience in infrastructure, networking, or production engineering — with meaningful time at companies operating at internet scale (cloud providers, CDNs, large-scale social/media platforms, or similar)
- 02Strong systems fundamentals: Linux, distributed systems, storage, compute scheduling — you understand the full stack from hardware up
- 03Hands-on data center experience: you've done physical infra, understand power and thermal constraints, and can reason about reliability at the facility level, not just the server level
- 04The ability to write code — not necessarily full-time, but enough to automate what shouldn't be manual, instrument what isn't observable, and build tooling your team will actually use
- 05Excellent analytical and problem-solving skills, including the ability to synthesize ambiguous customer and system signals into clear problem statements and designs
- 06Strong incident command: you lead calmly under pressure, communicate clearly during outages, and run blameless retrospectives that actually improve systems
Условия
- 01Compensation: $261,000 - $326,000 + Bonus
- 02Restricted Stock Units in a fast growing, well-funded technology company
- 03Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
- 04Employer contributions to HSA accounts
- 05Paid Parental Leave
- 06Paid life insurance, short-term and long-term disability
- 07Teladoc
- 08401(k) with a 100% match up to 4% of salary
- 09Generous paid time off and holiday schedule
- 10Cell phone reimbursement
- 11Tuition reimbursement
- 12Subscription to the Calm app
- 13MetLife Legal
- 14Company paid commuter benefit; $300 per month