Crusoe3 дня назад
Manager, Data Center Operations
Полная занятостьОфис
Навыки
GPU compute hardware troubleshootingSuperMicro hardwareJiraServiceNowDCIMNetBoxAMD GPU clustersMI300XNVIDIA GPU platformsH100H200B200RoCE fabric topologyUPS systemsPDUsGeneratorsCRAC systemsCRAH systems
Обязанности
- 01Own the daily operation, health, and availability of the OH5C data center
- 02Lead troubleshooting and repair of GPU compute hardware, including GPU trays, DIMMs, drives, cabling, and server nodes
- 03Drive rapid triage and repair while maintaining MTTR and uptime targets
- 04Coordinate RMAs and hardware support with OEM vendors, primarily SuperMicro
- 05Maintain spare-parts inventory and ensure critical hardware is available when needed
- 06Partner with Fleet Operations, SRE, networking, and infrastructure teams on escalations
- 07Lead, coach, and develop the on-site data center technician team
- 08Set clear expectations for safety, quality, responsiveness, and accountability
- 09Conduct regular one-on-ones, performance reviews, and development planning
- 10Support technician hiring, onboarding, training, and workforce planning
- 11Build a culture of technical precision, ownership, and continuous improvement
- 12Track and report site KPIs, including uptime, MTTR, SLA compliance, deployment velocity, and ticket aging
- 13Use operational data to identify recurring issues and improve reliability
- 14Maintain accurate break-fix workflows in Jira or a comparable ticketing system
- 15Provide clear operational updates, incident summaries, and corrective-action plans to senior leadership
- 16Serve as the primary on-site liaison with the colocation provider
- 17Hold facility partners accountable to SLAs related to power, cooling, security, and availability
- 18Maintain working knowledge of UPS systems, PDUs, generators, CRAC and CRAH systems, and supporting infrastructure
- 19Escalate and track facility issues through resolution
- 20Coordinate planned maintenance to minimize risk to production systems
- 21Maintain site runbooks, SOPs, emergency procedures, and hardware documentation
- 22Ensure work is completed in accordance with safety, security, and change-management standards
- 23Contribute to fleet-wide operating standards and knowledge sharing
- 24Maintain accurate asset, inventory, and configuration records
Требования
- 015+ years of data center operations leadership experience in a production environment
- 02Experience managing and developing technical teams
- 03Hands-on experience troubleshooting enterprise server hardware, including GPU nodes, DIMMs, drives, cabling, and rack-level infrastructure
- 04Strong familiarity with SuperMicro hardware, diagnostics, event logs, and RMA processes
- 05Experience working in colocation environments and managing provider SLAs
- 06Working knowledge of data center electrical and mechanical systems
- 07Experience with Jira, ServiceNow, or a similar ticketing platform
- 08Strong understanding of incident management, root-cause analysis, and operational risk
- 09Clear written and verbal communication skills, including the ability to present technical and operational information to senior leaders
- 10Ability to work on-site in Springfield, Ohio, and support critical incidents as needed
Условия
- 01Competitive compensation and equity
- 02Restricted Stock Units
- 03Paid time off, holidays, and leave programs
- 04Medical, dental, and vision insurance
- 05Employer HSA contributions
- 06Paid parental leave
- 07Life, short-term disability, and long-term disability insurance
- 08Professional development and tuition reimbursement
- 09Mental health and wellness
- 10Role is based on-site at Crusoe’s OH5C facility in Springfield, Ohio
- 11Periodic travel to other Crusoe sites may be required