Crusoe3 дня назад

Manager, Data Center Operations

Полная занятостьОфис

Навыки

GPU compute hardware troubleshootingSuperMicro hardwareJiraServiceNowDCIMNetBoxAMD GPU clustersMI300XNVIDIA GPU platformsH100H200B200RoCE fabric topologyUPS systemsPDUsGeneratorsCRAC systemsCRAH systems

Обязанности

01Own the daily operation, health, and availability of the OH5C data center
02Lead troubleshooting and repair of GPU compute hardware, including GPU trays, DIMMs, drives, cabling, and server nodes
03Drive rapid triage and repair while maintaining MTTR and uptime targets
04Coordinate RMAs and hardware support with OEM vendors, primarily SuperMicro
05Maintain spare-parts inventory and ensure critical hardware is available when needed
06Partner with Fleet Operations, SRE, networking, and infrastructure teams on escalations
07Lead, coach, and develop the on-site data center technician team
08Set clear expectations for safety, quality, responsiveness, and accountability
09Conduct regular one-on-ones, performance reviews, and development planning
10Support technician hiring, onboarding, training, and workforce planning
11Build a culture of technical precision, ownership, and continuous improvement
12Track and report site KPIs, including uptime, MTTR, SLA compliance, deployment velocity, and ticket aging
13Use operational data to identify recurring issues and improve reliability
14Maintain accurate break-fix workflows in Jira or a comparable ticketing system
15Provide clear operational updates, incident summaries, and corrective-action plans to senior leadership
16Serve as the primary on-site liaison with the colocation provider
17Hold facility partners accountable to SLAs related to power, cooling, security, and availability
18Maintain working knowledge of UPS systems, PDUs, generators, CRAC and CRAH systems, and supporting infrastructure
19Escalate and track facility issues through resolution
20Coordinate planned maintenance to minimize risk to production systems
21Maintain site runbooks, SOPs, emergency procedures, and hardware documentation
22Ensure work is completed in accordance with safety, security, and change-management standards
23Contribute to fleet-wide operating standards and knowledge sharing
24Maintain accurate asset, inventory, and configuration records

Требования

015+ years of data center operations leadership experience in a production environment
02Experience managing and developing technical teams
03Hands-on experience troubleshooting enterprise server hardware, including GPU nodes, DIMMs, drives, cabling, and rack-level infrastructure
04Strong familiarity with SuperMicro hardware, diagnostics, event logs, and RMA processes
05Experience working in colocation environments and managing provider SLAs
06Working knowledge of data center electrical and mechanical systems
07Experience with Jira, ServiceNow, or a similar ticketing platform
08Strong understanding of incident management, root-cause analysis, and operational risk
09Clear written and verbal communication skills, including the ability to present technical and operational information to senior leaders
10Ability to work on-site in Springfield, Ohio, and support critical incidents as needed

Условия

01Competitive compensation and equity
02Restricted Stock Units
03Paid time off, holidays, and leave programs
04Medical, dental, and vision insurance
05Employer HSA contributions
06Paid parental leave
07Life, short-term disability, and long-term disability insurance
08Professional development and tuition reimbursement
09Mental health and wellness
10Role is based on-site at Crusoe’s OH5C facility in Springfield, Ohio
11Periodic travel to other Crusoe sites may be required

Manager, Data Center Operations

Навыки

Обязанности

Требования

Условия

Похожие вакансии

Employee Relations Manager

Customer Success Manager

Senior Customer Success Manager

Senior Customer Success Manager

Senior Treasury Analyst

[Contract] Recruiter - Manufacturing