Crusoe3 дня назад

Manager, Data Center Operations

Полная занятостьОфис

Навыки

GPU compute hardware troubleshootingSuperMicro hardwareJiraServiceNowDCIMNetBoxAMD GPU clustersMI300XNVIDIA GPU platformsH100H200B200RoCE fabric topologyUPS systemsPDUsGeneratorsCRAC systemsCRAH systems

Обязанности

  • 01Own the daily operation, health, and availability of the OH5C data center
  • 02Lead troubleshooting and repair of GPU compute hardware, including GPU trays, DIMMs, drives, cabling, and server nodes
  • 03Drive rapid triage and repair while maintaining MTTR and uptime targets
  • 04Coordinate RMAs and hardware support with OEM vendors, primarily SuperMicro
  • 05Maintain spare-parts inventory and ensure critical hardware is available when needed
  • 06Partner with Fleet Operations, SRE, networking, and infrastructure teams on escalations
  • 07Lead, coach, and develop the on-site data center technician team
  • 08Set clear expectations for safety, quality, responsiveness, and accountability
  • 09Conduct regular one-on-ones, performance reviews, and development planning
  • 10Support technician hiring, onboarding, training, and workforce planning
  • 11Build a culture of technical precision, ownership, and continuous improvement
  • 12Track and report site KPIs, including uptime, MTTR, SLA compliance, deployment velocity, and ticket aging
  • 13Use operational data to identify recurring issues and improve reliability
  • 14Maintain accurate break-fix workflows in Jira or a comparable ticketing system
  • 15Provide clear operational updates, incident summaries, and corrective-action plans to senior leadership
  • 16Serve as the primary on-site liaison with the colocation provider
  • 17Hold facility partners accountable to SLAs related to power, cooling, security, and availability
  • 18Maintain working knowledge of UPS systems, PDUs, generators, CRAC and CRAH systems, and supporting infrastructure
  • 19Escalate and track facility issues through resolution
  • 20Coordinate planned maintenance to minimize risk to production systems
  • 21Maintain site runbooks, SOPs, emergency procedures, and hardware documentation
  • 22Ensure work is completed in accordance with safety, security, and change-management standards
  • 23Contribute to fleet-wide operating standards and knowledge sharing
  • 24Maintain accurate asset, inventory, and configuration records

Требования

  • 015+ years of data center operations leadership experience in a production environment
  • 02Experience managing and developing technical teams
  • 03Hands-on experience troubleshooting enterprise server hardware, including GPU nodes, DIMMs, drives, cabling, and rack-level infrastructure
  • 04Strong familiarity with SuperMicro hardware, diagnostics, event logs, and RMA processes
  • 05Experience working in colocation environments and managing provider SLAs
  • 06Working knowledge of data center electrical and mechanical systems
  • 07Experience with Jira, ServiceNow, or a similar ticketing platform
  • 08Strong understanding of incident management, root-cause analysis, and operational risk
  • 09Clear written and verbal communication skills, including the ability to present technical and operational information to senior leaders
  • 10Ability to work on-site in Springfield, Ohio, and support critical incidents as needed

Условия

  • 01Competitive compensation and equity
  • 02Restricted Stock Units
  • 03Paid time off, holidays, and leave programs
  • 04Medical, dental, and vision insurance
  • 05Employer HSA contributions
  • 06Paid parental leave
  • 07Life, short-term disability, and long-term disability insurance
  • 08Professional development and tuition reimbursement
  • 09Mental health and wellness
  • 10Role is based on-site at Crusoe’s OH5C facility in Springfield, Ohio
  • 11Periodic travel to other Crusoe sites may be required