Perplexity14.04.2026

Member of Technical Staff (AI Infrastructure Engineer)

Полная занятостьУдалёнка

Обязанности

  • 01Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
  • 02Manage and optimize Slurm-based HPC environments for distributed training of large language models
  • 03Develop robust APIs and orchestration systems for both training pipelines and inference services
  • 04Implement resource scheduling and job management systems across heterogeneous compute environments
  • 05Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
  • 06Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
  • 07Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
  • 08Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Требования

  • 01Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • 02Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
  • 03Experience with deploying and managing distributed training systems at scale
  • 04Deep understanding of container orchestration and distributed systems architecture
  • 05High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
  • 06Experience managing GPU clusters and optimizing compute resource utilization
  • 07Expert-level Kubernetes administration and YAML configuration management
  • 08Proficiency with Slurm job scheduling, resource management, and cluster configuration
  • 09Python and C++ programming with focus on systems and infrastructure automation
  • 10Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
  • 11Strong understanding of networking, storage, and compute resource management for ML workloads
  • 12Experience developing APIs and managing distributed systems for both batch and real-time workloads
  • 13Solid debugging and monitoring skills with expertise in observability tools for containerized environments
  • 14Demonstrated experience managing large-scale Kubernetes deployments in production environments
  • 15Proven track record with Slurm cluster administration and HPC workload management
  • 16Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
  • 17Experience supporting both long-running training jobs and high-availability inference services
  • 18Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management