Anthropic06.05.2026

Senior Staff+ Software Engineer, Kubernetes Platform

San Francisco

Обязанности

  • 01Own, operate, and extend the Kubernetes scheduler for Anthropic's accelerator fleets, including custom scheduling plugins and policies for gang scheduling, topology awareness, and preemption
  • 02Scale the Kubernetes control plane (apiserver, etcd, controller-manager) to support clusters far beyond typical limits, and find the next bottleneck before it finds us
  • 03Design, build, and operate core cluster services such as service discovery that every workload in the fleet depends on
  • 04Build and maintain custom controllers, operators, and CRDs
  • 05Partner with research, training, and inference to understand workload shapes and turn their requirements into platform capabilities
  • 06Collaborate with cloud providers on required features and escalations
  • 07Participate in on-call, lead incident response, and design processes (postmortems, runbooks, SLOs) that help the team avoid repeating failures

Требования

  • 01Significant software engineering experience building and operating production distributed systems
  • 02Proficiency in at least one systems-appropriate language (e.g., Go, Python, Rust, or C++)
  • 03Deep, hands-on Kubernetes experience (well beyond "user of") into scheduler, controllers, apiserver, or operating large multi-tenant clusters
  • 04Demonstrated ability to debug complex issues across the stack, from API behavior down to node and network-level root causes
  • 05A track record of designing for reliability, correctness, and clear failure semantics in systems other engineers depend on
  • 06Strong written and verbal communication; comfort building consensus with internal stakeholders
  • 07Experience with Kubernetes internals or contributions: kube-scheduler / scheduling framework, apiserver, etcd, client-go, controller-runtime, or similar
  • 08Experience building or operating cluster schedulers or batch systems (e.g., Kueue, Volcano, Slurm, or in-house equivalents)
  • 09Background scaling control planes or coordination systems (etcd, ZooKeeper, Consul, or large DNS/service-mesh deployments)
  • 10Familiarity with ML infrastructure: GPUs, TPUs, or Trainium; gang scheduling; topology-aware placement; collective networking such as NCCL
  • 11Experience with GCP and/or AWS, including GKE/EKS internals and Infrastructure as Code
  • 12Low-level systems experience such as Linux kernel tuning, cgroups, or eBPF
  • 138+ years of relevant industry experience, including time leading large, ambiguous infrastructure projects

Условия

  • 01Annual Salary: $320,000 — $405,000 USD
  • 02Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience
  • 03Required field of study: A field relevant to the role as demonstrated through coursework, training, or professional experience
  • 04Location-based hybrid policy: office at least 25% of the time
  • 05Visa sponsorship: available (with reasonable efforts to obtain visa)