Anthropic06.05.2026

Senior Staff+ Software Engineer, Kubernetes Platform

San Francisco

Обязанности

01Own, operate, and extend the Kubernetes scheduler for Anthropic's accelerator fleets, including custom scheduling plugins and policies for gang scheduling, topology awareness, and preemption
02Scale the Kubernetes control plane (apiserver, etcd, controller-manager) to support clusters far beyond typical limits, and find the next bottleneck before it finds us
03Design, build, and operate core cluster services such as service discovery that every workload in the fleet depends on
04Build and maintain custom controllers, operators, and CRDs
05Partner with research, training, and inference to understand workload shapes and turn their requirements into platform capabilities
06Collaborate with cloud providers on required features and escalations
07Participate in on-call, lead incident response, and design processes (postmortems, runbooks, SLOs) that help the team avoid repeating failures

01Significant software engineering experience building and operating production distributed systems
02Proficiency in at least one systems-appropriate language (e.g., Go, Python, Rust, or C++)
03Deep, hands-on Kubernetes experience (well beyond "user of") into scheduler, controllers, apiserver, or operating large multi-tenant clusters
04Demonstrated ability to debug complex issues across the stack, from API behavior down to node and network-level root causes
05A track record of designing for reliability, correctness, and clear failure semantics in systems other engineers depend on
06Strong written and verbal communication; comfort building consensus with internal stakeholders
07Experience with Kubernetes internals or contributions: kube-scheduler / scheduling framework, apiserver, etcd, client-go, controller-runtime, or similar
08Experience building or operating cluster schedulers or batch systems (e.g., Kueue, Volcano, Slurm, or in-house equivalents)
09Background scaling control planes or coordination systems (etcd, ZooKeeper, Consul, or large DNS/service-mesh deployments)
10Familiarity with ML infrastructure: GPUs, TPUs, or Trainium; gang scheduling; topology-aware placement; collective networking such as NCCL
11Experience with GCP and/or AWS, including GKE/EKS internals and Infrastructure as Code
12Low-level systems experience such as Linux kernel tuning, cgroups, or eBPF
138+ years of relevant industry experience, including time leading large, ambiguous infrastructure projects

01Annual Salary: $320,000 — $405,000 USD
02Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience
03Required field of study: A field relevant to the role as demonstrated through coursework, training, or professional experience
04Location-based hybrid policy: office at least 25% of the time
05Visa sponsorship: available (with reasonable efforts to obtain visa)