Perplexity14.04.2026

Member of Technical Staff (AI Inference Engineer)

Полная занятостьSan Francisco

Обязанности

01Support transformer-based retrieval, text-generation, and multimodal models in our inference infrastructure, from weight loading, request scheduling and KV-cache management to support in API Gateway
02Port our in-house CUDA kernels to NVIDIA's CuTe DSL so they run on GB200 today and are portable to Vera Rubin racks tomorrow
03Develop our internal Rust-based inference server to solve all Python pains and keep up with rapidly growing traffic
04Profile and fix bottlenecks from network ingress through continuous batching and GPU kernel interleaving
05Build dashboards, alerts, and automated remediation so we catch regressions before users do
06Respond to and learn from production incidents

01Deep experience with GPU programming and performance work (CUDA, Triton, CUTLASS, or similar)
02Understanding of modern LLM architectures and ability to bring them up reliably in a production environment
03Experience building and operating production distributed systems under real load - ideally performance-critical ones
04Comfortable working across languages and layers: Rust for the serving runtime, Python for model code, CUDA/CuteDSL for kernels
05Ability to own problems end-to-end
06Self-directed work in fast-moving environments
073+ years of professional software engineering experience with meaningful work on ML inference or high-performance systems
08Familiarity with at least one deep learning framework (PyTorch, JAX, TensorFlow)
09Understanding of GPU architectures (memory hierarchy, warp scheduling, tensor cores)
10Understanding of common LLM architectures and inference optimization techniques (e.g. quantization, speculative decoding, prefill-decode disaggregation)