Perplexity14.04.2026

Member of Technical Staff (AI Inference Engineer)

Полная занятостьLondon

Обязанности

  • 01Support transformer-based retrieval, text-generation, and multimodal models in our inference infrastructure, from weight loading, request scheduling and KV-cache management to support in API Gateway
  • 02Port our in-house CUDA kernels to NVIDIA's CuTe DSL so they run on GB200 today and are portable to Vera Rubin racks tomorrow
  • 03Develop our internal Rust-based inference server to solve all Python pains and keep up with rapidly growing traffic
  • 04Profile and fix bottlenecks from network ingress through continuous batching and GPU kernels interleaving
  • 05Build dashboards, alerts, and automated remediation so we catch regressions before users do. Respond to and learn from production incidents

Требования

  • 01Deep experience with GPU programming and performance work (CUDA, Triton, CUTLASS, or similar)
  • 02Understanding of modern LLM architectures and ability to bring them up reliably in a production environment
  • 03Experience building and operating production distributed systems under real load - ideally performance-critical ones
  • 04Comfortable working across languages and layers: Rust for the serving runtime, Python for model code, CUDA/CuteDSL for kernels
  • 05Ability to own problems end-to-end
  • 06Self-directed work in fast-moving environments
  • 073+ years of professional software engineering experience with meaningful work on ML inference or high-performance systems
  • 08Familiarity with at least one deep learning framework (PyTorch, JAX, TensorFlow)
  • 09Understanding of GPU architectures (memory hierarchy, warp scheduling, tensor cores)
  • 10Understanding of common LLM architectures and inference optimization techniques (e.g. quantization, speculative decoding, prefill-decode disaggregation)

Условия

  • 01Equity may be part of the total compensation package
  • 02Final offer amounts are determined by multiple factors including experience and expertise