April 2026 · 18 min read

How to find ML systems engineers in 2026: C++, Rust, and inference infrastructure

The people who make LLMs run fast in production are not the same people who train them. ML systems engineers write C++ and Rust for inference engines, optimize GPU memory layouts, and build the infrastructure that turns a research model into a product. There are maybe 50,000 of them globally, and every AI company on the planet wants to hire them.

The AI hiring market in 2026 has split into two camps that most job descriptions blur together. One camp is ML researchers and data scientists who train models in Python, tune hyperparameters, and evaluate results in Jupyter notebooks. The other is systems engineers who take those models and make them serve 10,000 requests per second at 50ms latency on a GPU cluster. The first group is large and growing. The second is small, in extreme demand, and mostly invisible to traditional recruiting.

ML systems engineers write C++ for PyTorch internals. They write Rust for new inference frameworks like Hugging Face's candle. They contribute to llama.cpp, which has been the backbone of local LLM inference since Georgi Gerganov's first commit in March 2023. They optimize model quantization formats (GPTQ, AWQ, GGUF) that make 70-billion-parameter models run on consumer hardware. And they do all of this on GitHub, in public, which makes GitHub a strong sourcing channel for this role.

This guide covers where to find ML systems engineers, what separates a real systems contributor from someone who fine-tuned a model once, and why the "full-stack ML engineer" job description is hurting your pipeline. It also walks through a sourcing workflow that can reach these engineers before they accept a competing offer.

The ML systems engineering talent market

Between 20,000 and 50,000 engineers globally write C++ for ML systems. The Rust-plus-ML intersection is smaller: 10,000 to 30,000 engineers who work in both. These estimates are rough because the role sits at a boundary that neither systems programming nor ML communities fully own. A C++ engineer working on game engine physics is not an ML systems engineer. A Python ML engineer who calls model.predict() is not one either. The people in the overlap have a specific and uncommon combination of skills.

Demand has outstripped supply since GPT-3 in 2020, and the gap widened with every subsequent model release. Rust ML job postings grew 67% year-over-year. The salary premium for ML systems work runs 15-20% above base ML roles. Senior ML systems engineers in the US earn $220,000 to $350,000 in total compensation; staff-level engineers at top AI labs clear $500,000. These are not outliers driven by a few Bay Area companies. They reflect a global shortage where more production inference systems are being built than there are engineers who know how to build them.

The hiring cycle reflects the scarcity. Expect 60 to 120 days to fill a senior ML systems role. Staff-level or highly specialized positions like custom CUDA kernel development can take longer. The bottleneck is sourcing, not evaluation. Most companies know how to interview systems engineers. The hard part is getting qualified candidates into the pipeline at all.

Where do these engineers come from? Most followed one of three paths. Some are traditional systems programmers with C++ backgrounds in compilers, databases, or game engines who moved into ML infrastructure. They bring performance intuition and memory management discipline, then learn ML concepts on the job. Others are ML researchers who got curious about what happens below the Python API layer, the kind of person who started training models but wanted to know why torch.compile produces the output it does, or why their model serves 3x slower than it should. The rest come from compiler and hardware engineering at GPU companies, chip startups, and HPC labs, bringing optimization skills directly to ML workloads.

The "full-stack ML engineer" myth

Most ML job descriptions ask for one person who can do everything: research new architectures, train models at scale, optimize inference, build APIs, deploy to production, and monitor performance. This person barely exists. The skills needed for ML research (mathematical intuition, experiment design, paper reading) and ML systems engineering (memory management, kernel optimization, hardware-aware programming) overlap only slightly. Looking for someone who does both well is like looking for a backend engineer who is also a graphic designer. Some exist. You cannot build a hiring pipeline around finding them.

The practical cost of this confusion is wasted sourcing effort. A recruiter searching for "ML engineers" on GitHub will find thousands of Python developers who train models. That is fine, but it does not help if you need someone to optimize inference latency, implement custom CUDA kernels, or port a model to edge hardware. The two roles need different sourcing strategies, different evaluation criteria, and different compensation expectations.

When writing job descriptions for ML systems roles, be specific. Name the inference frameworks you use or plan to use. Say whether the work involves CUDA and GPU programming, CPU optimization with SIMD, or both. Specify whether the candidate will work with existing C++ codebases (PyTorch, ONNX Runtime) or build new infrastructure in Rust. State the quantization formats relevant to your pipeline. ML systems engineers respond to technical detail because it signals you understand what the job actually involves. Generic "AI/ML" descriptions get ignored.

Where ML systems engineers contribute on GitHub

ML systems work happens almost entirely in the open. The frameworks, inference engines, and optimization tools that define this field are open source. That makes contribution history the most reliable signal for identifying ML systems engineers. We covered seniority signals on GitHub broadly; here are the repositories and patterns that matter for ML systems specifically.

PyTorch internals. pytorch/pytorch is a 900,000-line codebase. The part that matters for systems engineering is the C++ core: ATen (the tensor library), the autograd engine, torch::jit (TorchScript), and the dispatcher that routes operations to CPU, CUDA, or other backends. A contributor who works on the Python API is an ML engineer. A contributor who works on ATen operator implementations, CUDA kernel registrations, or memory allocator changes is an ML systems engineer. Both profiles live in the same repository but do very different work. Check file paths: aten/src/ATen/, c10/, and torch/csrc/ are where systems work happens.

llama.cpp and GGML. ggml-org/llama.cpp has over 85,000 stars. It proved that large language models could run on consumer hardware with aggressive quantization, and that changed how the entire ecosystem thinks about inference. In February 2026, the project joined Hugging Face, gaining institutional backing while keeping its open development model. ggml-org/ggml is the underlying tensor library. Contributors to these projects understand quantization at a bit-manipulation level, SIMD optimization for CPU inference, memory-mapped model loading, and the practical tradeoffs between model quality and inference speed. This community is about as close to pure ML systems engineering as you will find on GitHub.

ONNX Runtime. onnx/onnxruntime is Microsoft's cross-platform inference engine. It supports CPU, GPU, and specialized hardware accelerators through an execution provider architecture. Contributors work on graph optimizations, operator implementations, and hardware-specific backends. The codebase is mostly C++ with a focus on performance portability. Engineers who contribute here know how to make the same model run efficiently across different hardware targets, which is a different skill than optimizing for a single GPU architecture.

TensorFlow. tensorflow/tensorflow has lost research mindshare to PyTorch but remains widely used in production inference, especially via TensorFlow Lite (mobile and edge) and TensorFlow Serving. The C++ core and XLA compiler infrastructure are where ML systems engineers contribute. TensorFlow contributors tend to have more production deployment experience than PyTorch contributors, since TensorFlow's ecosystem was built around serving from the start.

vLLM. vllm-project/vllm is the leading open source LLM serving framework. It introduced PagedAttention, which fixed the memory fragmentation problem that made LLM serving wasteful. vLLM contributors understand GPU memory management, attention mechanism optimization, batching strategies for variable-length sequences, and what it takes to serve a language model at scale. If you are hiring for LLM inference, this is one of the first repositories to look at.

Model quantization projects. Quantization reduces model precision (from 32-bit floats to 8-bit integers, 4-bit, or lower) to shrink the memory footprint and speed up inference. The major formats are GPTQ (GPU-focused), AWQ (activation-aware weight quantization), and GGUF (the format llama.cpp uses). Contributors to these projects understand numerical precision tradeoffs, calibration data selection, mixed-precision strategies, and the hardware constraints that make quantization necessary. Look for contributions to AutoGPTQ, mit-han-lab/llm-awq, and quantization-related changes in llama.cpp.

TensorRT and inference optimization. NVIDIA's TensorRT is the standard tool for optimizing and deploying trained models on NVIDIA GPUs. It performs layer fusion, precision calibration, and kernel auto-tuning to get maximum throughput from GPU hardware. Engineers who work with TensorRT usually also know NVIDIA's lower-level tools: CUTLASS for custom GEMM implementations, cuDNN for optimized primitives, and Nsight for GPU profiling. TensorRT experience is a strong signal for production ML deployment, especially at companies running inference at scale on NVIDIA hardware.

Rust in ML infrastructure

For decades, ML systems meant C++. Rust is changing that, and faster than most hiring managers realize. Rust developers are already in high demand across systems programming; the ML overlap makes them even more scarce.

Hugging Face has been the main driver. Their Rust ML projects are not experiments. huggingface/tokenizers is written in Rust and used in production by thousands of companies through Python bindings. It handles text tokenization orders of magnitude faster than pure Python. huggingface/safetensors is a Rust-based model serialization format that replaced pickle-based formats (which could execute arbitrary code on load) and is now the default for model distribution on the Hugging Face Hub. The ML ecosystem depends on both of these daily.

huggingface/candle is more ambitious: a full ML framework written in Rust. It supports CPU and GPU inference, with a focus on minimal dependencies, fast compilation, and deployment without Python. Candle targets inference workloads where Python's overhead and dependency management are liabilities. Contributors to candle understand both Rust's ownership model and ML computation graphs, a rare combination.

tracel-ai/burn takes a different approach. Burn provides a complete training and inference framework with multiple backends (CPU, CUDA, Vulkan, WebGPU) and a focus on flexibility. Its contributors understand backend abstraction, cross-platform GPU programming, and framework design well enough to work across all of those targets.

Why Rust? Start with memory safety without garbage collection. Inference servers run for weeks or months. A slow memory leak in a C++ inference server eventually causes an OOM crash and a 3am page. Rust's ownership model catches these bugs at compile time. Then there is dependency management. C++ dependency tooling (CMake, vcpkg, Conan) is a constant source of friction. Cargo just works. And Rust binaries compile to a single static binary without runtime dependencies, which simplifies deployment to edge devices, containers, and serverless environments.

The 67% year-over-year growth in Rust ML job postings shows companies acting on these advantages. The talent pool is small (10,000 to 30,000 engineers with both Rust and ML skills), so the salary premium is real. Engineers who can write performant Rust for ML workloads while also understanding quantization, attention mechanisms, and memory-efficient inference are some of the hardest people to hire right now.

Quality signals in ML systems code

Evaluating ML systems contributions takes different criteria than evaluating ML research or general software engineering. General seniority signals still apply, but this domain has specific markers that separate production-grade systems engineers from hobbyists.

C++ quality signals. RAII (Resource Acquisition Is Initialization) patterns in GPU code are the first thing to look for. An experienced ML systems engineer wraps GPU memory allocations in scope-guarded objects that automatically free memory on destruction. This prevents the GPU memory leaks that plague production inference servers. Template metaprogramming for kernel dispatch is another strong signal: type-dispatched kernels that handle float16, float32, bfloat16, and int8 through compile-time dispatch rather than runtime branching show someone who understands both C++ and GPU performance. Familiarity with memory layouts (NCHW vs NHWC) and why the choice matters for different hardware targets points to production experience. SIMD optimization (SSE, AVX, AVX-512, NEON) for CPU inference paths is the kind of low-level performance work that separates systems engineers from application developers.

Rust quality signals. Zero-copy tensor operations are the Rust equivalent of RAII in C++. An experienced Rust ML engineer uses lifetimes and borrows to process tensor data without unnecessary copies, keeping Rust's safety guarantees while matching C++ performance. Safe GPU memory management through Rust's type system (wrapping CUDA allocations in types that enforce ownership semantics) is a strong signal. Custom operator implementations that plug into a framework's computation graph show someone who understands both Rust and ML framework internals. Unsafe blocks used sparingly and correctly (for FFI calls to CUDA, for hardware-specific intrinsics) show a mature Rust developer. Too much unsafe code suggests someone fighting the language rather than using it.

Quantization expertise. Look for code that implements or extends quantization schemes. The details matter: does the engineer understand calibration dataset selection and its effect on quantized model accuracy? Do they implement mixed-precision strategies where different layers get different bit widths based on sensitivity analysis? Can they explain the tradeoffs between GPTQ (slower quantization, good quality), AWQ (activation-aware, better preservation of important weights), and GGUF (flexible, CPU-optimized)? Quantization is where systems engineering meets ML theory, and engineers who do it well need to understand both.

Inference optimization patterns. KV-cache management for transformer models is one of the central challenges of LLM inference. Engineers who contribute to vLLM's PagedAttention, implement continuous batching, or optimize speculative decoding are working on the hardest problems in production LLM serving. Kernel fusion (combining multiple small GPU operations into a single kernel launch to reduce overhead) is another signal. Look for flash attention implementations, fused layer norm + residual operations, or custom attention kernels that handle variable-length sequences efficiently.

Profiling and benchmarking. ML systems engineers who include detailed benchmarks with their contributions (latency distributions, throughput under load, memory high-water marks) show a production mindset. A pull request that says "faster inference" is less convincing than one that says "reduced P99 latency from 84ms to 51ms for batch size 32 on A100, measured over 10,000 requests." Look for contributions that reference specific profiling tools like NVIDIA Nsight Systems, perf, or Instruments on macOS.

Red flags. Watch out for contributions that are mostly Python wrappers around existing C++ or Rust libraries with no systems-level changes. Model training scripts, Jupyter notebooks, and Hugging Face Transformers fine-tuning code are ML engineering, not ML systems engineering. Someone with heavy GitHub activity in transformers Python code but no contributions to inference engines or low-level libraries is probably the wrong profile for a systems role.

The Apple MLX effect

Apple's MLX framework deserves separate attention because it is changing where ML inference happens. ml-explore/mlx has 24,900 stars and a growing ecosystem of 4,316 pre-converted models ready to run on Apple Silicon. MLX is built for the unified memory architecture of M-series chips, where CPU and GPU share the same memory pool. This eliminates the PCIe transfer bottleneck that plagues discrete GPU setups and makes Apple hardware competitive for inference workloads that fit in unified memory.

In March 2026, Ollama switched its backend to MLX for Apple Silicon Macs. Ollama is the most popular tool for running LLMs locally, so choosing MLX over llama.cpp for Apple hardware says something about where the framework stands. The MLX contributor community is growing fast. Engineers who contribute here work under a different set of constraints than CUDA developers: Metal compute shaders instead of CUDA kernels, unified memory management instead of host-device transfers, and Apple's Neural Engine for specific operations.

MLX contributors are worth paying attention to because they often have both ML depth and Apple platform experience. Companies building on-device AI products (mobile apps, desktop tools, edge inference) should look at this talent pool first. The overlap with traditional CUDA/C++ ML systems engineers is smaller than you might expect. Many MLX contributors come from the Apple developer ecosystem (Swift, Objective-C, Metal) rather than from ML infrastructure backgrounds.

A practical ML systems sourcing workflow

Here is a step-by-step approach to finding ML systems engineers, from defining the role to first outreach.

Step 1: Define the systems scope. "ML systems engineer" is still too broad. What kind of systems work? Inference optimization (latency, throughput, memory efficiency for serving trained models)? Framework internals (contributing to or extending PyTorch, TensorFlow, or a Rust framework)? Model compression (quantization, pruning, distillation for constrained environments)? Edge deployment (running models on mobile, IoT, or embedded hardware)? Inference serving infrastructure (vLLM, TensorRT, autoscaling, batching)? Each maps to different repositories, different languages, and different candidate profiles. An engineer who optimizes CUDA kernels for A100 inference is not the same person who deploys quantized models to Android phones. Pin down the technical domain before you start sourcing.

Step 2: Map to target repositories. Based on the scope, pick the 5-10 GitHub repositories where your ideal candidate would show up. For LLM inference: llama.cpp, vLLM, candle. For framework work: pytorch/pytorch (C++ paths), burn, ggml. For quantization: AutoGPTQ, llm-awq, quantization contributions within llama.cpp. For Apple platforms: mlx. For production serving: onnxruntime, TensorRT-related repos. For GPU-specific optimization: CUTLASS, Triton, CUDA samples. Being specific about repositories keeps your sourcing aimed at the right kind of systems engineer.

Step 3: Extract and filter contributors. Pull recent contributors from your target repositories. Focus on the last 6 months for active candidates. Filter by contribution type: code changes to C++, Rust, or CUDA files carry more signal than documentation updates or Python test changes. Pay attention to file paths within repositories. In pytorch/pytorch, contributions to aten/src/ and torch/csrc/ are systems work; contributions to torch/nn/ are application-layer Python. In llama.cpp, look for contributors who touch quantization implementations, SIMD-optimized kernels, or backend-specific code rather than CLI argument parsing or README updates.

Step 4: Evaluate depth. For each candidate, assess whether their contributions show systems depth or surface-level involvement. Depth signals: PRs that change memory management or allocation strategies, contributions with benchmark results, code reviews on performance-sensitive changes, multi-commit contributions that iterate on an optimization based on profiling data. Surface signals: single-commit typo fixes, documentation-only contributions, Python-only changes to systems-level projects. Both count, but weight them accordingly. A developer with three merged PRs and benchmark data in llama.cpp is more likely a strong ML systems engineer than one with twenty documentation fixes across ten ML repos.

Step 5: Check adjacent repositories. ML systems engineers rarely contribute to only one project. Someone who works on llama.cpp often also contributes to ggml, whisper.cpp, or their own inference experiments. A PyTorch C++ contributor might also show up in ONNX Runtime or Triton. Cross-referencing contributions across related repositories gives you a fuller picture of a candidate's expertise. A developer who contributes C++ optimizations to PyTorch and also maintains a Rust inference library has exactly the skill combination that makes this role so hard to fill.

Step 6: Outreach with technical specificity. Generic outreach fails for this audience. ML systems engineers get recruiter messages that say "AI/ML opportunity" and delete them. Effective outreach to developers means referencing specific contributions. "I saw your PR implementing 4-bit quantization with group-wise scaling in llama.cpp. We're building inference infrastructure for on-device LLMs and your work on memory-efficient weight packing is directly relevant to our biggest bottleneck." That kind of message gets read. It shows you understand the work, you have a real technical problem, and you are not sending the same email to 500 people. Reference their actual code. Mention the specific technical challenge your team faces. Say why their expertise matters for your problem.

Step 7: Consider adjacent backgrounds. The ML systems talent pool is small enough that strict filtering eliminates too many viable candidates. Engineers with strong C++ systems backgrounds in adjacent domains are worth considering even without direct ML experience. Compiler engineers understand code generation and optimization passes, which translates directly to ML compiler work (XLA, TVM, MLIR). Game engine developers know GPU programming, memory management, and real-time performance constraints. Database systems engineers understand memory layouts, cache efficiency, and query optimization. HPC engineers understand distributed computation, SIMD optimization, and scaling. Look for these backgrounds combined with personal ML projects or recent open source contributions to ML infrastructure. The niche stack sourcing strategies we covered for other languages apply here too. A strong C++ systems engineer can usually become productive in ML infrastructure in weeks, not months.

Step 8: Scale with tooling. The manual workflow above works, but it is limited by how many contributor graphs you can review in a day. Tools like riem.ai automate the discovery step by scanning 30 million-plus GitHub events per month and matching contributors to ML systems repositories based on actual code changes. Describe the role in plain language ("C++ engineers who contribute to llama.cpp or PyTorch internals with experience in quantization or inference optimization") and get a ranked list of candidates with contribution summaries, quality scores, and recent activity data. The ranking weights systems-level contributions higher than application-layer work, which is the distinction that matters most for this role.

Frequently asked questions

How many ML systems engineers are there?

Roughly 20,000 to 50,000 engineers globally write C++ for ML systems work, and 10,000 to 30,000 have both Rust and ML skills. These numbers are imprecise because ML systems engineering sits between systems programming and machine learning, and neither community fully claims the other. The talent pool is growing but remains much smaller than demand. Every major AI lab, every company running LLM inference at scale, and every inference framework team is hiring from the same limited pool.

What salary should I expect to pay ML systems engineers?

ML systems engineers command a 15-20% salary premium over base ML roles. Senior ML systems engineers in the US typically earn $220,000 to $350,000 in total compensation, with staff-level engineers at top AI labs exceeding $500,000. Rust ML roles carry an additional premium because of scarcity. AI labs set the salary floor since they can offer equity in high-growth companies, so startups competing for this talent need to lead with interesting technical problems, open source contribution opportunities, and the chance to work on inference at real scale.

What is the difference between an ML engineer and an ML systems engineer?

An ML engineer trains models, designs architectures, runs experiments, and works primarily in Python with PyTorch or TensorFlow at the API level. An ML systems engineer makes those models run fast in production. They write C++ and Rust code for inference engines, optimize memory layouts, implement kernel fusion, write CUDA kernels for custom operations, and build the infrastructure that serves models at scale. The ML engineer cares about model accuracy. The ML systems engineer cares about latency, throughput, and memory footprint. Both roles require understanding ML concepts, but the systems engineer spends more time in a profiler than in a Jupyter notebook.

Should I hire C++ or Rust ML systems engineers?

It depends on your existing codebase and where the ecosystem is heading. C++ dominates ML infrastructure today: PyTorch's core, TensorFlow, ONNX Runtime, and TensorRT are all C++. If you are contributing to or building on top of these projects, C++ is the practical choice. Rust is gaining ground fast, with a 67% increase in ML-related job postings. Hugging Face's candle and tracel-ai's burn are production-grade Rust ML frameworks. Rust's memory safety guarantees matter for inference servers that must run for weeks without leaking memory. For new inference infrastructure projects, Rust is increasingly the default choice. For work that touches existing C++ ML codebases, C++ remains necessary.

What GitHub projects should I look for when sourcing ML systems engineers?

For C++ ML systems: pytorch/pytorch (C++ core), ggml-org/llama.cpp, ggml-org/ggml, onnx/onnxruntime, and tensorflow/tensorflow. For Rust ML: huggingface/candle, tracel-ai/burn, huggingface/tokenizers, and huggingface/safetensors. For inference infrastructure: vllm-project/vllm, ml-explore/mlx, and TensorRT-related repositories. For model quantization: look for contributions to GPTQ, AWQ, and GGUF format implementations. Contributors to any of these projects are working at the intersection of systems programming and machine learning, which is exactly the skill set you need.

How long does it take to hire an ML systems engineer?

Expect 60 to 120 days for senior ML systems engineers, and potentially longer for staff-level roles or positions requiring specific domain expertise like custom CUDA kernel development. The bottleneck is almost entirely supply. AI labs and inference startups are all competing for the same small pool. Sourcing from GitHub contribution data helps because many ML systems engineers are active in open source but not on job boards. Referrals from existing systems engineers are the other high-yield channel. Standard recruiter outreach with generic AI job descriptions will not work for this audience.