April 2026 · 18 min read

How to hire Triton and ML compiler engineers in 2026: A sourcing guide

The AI industry has a bottleneck, and it is not data or compute. It is the engineers who write the software that turns compute into performance. Triton has roughly 500 to 1,000 proficient developers worldwide. ML compiler engineers number maybe 3,000 to 8,000. These are the scarcest skills in AI, and every company training or serving models needs them.

Every large language model you have heard of runs on GPU kernels somebody had to write. Every custom AI chip that reaches production needs a compiler somebody had to build. The gap between "we have the hardware" and "the hardware runs our models fast" is filled by a small group of engineers who understand both silicon and software at a level most developers never reach. These are Triton developers, MLIR contributors, XLA engineers, and the people who build compiler backends for chips that did not exist two years ago.

This guide covers the market for these engineers, where they contribute on GitHub, what separates a strong candidate from someone who read a tutorial, and how to build a sourcing workflow that reaches them before your competitors do. We covered niche language sourcing strategies previously. ML compilers and GPU kernel programming are a different kind of niche: not a language community, but an engineering discipline that spans multiple languages, frameworks, and hardware targets.

Why ML compiler engineers are the scarcest skill in AI

The talent pool numbers tell the story. There are an estimated 500 to 1,000 developers worldwide who can write production Triton kernels. The broader ML compiler community (MLIR, XLA, TVM, custom chip toolchains) is perhaps 3,000 to 8,000 engineers globally. For context, there are roughly 4 million JavaScript developers and 16 million Python developers. ML compiler engineering is not a small pond. It is a puddle.

The supply constraint has structural causes that will not resolve soon. This work requires an unusual combination of skills: compiler theory (SSA form, dataflow analysis, register allocation), GPU architecture (memory hierarchies, warp scheduling, occupancy), linear algebra, and the ability to reason about performance at the instruction level. University programs produce maybe a few hundred graduates per year with this background, split between traditional compiler courses and GPU computing courses that rarely overlap. No university teaches Triton in its standard curriculum. The language is roughly three years old.

Meanwhile, demand is accelerating from multiple directions at once. Every AI lab optimizing inference latency needs kernel engineers. Every company training large models needs engineers who can squeeze more FLOPS out of their GPU clusters. Every custom AI chip startup needs compiler engineers to make their silicon programmable. Cerebras, Groq, SambaNova, Tenstorrent, d-Matrix, Esperanto, Graphcore (now part of SoftBank) all need compiler teams. The hyperscalers (Google with TPUs, Amazon with Trainium/Inferentia, Microsoft with Maia) need them too. And NVIDIA itself has hundreds of compiler engineers working on CUDA, cuDNN, and Triton integration.

Compensation reflects the imbalance. The average base salary for ML compiler engineers is around $147,000 in the US, but that number is misleading. Total compensation at the companies competing hardest for this talent runs $250,000 to $500,000 or more. Staff-level compiler engineers at OpenAI, Google DeepMind, or NVIDIA can earn $400,000 to $600,000 when equity is included. Startups compete with equity upside and the promise of greenfield compiler work, which appeals to engineers who want to design systems from scratch rather than maintain existing ones.

Hiring cycles are long. Expect 90 to 180 days for senior roles. The limiting factor is not your interview process. It is finding candidates who exist, are reachable, and are open to a conversation. Most qualified engineers are employed, well-compensated, and not actively looking. Passive candidate sourcing is not optional for this market. It is the only strategy that works.

Triton: the next CUDA

Triton was created at OpenAI by Philippe Tillet to solve a specific problem: writing GPU kernels is too hard and too slow. CUDA requires developers to think in terms of individual threads, manage shared memory explicitly, handle memory coalescing, and optimize for hardware-specific constraints. A typical CUDA kernel for a fused attention operation might be 500 to 1,000 lines. The equivalent Triton kernel is 100 to 200 lines. That productivity gap adds up fast when you are iterating on model architectures and kernel performance is the bottleneck.

Triton's approach is block-level programming. Instead of managing individual GPU threads, developers write operations on blocks (tiles) of data. The compiler handles thread management, memory coalescing, shared memory allocation, and many of the low-level details that make CUDA painful. A Triton kernel looks like Python with explicit block dimensions and pointer arithmetic. Do not be fooled by the Python-like syntax. Writing a correct Triton kernel still requires understanding GPU memory hierarchies (registers, shared memory, L1/L2 cache, global memory), data layout, and parallel reduction patterns. The language removes the mechanical complexity of CUDA without removing the conceptual complexity of GPU programming.

The triton-lang/triton repository has 18,800 GitHub stars. But starring a repo and writing production kernels with it are different activities separated by months of study. The 500 to 1,000 estimate of proficient developers is based on the number of people who have written non-trivial Triton code in production or open source settings. That number is growing, but slowly, because the prerequisites (GPU architecture knowledge, compiler understanding, performance optimization experience) take years to develop.

Triton's importance goes beyond OpenAI. PyTorch's torch.compile uses Triton as a backend through Inductor. This means every PyTorch user who enables torch.compile is running Triton-generated kernels. Flash Attention, the memory-efficient attention algorithm that made training long-context models practical, has Triton implementations alongside its CUDA versions. When researchers publish new attention mechanisms or activation functions, Triton is increasingly the first implementation language rather than CUDA. Triton is winning.

For hiring, this means Triton experience is becoming a proxy for "can optimize GPU workloads for ML." It is not a replacement for CUDA knowledge. Many of the hardest problems still require dropping to CUDA or PTX. But a developer who writes Triton kernels fluently understands GPU architecture at a level that cannot be faked. They know why tiling strategies matter, what shared memory bank conflicts are, how to reason about occupancy, and when auto-tuning will find a good configuration versus when manual optimization is needed.

The ML compiler stack: XLA, MLIR, TVM, and beyond

Triton is one layer of a much larger compiler ecosystem that makes ML hardware usable. Understanding this stack is necessary for sourcing because the engineers who work at different layers have different skills and come from different backgrounds.

XLA (Accelerated Linear Algebra). XLA is Google's compiler for ML workloads, originally built for TensorFlow and now the backend for JAX. XLA takes a computation graph, fuses operations, optimizes memory layout, and generates code for CPUs, GPUs, and TPUs. The openxla/xla repository is where this work happens. XLA engineers understand graph-level optimizations (operator fusion, layout assignment, memory scheduling) and hardware-specific code generation. They tend to have strong backgrounds in traditional compiler engineering plus ML systems knowledge. Google, which created XLA, employs a large portion of the active contributors, but the project is open source and contributors come from other companies too.

MLIR (Multi-Level Intermediate Representation). MLIR is a compiler infrastructure project within LLVM, created by Chris Lattner (who also created LLVM, Clang, and Swift) while at Google. MLIR provides a framework for building domain-specific compilers by defining "dialects" that represent computations at different abstraction levels. Nearly every serious ML compiler project now uses MLIR: Triton compiles through MLIR, XLA uses MLIR dialects, and custom AI chip compilers are built on MLIR. The llvm/llvm-project repository (specifically the mlir/ subdirectory) is where most of this work happens. MLIR contributors tend to be strong compiler engineers. They understand type systems, intermediate representations, optimization passes, and how to design abstractions that work across hardware targets.

TVM (Apache TVM). TVM is an open source compiler framework for deep learning that generates optimized code for diverse hardware backends: CPUs, GPUs, FPGAs, and custom accelerators. TVM uses a different approach from XLA, focusing on search-based auto-tuning to find optimal operator implementations. The apache/tvm repository is the main codebase. TVM contributors often have experience with auto-scheduling, cost models, and hardware-aware optimization. TVM has been particularly influential in the edge ML and embedded ML space, where diverse hardware targets make manual optimization impractical.

Mojo and Modular. Mojo is a new programming language created by Chris Lattner's company Modular. It is designed as a superset of Python with systems programming capabilities and first-class support for MLIR. Mojo aims to be the language that replaces both Python (for ML research) and C++/CUDA (for ML infrastructure) with a single language that can do both. The modular/mojo repository is active and growing. Mojo contributors tend to have experience with MLIR, LLVM, and language design. The project is still early, but it is attracting experienced compiler engineers who want to work on language design for ML.

Custom AI chip toolchains. Custom chips are where the hiring pressure is worst. Every custom AI chip needs a compiler stack that takes ML models (typically represented as ONNX, TorchScript, or StableHLO graphs) and generates optimized code for the chip's specific architecture. Cerebras has its WSE compiler. Groq has its TSP compiler. Tenstorrent has its Buda compiler. SambaNova has its SambaFlow compiler. Most of these are built on MLIR and many of their engineers contribute to upstream MLIR. The compiler is often what determines whether a chip succeeds or fails in the market. A chip with great hardware but a bad compiler will produce poor real-world performance and lose to NVIDIA even on paper-superior silicon.

PyTorch compiler stack. PyTorch's compiler ecosystem has grown significantly. torch.compile, powered by TorchDynamo (graph capture), TorchInductor (code generation), and Triton (GPU kernel backend), forms a full compilation pipeline from eager Python to optimized GPU code. Engineers who contribute to this stack within the pytorch/pytorch repository understand both the framework's internals and the compilation challenges specific to ML workloads: dynamic shapes, control flow, memory management for large models, and operator fusion strategies.

Where Triton and ML compiler engineers contribute on GitHub

Unlike language-specific ecosystems where one or two repositories dominate, ML compiler work is spread across dozens of projects. GitHub is the best sourcing channel for these engineers because almost all of the infrastructure they build is open source. But you need to know where to look.

Triton itself. triton-lang/triton (18,800 stars) is the obvious starting point. Contributors to the compiler (not just example kernels) understand MLIR, GPU code generation, and auto-tuning. The contributor list is relatively short. People who submit PRs to Triton's compiler passes, code generation backends, or auto-tuner are exactly the kind of GPU programming engineers you want to hire. Even substantial issue discussions and bug reports on this repo indicate someone who works with the compiler at a deep level.

LLVM and MLIR. llvm/llvm-project is massive (over 100,000 commits, thousands of contributors), so you need to filter. Focus on the mlir/ subdirectory, particularly dialects related to ML: linalg, tensor, bufferization, gpu, and vector dialects. Contributors to these areas understand how ML computations are represented, transformed, and lowered to hardware-specific code. LLVM backend contributors who work on GPU targets (AMDGPU, NVPTX) are also relevant. The LLVM community uses Phabricator and Discourse for code review, so GitHub contributions alone undercount activity. Check both.

XLA and StableHLO. openxla/xla and openxla/stablehlo are the core repositories for Google's ML compiler stack. StableHLO defines a stable operation set for ML compilers, making it a portable representation that different backends can target. Contributors here understand graph-level optimizations, hardware-specific lowering, and the challenges of compiling dynamic ML workloads. Many XLA contributors work at Google, but the OpenXLA initiative has attracted contributions from Intel, AMD, and AI chip startups.

PyTorch internals. Within the pytorch/pytorch monorepo, look for contributions to torch/_inductor/, torch/_dynamo/, and torch/compiler/. These directories contain the graph capture and code generation machinery behind torch.compile. Inductor contributors are writing the logic that generates Triton kernels from PyTorch operations. This is a small group within the much larger PyTorch contributor base, and they have a rare combination of framework internals knowledge and compiler engineering skills.

Custom kernel libraries. NVIDIA/cutlass (CUDA Templates for Linear Algebra Subroutines) is where NVIDIA publishes high-performance GEMM and convolution kernels. Flash Attention repositories (Dao-AILab/flash-attention) contain heavily optimized CUDA and Triton code that ships in production ML systems. NVIDIA/FasterTransformer and its successor NVIDIA/TensorRT-LLM are inference optimization libraries. Contributors to these repos write the kernels that set the speed limit for LLM training and inference. They tend to be CUDA experts who may or may not also write Triton.

TVM. apache/tvm contributors understand auto-scheduling, operator fusion, and cross-hardware compilation. TVM's contributor community is more geographically distributed than XLA or Triton, with significant contributions from research groups in China, the US, and Europe. TVM experience maps well to custom AI chip work because TVM was designed from the start to target diverse hardware.

Mojo. modular/mojo is newer but growing. Early contributors to Mojo tend to be experienced systems programmers and compiler engineers who are attracted to the language design work. Modular has hired aggressively from the LLVM and MLIR communities, so Mojo contributors often have deep compiler backgrounds.

JAX ecosystem. jax-ml/jax is a framework, but contributions to its compilation pipeline (tracing, staging, lowering to HLO) require compiler engineering skills. JAX's pjit, custom_vjp, and shard_map primitives expose compiler-level concepts to users, so contributors who work on these interfaces understand both the user-facing API and the compilation machinery underneath.

Quality signals in GPU kernel and compiler code

Evaluating ML compiler candidates on GitHub is harder than evaluating, say, a React developer. The code is dense, domain-specific, and often involves concepts that most software engineers never encounter. General seniority signals still apply, but there are specific markers that separate real expertise from surface familiarity.

GPU memory hierarchy reasoning. The single most telling signal in GPU kernel code is how a developer handles memory. GPUs have a deep memory hierarchy: registers (fastest, per-thread), shared memory (fast, per-block), L1/L2 cache, and global memory (slow, device-wide). A developer who writes a kernel that reads from global memory in a tight loop does not understand GPUs, no matter how correct the logic is. An experienced developer structures computations to load data into shared memory in coalesced patterns, operates on it in shared memory or registers, and writes results back. Look for explicit shared memory management, memory coalescing patterns, and comments explaining data movement decisions.

Tiling strategies. Tiling is the technique of breaking a large computation into smaller blocks that fit into fast memory levels. In Triton, this is the fundamental programming model. In CUDA, it is a design choice that separates fast kernels from slow ones. A strong candidate chooses tile sizes based on hardware constraints (shared memory capacity, register file size, warp size) and the computation's data reuse patterns. If you see a Triton kernel with carefully chosen BLOCK_SIZE parameters and auto-tuning configurations that sweep a meaningful range, that person understands the relationship between tile shape and hardware utilization.

Kernel fusion. Fusing multiple operations into a single kernel reduces memory traffic, which is often the bottleneck in GPU computation. A naive approach runs each operation as a separate kernel, reading from and writing to global memory between them. An experienced developer identifies fusion opportunities: an elementwise operation following a reduction, or a softmax followed by a matrix multiply. Flash Attention is the canonical example of aggressive kernel fusion. If a candidate has implemented fused kernels or contributed fusion passes to a compiler, they understand why ML workloads are bottlenecked by memory traffic and how to fix it.

Auto-tuning awareness. Triton and TVM both use auto-tuning to find optimal configurations (tile sizes, number of warps, number of stages) for a given kernel on specific hardware. An experienced developer writes kernels with tunable parameters and understands which parameters matter most for performance. They know that the optimal configuration changes between GPU generations (what works on an A100 may not be optimal on an H100) and between problem sizes. Look for auto-tuning configs in Triton code, or cost model contributions in TVM. A developer who hardcodes a single configuration without auto-tuning parameters is either writing a one-off kernel or does not understand the optimization space.

Compiler pass design. For compiler engineers (as opposed to kernel developers), the quality signal is in optimization pass design. A well-written compiler pass has a clear scope (one transformation per pass), handles edge cases without special-casing everything, is composable with other passes, and has test coverage that includes adversarial inputs. In MLIR, this means well-defined pattern rewrites and proper handling of types and attributes. In XLA, it means passes that work correctly across the full HLO operation set. Look for passes that include regression tests, performance benchmarks, and clear documentation of what they optimize and when they should run.

Hardware-software co-design thinking. The strongest ML compiler engineers do not just write software that targets hardware. They reason about how compiler decisions affect hardware utilization and, sometimes, how hardware design should change to make compilation easier. This shows up in code reviews, design documents, and issue discussions where someone explains why a particular instruction scheduling strategy matters for a specific chip's pipeline depth, or why a memory layout decision enables better vectorization. If you see a candidate's PR review that says "this will cause bank conflicts on shared memory because the stride is a multiple of 32" or "this layout change enables 2x vectorization on the load path," they can reason about how code maps to silicon.

Benchmarking rigor. Performance work without measurement is guessing. Strong candidates include benchmarks with their kernel or compiler contributions. Not just "it runs," but "here is the throughput in TFLOPS, here is the memory bandwidth utilization, here is the comparison to cuBLAS/cuDNN, here is how it scales with problem size." They report numbers on specific hardware (A100 80GB, H100 SXM) rather than vague claims. They understand roofline analysis and can explain whether a kernel is compute-bound or memory-bound. Without this discipline, you are looking at someone who writes correct code that nobody would ship.

A practical sourcing workflow

The workflow for sourcing ML compiler engineers differs from mainstream engineering roles in important ways. The pool is so small that traditional funnel math (source 100 candidates, contact 50, interview 10, hire 1) does not apply. You might identify 30 qualified candidates total and need to convert a meaningful percentage of them.

Step 1: Define the layer. "ML compiler engineer" spans at least three distinct roles that require different expertise. Kernel developers write GPU kernels in Triton or CUDA. They need GPU architecture knowledge, numerical methods, and performance optimization skills. Compiler frontend engineers work on graph-level optimizations: operator fusion, shape inference, memory planning, and layout assignment. They need compiler theory and ML model architecture knowledge. Compiler backend engineers generate code for specific hardware targets. They need ISA-level understanding of the target chip, register allocation expertise, and instruction scheduling skills. A candidate who is strong at one of these may be weak at the others. Decide which layer you need before sourcing.

Step 2: Map the repositories. Based on the layer, identify where your ideal candidate would be contributing. For kernel developers: triton-lang/triton (examples and kernel libraries), NVIDIA/cutlass, Dao-AILab/flash-attention, pytorch/pytorch (inductor kernels). For compiler frontend: openxla/xla, pytorch/pytorch (dynamo, inductor passes), apache/tvm. For compiler backend: llvm/llvm-project (MLIR, NVPTX, AMDGPU backends), triton-lang/triton (code generation), modular/mojo.

Step 3: Extract and filter contributors. The contributor lists for these repos are shorter than you might expect. triton-lang/triton has a few hundred active contributors over its lifetime. The MLIR subdirectory of LLVM has a few hundred more. Filter for recent activity (last 6 months), non-trivial contributions (not just documentation fixes), and code review participation. Code reviews are a particularly strong signal in compiler projects because reviewing compiler code requires the same depth of understanding as writing it. A developer who reviews MLIR pattern rewrites or Triton code generation passes understands the system at a deep level.

Step 4: Check academic and conference connections. ML compiler engineering straddles academia and industry. The top conferences are CGO (Code Generation and Optimization), PLDI (Programming Language Design and Implementation), ASPLOS (Architectural Support for Programming Languages and Operating Systems), and MLSys. Published authors at these venues who also contribute to open source projects are often the strongest candidates. Check Google Scholar profiles and cross-reference with GitHub accounts. Many ML compiler engineers have PhDs, and their dissertation work often maps directly to their industry contributions.

Step 5: Expand to adjacent talent. Not every strong ML compiler candidate has "compiler" in their GitHub bio. Traditional compiler engineers from LLVM, GCC, or JVM projects can transition to ML compilers. High-performance computing (HPC) engineers who optimize scientific code for GPUs have directly transferable skills. Numerical linear algebra researchers who implement dense and sparse solvers understand the computational patterns that ML compilers optimize. CUDA developers who have hit the limits of manual kernel optimization and want to work at a higher level of abstraction are natural Triton adopters. Search for these adjacent profiles and look for signals that they are moving toward ML: stars on ML compiler repos, personal projects that apply compilers to ML workloads, or conference attendance at ML systems venues.

Step 6: Look beyond GitHub. The LLVM and MLIR communities use Discourse forums (llvm.discourse.group) and mailing lists for technical discussions. Active participants in MLIR-related Discourse threads may not have large GitHub contribution histories but could be deeply knowledgeable. The MLIR Open Design Meetings are recorded and publicly available; speakers at these meetings are identifiable. Triton has a community Slack channel. The PyTorch dev forums have threads on Inductor and TorchDynamo. These are small, technical communities where active participants are worth sourcing.

Step 7: Scale with tooling. The manual workflow above is necessary for the first few hires, but it does not scale. Tools like riem.ai automate contributor discovery across 30 million-plus monthly GitHub events. Instead of manually reviewing contributor graphs for a dozen repositories, you can describe the profile in natural language ("engineers who contribute to Triton compiler backends or MLIR GPU dialects") and get a ranked list of candidates with contribution summaries and quality scores. This matters more for ML compilers than for mainstream engineering roles because the signal-to-noise ratio in manual searching is lower. The engineers you want are spread across many repositories, often working on unglamorous compiler passes rather than high-visibility features.

Writing outreach that gets responses

ML compiler engineers receive recruiter messages constantly. The bad ones are obvious: "I noticed your experience with machine learning" attached to a generic job description. These get deleted. Effective developer outreach requires specificity, and for this audience the bar is higher than usual because the candidates can immediately tell whether you understand what they do.

Reference specific contributions. "I saw your PR adding vectorized load support to the Triton AMDGPU backend" is a first sentence that gets the message read. "I noticed your work on GPU computing" is a first sentence that gets the message deleted. These engineers have invested years developing rare expertise. Showing that you understand their specific work distinguishes you from recruiters who searched "compiler engineer" on LinkedIn.

Lead with the technical problem. ML compiler engineers choose roles based on the problem, not the company's funding round or office perks. "We are building a compiler backend for a novel dataflow architecture and need someone who understands tiling and scheduling for non-von Neumann hardware" is more compelling than "we are a well-funded AI startup looking for talented engineers." Describe the actual technical challenge. If you are building a custom chip compiler, explain what makes your architecture different from GPUs. If you are optimizing inference, explain what latency targets you are trying to hit and why existing solutions fall short.

Be specific about compute access. Compiler engineers need hardware to test on. If you have a cluster of H100s, say so. If you are building a custom chip, explain whether the candidate will have access to silicon or only simulators. Compute access is a real differentiator for this audience. An engineer who can test their compiler changes on actual hardware iterates faster and learns faster than one waiting for simulation results.

Address compensation directly. Do not make candidates guess. If you can offer $300,000+ total compensation, say so in the first message. If you are a startup competing with NVIDIA and Google on equity, explain the equity structure and your reasoning for why it is competitive. ML compiler engineers know their market value. Trying to get them into a conversation before discussing compensation wastes their time and yours.

Mention the team. These engineers care about who they will work with. If you have senior compiler engineers on the team already, name them (with permission). If your team includes people who have contributed to LLVM, XLA, or Triton, that is a stronger signal than your company's valuation. Compiler engineering is deeply collaborative. A candidate who knows they will work alongside people who understand their domain is more likely to engage than one joining as the first compiler engineer in a company full of ML researchers.

Frequently asked questions

How many Triton developers are there?

Estimated 500 to 1,000 proficient Triton developers worldwide as of 2026. The language is roughly three years old and no universities teach it in their standard curriculum. Despite 18,800 GitHub stars on the triton-lang/triton repository, the gap between starring a project and writing production GPU kernels in it is enormous. Most Triton-proficient engineers learned it on the job at AI labs or through self-directed study of GPU architecture. The pool is growing as more AI companies adopt Triton over raw CUDA, but experienced practitioners remain extremely scarce.

What salary should I expect to pay ML compiler engineers?

ML compiler engineers average around $147,000 base salary in the US, but total compensation at top companies ranges from $250,000 to $500,000 or more. At OpenAI, Google DeepMind, NVIDIA, and Modular, senior ML compiler engineers can earn $400,000 to $600,000 in total compensation including equity. Startups building custom AI chips (Cerebras, Groq, SambaNova, Tenstorrent) offer competitive packages with significant equity upside. The pay reflects how few of these engineers exist: every AI company needs them and there are not enough to go around.

What is the difference between Triton and CUDA?

CUDA is NVIDIA's low-level GPU programming model that gives developers direct control over GPU threads, shared memory, and hardware-specific features. Triton is a higher-level language created by OpenAI that compiles Python-like code into optimized GPU kernels. Triton abstracts away thread management and memory coalescing, letting developers think in terms of blocks and tiles instead of individual threads. A Triton kernel is typically 3 to 5 times shorter than equivalent CUDA code. The tradeoff is less fine-grained control over hardware, though Triton's auto-tuning often produces performance within 10 to 20 percent of hand-optimized CUDA. Triton is sometimes called "the next CUDA" because it lowers the barrier to GPU kernel programming while still requiring deep understanding of GPU memory hierarchies and parallel computation.

Do ML compiler engineers need hardware experience?

It depends on the role. Compiler engineers working on frontends (graph-level optimizations, operator fusion, shape inference) can work primarily in software without deep hardware knowledge. But engineers working on backends (code generation, register allocation, instruction scheduling for specific hardware targets) need to understand the target architecture at a detailed level: memory hierarchies, execution units, instruction latencies, and data movement costs. For custom AI chips (Cerebras, Groq, Tenstorrent), hardware-software co-design experience is almost always required. The strongest ML compiler engineers can read a chip's architecture manual and reason about how compiler decisions affect silicon utilization.

What projects should I look for on GitHub when hiring ML compiler engineers?

For Triton: triton-lang/triton (18,800 stars). For MLIR and LLVM: llvm/llvm-project, especially the mlir/ subdirectory. For XLA: openxla/xla and openxla/stablehlo. For Mojo: modular/mojo. For PyTorch internals: pytorch/pytorch contributions to torch.compile, Inductor, or TorchDynamo. For JAX internals: jax-ml/jax contributions to the compiler pipeline. For TVM: apache/tvm. For custom kernel libraries: NVIDIA/cutlass, NVIDIA/FasterTransformer, and flash-attention repositories. Contributors to any of these projects do the compiler and hardware work that ML infrastructure depends on.

How long does it take to hire a Triton or ML compiler engineer?

Expect 90 to 180 days for senior ML compiler roles, and potentially longer for staff-level positions. The bottleneck is not pipeline efficiency but candidate availability. With only 500 to 1,000 proficient Triton developers and 3,000 to 8,000 ML compiler engineers globally, most qualified candidates are already employed at well-funded AI companies with strong retention packages. Sourcing from GitHub contribution data, MLIR and LLVM community mailing lists, and conference speaker lists (CGO, PLDI, ASPLOS, MLSys) can help identify candidates who are not on job boards. Be prepared to compete on compensation, technical challenge, and compute access.