A real kernel-optimization run on a real GPU. Hardware: NVIDIA A10 (sm_86, full passthrough, Lambda Cloud). Input: a deliberately naive 1024×1024 single-precision matmul. The agent loop measured it, diagnosed it, rewrote it, and proved the rewrite was correct and faster before accepting it.
From ncu --set full on the naive kernel — 89 metrics parsed, five of them discriminating:
SM throughput pinned at 94% looks compute-bound — but DRAM at 1.5% and L1 at 99% tells a sharper story.
The SMs aren't choked on FLOPs — they're choked on the load/store unit hitting L1. Each output element re-reads its row of A and column of B from L1 with no proper data reuse.
Fix: shared-memory tiling — load tiles into __shared__ once, reuse them across many threads.
This refined verdict — distinguishing compute-bound subtypes — is what the Profiler MCP passes to the Optimizer as the diagnosis string.
The supervisor accepts only when all three gates pass: compile, correctness, and a speedup threshold. Process wall-time would hide a small kernel inside cudaMalloc + memcpy overhead, so the sandbox uses kernel-only cudaEventRecord timing parsed from the kernel's own stderr.
| Gate | Baseline | Candidate | Result |
|---|---|---|---|
| nvcc -O3 -arch=sm_86 | OK | OK | ✓ compile_ok |
| stdout checksum + 4 corners | reference | identical (rtol 1e−3) | ✓ correctness=match |
| Kernel time (mean of 20 cudaEvent iters) | 1.185 ms | 0.936 ms | ✓ 1.27× ≥ 1.2× bar |
| Process wall time | 218 ms | 217 ms | cudaMalloc + 8 MB memcpy dominate; kernel speedup invisible at this layer |
Loop time end-to-end: ~45 seconds. One profile, one research call, one optimize call, one verify — accepted on the first attempt.
Generated by nvidia/llama-3.3-nemotron-super-49b-v1 on NVIDIA NIM, given the Profiler's diagnosis plus a canonical shared-memory-tiling exemplar in the system prompt. No external libraries (cuBLAS / cuDNN / thrust are explicitly forbidden — only hand-written kernels). No retry needed.
#define TILE 32
__global__ void tiled_matmul(const float* A, const float* B, float* C, int N) {
__shared__ float sA[TILE][TILE];
__shared__ float sB[TILE][TILE];
int row = blockIdx.y * TILE + threadIdx.y;
int col = blockIdx.x * TILE + threadIdx.x;
float acc = 0.0f;
for (int t = 0; t < (N + TILE - 1) / TILE; ++t) {
int aCol = t * TILE + threadIdx.x;
int bRow = t * TILE + threadIdx.y;
sA[threadIdx.y][threadIdx.x] = (row < N && aCol < N) ? A[row * N + aCol] : 0.0f;
sB[threadIdx.y][threadIdx.x] = (bRow < N && col < N) ? B[bRow * N + col] : 0.0f;
__syncthreads();
#pragma unroll
for (int k = 0; k < TILE; ++k) {
acc += sA[threadIdx.y][k] * sB[k][threadIdx.x];
}
__syncthreads();
}
if (row < N && col < N) C[row * N + col] = acc;
}
A correct, idiomatic 32×32 tile with paired __syncthreads() around the compute and an unrolled inner reduction. The kind of kernel a senior CUDA dev writes for a first-pass optimization — produced autonomously, then proven correct against the baseline's stdout reference.
Four agents and three MCP tool servers, orchestrated by a LangGraph state machine with a verification loop. The loop only accepts a kernel that is both measurably faster and bitwise-equivalent to the baseline. Failed attempts feed back as anti-context: "your previous attempt didn't clear the bar — try a different approach."
Compiles the kernel with nvcc, runs ncu --set full, parses 89 metrics, and produces a structured bottleneck diagnosis — including subtype refinement (e.g. compute-bound + L1 saturated + DRAM idle ⇒ load/store-unit limited, not FMA-limited).
Searches a CUDA-doc RAG corpus (embedded with NVIDIA's NeMo Retriever, reranked, then fed to a reasoner) for techniques relevant to the diagnosed bottleneck. Degrades gracefully when the corpus is empty.
Generates a rewritten kernel given the diagnosis + research notes + a canonical tiling exemplar. Strict rules in the prompt: hand-written kernel only (no cuBLAS/cuDNN/thrust), preserve the public signature, time the kernel with cudaEventRecord and emit KERNEL_MS: on stderr.
Compiles the candidate, runs it under cudaEventRecord timing (mean of 20 kernel iterations per run, after warmup), parses KERNEL_MS from stderr, compares stdout to the baseline reference. The verify node either accepts (compile_ok & correctness=match & speedup ≥ 1.2×), retries with the failed attempt as anti-context, or exhausts after max_iterations.
Ten things this exercise made explicit that aren't in the CUDA marketing copy. Most cost an hour each the first time:
RunPod Community pods are missing CAP_SYS_ADMIN from their capability bounding set. setcap cap_sys_admin+ep ncu succeeds, then exec dies with Operation not permitted. Nsight needs the cap; the namespace forbids it.
Vultr's "A16 16 GB" plan is an A16-4Q vGPU profile — the hypervisor strips perf counters from guests regardless of driver flags. Same trap as Colab.
CUDA toolkit is preinstalled but Nsight Compute isn't. apt install nsight-compute brings it down from NVIDIA's repo (~400 MB).
NVreg_RestrictProfilingToAdminUsers=0 in modprobe.d only takes effect after a reboot — the running NVIDIA driver is in use by Lambda Stack services and can't be hot-reloaded.
Lambda Stack's default Python is 3.10. The NVIDIA Agent Toolkit hard-requires ≥3.11; install python3.11 from the deadsnakes PPA (already configured) and rebuild the venv.
It's nvidia/nvidia-nemotron-nano-9b-v2, not nvidia/nemotron-nano-9b-v2. The wrong name returns 404, easily confused for an auth problem.
Either pass api_key= explicitly, or set -a; source .env; set +a to push the key into the process env before launching python. Don't trust auto-detection.
load_dotenv() walks up from the calling .py file's location, not CWD. Pass dotenv_path=Path(...)/'.env' explicitly or your env stays empty.
A 2 ms kernel inside a 200 ms process — the rest is CUDA context init + 8 MB of cudaMemcpy + printf — shows 1.00× wall delta even when the kernel is 4× faster. Time with cudaEventRecord inside the .cu and emit KERNEL_MS on stderr.
Mutating state inside a conditional-edge function is silently dropped. State writes belong on real nodes (e.g. accepted_node); the conditional just returns the routing key.
{
"date": "2026-05-28",
"gpu": "NVIDIA A10 (sm_86, full passthrough)",
"provider": "Lambda Cloud on-demand",
"kernel": "naive_matmul (1024x1024 SP, un-tiled)",
"profile": {
"verdict": "compute-bound (L1/load-throughput limited - shared-memory tiling)",
"sm_throughput_pct": 93.93,
"l1_throughput_pct": 99.28,
"dram_throughput_pct": 1.49,
"l1_hit_rate_pct": 94.93,
"achieved_occupancy_pct": 65.72,
"metrics_parsed": 89
},
"verify": {
"compile_ok": true,
"correctness": "match",
"baseline_kernel_ms": 1.185,
"candidate_kernel_ms": 0.936,
"speedup": 1.27,
"iteration_accepted": 1,
"max_iterations": 3,
"accept_threshold": 1.2
},
"loop_time_seconds": 45.5,
"cost_usd_approx": { "gpu": 5, "nim_tokens": 1 }
}
tempfile directory; production should run inside gVisor / Firecracker with cgroup limits and no network.We'll profile your hot kernel, diagnose the bottleneck on real hardware, and propose — and verify — a faster rewrite.
Get in touch