Tech Arch
← Projects

CUDA Ops — Verified Run

A real kernel-optimization run on a real GPU. Hardware: NVIDIA A10 (sm_86, full passthrough, Lambda Cloud). Input: a deliberately naive 1024×1024 single-precision matmul. The agent loop measured it, diagnosed it, rewrote it, and proved the rewrite was correct and faster before accepting it.

Baseline kernel
1.185 ms
naive_matmul, cudaEvent timed
Optimized kernel
0.936 ms
agent-generated tiled rewrite
Verified speedup
1.27×
accepted at iteration 1
Diagnosed bottleneck
L1 load-throughput limited
SM 94% / DRAM 1.5%

Profiler diagnosis

From ncu --set full on the naive kernel — 89 metrics parsed, five of them discriminating:

SM throughput93.93%
L1/TEX throughput99.28%
DRAM throughput1.49%
L1 hit rate94.93%
Achieved occupancy65.72%

What that means

SM throughput pinned at 94% looks compute-bound — but DRAM at 1.5% and L1 at 99% tells a sharper story.

Verdict

The SMs aren't choked on FLOPs — they're choked on the load/store unit hitting L1. Each output element re-reads its row of A and column of B from L1 with no proper data reuse.

Fix: shared-memory tiling — load tiles into __shared__ once, reuse them across many threads.

This refined verdict — distinguishing compute-bound subtypes — is what the Profiler MCP passes to the Optimizer as the diagnosis string.

Verification result

The supervisor accepts only when all three gates pass: compile, correctness, and a speedup threshold. Process wall-time would hide a small kernel inside cudaMalloc + memcpy overhead, so the sandbox uses kernel-only cudaEventRecord timing parsed from the kernel's own stderr.

Gate Baseline Candidate Result
nvcc -O3 -arch=sm_86 OK OK ✓ compile_ok
stdout checksum + 4 corners reference identical (rtol 1e−3) ✓ correctness=match
Kernel time (mean of 20 cudaEvent iters) 1.185 ms 0.936 ms ✓ 1.27× ≥ 1.2× bar
Process wall time 218 ms 217 ms cudaMalloc + 8 MB memcpy dominate; kernel speedup invisible at this layer

Loop time end-to-end: ~45 seconds. One profile, one research call, one optimize call, one verify — accepted on the first attempt.

The kernel the agent wrote

Generated by nvidia/llama-3.3-nemotron-super-49b-v1 on NVIDIA NIM, given the Profiler's diagnosis plus a canonical shared-memory-tiling exemplar in the system prompt. No external libraries (cuBLAS / cuDNN / thrust are explicitly forbidden — only hand-written kernels). No retry needed.

#define TILE 32

__global__ void tiled_matmul(const float* A, const float* B, float* C, int N) {
    __shared__ float sA[TILE][TILE];
    __shared__ float sB[TILE][TILE];

    int row = blockIdx.y * TILE + threadIdx.y;
    int col = blockIdx.x * TILE + threadIdx.x;
    float acc = 0.0f;

    for (int t = 0; t < (N + TILE - 1) / TILE; ++t) {
        int aCol = t * TILE + threadIdx.x;
        int bRow = t * TILE + threadIdx.y;

        sA[threadIdx.y][threadIdx.x] = (row < N && aCol < N) ? A[row * N + aCol] : 0.0f;
        sB[threadIdx.y][threadIdx.x] = (bRow < N && col < N) ? B[bRow * N + col] : 0.0f;
        __syncthreads();

        #pragma unroll
        for (int k = 0; k < TILE; ++k) {
            acc += sA[threadIdx.y][k] * sB[k][threadIdx.x];
        }
        __syncthreads();
    }

    if (row < N && col < N) C[row * N + col] = acc;
}

A correct, idiomatic 32×32 tile with paired __syncthreads() around the compute and an unrolled inner reduction. The kind of kernel a senior CUDA dev writes for a first-pass optimization — produced autonomously, then proven correct against the baseline's stdout reference.

System design

Four agents and three MCP tool servers, orchestrated by a LangGraph state machine with a verification loop. The loop only accepts a kernel that is both measurably faster and bitwise-equivalent to the baseline. Failed attempts feed back as anti-context: "your previous attempt didn't clear the bar — try a different approach."

CUDA Ops architecture: a Supervisor agent orchestrates three specialist agents — Profiler (Nsight Compute), Research (RAG over CUDA docs), Optimizer (NVIDIA NIM Nemotron) — over a shared state, with verification feedback.
The 4-agent team — Supervisor orchestrates Profiler, Research, and Optimizer over shared state.
CUDA Ops verification loop: Optimizer writes new code, Sandbox compiles + benchmarks it, Verify checks compile_ok and correctness=match and speedup≥threshold; if all gates pass the candidate is accepted, otherwise the failed attempt feeds back as anti-context and the loop retries up to max_iterations.
The verification loop — accept only if compile + correctness + speedup all pass; otherwise retry with the failed attempt as anti-context.

The pipeline, stage by stage

1

Profiler agent Nsight Compute

Compiles the kernel with nvcc, runs ncu --set full, parses 89 metrics, and produces a structured bottleneck diagnosis — including subtype refinement (e.g. compute-bound + L1 saturated + DRAM idle ⇒ load/store-unit limited, not FMA-limited).

CUDA / nvcc Nsight Compute (ncu) FastMCP
structured ProfileReport (verdict + 89 metrics)
2

Research agent NeMo Retriever · pgvector

Searches a CUDA-doc RAG corpus (embedded with NVIDIA's NeMo Retriever, reranked, then fed to a reasoner) for techniques relevant to the diagnosed bottleneck. Degrades gracefully when the corpus is empty.

NIM Embed NIM Rerank pgvector
ResearchFindings (summary + citations)
3

Optimizer agent Nemotron Super 49B

Generates a rewritten kernel given the diagnosis + research notes + a canonical tiling exemplar. Strict rules in the prompt: hand-written kernel only (no cuBLAS/cuDNN/thrust), preserve the public signature, time the kernel with cudaEventRecord and emit KERNEL_MS: on stderr.

NIM / Nemotron 49B LangChain
candidate .cu file
4

Sandbox + Verify nvcc + cudaEvent + LangGraph

Compiles the candidate, runs it under cudaEventRecord timing (mean of 20 kernel iterations per run, after warmup), parses KERNEL_MS from stderr, compares stdout to the baseline reference. The verify node either accepts (compile_ok & correctness=match & speedup ≥ 1.2×), retries with the failed attempt as anti-context, or exhausts after max_iterations.

LangGraph nvcc FastAPI

What the bring-up surfaced

Ten things this exercise made explicit that aren't in the CUDA marketing copy. Most cost an hour each the first time:

1 · Container GPU clouds can't profile

RunPod Community pods are missing CAP_SYS_ADMIN from their capability bounding set. setcap cap_sys_admin+ep ncu succeeds, then exec dies with Operation not permitted. Nsight needs the cap; the namespace forbids it.

2 · vGPU partitions can't profile either

Vultr's "A16 16 GB" plan is an A16-4Q vGPU profile — the hypervisor strips perf counters from guests regardless of driver flags. Same trap as Colab.

3 · Lambda Stack ships nvcc, not ncu

CUDA toolkit is preinstalled but Nsight Compute isn't. apt install nsight-compute brings it down from NVIDIA's repo (~400 MB).

4 · The driver flag needs a reboot

NVreg_RestrictProfilingToAdminUsers=0 in modprobe.d only takes effect after a reboot — the running NVIDIA driver is in use by Lambda Stack services and can't be hot-reloaded.

5 · nvidia-nat needs Python 3.11+

Lambda Stack's default Python is 3.10. The NVIDIA Agent Toolkit hard-requires ≥3.11; install python3.11 from the deadsnakes PPA (already configured) and rebuild the venv.

6 · The NIM model id has a doubled prefix

It's nvidia/nvidia-nemotron-nano-9b-v2, not nvidia/nemotron-nano-9b-v2. The wrong name returns 404, easily confused for an auth problem.

7 · ChatNVIDIA env-var pickup is unreliable

Either pass api_key= explicitly, or set -a; source .env; set +a to push the key into the process env before launching python. Don't trust auto-detection.

8 · load_dotenv searches up from the script

load_dotenv() walks up from the calling .py file's location, not CWD. Pass dotenv_path=Path(...)/'.env' explicitly or your env stays empty.

9 · Wall-time hides kernel speedup

A 2 ms kernel inside a 200 ms process — the rest is CUDA context init + 8 MB of cudaMemcpy + printf — shows 1.00× wall delta even when the kernel is 4× faster. Time with cudaEventRecord inside the .cu and emit KERNEL_MS on stderr.

10 · LangGraph conditional edges are pure routers

Mutating state inside a conditional-edge function is silently dropped. State writes belong on real nodes (e.g. accepted_node); the conditional just returns the routing key.

Raw run record

{
  "date": "2026-05-28",
  "gpu": "NVIDIA A10 (sm_86, full passthrough)",
  "provider": "Lambda Cloud on-demand",
  "kernel": "naive_matmul (1024x1024 SP, un-tiled)",
  "profile": {
    "verdict": "compute-bound (L1/load-throughput limited - shared-memory tiling)",
    "sm_throughput_pct": 93.93,
    "l1_throughput_pct": 99.28,
    "dram_throughput_pct": 1.49,
    "l1_hit_rate_pct": 94.93,
    "achieved_occupancy_pct": 65.72,
    "metrics_parsed": 89
  },
  "verify": {
    "compile_ok": true,
    "correctness": "match",
    "baseline_kernel_ms": 1.185,
    "candidate_kernel_ms": 0.936,
    "speedup": 1.27,
    "iteration_accepted": 1,
    "max_iterations": 3,
    "accept_threshold": 1.2
  },
  "loop_time_seconds": 45.5,
  "cost_usd_approx": { "gpu": 5, "nim_tokens": 1 }
}

What it does not do

Want this on your kernels?

We'll profile your hot kernel, diagnose the bottleneck on real hardware, and propose — and verify — a faster rewrite.

Get in touch