Tech Arch
← Projects

Memory Profiler — Live Stats

A real profile run, measured on a free Tesla T4 GPU. Model: microsoft/Phi-3-mini-4k-instruct (3.8B params).

Weights (FP16)
7.64 GB
Peak memory
7.86 GB
INT8 reduction
−47%
Bottleneck
Memory-bound

Memory breakdown (FP16)

Weights7.64 GB · 97%
KV cache + activations0.21 GB · 3%

Weights dominate — the textbook signal that quantization is the highest-leverage win.

FP16 vs INT8 (peak memory)

FP167.86 GB
7.86
INT84.25 GB
4.25

INT8 frees 3.6 GB — enough to fit on a smaller, cheaper GPU.

Bottleneck analysis

Compute (SM) utilization80.8%
Memory-bandwidth utilization82.8%

Verdict: memory-bound. Bandwidth utilization edges out compute — every output token reads all weights from HBM, so smaller weights and faster memory help more than more FLOPs.

KV cache grows with context

Measured ~0.41 MB per token — KV cache scales linearly with context length, the core driver of agentic-AI memory pressure.

🤖 Advisor recommendations

1 · Quantize weights FP16 → INT8

Weights are 97% of peak. Measured INT8: 7.64→4.02 GB (47% smaller); peak 7.86→4.25 GB, freeing 3.6 GB.

Savings: 3.6 GB · Quality cost: ~<1% (bitsandbytes / AWQ / GPTQ)

2 · Optimize for memory bandwidth

Decode is memory-bandwidth-bound. Quantization and faster HBM help more than added FLOPs; batching amortizes the per-token weight read.

3 · Re-profile at production context length

This run's KV cache is small (short prompt). KV grows ~linearly with context × concurrency — size it at your real workload to find the true ceiling.

Raw profile output

{
  "model": "microsoft/Phi-3-mini-4k-instruct",
  "gpu": "Tesla T4",
  "fp16": { "weights_gb": 7.642, "kv_cache_plus_activations_gb": 0.215,
            "peak_total_gb": 7.857, "avg_compute_util_pct": 80.8,
            "avg_memory_bw_util_pct": 82.8, "memory_bound": true },
  "int8": { "weights_gb": 4.024, "peak_total_gb": 4.253, "memory_bound": false },
  "kv_by_tokens": { "128": 0.057, "512": 0.213, "1024": 0.42 }
}

System design

A two-stage agentic pipeline. A GPU Profiler measures what's actually happening in memory on real hardware; an LLM Advisor then reasons over those measurements to produce ranked, quantified fixes. Profiling runs on-demand on a Modal GPU; the advisor runs on NVIDIA NIM.

MemoryProfiler architecture: you provide a model and a target GPU; the Profiler and Advisor agents run the model on a real GPU and return a ranked report plus JSON and a dashboard.
The big picture — request → two agents → real GPU → ranked report.
The two agents inside: a Profiler agent (load_model, memory_snapshot, run_inference, nvidia_smi_sample) and an Advisor agent that reads the profile JSON and outputs ranked recommendations.
The two agents — the Profiler measures, the Advisor interprets.

The pipeline, stage by stage

1

Request

A user runs the memoryprofiler CLI or calls the FastAPI service with a model and a target GPU.

CLI FastAPI
model + target GPU
2

Profiler agent Modal · Tesla T4

Loads the model on a real GPU and measures the memory split — weights vs KV cache vs activations — peak usage, and whether the workload is memory-bound or compute-bound.

PyTorch Modal GPU
measured profile (JSON)
3

Advisor agent NVIDIA Nemotron · NIM

Reads the measured profile and reasons over GPU economics, returning ranked, plain-English recommendations with quantified savings — no ML-performance engineer required.

NeMo Agent Toolkit NIM / Nemotron
ranked recommendations
4

Output

Delivered as a CLI report, raw JSON, or this live dashboard — the verdict, headline savings, and ranked next steps.

Want a profile of your own workload?

We'll measure where your GPU memory and money go — and hand you a quantified plan.

Request an audit