Memory Profiler
An agentic tool that profiles how an LLM uses GPU memory on real hardware and produces ranked, quantified optimization recommendations — cutting inference cost without an ML-performance engineer on staff.
How it works
- • Profiler — loads a model on a GPU and measures weights vs KV cache vs activations, plus memory-bound vs compute-bound.
- • Advisor — NVIDIA Nemotron (via NIM) reads the profile and returns ranked, plain-English recommendations.
- • Exposed as a CLI and a FastAPI service; the GPU step runs on-demand via Modal.
Measured result (Tesla T4)
- • FP16 weights: 7.64 GB → INT8: 4.02 GB
- • 47% memory reduction, 3.6 GB freed
- • Correctly detected the workload as memory-bandwidth-bound
- • KV cache measured scaling ~linearly with context length