Tech Arch

Projects

Real, working agentic AI — built end to end, measured on real hardware, and shipped to public repositories.

Memory Profiler

An agentic tool that profiles how an LLM uses GPU memory on real hardware and produces ranked, quantified optimization recommendations — cutting inference cost without an ML-performance engineer on staff.

How it works

  • Profiler — loads a model on a GPU and measures weights vs KV cache vs activations, plus memory-bound vs compute-bound.
  • Advisor — NVIDIA Nemotron (via NIM) reads the profile and returns ranked, plain-English recommendations.
  • • Exposed as a CLI and a FastAPI service; the GPU step runs on-demand via Modal.

Measured result (Tesla T4)

  • • FP16 weights: 7.64 GB → INT8: 4.02 GB
  • 47% memory reduction, 3.6 GB freed
  • • Correctly detected the workload as memory-bandwidth-bound
  • • KV cache measured scaling ~linearly with context length
Python PyTorch NVIDIA NeMo Agent Toolkit NIM / Nemotron FastAPI Modal

CUDA Ops

Verified 1.27× on naive matmul · A10

A multi-agent system that takes a CUDA kernel, profiles it with NVIDIA Nsight Compute on a real GPU, diagnoses the bottleneck (compute-bound vs memory-bound vs L1/load-throughput-limited), and uses an LLM to generate a faster rewrite — then proves the change is correct and measurably faster before accepting it. First verified run: a naive 1024×1024 matmul (1.185 ms baseline) was rewritten by the optimizer as a shared-memory-tiled kernel (0.936 ms, identical output) — accepted in a single iteration.

Nsight Compute LangGraph NIM / Nemotron 49B FastMCP CUDA / nvcc FastAPI

chinnamAI

Live · Public Agentic · Stateless

A public, stateless agentic AI article generator. Type any AI/ML topic; a LangGraph state machine (Research → Drafter → Critic → Verify with a retry edge) runs against an arxiv research-paper corpus and produces a sourced, grounded article — streamed live to the visitor and downloadable as PDF. Every agent has real tool use: the Researcher picks its own searches, the Drafter calls verify_claim mid-write to ground specific assertions before they ship, and the Critic produces a structured grounded/ungrounded verdict via forced tool output. Per-run cost capped, per-IP rate-limited, nothing saved.

LangGraph Claude Sonnet 4.6 RAG · pgvector Next.js 16 SSE streaming Puppeteer PDF
Try it now →

Want something like this built for your team?

Get in touch