Projects | Tech Arch Inc

Apex Copilot

Verified 100% limit-safe vs 67% base

A small open code model fine-tuned into a Salesforce Apex/LWC specialist that writes governor-limit-safe, bulkified code — and proves it beats the base model on an objective, executable eval (100% governor-limit-safe vs 67% base; +27 pts pass@1, verified on a held-out suite). Try it live on the project page: type a Salesforce task and watch base vs fine-tuned answer side by side, each scored in real time.

QLoRA / PEFT Qwen2.5-Coder Eval harness Transformers / TRL RunPod Serverless

Try it live →

CUDA Ops

Verified 1.27× on naive matmul · A10

A multi-agent system that takes a CUDA kernel, profiles it with NVIDIA Nsight Compute on a real GPU, diagnoses the bottleneck (compute-bound vs memory-bound vs L1/load-throughput-limited), and uses an LLM to generate a faster rewrite — then proves the change is correct and measurably faster before accepting it. First verified run: a naive 1024×1024 matmul (1.185 ms baseline) was rewritten by the optimizer as a shared-memory-tiled kernel (0.936 ms, identical output) — accepted in a single iteration.

Nsight Compute LangGraph NIM / Nemotron 49B FastMCP CUDA / nvcc FastAPI

View the run →

Memory Profiler

An agentic tool that profiles how an LLM uses GPU memory on real hardware and produces ranked, quantified optimization recommendations — cutting inference cost without an ML-performance engineer on staff.

How it works

• Profiler — loads a model on a GPU and measures weights vs KV cache vs activations, plus memory-bound vs compute-bound.
• Advisor — NVIDIA Nemotron (via NIM) reads the profile and returns ranked, plain-English recommendations.
• Exposed as a CLI and a FastAPI service; the GPU step runs on-demand via Modal.

Measured result (Tesla T4)

• FP16 weights: 7.64 GB → INT8: 4.02 GB
• 47% memory reduction, 3.6 GB freed
• Correctly detected the workload as memory-bandwidth-bound
• KV cache measured scaling ~linearly with context length

Python PyTorch NVIDIA NeMo Agent Toolkit NIM / Nemotron FastAPI Modal

View live stats →

chinnamAI

Live · Public Agentic · Stateless

A public, stateless agentic AI article generator. Type any AI/ML topic; a LangGraph state machine (Research → Drafter → Critic → Verify with a retry edge) runs against an arxiv research-paper corpus and produces a sourced, grounded article — streamed live to the visitor and downloadable as PDF. Every agent has real tool use: the Researcher picks its own searches, the Drafter calls verify_claim mid-write to ground specific assertions before they ship, and the Critic produces a structured grounded/ungrounded verdict via forced tool output. Per-run cost capped, per-IP rate-limited, nothing saved.

LangGraph Claude Sonnet 4.6 RAG · pgvector Next.js 16 SSE streaming Puppeteer PDF

Try it now →