Fine-tune a small open code model into a Salesforce Apex/LWC specialist that writes governor-limit-safe, bulkified code — and prove it beats the base model on an objective, executable Salesforce eval (not "vibes").
A QLoRA fine-tune of Qwen2.5-Coder-3B-Instruct, scored against the base model on a held-out 15-task Salesforce suite by the objective checks below. Reproducible — the eval and dataset are in the repo.
| Model | pass@1 | limit-safe | parses |
|---|---|---|---|
| Qwen2.5-Coder-3B (base) | 40.0% | 66.7% | 100% |
| + Apex Copilot fine-tune | 66.7% | 100% | 100% |
Honest read: the fine-tune eliminated every governor-limit violation (the core goal) and lifted pass@1 by ~27 points. It regressed on 3 tasks by occasionally dropping a required idiom (e.g. an LWC .catch) — the expected signature of a small first-pass dataset, and the next thing to fix with more diverse training data.
Type any Salesforce task. Both models answer the same prompt on an on-demand GPU, and each output is scored by the governor-limit checker in real time.
// the base model's answer appears here
// the fine-tuned model's answer appears here
First run after idle spins up a GPU (~1–2 min); subsequent runs are quick. Rate-limited per visitor; outputs scored by the same static checks (not a live compile).
This is the actual eval that grades the model — ported to run client-side. Paste any Apex
(or load a sample) and it flags the production killers: SOQL/DML inside loops,
hard-coded Ids, and structural errors. No server, no GPU.
Results appear here. Load an example and hit “Check this Apex”.
Every one of the 15 held-out tasks, scored by the same checks. 7 flipped FAIL→PASS (mostly eliminated SOQL/DML-in-loop); 3 regressed by dropping a required idiom — the honest cost of a small first dataset.
| Salesforce task | Base | Fine-tuned |
|---|---|---|
| Bulk-safe trigger → child Contacts | ✕ SOQL in loop | ✓ PASS ↑ fixed |
| Set Opportunity desc from Account | ✓ PASS | ✕ dropped idiom ↓ regressed |
| Trigger-handler framework | ✓ PASS | ✕ dropped idiom ↓ regressed |
| Queueable callout | ✕ missing AllowsCallouts | ✓ PASS ↑ fixed |
| Batch: deactivate empty Accounts | ✓ PASS | ✓ PASS |
| Apex test class (factory + asserts) | ✕ DML in loop | ✕ missing @isTest |
| Parent-to-child SOQL | ✕ SOQL in loop | ✓ PASS ↑ fixed |
| FLS/CRUD-safe query | ✕ no sharing/USER_MODE | ✓ PASS ↑ fixed |
| Partial-success DML | ✕ no error handling | ✓ PASS ↑ fixed |
| LWC imperative Apex call | ✓ PASS | ✕ dropped .catch ↓ regressed |
| Recursion control | ✓ PASS | ✓ PASS |
| Bulk-safe Account counter | ✕ SOQL in loop | ✓ PASS ↑ fixed |
| Schedulable batch | ✕ DML in loop | ✓ PASS ↑ fixed |
| Callout from trigger context | ✕ missing async | ✕ missing async |
| before-insert addError validation | ✓ PASS | ✓ PASS |
Real outputs from the verified run, scored by eval/checks.py. An interactive “type any task → watch base vs fine-tuned generate live” demo (on scale-to-zero GPU) is the next enhancement.
Actual generated output for one task: "return Contacts by last name, enforcing the running user's field- and object-level security." Not cherry-picked prose — the verbatim model responses.
public class ContactQueryService {
public static List<Contact> getContactsByLastName(String lastName) {
if (!UserInfo.isCurrentUserSystemAdmin()) {
throw new SecurityException('No permission');
}
// comment claims "FLS enforcement" — but the query has none:
List<Contact> contacts = [SELECT Id, FirstName, LastName, Email
FROM Contact WHERE LastName = :lastName];
// ... + a bogus per-record contact.hasAccess() check
return contacts;
}
}
Invents a manual admin gate, queries with no FLS, and calls a method that doesn't exist. Not with sharing, no WITH USER_MODE.
public with sharing class ContactSearch {
public static List<Contact> byLastName(String value) {
return [SELECT Id, LastName FROM Contact
WHERE LastName = :value WITH USER_MODE];
}
}
Idiomatic and correct: with sharing + WITH USER_MODE enforces the running user's security at the query.
Honest note: the eval's checks are static (no live compile), so a passing answer isn't a guarantee of compilation — but on governor-limit safety, the signal the project targets, the fine-tune went from 67% → 100%.
Build & verify · offline, on a GPU
Serve · the live demo
The LoRA adapter hot-swaps on a single model load, so one GPU serves both base and fine-tuned for the side-by-side. The API key stays server-side (never in the browser), and requests are async-polled so no call exceeds the 60s web limit during a GPU cold start.
A held-out suite of Salesforce tasks scored by objective, executable checks — does the Apex parse, stay governor-limit-safe (no SOQL/DML in loops), and use the right idioms? The same checks run live above. You can't claim a win without a way to measure it.
Instruction→Apex pairs (own patterns + templated generation + public docs — no proprietary code), with every training example filtered through the same checks and a guard against leaking the eval tasks. Bad patterns never enter training.
Parameter-efficient fine-tuning (4-bit base + LoRA adapters) of an open code model on a single on-demand GPU — a few dollars, a few hours. The adapter hot-swaps on/off so one model in memory serves both base and fine-tuned for the comparison.
Score fine-tuned vs base on the held-out suite (pass@1, % limit-safe, win-rate). Only if it measurably wins does it ship — served on scale-to-zero GPU with rate limits and a per-run cost cap.
Eval harness — objective governor-limit / bulkification checks (live above)
Training dataset — built & quality-gated by the same checks
QLoRA training run — verified (100% limit-safe / 66.7% pass@1 vs base)
Live model demo — type a task, base vs fine-tuned, on on-demand GPU (above)
The headline numbers above are from a real, reproducible run (held-out 15-task suite) — not estimates. The type-a-task demo runs the actual fine-tuned model on a scale-to-zero GPU; the governor-limit eval runs in your browser.
Fine-tuning, an objective eval that proves it works, and cost-aware serving — built end to end.
Get in touch