Tech Arch
← Projects

Apex Copilot — Salesforce Fine-Tune

Verified 100% governor-limit-safe vs 67% base Eval live — try it below

Fine-tune a small open code model into a Salesforce Apex/LWC specialist that writes governor-limit-safe, bulkified code — and prove it beats the base model on an objective, executable Salesforce eval (not "vibes").

Verified result

A QLoRA fine-tune of Qwen2.5-Coder-3B-Instruct, scored against the base model on a held-out 15-task Salesforce suite by the objective checks below. Reproducible — the eval and dataset are in the repo.

Governor-limit-safe
100%
15/15 · base 66.7%
pass@1
66.7%
10/15 · base 40.0%
Win-rate vs base
+4
fixed 7, regressed 3
Parses
100%
both models
Modelpass@1limit-safeparses
Qwen2.5-Coder-3B (base)40.0%66.7%100%
+ Apex Copilot fine-tune66.7%100%100%

Honest read: the fine-tune eliminated every governor-limit violation (the core goal) and lifted pass@1 by ~27 points. It regressed on 3 tasks by occasionally dropping a required idiom (e.g. an LWC .catch) — the expected signature of a small first-pass dataset, and the next thing to fix with more diverse training data.

Try it live — base vs. fine-tuned

live model · ~1–2 min on first run (GPU cold start)

Type any Salesforce task. Both models answer the same prompt on an on-demand GPU, and each output is scored by the governor-limit checker in real time.

Base — Qwen2.5-Coder-3B
// the base model's answer appears here
+ Apex Copilot fine-tune
// the fine-tuned model's answer appears here

First run after idle spins up a GPU (~1–2 min); subsequent runs are quick. Rate-limited per visitor; outputs scored by the same static checks (not a live compile).

Live: governor-limit checker

runs in your browser

This is the actual eval that grades the model — ported to run client-side. Paste any Apex (or load a sample) and it flags the production killers: SOQL/DML inside loops, hard-coded Ids, and structural errors. No server, no GPU.

Results appear here. Load an example and hit “Check this Apex”.

Per-task: base → fine-tuned

Every one of the 15 held-out tasks, scored by the same checks. 7 flipped FAIL→PASS (mostly eliminated SOQL/DML-in-loop); 3 regressed by dropping a required idiom — the honest cost of a small first dataset.

Salesforce taskBaseFine-tuned
Bulk-safe trigger → child Contacts✕ SOQL in loop✓ PASS  ↑ fixed
Set Opportunity desc from Account✓ PASS✕ dropped idiom ↓ regressed
Trigger-handler framework✓ PASS✕ dropped idiom ↓ regressed
Queueable callout✕ missing AllowsCallouts✓ PASS  ↑ fixed
Batch: deactivate empty Accounts✓ PASS✓ PASS
Apex test class (factory + asserts)✕ DML in loop✕ missing @isTest
Parent-to-child SOQL✕ SOQL in loop✓ PASS  ↑ fixed
FLS/CRUD-safe query✕ no sharing/USER_MODE✓ PASS  ↑ fixed
Partial-success DML✕ no error handling✓ PASS  ↑ fixed
LWC imperative Apex call✓ PASS✕ dropped .catch ↓ regressed
Recursion control✓ PASS✓ PASS
Bulk-safe Account counter✕ SOQL in loop✓ PASS  ↑ fixed
Schedulable batch✕ DML in loop✓ PASS  ↑ fixed
Callout from trigger context✕ missing async✕ missing async
before-insert addError validation✓ PASS✓ PASS

Real outputs from the verified run, scored by eval/checks.py. An interactive “type any task → watch base vs fine-tuned generate live” demo (on scale-to-zero GPU) is the next enhancement.

Real example — same prompt, both models

Actual generated output for one task: "return Contacts by last name, enforcing the running user's field- and object-level security." Not cherry-picked prose — the verbatim model responses.

Base — Qwen2.5-Coder-3B ✕ FAIL
public class ContactQueryService {
  public static List<Contact> getContactsByLastName(String lastName) {
    if (!UserInfo.isCurrentUserSystemAdmin()) {
      throw new SecurityException('No permission');
    }
    // comment claims "FLS enforcement" — but the query has none:
    List<Contact> contacts = [SELECT Id, FirstName, LastName, Email
                               FROM Contact WHERE LastName = :lastName];
    // ... + a bogus per-record contact.hasAccess() check
    return contacts;
  }
}

Invents a manual admin gate, queries with no FLS, and calls a method that doesn't exist. Not with sharing, no WITH USER_MODE.

+ Apex Copilot fine-tune ✓ PASS
public with sharing class ContactSearch {
  public static List<Contact> byLastName(String value) {
    return [SELECT Id, LastName FROM Contact
            WHERE LastName = :value WITH USER_MODE];
  }
}

Idiomatic and correct: with sharing + WITH USER_MODE enforces the running user's security at the query.

Honest note: the eval's checks are static (no live compile), so a passing answer isn't a guarantee of compilation — but on governor-limit safety, the signal the project targets, the fine-tune went from 67% → 100%.

System design

Build & verify · offline, on a GPU

Dataset (123, checks-gated) QLoRA fine-tune (Qwen2.5-Coder-3B) Eval gate (governor-limit checks) LoRA adapter

Serve · the live demo

Browser techarchinc proxy (key in Secret Manager · rate-limit) RunPod Serverless (base ⇄ adapter hot-swap · scale-to-zero) base + tuned → PASS/FAIL

The LoRA adapter hot-swaps on a single model load, so one GPU serves both base and fine-tuned for the side-by-side. The API key stays server-side (never in the browser), and requests are async-polled so no call exceeds the 60s web limit during a GPU cold start.

How it works

1

Eval harness first (the credibility core)

A held-out suite of Salesforce tasks scored by objective, executable checks — does the Apex parse, stay governor-limit-safe (no SOQL/DML in loops), and use the right idioms? The same checks run live above. You can't claim a win without a way to measure it.

2

A dataset held to the same bar

Instruction→Apex pairs (own patterns + templated generation + public docs — no proprietary code), with every training example filtered through the same checks and a guard against leaking the eval tasks. Bad patterns never enter training.

3

QLoRA fine-tuning

Parameter-efficient fine-tuning (4-bit base + LoRA adapters) of an open code model on a single on-demand GPU — a few dollars, a few hours. The adapter hot-swaps on/off so one model in memory serves both base and fine-tuned for the comparison.

4

Verify, then serve

Score fine-tuned vs base on the held-out suite (pass@1, % limit-safe, win-rate). Only if it measurably wins does it ship — served on scale-to-zero GPU with rate limits and a per-run cost cap.

Status — built in the open

Eval harness — objective governor-limit / bulkification checks (live above)

Training dataset — built & quality-gated by the same checks

QLoRA training run — verified (100% limit-safe / 66.7% pass@1 vs base)

Live model demo — type a task, base vs fine-tuned, on on-demand GPU (above)

The headline numbers above are from a real, reproducible run (held-out 15-task suite) — not estimates. The type-a-task demo runs the actual fine-tuned model on a scale-to-zero GPU; the governor-limit eval runs in your browser.

Need a domain-specialized code model?

Fine-tuning, an objective eval that proves it works, and cost-aware serving — built end to end.

Get in touch