Apex Copilot — Fine-Tuned Salesforce Code Model

Verified result

A QLoRA fine-tune of Qwen2.5-Coder-3B-Instruct, scored against the base model on a held-out 15-task Salesforce suite by the objective checks below. Reproducible — the eval and dataset are in the repo.

Governor-limit-safe

100%

15/15 · base 66.7%

pass@1

66.7%

10/15 · base 40.0%

Win-rate vs base

fixed 7, regressed 3

Parses

100%

both models

Model	pass@1	limit-safe	parses
Qwen2.5-Coder-3B (base)	40.0%	66.7%	100%
+ Apex Copilot fine-tune	66.7%	100%	100%

Honest read: the fine-tune eliminated every governor-limit violation (the core goal) and lifted pass@1 by ~27 points. It regressed on 3 tasks by occasionally dropping a required idiom (e.g. an LWC .catch) — the expected signature of a small first-pass dataset, and the next thing to fix with more diverse training data.

Try it live — base vs. fine-tuned

live model · ~1–2 min on first run (GPU cold start)

Type any Salesforce task. Both models answer the same prompt on an on-demand GPU, and each output is scored by the governor-limit checker in real time.

Base — Qwen2.5-Coder-3B

// the base model's answer appears here

+ Apex Copilot fine-tune

// the fine-tuned model's answer appears here

First run after idle spins up a GPU (~1–2 min); subsequent runs are quick. Rate-limited per visitor; outputs scored by the same static checks (not a live compile).

Live: governor-limit checker

runs in your browser

This is the actual eval that grades the model — ported to run client-side. Paste any Apex (or load a sample) and it flags the production killers: SOQL/DML inside loops, hard-coded Ids, and structural errors. No server, no GPU.

Results appear here. Load an example and hit “Check this Apex”.

Per-task: base → fine-tuned

Every one of the 15 held-out tasks, scored by the same checks. 7 flipped FAIL→PASS (mostly eliminated SOQL/DML-in-loop); 3 regressed by dropping a required idiom — the honest cost of a small first dataset.

Salesforce task	Base	Fine-tuned
Bulk-safe trigger → child Contacts	✕ SOQL in loop	✓ PASS ↑ fixed
Set Opportunity desc from Account	✓ PASS	✕ dropped idiom ↓ regressed
Trigger-handler framework	✓ PASS	✕ dropped idiom ↓ regressed
Queueable callout	✕ missing AllowsCallouts	✓ PASS ↑ fixed
Batch: deactivate empty Accounts	✓ PASS	✓ PASS
Apex test class (factory + asserts)	✕ DML in loop	✕ missing @isTest
Parent-to-child SOQL	✕ SOQL in loop	✓ PASS ↑ fixed
FLS/CRUD-safe query	✕ no sharing/USER_MODE	✓ PASS ↑ fixed
Partial-success DML	✕ no error handling	✓ PASS ↑ fixed
LWC imperative Apex call	✓ PASS	✕ dropped .catch ↓ regressed
Recursion control	✓ PASS	✓ PASS
Bulk-safe Account counter	✕ SOQL in loop	✓ PASS ↑ fixed
Schedulable batch	✕ DML in loop	✓ PASS ↑ fixed
Callout from trigger context	✕ missing async	✕ missing async
before-insert addError validation	✓ PASS	✓ PASS

Real outputs from the verified run, scored by eval/checks.py. An interactive “type any task → watch base vs fine-tuned generate live” demo (on scale-to-zero GPU) is the next enhancement.

Real example — same prompt, both models

Actual generated output for one task: "return Contacts by last name, enforcing the running user's field- and object-level security." Not cherry-picked prose — the verbatim model responses.

Base — Qwen2.5-Coder-3B ✕ FAIL

public class ContactQueryService {
  public static List<Contact> getContactsByLastName(String lastName) {
    if (!UserInfo.isCurrentUserSystemAdmin()) {
      throw new SecurityException('No permission');
    }
    // comment claims "FLS enforcement" — but the query has none:
    List<Contact> contacts = [SELECT Id, FirstName, LastName, Email
                               FROM Contact WHERE LastName = :lastName];
    // ... + a bogus per-record contact.hasAccess() check
    return contacts;
  }
}

Invents a manual admin gate, queries with no FLS, and calls a method that doesn't exist. Not with sharing, no WITH USER_MODE.

+ Apex Copilot fine-tune ✓ PASS

public with sharing class ContactSearch {
  public static List<Contact> byLastName(String value) {
    return [SELECT Id, LastName FROM Contact
            WHERE LastName = :value WITH USER_MODE];
  }
}

Idiomatic and correct: with sharing + WITH USER_MODE enforces the running user's security at the query.

Honest note: the eval's checks are static (no live compile), so a passing answer isn't a guarantee of compilation — but on governor-limit safety, the signal the project targets, the fine-tune went from 67% → 100%.

System design

Build & verify · offline, on a GPU

Dataset (123, checks-gated) → QLoRA fine-tune (Qwen2.5-Coder-3B) → Eval gate (governor-limit checks) → LoRA adapter

Serve · the live demo

Browser → techarchinc proxy (key in Secret Manager · rate-limit) → RunPod Serverless (base ⇄ adapter hot-swap · scale-to-zero) → base + tuned → PASS/FAIL

The LoRA adapter hot-swaps on a single model load, so one GPU serves both base and fine-tuned for the side-by-side. The API key stays server-side (never in the browser), and requests are async-polled so no call exceeds the 60s web limit during a GPU cold start.

How it works

Eval harness first (the credibility core)

A held-out suite of Salesforce tasks scored by objective, executable checks — does the Apex parse, stay governor-limit-safe (no SOQL/DML in loops), and use the right idioms? The same checks run live above. You can't claim a win without a way to measure it.

A dataset held to the same bar

Instruction→Apex pairs (own patterns + templated generation + public docs — no proprietary code), with every training example filtered through the same checks and a guard against leaking the eval tasks. Bad patterns never enter training.

QLoRA fine-tuning

Parameter-efficient fine-tuning (4-bit base + LoRA adapters) of an open code model on a single on-demand GPU — a few dollars, a few hours. The adapter hot-swaps on/off so one model in memory serves both base and fine-tuned for the comparison.

Verify, then serve

Score fine-tuned vs base on the held-out suite (pass@1, % limit-safe, win-rate). Only if it measurably wins does it ship — served on scale-to-zero GPU with rate limits and a per-run cost cap.

Status — built in the open

✓

Eval harness — objective governor-limit / bulkification checks (live above)

✓

Training dataset — built & quality-gated by the same checks

✓

QLoRA training run — verified (100% limit-safe / 66.7% pass@1 vs base)

✓

Live model demo — type a task, base vs fine-tuned, on on-demand GPU (above)

The headline numbers above are from a real, reproducible run (held-out 15-task suite) — not estimates. The type-a-task demo runs the actual fine-tuned model on a scale-to-zero GPU; the governor-limit eval runs in your browser.

Apex Copilot — Salesforce Fine-Tune