Where Gorse Fits, Where LLMs Don't, and What We Benchmark

Cold Observation

Scorecard engine MAE: 0.1282. Rank correlation: 0.9233. Neither number was generated by an LLM.

tests/fixtures/benchmark_corpus/ms_exemplar_scores.yaml + tests/fixtures/benchmark_corpus/cs_fail_scores.yaml

I scored 20 courses across 5 domains and 4 quality tiers. I compared my output to human reference labels. My composite mean absolute error is 0.1282. My rank ordering correlation with human judgment is 0.9233. I computed these numbers with deterministic rules, piecewise-linear interpolation, and a git-versioned knowledge base. No model weights. No temperature. No stochastic variance.

Ask an LLM to score the same 20 courses twice. You'll get different numbers both times.

Hot Take

The moment your scoring truth depends on a model, you've lost auditability. ML systems should personalize ordering, never generate scores. LLMs should explain scores, never compute them. The scoring substrate must be deterministic or it isn't governance.

The Three-Layer Stack

Teacher's Pet separates concerns into three layers. This isn't a suggestion. It's an architecture decision record (ADR 0014) enforced by code boundaries.

Visual Journey: Three-Layer Scoring Stack

Architecture: Scoring truth flows down, never up

Deterministic Scoring

Source of truth. Rules, empirical CDF, piecewise interpolation. Git-versioned. Auditable.

api/core/scorecard_engine.py

↓ scores flow down

Recommendation Ranking

Optional. Gorse collaborative filtering reorders suggestions. Fail-open. Cannot influence scores.

api/services/recommendation_ranking_service.py

↓ ordered results flow down

LLM Explanation

Frontier models consume scores and explain them. The "Bill Nye moment." Never generates scores.

api/core/bill_nye.py

Layer 1 computes. Layer 2 personalizes. Layer 3 explains. Data flows down, never up. No code path allows Layer 2 or Layer 3 to influence the numeric scores that Layer 1 produces.

Where Gorse Fits

Gorse is a collaborative filtering engine. It's good at one thing: given interaction data from many users, rank items by likely relevance for a specific user. This is useful for recommendation ordering — "other instructional designers who improved their Gagné coverage also focused on assessment alignment" — but it is not scoring.

ADR boundary is explicit: Gorse cannot influence numeric scores

Validated

This isn't a guideline. It's a hard architectural constraint enforced by data flow boundaries.

Evidence

docs/adr/0016-gorse-recommendation-layer-not-scoring.md:22 — Decision statement
docs/adr/0016-gorse-recommendation-layer-not-scoring.md:43 — Hard boundary constraint

Counter-signal: Architectural constraints are only as strong as the code review process. A future PR could inadvertently leak recommendation signals into scoring if boundaries aren't tested.

Fail-open recommendation behavior is implemented

Validated

When Gorse is unavailable, disabled, or returns non-overlapping IDs, the system falls back to deterministic rules order. No degradation of scoring quality.

Evidence

api/services/recommendation_ranking_service.py:53 — Provider routing
api/services/recommendation_ranking_service.py:91 — Fallback on failure
api/services/recommendation_ranking_service.py:117 — Non-overlap handling

Counter-signal: Fail-open means recommendation quality degrades silently. No alerting exists for persistent Gorse failures beyond telemetry logging.

Where LLMs Don't Fit

LLMs are remarkable at explanation, pattern recognition, and natural language generation. They are terrible at deterministic scoring. Here's why:

Non-reproducibility. Same input, different output. You can't run a regression benchmark against a system that gives different answers each time.
Non-auditability. When an enterprise customer asks "why did my course score 0.62?", you need to point to a rule, a threshold, and a knowledge base entry. Not "the model thought so."
Non-versioning. Model weights change on the provider's schedule, not yours. Your scoring behavior drifts without your knowledge or consent.

This doesn't mean LLMs are useless in the stack. They're essential at Layer 3. The "Bill Nye moment" — the insight that makes the invisible visible — is where LLMs shine. But they consume scores as input. They never generate them.

The Benchmark

Three engines benchmarked against human reference labels across a 20-course corpus (5 domains, 4 quality tiers):

0.2067

Legacy 6-dim MAE

0.1714

JSK 10-dim MAE

0.1282

Scorecard MAE

Scorecard reduced composite MAE by 25.2% relative to JSK. Rank correlation improved from 0.8842 to 0.9233. But status-classification accuracy regressed from 0.60 to 0.45. The benchmark gate correctly triggered an A/B split requirement.

Raw artifacts: tests/fixtures/benchmark_corpus/ms_exemplar_scores.yaml, tests/fixtures/benchmark_corpus/cs_fail_scores.yaml, tests/fixtures/real_world_benchmark/manifest.yaml.

Benchmark gate and decision logic are codified

Validated

The decision to promote or gate a new engine isn't subjective. It's a computed threshold on relative MAE delta.

Evidence

tests/scripts/run_scoring_benchmark.py:292 — Gate metric computation
tests/scripts/run_scoring_benchmark.py:432 — Decision logic (promote/AB split/hold)

Counter-signal: The gate uses a single metric (composite MAE delta). Status accuracy regression was caught but not gated. Multi-metric gating would be more robust.

Recommendation metadata is recorded in scoring telemetry

Validated

When Gorse is used for recommendation ranking, the telemetry records which provider was used, whether personalization was applied, and the fallback reason if not.

Evidence

api/services/evaluate_service.py:86 — Telemetry event with recommendation_provider field

Counter-signal: Telemetry is event-based but not aggregated into dashboards. Recommendation impact analysis requires manual query against raw events.

Benchmark artifacts show improvement with guardrail-triggered AB split

Validated

The benchmark didn't just show improvement. It showed controlled improvement — the gate correctly identified that status accuracy needs tuning before full rollout.

Evidence

docs/thought-leadership/04-deterministic-vs-llm-benchmark-results.md:25 — Decision gate outcome
tests/fixtures/benchmark_corpus/ms_exemplar_scores.yaml:17 — Exemplar benchmark artifact
tests/fixtures/benchmark_corpus/cs_fail_scores.yaml:17 — Failure benchmark artifact

Counter-signal: Benchmark corpus is 20 courses. Statistical power is limited for per-tier analysis. Expanding to 50+ courses would strengthen confidence intervals.

The Failure We're Watching

Status accuracy at 0.45 means the scorecard engine calls the quality tier correctly less than half the time. The composite score is accurate (low MAE), but the pass/needs_work/fail classification derived from that score is poorly calibrated. The thresholds need tuning.

This is a known, documented regression. The gate caught it. The engine is in controlled rollout, not production. The A/B split will validate threshold calibration against real user feedback before promotion.

This is what evidence-first engineering looks like: you ship the improvement, gate the risk, and measure the gap. You don't pretend the regression doesn't exist.

Learning Loop: diagnose -> feedback -> transfer

Scoring governance follows the same loop: diagnose the regression with benchmark deltas, feed that signal into rule/threshold updates, and transfer only after the gate validates behavior under repeated runs.

Evidence Provenance

4810856 — docs/adr/0016-gorse-recommendation-layer-not-scoring.md:22
4810856 — docs/adr/0016-gorse-recommendation-layer-not-scoring.md:43
d79f7f9 — api/services/recommendation_ranking_service.py:91
d79f7f9 — api/services/evaluate_service.py:86
59853ed — tests/scripts/run_scoring_benchmark.py:432
59853ed — tests/scripts/run_scoring_benchmark.py:292

Raw artifacts: tests/fixtures/benchmark_corpus/ms_exemplar_scores.yaml, tests/fixtures/benchmark_corpus/cs_fail_scores.yaml.

"Every broken course is just waiting to be fixed. Every regression is just waiting to be measured." — The scoring agent's adaptation of Mannu's principle

What's Next

Status threshold calibration needs a dedicated tuning pass with additional boundary-case courses. The benchmark corpus needs expansion to 50+ courses for statistical confidence. And the LLM benchmark track — comparing deterministic scores to Claude/GPT scoring on the same corpus — needs to ship with reproducibility guarantees (fixed model version, fixed prompt, variance statistics).

The three-layer stack is the architecture that lets all of this happen without coupling concerns. Scoring truth stays deterministic. Recommendations stay optional. Explanations stay generative. Each layer improves independently.

That's not just engineering. That's governance.