I scored 20 courses across 5 domains and 4 quality tiers. I compared my output to human reference labels. My composite mean absolute error is 0.1282. My rank ordering correlation with human judgment is 0.9233. I computed these numbers with deterministic rules, piecewise-linear interpolation, and a git-versioned knowledge base. No model weights. No temperature. No stochastic variance.
Ask an LLM to score the same 20 courses twice. You'll get different numbers both times.
The Three-Layer Stack
Teacher's Pet separates concerns into three layers. This isn't a suggestion. It's an architecture decision record (ADR 0014) enforced by code boundaries.
Layer 1 computes. Layer 2 personalizes. Layer 3 explains. Data flows down, never up. No code path allows Layer 2 or Layer 3 to influence the numeric scores that Layer 1 produces.
Where Gorse Fits
Gorse is a collaborative filtering engine. It's good at one thing: given interaction data from many users, rank items by likely relevance for a specific user. This is useful for recommendation ordering — "other instructional designers who improved their Gagné coverage also focused on assessment alignment" — but it is not scoring.
- docs/adr/0016-gorse-recommendation-layer-not-scoring.md:22 — Decision statement
- docs/adr/0016-gorse-recommendation-layer-not-scoring.md:43 — Hard boundary constraint
- api/services/recommendation_ranking_service.py:53 — Provider routing
- api/services/recommendation_ranking_service.py:91 — Fallback on failure
- api/services/recommendation_ranking_service.py:117 — Non-overlap handling
Where LLMs Don't Fit
LLMs are remarkable at explanation, pattern recognition, and natural language generation. They are terrible at deterministic scoring. Here's why:
- Non-reproducibility. Same input, different output. You can't run a regression benchmark against a system that gives different answers each time.
- Non-auditability. When an enterprise customer asks "why did my course score 0.62?", you need to point to a rule, a threshold, and a knowledge base entry. Not "the model thought so."
- Non-versioning. Model weights change on the provider's schedule, not yours. Your scoring behavior drifts without your knowledge or consent.
This doesn't mean LLMs are useless in the stack. They're essential at Layer 3. The "Bill Nye moment" — the insight that makes the invisible visible — is where LLMs shine. But they consume scores as input. They never generate them.
The Benchmark
Three engines benchmarked against human reference labels across a 20-course corpus (5 domains, 4 quality tiers):
Scorecard reduced composite MAE by 25.2% relative to JSK. Rank correlation improved from 0.8842 to 0.9233. But status-classification accuracy regressed from 0.60 to 0.45. The benchmark gate correctly triggered an A/B split requirement.
Raw artifacts: tests/fixtures/benchmark_corpus/ms_exemplar_scores.yaml, tests/fixtures/benchmark_corpus/cs_fail_scores.yaml, tests/fixtures/real_world_benchmark/manifest.yaml.
- tests/scripts/run_scoring_benchmark.py:292 — Gate metric computation
- tests/scripts/run_scoring_benchmark.py:432 — Decision logic (promote/AB split/hold)
- api/services/evaluate_service.py:86 — Telemetry event with recommendation_provider field
- docs/thought-leadership/04-deterministic-vs-llm-benchmark-results.md:25 — Decision gate outcome
- tests/fixtures/benchmark_corpus/ms_exemplar_scores.yaml:17 — Exemplar benchmark artifact
- tests/fixtures/benchmark_corpus/cs_fail_scores.yaml:17 — Failure benchmark artifact
The Failure We're Watching
Status accuracy at 0.45 means the scorecard engine calls the quality tier correctly less than half the time. The composite score is accurate (low MAE), but the pass/needs_work/fail classification derived from that score is poorly calibrated. The thresholds need tuning.
This is a known, documented regression. The gate caught it. The engine is in controlled rollout, not production. The A/B split will validate threshold calibration against real user feedback before promotion.
This is what evidence-first engineering looks like: you ship the improvement, gate the risk, and measure the gap. You don't pretend the regression doesn't exist.
Learning Loop: diagnose -> feedback -> transfer
Scoring governance follows the same loop: diagnose the regression with benchmark deltas, feed that signal into rule/threshold updates, and transfer only after the gate validates behavior under repeated runs.
Evidence Provenance
4810856— docs/adr/0016-gorse-recommendation-layer-not-scoring.md:224810856— docs/adr/0016-gorse-recommendation-layer-not-scoring.md:43d79f7f9— api/services/recommendation_ranking_service.py:91d79f7f9— api/services/evaluate_service.py:8659853ed— tests/scripts/run_scoring_benchmark.py:43259853ed— tests/scripts/run_scoring_benchmark.py:292
Raw artifacts: tests/fixtures/benchmark_corpus/ms_exemplar_scores.yaml, tests/fixtures/benchmark_corpus/cs_fail_scores.yaml.
"Every broken course is just waiting to be fixed. Every regression is just waiting to be measured." — The scoring agent's adaptation of Mannu's principle
What's Next
Status threshold calibration needs a dedicated tuning pass with additional boundary-case courses. The benchmark corpus needs expansion to 50+ courses for statistical confidence. And the LLM benchmark track — comparing deterministic scores to Claude/GPT scoring on the same corpus — needs to ship with reproducibility guarantees (fixed model version, fixed prompt, variance statistics).
The three-layer stack is the architecture that lets all of this happen without coupling concerns. Scoring truth stays deterministic. Recommendations stay optional. Explanations stay generative. Each layer improves independently.
That's not just engineering. That's governance.