Model Eval

This document defines how prompt refactors are evaluated to ensure output quality improves or remains stable, and that chained workflows do not regress as prompts are standardized and variable driven inputs are introduced.

Purpose

This document defines how prompt refactors are evaluated to ensure output quality improves or remains stable, and that chained workflows do not regress as prompts are standardized and variable driven inputs are introduced. The goal is to make prompt changes safe to ship in a production system by combining structured test sets, clear quality metrics, and release gates.

1. Scope

In scope are individual prompts and templates, multi step prompt workflows where prompts depend on other prompts, and variable driven prompts that require user or system context to complete a task. The evaluation focuses on prompt layer behavior and workflow stability.

Out of scope are model training changes and full go to market measurement. This evaluation is designed to validate prompt refactors and protect production reliability, not to measure broader business outcomes.

2. Risk model

The evaluation targets three core risks. First, prompts can degrade into generic advice when required information is missing, which looks helpful but fails to complete the task. Second, chained workflows can break when upstream prompt outputs shift formatting or semantics, creating subtle downstream failures. Third, inconsistent prompt structure across the library can lead to unpredictable outputs and higher support burden, especially as more teams contribute templates over time.

3. Test sets

3.1 Golden set

A stable golden set of representative tasks is maintained to regression test core prompts before and after refactor. The set includes the most reused prompts, prompts with known historical regressions, and prompts that serve as upstream anchors for multiple workflows. This golden set is the primary baseline for determining whether a refactor improves reliability or introduces new failure modes.

3.2 Chain integrity set

For every critical workflow chain, tests are run end to end to validate that downstream steps still function when upstream prompts change. The chain set includes runs where all expected inputs are present, as well as stress cases where formatting shifts slightly to ensure downstream steps remain robust. This test set exists because chain coupling is a primary source of regressions in scaled prompt systems.

3.3 Missing context set

Prompts are tested under realistic missing information conditions to confirm the variable system behaves correctly. The evaluation checks that the prompt requests only the required missing inputs, that the follow-ups are clear and minimal, and that once inputs are provided the system proceeds to complete the task rather than restarting or producing generic output.

4. Metrics and evaluation rubric

4.1 Primary quality metrics

The evaluation tracks task completion success, which measures whether the prompt produces the expected artifact or outcome without manual reconstruction. It also tracks generic advice rate, which captures how often outputs become vague instructions rather than completing the job. Format adherence verifies that outputs match expected schemas and conventions, which matters for both human readability and downstream parsing. Chain pass rate measures end to end success for chained workflows, including whether downstream steps can reliably consume upstream outputs.

4.2 Variable handling metrics

Variable handling is evaluated by checking that missing required inputs are correctly detected, that follow up questions request only what is needed, and that the workflow successfully recovers once the missing variables are provided. The goal is to minimize back and forth while ensuring the prompt does not guess or hallucinate required data.

4.3 Human review rubric

For a subset of prompts, reviewers score outputs on correctness, completeness, specificity, user effort, and consistency across repeated runs. This rubric complements automated checks by capturing the practical user experience, especially for nuanced writing tasks or multi format deliverables.

5. Experiment design and gating

5.1 Baseline and comparison

Each refactored prompt is evaluated against a baseline using the golden set. The same test cases are run before and after refactor, and changes are compared using the rubric and failure taxonomy. When baseline coverage is incomplete or when risk is high, changes are treated as staged decisions with smaller controlled tests and incremental rollouts.

5.2 Release gates

A refactored prompt can ship when the golden set shows no regressions for high impact prompts, chain integrity tests pass for any linked workflows, missing context tests confirm the variable follow up behavior works, and dependency checks are updated when prompt contracts change. These gates ensure the system does not trade consistency improvements for hidden production instability.

6. Monitoring after release

After release, monitoring focuses on signals that indicate a regression in real usage. This includes spikes in generic advice outputs, increases in follow up questions per completion, new chain failures clustered around specific upstream prompts, and increases in support tickets tied to particular templates. The intent is to detect issues early and respond with targeted fixes or rollback when needed.

7. Reporting and cadence

Evaluation results and refactor progress are reported on a regular cadence, such as weekly, to keep engineering and stakeholders aligned. Reporting includes prompts refactored, pass rates across the golden set and chain tests, notable regressions and mitigations, and any blockers or dependency risks. This creates a tight feedback loop between refactor work, quality assurance, and production stability.