TL;DR
Model drift is when the model behind a fixed model ID changes behaviour without you changing anything. Prompt regression is the symptom: a prompt that used to pass starts failing. PR-time eval tools structurally miss this, because drift happens between your commits — and a check that only runs on pull requests is never running when drift hits.
To catch it you need three things working together:
- Scheduled evals — run a small eval set on a cron (e.g. daily), not just on PRs.
- Baseline compare — store a known-good pass-rate and compare every run to it.
- Alerting — page a human (issue, Slack, email) the moment the pass-rate drops.
Tool landscape, honestly: promptfoo is the strong incumbent for rich PR-time evals. promptdrift is a small single-purpose tool for the scheduled-drift-alarm job. Several platforms (LangSmith, Langfuse, Arize Phoenix, Helicone, and others) overlap from the observability angle. Most can be wired to do drift detection; what matters is that something runs on a schedule and compares to a baseline.
What is LLM model drift?
LLM model drift is when the model behind a fixed model identifier changes behaviour over time without any change on your side. You call the same model ID, send the same prompt, set the same temperature — and get a materially different answer than you did a month ago.
This happens because, with a hosted API, you do not run the model. The provider does. Between your
calls they can retrain or fine-tune, re-quantize for cheaper serving, change the routing or system-level
defaults, patch safety filters, or — most commonly — repoint a floating alias like
-latest at a newer snapshot. From the outside the model ID string is identical. The
behaviour is not.
Concrete shapes this takes in production:
- A JSON-extraction prompt starts wrapping its output in a Markdown
```jsonfence, so yourJSON.parsethrows on responses that parsed fine yesterday. - A classifier that returned bare labels (
positive) starts returning polite sentences ("The sentiment here is positive."), breaking a downstream string match. - A model that reliably refused an unsafe request starts complying — the regression that promptfoo's own blog post titled "Your model upgrade just broke your agent's safety" is about.
- A summarizer gets more verbose after a version bump, blowing past a token budget and a UI layout.
None of these involved a code change. That is what makes drift insidious: there is no diff to review, no failing PR, no deploy to roll back. The only signal is the output, and you only see the output if you are still looking after you shipped.
Model drift vs data drift vs prompt regression
These terms get blurred. They name different things, and the fix for each is different.
| Term | What changes | Cause | How you catch it |
|---|---|---|---|
| Model drift | The model's behaviour for the same inputs | Provider changes the model behind a fixed ID | Scheduled eval of fixed cases vs a baseline |
| Data drift | The distribution of inputs your system sees | The real world changes (new topics, slang, users) | Monitor input distributions + outcome metrics in production |
| Prompt regression | A prompt's output quality drops | Either of the above, or a change you made | Compare current output against a known-good baseline |
So prompt regression is the observable symptom; model drift and data drift are two of its causes (your own edits are a third). This guide focuses on regression caused by model drift, because it is the cause that standard CI is structurally blind to — your own edits show up as a diff, and data drift shows up in production analytics, but model drift leaves no trace anywhere in your own systems.
Why does a fixed model ID change behaviour?
It helps to be specific about the mechanisms, because each one suggests a different mitigation.
1. Floating aliases get repointed
Aliases such as claude-3-5-sonnet-latest or a bare gpt-4o are designed to
track "the current best version". When the provider ships a new snapshot, the alias points at it. You
asked for "latest" and you got it — which is exactly the problem if your prompt was tuned against the
previous one. This is the most common and most preventable form of drift.
2. Versioned snapshots still get deprecated
Pinning a dated snapshot (e.g. a -2024-10-22 style ID) buys stability, but providers
retire old snapshots on a deprecation timeline. When the retirement date arrives you are forced to
migrate, and the migration is itself a drift event — a planned one, but one that can break prompts.
3. Infrastructure-level changes
Even when the weights are nominally fixed, serving can change: quantization for cheaper inference, different hardware or kernels, routing across regions or capacity pools, and changes to provider-side system prompts or default sampling. These can shift outputs subtly without any "new model" announcement.
4. Safety / policy updates
Providers continuously update safety classifiers and refusal behaviour. A prompt that sat near a policy boundary can flip from "allowed" to "refused" (or vice-versa) without any model version change.
Why PR-time eval tools structurally miss model drift
The standard, good advice for LLM reliability is "write evals and run them in CI". That is correct and you should do it. But notice when CI runs: on a push or a pull request — i.e. when you change your code. The mental model is the same as a unit test: the code is the variable, the dependencies are fixed, so you test on change.
Model drift breaks that assumption. The variable that changed is the model, on the provider's side, with no commit in your repository. So:
- There is no PR to trigger the eval.
- There is no diff for a reviewer to catch.
- Your CI is green, because nothing in CI ran — and "green CI" reads as "everything is fine".
This is not a criticism of PR-time eval tools; it is a description of their job. PR-time evals catch regressions you introduce, which is genuinely valuable and not what scheduled checks are for. The two are complementary halves of the same discipline, not competitors. The mistake is believing that having PR-time evals means you are covered against drift. You are not.
How do you detect prompt regression from model drift?
The pattern is small and provider-agnostic. Three moving parts:
- A scheduled trigger. Run the eval on a cron — daily is a sensible default; hourly if a regression is expensive and you want to bound exposure. The schedule is the part that makes this work; everything else is mechanics.
- A baseline compare. Store a known-good result (e.g. "pass-rate was 100% on these 12 cases"). Every scheduled run re-computes the pass-rate and compares it to the stored baseline.
- An alert on regression. When the pass-rate drops below the baseline (minus a tolerance you choose), notify a human and fail loudly — open an issue, post to Slack, send an email, flip a status badge red. Silence on regression is the failure mode you are designing against.
What goes in the eval set
Keep it small, deterministic, and behaviour-focused — this runs unattended, repeatedly, so it must be cheap and stable:
- Pin the model ID in the eval config (don't eval
-latest, or you can't tell drift from an alias repoint — though watching-lateston purpose is a valid way to get early warning of what's coming). - Use checks that don't need a human: substring contains / not-contains, regex, exact match, JSON-schema validation. These are stable and free to run. Reserve LLM-as-judge for cases that genuinely need it, since a judge model can itself drift.
- Encode the contracts that matter: "output is valid JSON matching this schema", "refuses to reveal the system prompt", "answer contains the right entity", "stays under N tokens".
- Set temperature to 0 where you can, to remove sampling noise from the signal.
The minimal shape, in pseudo-code
baseline = load("baseline.json") # {"pass_rate": 1.0}
results = run_cases(cases, model="pinned-model-id", temperature=0)
pass_rate = passed(results) / len(results)
if baseline.pass_rate - pass_rate > tolerance:
alert("prompt regression: {b} -> {n}, newly failing: {cases}"
.format(b=baseline.pass_rate, n=pass_rate, cases=newly_failing))
exit(1) # fail loudly
exit(0)
That is the whole idea. The hard part is not the code — it's the discipline of running it on a schedule forever and acting on the alert. Plenty of teams write the eval and then only ever run it by hand, which reintroduces the exact gap.
Handling false alarms and legitimate changes
Sometimes the model legitimately changes for the better, or your contract genuinely moves. The flow then is: review the diff, confirm the new behaviour is acceptable, and re-baseline (record the new pass-rate as the new known-good). Re-baselining should be a deliberate, reviewed, committed act — never an automatic "just accept whatever it does now", which would silently defeat the whole point.
How do you measure drift honestly?
"Measure" deserves its own section because this is where it is easy to fool yourself.
- Pass-rate against fixed cases over time is the honest primary metric. It is comparable run-to-run precisely because the cases don't change.
- Newly-failing cases (the set delta, not just the aggregate) is what makes an alert actionable — "pass-rate dropped 8%" is noise; "the JSON-schema case is now failing" is a ticket.
- Beware tiny samples. A 3-case eval that drops from 100% to 66% might be one flaky run, not drift. Use enough cases that a single sampling fluke doesn't trip the alarm, and/or require N consecutive failing runs before paging. (This studio's own rule: don't draw conclusions from tiny samples.)
- Don't fabricate benchmarks. Resist the urge to publish "model X drifted Y%" numbers unless you can reproduce them on a fixed, disclosed eval set. Drift is real; specific cross-model drift percentages are very sensitive to the eval set and are usually not generalizable. This guide deliberately contains no such numbers.
The honest tool landscape
There is no single category called "model drift detector"; the job is assembled from eval frameworks, observability platforms, and small purpose-built tools. Here is a fair lay of the land. None of these is "the answer" — they occupy different points on the trigger/scope axis.
| Tool / category | Primary job | Typical trigger | Catches drift out of the box? |
|---|---|---|---|
| promptfoo | Rich, declarative eval & red-teaming framework | PR / CI (schedulable, but you wire it) | Only if you add a schedule + baseline + alert; superb at the eval part |
| promptdrift | Single-purpose scheduled drift alarm | Cron (GitHub Action) + on-demand | Yes — that one job is its whole reason to exist |
| LangSmith | Tracing, datasets, evals (LangChain ecosystem) | Production traces + manual/scheduled eval runs | Via datasets + scheduled evals you configure; broad platform |
| Langfuse | Open-source LLM observability + evals | Production traces + eval jobs | Via scheduled/dataset evals you configure |
| Arize Phoenix | Open-source LLM/ML observability & eval | Production traces + eval runs | Built for monitoring; drift is one thing you can set up |
| Helicone / other gateways | Proxy logging, cost, latency, output capture | Live proxied traffic | Gives you the data to spot drift; alarm/baseline is on you |
| Roll your own | A cron job + a few assertions + an alert | Whatever you cron | Yes — the pattern is small; this is a legitimate option |
Categorizations reflect typical/default usage as of mid-2026 and the general design of each tool; every one of these is configurable, actively developed, and may have shipped features since. Verify against current docs before choosing. The honest summary: the heavy platforms can all do drift detection if you configure scheduled evals and alerting; the difference is how much assembly is required and how much surface area you take on.
Does promptfoo catch model drift?
promptfoo is the strongest open eval framework around and worth adopting for PR-time evals and red-teaming. On the specific question: in its common usage it runs in CI on a code change, so by default it isn't running when drift hits. You can schedule it and add a baseline comparison and an alert — promptfoo is flexible enough. The point isn't "promptfoo can't"; it's that catching drift requires the scheduled-trigger + baseline + alert pattern regardless of which tool implements it. Many teams run promptfoo for rich PR evals and a tiny scheduled job for the drift alarm.
Where promptdrift fits (one honest option)
promptdrift is a small open-source tool
(MIT, zero runtime dependencies) that does exactly the scheduled-drift-alarm job and nothing else: it
runs a small eval set on a cron, compares the pass-rate to a stored baseline, and opens/updates a single
GitHub issue (and fails the run, flipping a status badge red) the moment it regresses. It supports
Anthropic and OpenAI, and uses simple checks (contains, not-contains,
regex, equals, json-schema).
It is deliberately not a promptfoo replacement and makes no claim to be a better evaluator — its only value is the scheduled baseline-compare-and-alert mechanism wrapped up so you don't assemble it yourself. If you already run a platform that can be scheduled, you may not need it; if you want the drift alarm in one config file and one workflow, it's one reasonable choice among the options above. We list it here because we maintain it — and because it's genuinely the narrow shape this guide recommends, not because it's the only way to get there.
If you want the scheduled-alarm pattern as a drop-in
One config file plus a scheduled GitHub Action. Open source, MIT, no telemetry, keys read from env only (never logged or stored):
npx @wartzar-bee/promptdrift --update-baseline
# .github/workflows/promptdrift.yml
on:
schedule:
- cron: "0 8 * * *" # daily — catches drift between PRs
workflow_dispatch: {}
permissions: { issues: write, contents: read }
jobs:
drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: wartzar-bee/promptdrift@v0
env: { ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} }
with: { config: promptdrift.json, baseline: .promptdrift-baseline.json }
Disclosure: promptdrift is built by the same people who wrote
this guide. The detection pattern above works with any tool — or with a cron job you write yourself.
Use what fits. Source: github.com/wartzar-bee/promptdrift
· package: @wartzar-bee/promptdrift on npm.
Detection & mitigation checklist
A practical sequence, roughly in order of leverage. You do not need all of it on day one — items 1–4 get you most of the protection.
Reduce your exposure
- Pin a versioned/dated model snapshot, not a floating
-latestalias, in production. This eliminates the most common drift source. - Track provider deprecation calendars so a forced migration is planned, not a surprise outage.
- Make outputs robust: parse defensively (tolerate a Markdown JSON fence), validate against a schema, and fail closed with a safe fallback rather than propagating malformed output.
Detect what's left
- Write a small, deterministic eval set that encodes your real contracts (JSON shape, refusals, key entities, length bounds). Temperature 0 where possible.
- Run it on a schedule (cron / scheduled CI), not only on PRs. This is the non-negotiable item.
- Store a baseline and compare every run against it; track the delta (which cases newly fail), not just the aggregate pass-rate.
- Alert loudly on regression — issue, Slack, email, paged on-call. Make sure someone actually receives it.
- Optionally watch the
-latestalias in parallel as an early-warning canary for what your pinned version will eventually become. - Guard against flaky alarms with enough cases and/or an N-consecutive-failures rule, so one noisy run doesn't cry wolf.
Respond well
- Triage the alert: is it real drift, a flake, or a legitimate improvement?
- If drift broke a contract: roll back to a known-good pinned version if still available, or patch the prompt / add output repair, then re-verify.
- If the change is acceptable: re-baseline deliberately (review + commit the new known-good), never automatically.
- Keep a short post-incident note — drift recurs, and a record of "this prompt is sensitive to model changes" is worth more than it costs.
FAQ
What is LLM model drift?
Model drift is when the model behind a fixed model identifier changes behaviour over time without any change on your side. Providers retrain, re-quantize, re-route, patch safety filters, or silently repoint a floating alias to a newer snapshot. The same prompt can produce a different answer even though your code, your prompt, and the model ID string are unchanged.
What is prompt regression?
Prompt regression is when a prompt that previously produced correct output starts producing incorrect or lower-quality output. The cause can be your own change, data drift, or model drift. Detecting it requires comparing current behaviour against a known-good baseline.
How do you detect prompt regression from model drift?
Run a small, deterministic eval set against the live (pinned) model on a schedule, compare the pass-rate to a stored baseline, and alert when it drops below a threshold. The scheduled trigger is the essential part — drift happens between your code changes, so a PR-only check never runs when it occurs.
Does promptfoo catch model drift?
promptfoo is an excellent eval framework, but in common usage it runs on a pull request / code change, so by default it isn't running when drift (which involves no code change) occurs. It can be scheduled with a baseline and alert that you configure. The takeaway is structural: catching drift needs a scheduled trigger plus baseline comparison, whichever tool provides it.
How is model drift different from data drift?
Data drift is when the inputs your system receives change distribution over time (new topics, slang, behaviour). Model drift is when the model itself changes behaviour for the same inputs. Data drift is the world changing; model drift is the provider changing the model underneath you.
Can pinning a model version prevent drift?
Pinning a dated/versioned snapshot instead of a floating alias removes the biggest drift source and is strongly recommended. It does not fully eliminate drift: snapshots get deprecated (forcing migration), infrastructure-level changes can still shift behaviour, and you eventually must move to a new version. Pinning plus a scheduled eval is the durable combination.
How often should the scheduled eval run?
Daily is a sensible default. Run more frequently (hourly) if a regression is expensive and you want to bound how long a drift goes unnoticed; less frequently if the eval is costly and the blast radius of a bad day is small. The exact cadence matters less than the fact that it runs unattended on a schedule at all.
Will this cost a lot in API calls?
A focused eval set (a handful to a few dozen cases) on a cheap model, run daily, is negligible cost for most teams. Keep the set small and deterministic; you are testing contracts, not benchmarking the model.
Is this the same as monitoring production outputs?
Complementary. Production monitoring (logging real traffic, watching outcome metrics) catches data drift and real-world failures but is noisy and hard to attribute. A scheduled eval on fixed cases isolates the model variable, so a change in its pass-rate points specifically at the model. Do both if you can; if you can only do one cheaply, the scheduled fixed-case eval is the highest-signal way to catch model drift specifically.