Research Workbench · March 2026

Karoline Leavitt: simulating the press secretary to predict Polymarket word markets

Cover image: Morgin.ai Studio · distillation-dynamics.png

Working draft for a Karoline Leavitt LoRA project aimed at forecasting briefing language and testing whether tradable political wording is predictable.

This is a local working draft.

The core idea is simple: if prediction markets are willing to settle on whether a White House spokesperson says a specific word or phrase, then the wording itself is economically meaningful.

That makes a Karoline Leavitt simulator interesting for more than style transfer. The project becomes interesting if a narrow persona adapter helps predict which phrases, frames, or talking points will actually appear in a future briefing.

The Claim

The strongest version of the project is not "we made a model that sounds like Karoline Leavitt."

It is this:

White House briefings are constrained enough to be forecastable.
A press-secretary-specific LoRA may improve forecasts over a generic model.
Some of those forecasts map onto live, tradable semantics in prediction markets.

If that holds up, this would be closer to a market-facing forecasting system than a mere voice-cloning demo.

Why Karoline Leavitt

Karoline Leavitt is a good target because the role is repetitive, public, adversarial, and tightly coupled to the day's official line.

That means the model can be asked to learn several things at once:

surface style and cadence,
stock evasions and pivots,
topic prioritization,
how briefing language compresses the administration's current agenda.

The real question is whether those patterns are predictive enough to help on held-out future briefings.

Candidate Data Sources

We probably want a layered corpus rather than one single source.

Source	What it gives us	Notes
White House briefings and statements	Official transcripts and release dates	Best gold-standard source when transcripts are published.
White House remarks	Same-day context from the president and administration	Useful conditioning data even when it is not Leavitt text.
White House videos and official YouTube	Video archive and caption fallback	Important because official transcript coverage may be selective.
C-SPAN archive	Dated briefing and interview appearances	Probably the best archive-quality expansion set outside the White House site.
Factba.se / Roll Call Factbase	Searchability, dates, calendars, release metadata	Helpful as a retrieval and bookkeeping layer.
Polymarket Gamma API	Historical market metadata and labels	Needed for event definitions, market timing, and price snapshots.

Example official transcript page:

<a href="https://www.whitehouse.gov/briefings-statements/2025/01/press-briefing-by-press-secretary-karoline-leavitt/" target="_blank" rel="noopener noreferrer">Press Briefing by Press Secretary Karoline Leavitt</a>

Immediate corpus-building note: the safest plan is to keep only dated, source-linked text with strict event-level deduplication. The same briefing may appear as an official transcript, a video posting, a C-SPAN entry, and a searchable archive record.

Why Word Markets Matter

Prediction markets have repeatedly listed contracts on whether a politician will say a particular word or phrase in a speech or briefing.

That means there is already a market for narrow semantic events.

Examples:

<a href="https://polymarket.com/event/will-president-biden-say-folks-in-his-first-joint-address" target="_blank" rel="noopener noreferrer">Will President Biden say "folks" in his first joint address?</a>
<a href="https://polymarket.com/event/will-president-biden-mention-coronavirus-3-or-more-times-in-his-first-joint-address" target="_blank" rel="noopener noreferrer">Will Biden mention coronavirus 3+ times?</a>
<a href="https://polymarket.com/event/will-president-biden-mention-cryptocurrency-in-the-2022-state-of-the-union-address" target="_blank" rel="noopener noreferrer">Will Biden mention cryptocurrency in the 2022 State of the Union?</a>
<a href="https://polymarket.com/event/will-president-biden-mention-swift-in-the-2022-state-of-the-union-address" target="_blank" rel="noopener noreferrer">Will Biden mention SWIFT?</a>
<a href="https://polymarket.com/event/will-biden-say-recession-in-his-tuesday-speech-on-inflation" target="_blank" rel="noopener noreferrer">Will Biden say recession?</a>

This is the cleanest motivation for the project. If markets settle on wording, then a simulator that improves wording forecasts is potentially useful in a way that ordinary political style transfer is not.

Proposed Benchmark

The benchmark should be forward-only and date-clean.

Train on briefings up to date T, validate on the next block, and test only on later briefings. No random split. No splitting individual Q-and-A segments from the same briefing across train and test.

The first pass should probably score four task types:

1. Phrase appearance: does a target word or phrase appear at all? 2. Talking-point ranking: which themes are most likely to show up? 3. Likely-answer generation: how would Leavitt answer a specific reporter question? 4. Market calibration: how do the model's probabilities compare with Polymarket prices?

Core metrics:

Brier score and log loss for binary phrase markets,
precision/recall or F1 for mention detection,
calibration plots,
LoRA vs base ablations,
retrieval-only and market-only baselines.

The benchmark should also separate three different sources of predictive power:

persona/style signal,
same-day administration signal,
generic current-events signal.

Otherwise we could mistake "the model read today's agenda correctly" for "the Karoline adapter added something real."

Leakage And Contamination Risks

This project is only interesting if the held-out set is genuinely future-facing.

Main risks:

the same event appears in multiple archives,
later transcripts leak into the training corpus through scraped mirrors,
models may already know older briefings from pretraining,
market settlement can hinge on exact rules rather than intuitive wording.

That last point matters a lot. A model can be directionally right about the briefing and still lose the market if the contract resolves on one exact phrase.

The cleanest version of the benchmark would use materially post-cutoff Leavitt briefings for the main test set, especially for open models whose pretraining windows likely end before the relevant White House period.

Prior Work And Framing

Several adjacent literatures matter here.

Reference	Why it matters
Argyle et al. - Out of One, Many	Political-science precedent for using language models as simulated human samples.
Horton - Homo Silicus	Frames LLMs as simulated agents rather than just text generators.
Park et al. - Generative Agents	Useful conceptual framing for persistent role-consistent simulacra.
Approaching Human-Level Forecasting with Language Models	Bridge from persona simulation to forecast quality.
Exploring Decentralized Prediction Markets on Polymarket	Useful grounding if we want to justify Polymarket as a serious target.
Hu et al. - LoRA	Base adaptation method for the project.
RoleLLM	Closest benchmark literature on role-playing ability.
Are Large Language Models Actually Good at Text Style Transfer?	Good caution against over-reading surface imitation.

The article should probably lean on one idea above all others: success here would not prove that a model can "know the future." It would show that official political language is structured and coordinated enough to be forecastable.

Concrete Research Questions

Does a Karoline-specific LoRA beat the base model on held-out phrase prediction?
Does it beat a retrieval-only baseline built from prior briefings?
Does it add anything on top of same-day White House context?
Can it beat Polymarket prices on a subset of wording markets, or at least improve calibration when combined with them?
Does the adapter mostly change tone, or does it genuinely improve content prediction?

Immediate Build Plan

1. Build a dated corpus of official Leavitt appearances with source URLs and deduplication. 2. Mark each item by source quality: official transcript, official caption, broadcast transcript, or manually recovered transcript. 3. Build a companion context dataset from same-day White House remarks, calendar items, executive actions, and major news triggers. 4. Collect historical wording-sensitive prediction markets and normalize their settlement rules. 5. Run a baseline stack before any LoRA training: generic frontier model, base local model, retrieval-only baseline, and market-implied probabilities. 6. Only then test whether a LoRA adds incremental predictive signal.

Open Questions

How much official Leavitt transcript material actually exists in reusable text form?
How often do briefing words matter because of Leavitt specifically, versus because the administration's line was already obvious?
Which market formulations are robust enough for evaluation, and which are too settlement-fragile?
Is the right unit a full simulated briefing, or a narrower next-answer / phrase-probability task?

For now, the right posture is to treat this as a research workbench: collect the corpus, define the clean benchmark, and resist the temptation to celebrate a style clone before we know whether it predicts anything tradable.