Karoline Leavitt: simulating the press secretary to predict Polymarket word markets
Cover image: Morgin.ai Studio · distillation-dynamics.png
Working draft for a Karoline Leavitt LoRA project aimed at forecasting briefing language and testing whether tradable political wording is predictable.
This is a local working draft.
The core idea is simple: if prediction markets are willing to settle on whether a White House spokesperson says a specific word or phrase, then the wording itself is economically meaningful.
That makes a Karoline Leavitt simulator interesting for more than style transfer. The project becomes interesting if a narrow persona adapter helps predict which phrases, frames, or talking points will actually appear in a future briefing.
The Claim
The strongest version of the project is not "we made a model that sounds like Karoline Leavitt."
It is this:
- White House briefings are constrained enough to be forecastable.
- A press-secretary-specific LoRA may improve forecasts over a generic model.
- Some of those forecasts map onto live, tradable semantics in prediction markets.
If that holds up, this would be closer to a market-facing forecasting system than a mere voice-cloning demo.
Why Karoline Leavitt
Karoline Leavitt is a good target because the role is repetitive, public, adversarial, and tightly coupled to the day's official line.
That means the model can be asked to learn several things at once:
- surface style and cadence,
- stock evasions and pivots,
- topic prioritization,
- how briefing language compresses the administration's current agenda.
The real question is whether those patterns are predictive enough to help on held-out future briefings.
Candidate Data Sources
We probably want a layered corpus rather than one single source.
| Source | What it gives us | Notes |
|---|---|---|
| White House briefings and statements | Official transcripts and release dates | Best gold-standard source when transcripts are published. |
| White House remarks | Same-day context from the president and administration | Useful conditioning data even when it is not Leavitt text. |
| White House videos and official YouTube | Video archive and caption fallback | Important because official transcript coverage may be selective. |
| C-SPAN archive | Dated briefing and interview appearances | Probably the best archive-quality expansion set outside the White House site. |
| Factba.se / Roll Call Factbase | Searchability, dates, calendars, release metadata | Helpful as a retrieval and bookkeeping layer. |
| Polymarket Gamma API | Historical market metadata and labels | Needed for event definitions, market timing, and price snapshots. |
Example official transcript page:
- <a href="https://www.whitehouse.gov/briefings-statements/2025/01/press-briefing-by-press-secretary-karoline-leavitt/" target="_blank" rel="noopener noreferrer">Press Briefing by Press Secretary Karoline Leavitt</a>
Immediate corpus-building note: the safest plan is to keep only dated, source-linked text with strict event-level deduplication. The same briefing may appear as an official transcript, a video posting, a C-SPAN entry, and a searchable archive record.
Why Word Markets Matter
Prediction markets have repeatedly listed contracts on whether a politician will say a particular word or phrase in a speech or briefing.
That means there is already a market for narrow semantic events.
Examples:
- <a href="https://polymarket.com/event/will-president-biden-say-folks-in-his-first-joint-address" target="_blank" rel="noopener noreferrer">Will President Biden say "folks" in his first joint address?</a>
- <a href="https://polymarket.com/event/will-president-biden-mention-coronavirus-3-or-more-times-in-his-first-joint-address" target="_blank" rel="noopener noreferrer">Will Biden mention coronavirus 3+ times?</a>
- <a href="https://polymarket.com/event/will-president-biden-mention-cryptocurrency-in-the-2022-state-of-the-union-address" target="_blank" rel="noopener noreferrer">Will Biden mention cryptocurrency in the 2022 State of the Union?</a>
- <a href="https://polymarket.com/event/will-president-biden-mention-swift-in-the-2022-state-of-the-union-address" target="_blank" rel="noopener noreferrer">Will Biden mention SWIFT?</a>
- <a href="https://polymarket.com/event/will-biden-say-recession-in-his-tuesday-speech-on-inflation" target="_blank" rel="noopener noreferrer">Will Biden say recession?</a>
This is the cleanest motivation for the project. If markets settle on wording, then a simulator that improves wording forecasts is potentially useful in a way that ordinary political style transfer is not.
Proposed Benchmark
The benchmark should be forward-only and date-clean.
Train on briefings up to date T, validate on the next block, and test only on later briefings. No random split. No splitting individual Q-and-A segments from the same briefing across train and test.
The first pass should probably score four task types:
1. Phrase appearance: does a target word or phrase appear at all? 2. Talking-point ranking: which themes are most likely to show up? 3. Likely-answer generation: how would Leavitt answer a specific reporter question? 4. Market calibration: how do the model's probabilities compare with Polymarket prices?
Core metrics:
- Brier score and log loss for binary phrase markets,
- precision/recall or F1 for mention detection,
- calibration plots,
- LoRA vs base ablations,
- retrieval-only and market-only baselines.
The benchmark should also separate three different sources of predictive power:
- persona/style signal,
- same-day administration signal,
- generic current-events signal.
Otherwise we could mistake "the model read today's agenda correctly" for "the Karoline adapter added something real."
Leakage And Contamination Risks
This project is only interesting if the held-out set is genuinely future-facing.
Main risks:
- the same event appears in multiple archives,
- later transcripts leak into the training corpus through scraped mirrors,
- models may already know older briefings from pretraining,
- market settlement can hinge on exact rules rather than intuitive wording.
That last point matters a lot. A model can be directionally right about the briefing and still lose the market if the contract resolves on one exact phrase.
The cleanest version of the benchmark would use materially post-cutoff Leavitt briefings for the main test set, especially for open models whose pretraining windows likely end before the relevant White House period.
Prior Work And Framing
Several adjacent literatures matter here.
| Reference | Why it matters |
|---|---|
| Argyle et al. - Out of One, Many | Political-science precedent for using language models as simulated human samples. |
| Horton - Homo Silicus | Frames LLMs as simulated agents rather than just text generators. |
| Park et al. - Generative Agents | Useful conceptual framing for persistent role-consistent simulacra. |
| Approaching Human-Level Forecasting with Language Models | Bridge from persona simulation to forecast quality. |
| Exploring Decentralized Prediction Markets on Polymarket | Useful grounding if we want to justify Polymarket as a serious target. |
| Hu et al. - LoRA | Base adaptation method for the project. |
| RoleLLM | Closest benchmark literature on role-playing ability. |
| Are Large Language Models Actually Good at Text Style Transfer? | Good caution against over-reading surface imitation. |
The article should probably lean on one idea above all others: success here would not prove that a model can "know the future." It would show that official political language is structured and coordinated enough to be forecastable.
Concrete Research Questions
- Does a Karoline-specific LoRA beat the base model on held-out phrase prediction?
- Does it beat a retrieval-only baseline built from prior briefings?
- Does it add anything on top of same-day White House context?
- Can it beat Polymarket prices on a subset of wording markets, or at least improve calibration when combined with them?
- Does the adapter mostly change tone, or does it genuinely improve content prediction?
Immediate Build Plan
1. Build a dated corpus of official Leavitt appearances with source URLs and deduplication. 2. Mark each item by source quality: official transcript, official caption, broadcast transcript, or manually recovered transcript. 3. Build a companion context dataset from same-day White House remarks, calendar items, executive actions, and major news triggers. 4. Collect historical wording-sensitive prediction markets and normalize their settlement rules. 5. Run a baseline stack before any LoRA training: generic frontier model, base local model, retrieval-only baseline, and market-implied probabilities. 6. Only then test whether a LoRA adds incremental predictive signal.
Open Questions
- How much official Leavitt transcript material actually exists in reusable text form?
- How often do briefing words matter because of Leavitt specifically, versus because the administration's line was already obvious?
- Which market formulations are robust enough for evaluation, and which are too settlement-fragile?
- Is the right unit a full simulated briefing, or a narrower next-answer / phrase-probability task?
For now, the right posture is to treat this as a research workbench: collect the corpus, define the clean benchmark, and resist the temptation to celebrate a style clone before we know whether it predicts anything tradable.