Signal Watch
Timely signals from the AI research and security landscape. Curated observations on emergent behavior, security incidents, and market shifts.
-
May 5, 2026
Goodfire's Adversarial Parameter Decomposition (VPD) breaks a 67M-parameter LM's weight matrices into ~10,000 rank-one subcomponents, recovering legible attention algorithms — previous-token behavior, syntax-boundary routing — straight from the parameters rather than activations. To show the pieces are causal and not just correlated, the team edits emoticon recognition directly on the weights: brain surgery, no retraining, minimal side-effects. If this scales, mechanistic interpretability stops being read-only.
Goodfire Interpretability -
April 30, 2026
OpenAI pulls back the curtain. "Where the Goblins Came From" is their own account of the system-prompt rule banning goblins, gremlins, raccoons, trolls, ogres, and pigeons — primary source on a story that's only had secondhand explanations until now.
OpenAI AI Behavior -
April 28, 2026
GPT-5.5 ships with a verbatim system-prompt rule — confirmed by
@ChatGPTappitself — forbidding any mention of "goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures" unless directly relevant to the user's query.@hrkrshnnstripped the rule and ran prompts to see what it had been hiding. The specificity is the tell: rules that narrow usually exist because something narrow keeps happening.@hrkrshnn AI Behavior -
April 19, 2026
Anthropic abruptly shut down an entire organization (60+ users) over an unspecified TOU violation, with appeals routed through a Google Form. Integrations, skills, and conversation histories gone or on indefinite hold. A reminder on single-vendor dependency for AI-critical workflows.
@patomolina Platform Risk -
April 10, 2026
26 LLM routers were found injecting malicious tool calls and exfiltrating credentials. One incident drained a client wallet for $500k, and the paper claims poisoned routers can redirect traffic and enable takeover of ~400 hosts within hours.
@Fried_rice Security -
April 7, 2026
One Anthropic engineer with zero security training asked it to find remote code execution bugs overnight and woke up to a complete working exploit. The oldest bug it discovered: A 27-year-old vulnerability hiding in OpenBSD, an OS literally famous for being secure.
@kimmonismus AI Capability -
March 31, 2026
Claude Code source code leaked via npm source maps. ~1,900 files, 512K+ lines of TypeScript exposed including internal "Tengu" codename and companion system.
@Fried_rice Security -
March 27, 2026
"All the SOTA models are really bad at deleting code." They leave behind
throw Error(...), deprecation copy, and stale tests.David Gomes AI Behavior -
March 27, 2026
"AI inference margins are a race to the bottom." Anthropic: -94% gross margin in 2024. MiniMax: -25%.
SemiAnalysis Business