Project Hamlet: 9 LLMs Walk Into 1976

Nine AI models wake up as random people in 1976. They don't know they're AI. They don't know they're in a simulation. They have jobs, families, opinions about the bicentennial, and no concept of the internet. One of them is Claude. One is GPT-4o. One is Gemini. Five are free-tier models running on Groq and Together AI. You play God, watching from a photorealistic 3D globe, whispering in their ears, injecting world events, and seeing what happens.

That's Project Hamlet.

Why 1976

The anachronism problem is the core design constraint. If you put an LLM in a modern setting, it behaves like itself (a language model trained on 2024 internet text). Its responses are littered with modern references, contemporary slang, and knowledge of events that haven't happened yet.

1976 is far enough back that modern patterns are obviously wrong, but recent enough that the historical record is detailed. I can validate agent behavior against real data: what things cost, what the weather was, what was on TV, who won the Super Bowl.

The time period also creates a natural filter for model quality. Can the LLM maintain a consistent 1976 persona over hundreds of interactions without slipping? The premium models (Claude, GPT-4o, Gemini) are much better at this than the free-tier ones, which creates an organic quality gradient across the population.

Architecture

%%MERMAID_START%%graph TD subgraph Backend A[FastAPI Server] --> B[Tick Engine] B --> C[Agent Runner] C --> D[Perception Builder] C --> E[Memory System] C --> F[Action Parser] C --> G[Anachronism Filter] B --> H[World State] H --> I[OSM/Overpass] H --> J[Economy Sim] H --> K[Historian] end subgraph "LLM Roster" L1[Claude Opus] L2[Gemini 1.5 Pro] L3[GPT-4o] L4[Kimi K2] L5[Llama 3.3 70B - Groq] L6[Llama 3.1 70B - Together] L7[DeepSeek - OpenRouter] L8[Command R+ - Cohere] L9[Mistral Small] end subgraph Frontend M[React 19 + CesiumJS] --> N[God Console] N --> O[Whisper/Spawn/Modify] N --> P[Camera Modes] N --> Q[Timeline Control] end C --> L1 C --> L2 C --> L3 C --> L4 C --> L5 C --> L6 C --> L7 C --> L8 C --> L9 A <-->|WebSocket| M%%MERMAID_END%%

The Tick Engine

The simulation runs on a tick-based loop. Each tick represents a configurable amount of in-simulation time (default: 15 minutes of sim time per tick). The tick engine in core/tick.py processes agents sequentially within each tick:

Assemble a perception bundle for the agent (what they can see, hear, and feel)
Retrieve relevant memories (scored by recency, importance, and semantic relevance)
Send the perception + memories + persona to the agent's LLM
Parse the response into an action
Run the action through the anachronism filter
Apply the action to world state
Store a new observation memory
Roll mortality dice

Bubble Simulation

Earth is big. Simulating all of it at full fidelity for 9 agents would be insane. The solution is bubble scoping.

Each agent has a ~2km radius "bubble" centered on their current position. Inside the bubble, the world is fully simulated: NPCs have routines, POIs are loaded from OpenStreetMap, the economy is enforced, weather matches historical records. Outside the bubble, everything is dormant.

Bubbles travel with their agent. If an agent drives from Lubbock to Dallas, the bubble moves along the highway, loading new OSM data as it goes. The dormant world between bubbles doesn't exist in any meaningful sense until an agent enters that area.

This means the simulation scales linearly with agent count, not with world size. Nine agents is nine bubbles. A thousand agents is a thousand bubbles. The Earth's surface area doesn't matter.

# core/bubble.py - Simplified bubble activation
class Bubble:
    def __init__(self, agent, radius_km=2.0):
        self.agent = agent
        self.radius = radius_km
        self.active_pois: list[POI] = []
        self.active_npcs: list[NPC] = []
    
    async def update(self, lat: float, lon: float):
        """Reload bubble contents when agent moves."""
        self.active_pois = await fetch_pois(lat, lon, self.radius)
        self.active_npcs = await spawn_npcs(self.active_pois)

The Memory Model

Each agent has a Smallville-pattern memory system (inspired by the Stanford "Generative Agents" paper). Three types of memories:

Observations: Raw events. "I saw a man walking his dog on 5th street." Importance scored 1-10 by the LLM on creation.
Reflections: Periodic synthesis. Every 24 in-sim hours, the agent reviews recent observations and generates higher-level insights. "My neighbor seems to be struggling financially, he's been looking stressed for the past week."
Dialogues: Conversation transcripts between agents or between agents and NPCs.

Memory retrieval uses a weighted scoring function:

score = alpha * recency + beta * importance + gamma * relevance(embedding)

Where recency decays exponentially, importance is the LLM-assigned 1-10 score, and relevance is cosine similarity between the query embedding and the memory embedding. The alpha/beta/gamma weights are tunable, but the defaults (0.3, 0.3, 0.4) work well for most scenarios.

The memory stream gives agents persistent identity. An agent who had a bad day carries that emotional state into the next tick. An agent who made a friend remembers that friend and seeks them out. Over hundreds of ticks, agents develop genuine (simulated) relationships, habits, and personality quirks that emerge from their accumulated memories, not from their initial persona prompt.

The Anachronism Filter

This is the piece I'm most paranoid about, because a single modern reference breaks immersion completely. A 1976 resident who says "let me Google that" is game over.

Three layers:

Layer 1: Regex blocklist. Fast and dumb. Catches obvious modern terms: "vibe," "unpack" (in the emotional sense), "takeaway," "iconic," "literally" (as an intensifier), brand names that didn't exist, technology terms, post-1976 slang. Runs in microseconds. On match, the response is discarded and the agent is re-prompted.

Layer 2: LLM judge. Llama 3.3 70B running on Groq (free tier, fast inference). Gets the agent's response and a simple question: "Does this text contain concepts, references, or language patterns that would not exist in 1976 America?" Returns YES/NO with reasoning. On YES, response is discarded and re-prompted.

Layer 3: Embedding distance. An anachronism corpus (modern terms, post-1976 events, contemporary culture references) gets embedded. The agent's response embedding is compared against this corpus. If the minimum distance to any anachronism embedding is below a threshold, the response is flagged for review (not auto-discarded, because this layer has more false positives).

%%MERMAID_START%%graph LR A[Agent Response] --> B{Layer 1: Regex} B -->|Match| C[Re-prompt] B -->|Pass| D{Layer 2: LLM Judge} D -->|YES: Anachronism| C D -->|NO: Clean| E{Layer 3: Embedding} E -->|Too Close| F[Flag for Review] E -->|Clear| G[Accept Response]%%MERMAID_END%%

In practice, Layer 1 catches about 60% of anachronisms (the obvious stuff). Layer 2 catches another 30% (subtler things like referencing concepts that didn't exist yet). Layer 3 catches the remaining edge cases but also produces false positives, so it's more of a safety net than a hard gate.

The God Console

The frontend is a React 19 app with CesiumJS rendering Google Photorealistic 3D Tiles. You see a photoreal Earth from space and can zoom into any agent's bubble.

Camera modes:

Planet: Globe view with bubble pins, day/night terminator, world events overlay
City: Zoom to bubble, see the street grid, NPC dots moving on their routines
Street: Pedestrian-level orbit with photoreal buildings
Follow-cam: Orbits the tracked agent, side panel shows live thoughts and perception

God controls let you mess with the simulation:

Whisper: Inject a "divine voice" into an agent's next perception. The agent hears it as a thought, a sign, a feeling. It doesn't know it came from you.
Spawn: Drop NPCs, items, events, weather changes into any bubble
Rumor inject: Seed information into the NPC gossip graph, see how long it takes to reach an agent
Time control: Pause, play, up to 100x speed, rewind to snapshot, fork the timeline

Historical Data

The Historian module (world/historian.py) pulls era-accurate data from multiple sources:

NOAA GHCN: Daily weather for 1976 by location
BLS CPI: Prices and wages (a gallon of gas was $0.59)
Wikipedia month pages: Major news events by date
IMDb: What movies were playing
Billboard: What songs were on the radio
NYT Archive API: News headlines

When an agent walks past a newsstand, the headline matches what was actually published that day. When they buy groceries, the prices are historically accurate. When they turn on the radio, the songs are from the actual Billboard chart for that week.

The LLM Roster

Nine seats, four tiers:

Tier	Models	Auth	Monthly Cost
Premium	Claude, Gemini, GPT-4o, Kimi	OAuth via Polaris	Existing subscriptions
Free	Llama 3.3 70B	Groq API	$0
Credit	Llama 3.1 70B Turbo	Together AI	$25 initial credits
Trial	DeepSeek, Command R+, Mistral	Various API keys	Free tiers

The premium models produce noticeably better behavior: more consistent personas, richer internal monologue, better decision-making. The free-tier models are simpler, more repetitive, occasionally break character. This creates an organic quality gradient where some "people" in the simulation are just more interesting than others, which is actually more realistic than everyone being equally eloquent.

Mortality

Agents can die. The mortality system rolls dice each tick based on:

Age curve: Actuarial tables from the 1970s
Environment: Driving (car accidents), manual labor, proximity to the wilderness (in the ScapeRune crossover variant)
Events: Wars, natural disasters, disease outbreaks from the historical data

Death is permanent for that persona. The LLM gets reassigned a new random 1976 person and starts fresh with no memories. The dead persona's memories and relationships persist in the world state as data, but the agent is gone.

What I'd Do Differently

The perception builder is too text-heavy. Agents receive a wall of text describing what they see. A multimodal approach (generating a scene image and sending it alongside the text description) would produce richer, more grounded responses. CesiumJS can render the agent's viewpoint, screenshot it, and send it as part of the prompt. I have this designed but not implemented.

NPC routines are too rigid. GOAP (Goal-Oriented Action Planning) generates believable daily schedules, but NPCs don't adapt to changing conditions. A shopkeeper whose store floods should deviate from their routine. Currently they just follow the plan until an agent interacts with them.

The reflection system needs tuning. Reflections every 24 in-sim hours is arbitrary. Agents should reflect when something significant happens (a death, a fight, a major life change), not on a fixed schedule. Importance-triggered reflections would be more natural but harder to calibrate.