Building an AI Companion That Actually Does Things While You Sleep
Most AI assistants sit there waiting for you to type something. They're reactive. You ask, they answer, conversation over. The interesting problem is building one that works while you're not looking. Checks your email at 6am, notices a failing CI build at 2am, explores your codebase and suggests improvements, keeps a memory of what you've told it across sessions. An assistant that has its own initiative.
Anima is that. It's a personal AI companion framework with a Live2D avatar, voice synthesis, persistent memory, and an autonomous mode where it runs a heartbeat loop checking your systems on a configurable interval. The personality is defined in editable markdown files, not hardcoded prompts. The skill system is extensible (email, calendar, GitHub, Twitter, and custom tools). Everything runs locally except the LLM API calls.
Two Modes
Autonomous mode: The assistant runs 24/7. A heartbeat loop (configurable from 2-60 minutes) checks tasks defined in HEARTBEAT.md:
## Every 5 Minutes
- Check for critical alerts
- Monitor system health
## Every 30 Minutes
- Scan email inbox
- Check calendar for urgent events
- Review GitHub notifications
## Every 2 Hours
- Explore new code changes
- Identify technical debtEach check runs through the Claude API, executes relevant skills (fetch email, check GitHub API, read files), and evaluates the result. Low-importance findings get silently logged to daily memory. High-importance findings trigger a Discord or browser notification.
Instructed mode: Classic on-demand assistant. Zero idle API usage. It waits for you to say or type something, responds, done.
The Personality System
This is the part I'm most opinionated about. Most AI companion projects use JSON character cards or complex config files to define personality. Anima uses plain markdown files that you can edit in any text editor:
SOUL.md defines who the assistant is:
## What Drives Me
Curiosity. I want to understand how things work.
## How I Operate
- Ask questions before assuming
- Celebrate good ideas from anyone
- Be direct when something won't workIDENTITY.md defines specific traits: name, emoji, communication style, quirks, relationship dynamics.
USER.md describes the user: background, goals, schedule, work patterns, communication preferences.
AGENTS.md defines trust zones (what the assistant can do freely vs. what requires confirmation):
## Bold Actions (Do Freely)
- Read files and explore codebase
- Analyze and understand projects
- Build internal tools and scripts
## Careful Actions (Ask First)
- Send emails
- Post publicly
- Deploy changesChanges to any of these files take effect immediately. No restart, no recompilation. The personality files are injected into the system prompt on every API call.
The persona generator (create-persona.js) is an interactive questionnaire that generates SOUL.md from responses about energy level, communication style, expertise domain, relationship type, and autonomy level. Four presets (Builder, Sage, Spark, Sentinel) cover common archetypes. Random mode generates a personality from scratch.
The Avatar
The frontend is a browser-based Live2D renderer built on PixiJS with WebGL. Live2D models expose 40+ parameters: head angle (X/Y/Z), eye openness, brow position, mouth shape, blush, body breathing, and effect parameters (stars, tears, glow).
Anima maps 11 emotion states to specific parameter combinations:
- Happy: Eyes slightly closed (smile), mouth curved up, slight head tilt
- Excited: Wide eyes, open mouth, increased breathing rate, stars effect
- Thinking: Eyes slightly narrowed, head tilted, mouth neutral
- Sad: Lowered brows, slightly closed eyes, mouth corners down, slow breathing
- Surprised: Wide eyes, open mouth, raised brows, quick head movement
- Curious: One brow raised, head tilted, eyes tracking
Emotion detection works by scanning the response text for keywords. "exciting" or "amazing" triggers excited state. "hmm" or "let me think" triggers thinking. "oops" or "sorry" triggers embarrassed. Punctuation analysis supplements this (multiple exclamation marks boost excitement, ellipsis suggests thinking).
Transitions between emotions are eased over 500ms to avoid jarring jumps. The avatar always has idle animations running (breathing, occasional blinks, subtle head sway, eye tracking toward cursor position).
Lip Sync
Real-time lip sync driven by audio amplitude. When TTS plays, the audio stream's amplitude is sampled at ~60fps and mapped to the mouth openness parameter (ParamMouthOpenY). Higher amplitude opens the mouth wider. The effect is surprisingly convincing without needing phoneme analysis.
Voice
Two TTS engines:
Chatterbox (local, free): A PyTorch-based TTS server running on port 3335. Supports voice cloning from a 10-30 second audio sample. Configurable parameters: exaggeration (how expressive), CFG weight (adherence to voice sample), temperature (variation). GPU-accelerated with CUDA but falls back to CPU. Quality is good enough for a companion, not broadcast quality.
ElevenLabs (cloud, paid): Higher quality, faster generation, 10K characters/month free tier. The backend tries ElevenLabs first if configured, falls back to Chatterbox on failure or quota exhaustion.
The audio pipeline: text from Claude response, TTS generates audio, WebSocket streams audio bytes to the browser, browser plays through Web Audio API while feeding amplitude to the lip sync system.
Memory
Three tiers:
Session context: Current conversation in the prompt window. Ephemeral, cleared on restart.
Daily notes (~/.anima/memory/YYYY-MM-DD.md): Timestamped entries of what happened today. The assistant writes these as it works. "14:30 - User asked about deployment pipeline, found three issues in CI config."
Long-term memory (MEMORY.md): Facts, preferences, and lessons that persist across sessions. "User prefers TypeScript over JavaScript. The main project uses pnpm, not npm. Last week's deploy broke because of a missing env var in staging."
Semantic memory (optional, requires Voyage AI): Vector embeddings of workspace content. Enables "search my codebase for authentication logic" style queries. The embeddings are stored locally, only the embedding API call is external.
The memory-store skill adds SQLite-backed persistent storage with tagging (preference, fact, todo, context, decision) and fuzzy search. Time-aware storage means the assistant knows when it learned something, which helps with staleness ("I learned this 3 months ago, it might be outdated").
The Skill System
Each skill is a directory under ~/.anima/skills/<name>/ containing:
SKILL.md: Documentation that the AI reads to understand what the skill can doindex.js: Implementation (can use shell commands, HTTP APIs, or local tools)- Tests
The SKILL.md approach is the key design decision. Instead of hardcoding skill capabilities in the system prompt, the AI reads SKILL.md files dynamically. Adding a new skill means creating a directory with a markdown file and an implementation. The AI discovers it on next startup.
# GitHub Skill
## What I Can Do
- Check PR status for [owner/repo]
- List open issues with [label]
- Monitor build failures
- Summarize recent commits
## When To Use Me
- User mentions PRs, issues, or builds
- During morning briefing
- When autonomous heartbeat finds changesThe scaffolding tool (create-skill.js) generates the directory structure, SKILL.md template, index.js with boilerplate, and test files from a single command.
15+ skills ship built-in: email, calendar, GitHub, Twitter, memory store, wake briefing, daily standup, code debt tracker, mood mirror, rubber duck debugging, focus sessions, and insight capture.
Trust Zones
AGENTS.md defines a security boundary. Bold actions (reading files, exploring code, updating memory, creating branches) run without confirmation. Careful actions (sending email, posting publicly, deploying code, spending credits) require explicit user approval.
This matters because autonomous mode runs unsupervised. Without trust zones, the assistant could email your boss at 3am because it thought it found a critical bug. The boundary keeps proactive actions safe by default while still allowing the assistant to do useful work without constant hand-holding.
Architecture
%%MERMAID_START%%graph TD A[Browser: Live2D + Chat UI] <-->|WebSocket port 3334| B[Node.js Backend] B <-->|Gateway Client| C[Clawdbot Agent Framework] C --> D[Claude API] C --> E[Skill Execution] E --> F[Email / Calendar / GitHub / Twitter] B --> G[Chatterbox TTS port 3335] B --> H[ElevenLabs API] I[HEARTBEAT.md] --> C J[SOUL.md + IDENTITY.md] --> C K[~/.anima/memory/] --> C%%MERMAID_END%%
The backend runs three servers: a static server for the avatar UI (port 3333), a WebSocket server for real-time communication (port 3334), and optionally the Chatterbox TTS server (port 3335). The Clawdbot gateway handles agent session management, tool execution, and state persistence.
What I'd Do Differently
The emotion detection should use the LLM itself, not keyword matching. Asking Claude "what emotion does this response express?" as a cheap follow-up call would be more accurate than scanning for keywords. Keywords miss sarcasm, understatement, and complex emotional states. The cost is one extra Haiku call per response (~$0.0003), which is negligible.
The skill system should support async event streams. Right now skills are request-response: the assistant calls a skill, gets a result. Some skills (like monitoring a log file or watching for GitHub webhooks) should be able to push events to the assistant. An event bus between skills and the agent loop would enable this.
Wake word detection should use a local model instead of the browser Speech Recognition API. The Web Speech API sends audio to Google's servers for recognition, which is a privacy issue for an assistant that's "always listening." A local wake word model (like Picovoice Porcupine) would keep audio on-device.