Skip to main content

Testing Guide

Seraph has automated backend and frontend coverage with CI running on every push and PR.

Running Tests

Backend

cd backend
uv sync --group dev # Install dev dependencies (first time)
uv run pytest -v # Run all tests with coverage
uv run pytest --no-cov # Run without coverage (faster)
uv run pytest tests/test_session.py -v # Run a single file
uv run pytest -k "test_create" # Run tests matching a pattern

Coverage is configured by default in pyproject.toml:

[tool.pytest.ini_options]
asyncio_mode = "auto"
addopts = "--cov=src --cov-report=term-missing"

Runtime Evals

For S1-B3 reliability work, there is also a deterministic eval harness for core guardian/runtime contracts:

cd backend
uv run python -m src.evals.harness --list
uv run python -m src.evals.harness
uv run python -m src.evals.harness --scenario rest_chat_behavior
uv run python -m src.evals.harness --scenario rest_chat_approval_contract
uv run python -m src.evals.harness --scenario rest_chat_timeout_contract
uv run python -m src.evals.harness --scenario websocket_chat_behavior
uv run python -m src.evals.harness --scenario websocket_chat_approval_contract
uv run python -m src.evals.harness --scenario websocket_chat_timeout_contract
uv run python -m src.evals.harness --scenario strategist_tick_behavior
uv run python -m src.evals.harness --scenario strategist_tick_learning_continuity_behavior
uv run python -m src.evals.harness --scenario guardian_state_synthesis
uv run python -m src.evals.harness --scenario guardian_world_model_behavior
uv run python -m src.evals.harness --scenario observer_refresh_behavior
uv run python -m src.evals.harness --scenario observer_delivery_decision_behavior
uv run python -m src.evals.harness --scenario native_presence_notification_behavior
uv run python -m src.evals.harness --scenario native_desktop_shell_behavior
uv run python -m src.evals.harness --scenario cross_surface_notification_controls_behavior
uv run python -m src.evals.harness --scenario cross_surface_continuity_behavior
uv run python -m src.evals.harness --scenario intervention_policy_behavior
uv run python -m src.evals.harness --scenario observer_delivery_salience_confidence_behavior
uv run python -m src.evals.harness --scenario guardian_feedback_loop
uv run python -m src.evals.harness --scenario provider_fallback_chain
uv run python -m src.evals.harness --scenario provider_health_reroute
uv run python -m src.evals.harness --scenario local_runtime_profile
uv run python -m src.evals.harness --scenario helper_local_runtime_paths
uv run python -m src.evals.harness --scenario context_window_summary_audit
uv run python -m src.evals.harness --scenario agent_local_runtime_profile
uv run python -m src.evals.harness --scenario delegation_local_runtime_profile
uv run python -m src.evals.harness --scenario delegated_tool_workflow_behavior
uv run python -m src.evals.harness --scenario delegated_tool_workflow_degraded_behavior
uv run python -m src.evals.harness --scenario workflow_composition_behavior
uv run python -m src.evals.harness --scenario mcp_specialist_local_runtime_profile
uv run python -m src.evals.harness --scenario embedding_runtime_audit
uv run python -m src.evals.harness --scenario vector_store_runtime_audit
uv run python -m src.evals.harness --scenario soul_runtime_audit
uv run python -m src.evals.harness --scenario vault_runtime_audit
uv run python -m src.evals.harness --scenario filesystem_runtime_audit
uv run python -m src.evals.harness --scenario runtime_model_overrides
uv run python -m src.evals.harness --scenario runtime_fallback_overrides
uv run python -m src.evals.harness --scenario runtime_profile_preferences
uv run python -m src.evals.harness --scenario runtime_path_patterns
uv run python -m src.evals.harness --scenario provider_policy_capabilities
uv run python -m src.evals.harness --scenario provider_policy_scoring
uv run python -m src.evals.harness --scenario provider_policy_safeguards
uv run python -m src.evals.harness --scenario provider_routing_decision_audit
uv run python -m src.evals.harness --scenario session_bound_llm_trace
uv run python -m src.evals.harness --scenario session_consolidation_behavior
uv run python -m src.evals.harness --scenario scheduled_local_runtime_profile
uv run python -m src.evals.harness --scenario daily_briefing_fallback
uv run python -m src.evals.harness --scenario daily_briefing_delivery_behavior
uv run python -m src.evals.harness --scenario shell_tool_runtime_audit
uv run python -m src.evals.harness --scenario browser_runtime_audit
uv run python -m src.evals.harness --scenario web_search_runtime_audit
uv run python -m src.evals.harness --scenario web_search_empty_result_audit
uv run python -m src.evals.harness --scenario observer_calendar_source_audit
uv run python -m src.evals.harness --scenario observer_git_source_audit
uv run python -m src.evals.harness --scenario observer_goal_source_audit
uv run python -m src.evals.harness --scenario observer_time_source_audit
uv run python -m src.evals.harness --scenario observer_delivery_gate_audit
uv run python -m src.evals.harness --scenario observer_delivery_transport_audit
uv run python -m src.evals.harness --scenario observer_daemon_ingest_audit
uv run python -m src.evals.harness --scenario mcp_test_api_audit
uv run python -m src.evals.harness --scenario skills_api_audit
uv run python -m src.evals.harness --scenario tool_policy_guardrails_behavior
uv run python -m src.evals.harness --scenario screen_repository_runtime_audit
uv run python -m src.evals.harness --scenario daily_briefing_degraded_memories_audit
uv run python -m src.evals.harness --scenario activity_digest_degraded_delivery_behavior
uv run python -m src.evals.harness --scenario activity_digest_degraded_summary_audit
uv run python -m src.evals.harness --scenario evening_review_degraded_delivery_behavior
uv run python -m src.evals.harness --scenario evening_review_degraded_inputs_audit

This runner does not call external providers. It exercises core seams with controlled mocks so REST and WebSocket chat behavior, guardian-state synthesis, guardian world-model behavior, guardian feedback loop behavior, strategist learned-native-delivery continuity behavior, intervention policy behavior, observer salience/confidence/interruption-cost behavior, calibrated high-salience delivery versus degraded-confidence defer behavior, observer refresh and delivery behavior, native notification fallback behavior, native desktop presence status plus the safe test-notification path, browser-side inspect/dismiss controls for pending native notifications, the unified cross-surface continuity snapshot, session consolidation behavior, strategist and scheduled proactive flow behavior, delegated tool-heavy workflow behavior, reusable workflow composition behavior, ordered fallback routing, health-aware provider rerouting, runtime-path profile preferences, wildcard runtime-path rules, capability-aware runtime policy intents, weighted provider policy scoring, strict required-capability plus cost/latency guardrail rerouting, structured routing decision auditing, session-bound helper LLM trace visibility, runtime-path primary and fallback overrides, local helper/agent/all current scheduled-job/delegation/MCP-specialist profile routing, embedding-model, vector-store, soul-file, vault-repository, and filesystem boundary failures, context-window degradation, daily-briefing, activity-digest, and evening-review degraded-input fallback auditing, tool/MCP policy guardrails including secret-ref containment metadata, proactive delivery transport, daemon ingest, manual MCP test API auth-required/success/failure behavior, skills toggle/reload audit behavior, screen observation summary/cleanup boundary behavior, observer source availability and time/goal summaries, sandbox, browser, filesystem, and web-search timeout/empty-result auditing, tool degradation behavior, and audit visibility for strategist/helper paths stay easy to verify after reliability changes.

Frontend

cd frontend
npm install # Install dependencies (first time)
npm test # Run all tests (single run)
npm run test:watch # Watch mode (re-runs on changes)
npm run test:coverage # Run with coverage report

Frontend tests use Vitest with jsdom, configured in vite.config.ts.

Test Structure

Backend (backend/tests/)

FileTestsCoverage
test_agent.py8Agent factory — tool count, model creation, context injection
test_catalog_api.py9Catalog API — browse catalog, install skills/MCP servers
test_browser_tool.py3Browser tool — blocked internal URLs plus success and timeout runtime audit logging
test_chat_api.py5REST chat endpoint — success, session continuity, errors
test_consolidation_reliability.py6Memory consolidation reliability — edge cases, retry behavior
test_consolidator.py5Memory consolidation — extract facts, soul updates, markdown fences, LLM failure
test_context_window.py19Token-aware context window — budget management, keep first/last, summarization, runtime audit logging
test_activity_digest.py6Activity digest — skip/no data, happy path, runtime path, timeout, degraded summary-input audit visibility
test_daily_briefing.py8Daily briefing — happy path, context/LLM failure, empty data, events in prompt, degraded memory-input audit visibility
test_delegation.py10Delegation architecture — orchestrator, specialist routing, depth limits
test_delivery.py20Delivery coordinator — deliver/queue/drop routing, intervention persistence, native notification fallback, budget decrement, and bundle formatting
test_embedder.py3Embedding model boundary — load success, load failure, encode failure runtime audit logging
test_e2e_conversation.py3End-to-end conversation flow — full agent interaction paths
test_evening_review.py10Evening review — happy path, no goals/messages, DB/LLM failure, date filtering, degraded-input audit visibility
test_goal_tree_integrity.py12Goal tree integrity — parent-child relationships, path consistency, cascading
test_goals_api.py10Goals HTTP endpoints — create, list, filter, tree, dashboard, update, delete
test_goals_repository.py21GoalRepository — CRUD, tree building, dashboard stats, cascading deletes
test_guardian_feedback.py2Guardian feedback repository — intervention persistence, outcome updates, explicit feedback, summary generation
test_guardian_state.py4Guardian-state synthesis — state assembly, world-model fields, confidence/salience labels, recent feedback injection, agent injection, strategist context
test_intervention_policy.py7Intervention policy — explicit act, bundle, defer, request-approval, stay-silent, high-interruption bundling, and low-salience suppression decisions
test_http_mcp_server.py16HTTP MCP server — request handling, internal URL blocking, timeout, truncation
test_insight_queue.py12Insight queue — enqueue, drain, peek, ordering, expiry
test_insight_queue_expiry.py8Insight queue expiry — TTL, cleanup, edge cases
test_mcp_api.py7MCP HTTP API endpoints — token update, manual server test auth/success/failure flows, and runtime audit logging
test_mcp_manager.py31MCP server integration — connect, disconnect, failure handling, token auth, env var resolution
test_observer_api.py22Observer API endpoints — state, context POST, daemon status, native notification list/dismiss controls, safe native test notification enqueue, native notification poll/ack, and explicit intervention feedback
test_observer_calendar.py4Calendar observer source — event parsing, empty/failure handling, runtime audit logging
test_observer_git.py7Git observer source — commit parsing, missing repo/reflog handling, runtime audit logging
test_observer_goals.py4Goals observer source — active goals summary and runtime audit logging
test_observer_manager.py26ContextManager — refresh, salience/confidence/interruption-cost derivation, state transitions, budget reset
test_screen_observation.py14Screen observation repository — create, backfill, summaries, cleanup, and runtime audit logging
test_observer_time.py14Time observer source — time-of-day, working hours, timezone, runtime audit logging
test_onboarding_edge_cases.py2Onboarding edge cases — skip, restart
test_native_tools_loader.py6Native tool auto-discovery + legacy alias compatibility — scan, expected tools, no duplicates, caching, reload, shared module state
test_profile.py7User profile + onboarding — get/create, mark/reset complete, HTTP endpoints
test_scheduler.py12Scheduler engine — job registration, start/stop, job execution
test_seed_config.py7Seed config — default MCP servers, default skills, first-run seeding
test_session.py23SessionManager — async DB-backed CRUD, history, pagination, title generation
test_sessions_api.py8Session HTTP endpoints — list, messages, update title, delete
test_settings_api.py6Settings API — interruption mode get/set
test_shell_tool.py9Shell execution — success, errors, size limits, timeout, connection errors, runtime audit logging
test_skills.py30Skills system — loading, gating, enable/disable, frontmatter parsing, API, and runtime audit logging for toggle/reload
test_soul.py12Soul file persistence — read/write, section update, ensure exists, runtime audit logging
test_specialists.py30Specialist agents — factory, tool domains, MCP specialist generation, runtime-path routing
test_strategist.py12Strategist agent — JSON parsing (valid, fenced, invalid, empty, partial), agent creation
test_timeouts.py5Execution timeouts — agent, briefing, consolidation timeouts
test_tool_registry.py4Tool metadata registry — lookup, required fields, copy safety
test_tools.py20Filesystem tools, template tool, web search, and filesystem/web-search runtime audit logging
test_workflows.py12Workflow composition — loader, gating, sequential execution, API, metadata, and delegation exposure
test_user_state.py57User state machine — derive_state, IDE deep work, should_deliver, budget, interruption modes
test_vault_api.py4Vault API — list keys, delete keys
test_vault_crypto.py4Vault crypto — Fernet encrypt/decrypt, key generation
test_vault_repository.py14Vault repository — store, get, list, delete, upsert, and runtime audit logging for success/missing/failure paths
test_vault_tools.py7Vault agent tools — store_secret, get_secret, list_secrets, delete_secret
test_vector_store.py3Vector store boundary — add success, search empty-result, add failure runtime audit logging
test_websocket.py3WebSocket — ping/pong, invalid JSON, skip onboarding

Frontend (frontend/src/)

FileTestsCoverage
game/objects/SpeechBubble.test.ts25Speech bubble — show/hide, positioning, text wrapping, timeout, animation
stores/chatStore.test.ts16Zustand chat store — sync actions (messages, panels, visual state) + async actions (profile, sessions, onboarding)
game/lib/mapParsers.test.ts15Map parsers — magic effect pool building, animation parsing, custom properties
components/settings/DaemonStatus.test.tsx3Native Presence card — daemon status rendering, safe desktop test-notification enqueue/refresh, and browser-side dismiss controls for pending native notifications
components/cockpit/layouts.test.ts4Cockpit layout presets — default/focus/review density expectations
stores/cockpitLayoutStore.test.ts4Cockpit layout store — preset switching, inspector visibility, reset behavior
hooks/useKeyboardShortcuts.test.ts6Keyboard shortcuts — cockpit composer focus, layout switching, inspector toggle, legacy panel handling, input focus exclusion
lib/toolParser.test.ts12Tool detection — regex patterns, fallback substring match, Phase 1/2/MCP tools
game/objects/MagicEffect.test.ts12Magic effects — pool cycling, spawn, fade, destroy lifecycle
stores/questStore.test.ts10Zustand quest store — goal CRUD, tree, dashboard, filters, refresh
lib/animationStateMachine.test.ts10Animation targets — tool→position mapping, facing direction, idle/thinking states
hooks/useWebSocket.test.ts6WebSocket hook — connect, reconnect, message dispatch, ping
config/constants.test.ts4Constant integrity — tool count, position ranges, scene keys, waypoint count

Writing New Tests

Backend: Using the async_db Fixture

All database-dependent tests use the shared async_db fixture from conftest.py. It creates an in-memory SQLite database and patches get_session across all modules.

from src.agent.session import SessionManager

async def test_example(async_db):
sm = SessionManager()
session = await sm.get_or_create("test-id")
assert session.title == "New Conversation"

For HTTP endpoint tests, use the client fixture (which depends on async_db):

async def test_list_goals(client):
res = await client.get("/api/goals")
assert res.status_code == 200

Frontend: Mocking Fetch

Store tests mock globalThis.fetch and reset store state between tests:

import { vi, beforeEach } from "vitest";
import { useChatStore } from "./chatStore";

const mockFetch = vi.fn();
globalThis.fetch = mockFetch;

beforeEach(() => {
useChatStore.setState({ messages: [], sessionId: null });
vi.clearAllMocks();
});

What Is NOT Tested

These areas are intentionally excluded from the test suite:

  • Legacy Phaser game objects (VillageScene, AgentSprite, UserSprite, SpeechBubble) — historical code that requires WebGL context and fragile mocking
  • Legacy EventBus.ts — single-line Phaser EventEmitter wrapper tied to the retired game shell
  • LanceDB vector_store.py — requires real embeddings model loaded
  • Full WS message streaming — complex sync/async interaction with agent streaming; basic WS tests cover ping, error handling, and skip_onboarding

CI/CD

Tests run automatically on pushes to develop and main, plus pull requests targeting either branch, via GitHub Actions (.github/workflows/test.yml).

Two parallel jobs:

  • backend-tests: Ubuntu, Python 3.12, uv sync --group dev, uv run pytest -v
  • frontend-tests: Ubuntu, Node 20, npm ci, npm test

Redundant runs are cancelled automatically via concurrency groups.