Skip to main content

Workstream 03: Runtime Reliability

Status On develop

  • Workstream 03 is only partially shipped on develop.

Paired Research

Shipped On develop

  • degraded-mode fallback in the context window when tiktoken cannot load offline
  • centralized provider-agnostic LLM runtime settings
  • ordered fallback-chain routing across shared completion and agent-model paths
  • health-aware cooldown rerouting across shared completion and agent-model paths
  • runtime-path-specific profile preference chains across shared completion and agent-model paths
  • runtime-path-specific primary model overrides
  • runtime-path-specific fallback-chain overrides
  • wildcard runtime-path routing rules, with exact-path overrides taking precedence
  • first-class local runtime profile for helper, all current scheduled completion jobs, core agent, delegation, and connected MCP-specialist paths
  • strict runtime-path provider safeguards for required capability intents plus cost, latency, task-class, and budget guardrails, with degrade-open behavior when no compliant target exists
  • timeout-safe audit visibility into primary-vs-fallback completion and agent-model behavior
  • session-bound LLM runtime traces for helper and agent flows, including request-id visibility for routing and fallback decisions
  • fallback-capable model wrappers for chat, onboarding, strategist, and specialists
  • repeatable runtime eval harness for guardian, core chat behavior, observer refresh and delivery behavior, session consolidation behavior, tool/MCP policy guardrails, proactive flow behavior, delegated workflow behavior, workflow composition behavior, threaded workflow recovery, capability repair, observer, storage, and integration seam checks
  • runtime audit coverage across chat, WebSocket, scheduler jobs including daily-briefing, activity-digest, and evening-review degraded-input fallbacks, strategist helpers, proactive delivery transport, MCP lifecycle and manual test API paths, skills toggle/reload paths, observer lifecycle plus screen observation summary/cleanup boundaries, embedding, vector store, soul file, vault repository, filesystem, browser, sandbox, and web search paths

Working On Now

  • Runtime Reliability is no longer the repo-wide active focus after provider explainability and budgets v3 plus guardian behavioral evals v9 shipped
  • the previous runtime-focused queue is fully shipped on develop
  • provider-policy-safeguards-v3, provider-policy-explainability-and-budgets-v3, and guardian-behavioral-evals-v9 are now represented in the shipped batch, including richer routing reason surfaces, budget/task-class guardrails, and deeper deterministic proof for bootstrap plus branching behavior
  • richer provider policy still remains to do on develop, but the remaining work is now simulation-grade planning, budget steering, and cross-surface legibility rather than first-pass hard requirements and tier guardrails
  • the next runtime-facing queue item now shifts to provider-policy-simulation-and-budget-planning-v1

Still To Do On develop

  • deepen provider selection policy beyond the shipped weighted scoring, required capability safeguards, task/budget guardrails, path patterns, explicit overrides, ordered fallbacks, cooldown rerouting, and first operator-facing routing summaries
  • expand eval coverage beyond the shipped REST, WebSocket, observer refresh, delivery policy, strategist-learning continuity, consolidation, proactive, tool-policy guardrail, threaded workflow recovery, capability repair/bootstrap, delegated workflow, and workflow-composition behavioral contracts

Completed PR Sequence

This sequence is the finished Runtime Reliability execution order on develop.

  1. behavioral-evals-core-chat: add behavioral eval contracts for REST chat and WebSocket chat, including fallback, timeout, approval, and audit expectations
  2. behavioral-evals-proactive-flows: add behavioral evals for strategist tick, daily briefing, evening review, and activity digest with expected degraded behavior and delivery outcomes
  3. behavioral-evals-tool-heavy-flow: add one delegated tool-heavy workflow contract covering routing, tool execution, audit, and degraded or failure handling
  4. provider-policy-capabilities: add provider capability metadata and runtime-path policy intents such as fast, cheap, reasoning, and local_first
  5. provider-routing-decision-audit: log structured routing decisions that explain the chosen target, rejected targets, and rejection reasons
  6. local-routing-gap-closure: verify the remaining worthwhile local-routing surface across onboarding, strategist, and all scheduled completion jobs so the queue does not stay open on assumed gaps
  7. incident-trace-gap-closure: bind session-aware helper and agent LLM runtime events into the same audit trace so target choice, reroutes, and fallback outcomes can be explained for one session incident

Next Most Valuable PR Sequence

This is the next ordered Runtime Reliability slice after the completed incident-trace queue. The repo-wide cross-workstream queue lives in 00-master-roadmap.md.

  1. provider-policy-scoring: deepen provider routing with weighted policy scoring, explicit capability preferences, and clearer target ranking so runtime-path selection is stronger than simple preference chains and cooldown skips
  2. behavioral-evals-guardian-flows: expand behavioral eval coverage beyond chat and scheduler seams into observer refresh, consolidation, proactive delivery, and policy-mode guardrails so broader guardian behavior is regression-tested

Non-Goals

  • pretending the runtime is done because the fallback baseline works
  • live-provider eval dependence for every reliability check

Acceptance Checklist

  • provider failure with configured fallbacks does not collapse the entire chat path
  • runtime paths can force distinct primary and fallback routing without changing the global baseline
  • dynamic runtime paths can inherit wildcard routing rules without losing exact-path control
  • a local or non-OpenRouter path is demonstrably possible across helper, all current scheduled completion jobs, core agent, delegation, and connected MCP-specialist flows
  • key flows are observable and easy to debug
  • the project has broad repeatable eval coverage for core guardian behavior beyond the shipped REST, WebSocket, observer refresh, delivery policy, consolidation, proactive, tool-policy guardrail, threaded workflow recovery, capability repair/bootstrap, delegated workflow, and workflow-composition behavioral contracts