Workstream 03: Runtime Reliability

Status On `develop`

Workstream 03 is only partially shipped on develop.

Paired Research

primary design doc: 03. Runtime And Reliability

Shipped On `develop`

Working On Now

Runtime Reliability is no longer the repo-wide active focus after provider explainability and budgets v3 plus guardian behavioral evals v9 shipped
the previous runtime-focused queue is fully shipped on develop
provider-policy-safeguards-v3, provider-policy-explainability-and-budgets-v3, and guardian-behavioral-evals-v9 are now represented in the shipped batch, including richer routing reason surfaces, budget/task-class guardrails, and deeper deterministic proof for bootstrap plus branching behavior
richer provider policy still remains to do on develop, but the remaining work is now simulation-grade planning, budget steering, and cross-surface legibility rather than first-pass hard requirements and tier guardrails
the next runtime-facing queue item now shifts to provider-policy-simulation-and-budget-planning-v1

Still To Do On `develop`

deepen provider selection policy beyond the shipped weighted scoring, required capability safeguards, task/budget guardrails, path patterns, explicit overrides, ordered fallbacks, cooldown rerouting, and first operator-facing routing summaries
expand eval coverage beyond the shipped REST, WebSocket, observer refresh, delivery policy, strategist-learning continuity, consolidation, proactive, tool-policy guardrail, threaded workflow recovery, capability repair/bootstrap, delegated workflow, and workflow-composition behavioral contracts

Completed PR Sequence

This sequence is the finished Runtime Reliability execution order on develop.

behavioral-evals-core-chat: add behavioral eval contracts for REST chat and WebSocket chat, including fallback, timeout, approval, and audit expectations
behavioral-evals-proactive-flows: add behavioral evals for strategist tick, daily briefing, evening review, and activity digest with expected degraded behavior and delivery outcomes
behavioral-evals-tool-heavy-flow: add one delegated tool-heavy workflow contract covering routing, tool execution, audit, and degraded or failure handling
provider-policy-capabilities: add provider capability metadata and runtime-path policy intents such as fast, cheap, reasoning, and local_first
provider-routing-decision-audit: log structured routing decisions that explain the chosen target, rejected targets, and rejection reasons
local-routing-gap-closure: verify the remaining worthwhile local-routing surface across onboarding, strategist, and all scheduled completion jobs so the queue does not stay open on assumed gaps
incident-trace-gap-closure: bind session-aware helper and agent LLM runtime events into the same audit trace so target choice, reroutes, and fallback outcomes can be explained for one session incident

Next Most Valuable PR Sequence

This is the next ordered Runtime Reliability slice after the completed incident-trace queue. The repo-wide cross-workstream queue lives in 00-master-roadmap.md.

provider-policy-scoring: deepen provider routing with weighted policy scoring, explicit capability preferences, and clearer target ranking so runtime-path selection is stronger than simple preference chains and cooldown skips
behavioral-evals-guardian-flows: expand behavioral eval coverage beyond chat and scheduler seams into observer refresh, consolidation, proactive delivery, and policy-mode guardrails so broader guardian behavior is regression-tested

Non-Goals

pretending the runtime is done because the fallback baseline works
live-provider eval dependence for every reliability check

Acceptance Checklist

provider failure with configured fallbacks does not collapse the entire chat path
runtime paths can force distinct primary and fallback routing without changing the global baseline
dynamic runtime paths can inherit wildcard routing rules without losing exact-path control
a local or non-OpenRouter path is demonstrably possible across helper, all current scheduled completion jobs, core agent, delegation, and connected MCP-specialist flows
key flows are observable and easy to debug
the project has broad repeatable eval coverage for core guardian behavior beyond the shipped REST, WebSocket, observer refresh, delivery policy, consolidation, proactive, tool-policy guardrail, threaded workflow recovery, capability repair/bootstrap, delegated workflow, and workflow-composition behavioral contracts

Status On develop
Paired Research
Shipped On develop
Working On Now
Still To Do On develop
Completed PR Sequence
Next Most Valuable PR Sequence
Non-Goals
Acceptance Checklist

Status On develop​

Paired Research​

Shipped On develop​

Working On Now​

Still To Do On develop​

Completed PR Sequence​

Next Most Valuable PR Sequence​

Non-Goals​

Acceptance Checklist​