09. Reference Systems And Evidence
Purpose
This file defines how competitive and architectural claims are allowed to enter the Seraph docs.
The rule is simple: if a claim affects roadmap priority, product positioning, or a “we are better/worse than X” statement, it needs a source.
Allowed Evidence Types
Use these in descending order of strength:
- direct inspection of the Seraph repo
- official reference-system docs and official repos
- primary research papers and ArXiv
- clearly labeled vendor/product docs for interface references
If a claim cannot be supported by one of those, it should be marked Unknown.
Disallowed Shortcuts
- social-media lore as product truth
- benchmark claims without a linked source
- “everybody knows” comparisons
- smoothing uncertainty into confident prose
Comparison Method
When comparing Seraph to OpenClaw, Hermes, or IronClaw:
- compare against Seraph’s shipped surface on
develop - compare only on explicit benchmark axes
- label each axis
Ahead,At Par,Behind, orUnknown - give a short why plus one or more source links
Verified Source Registry
Local repo truth
- Seraph repo under
/Users/bigcube/Desktop/repos/seraph - local MAAS docs under
/Users/bigcube/Desktop/repos/maas/docs
Official reference-system docs reviewed for this pass
Verified interface references
Primary research sources used in this pass
- MemoryBank
- MemGPT
- Generative Agents
- LoCoMo
- LongMemEval
- HACO
- AgentBench
- SWE-bench
- OSWorld
- τ-bench
- GAIA
- Mixed-initiative UI principles
- Attention-sensitive alerting
Known Unknowns
The current official materials for OpenClaw, Hermes, and IronClaw do not provide enough evidence to score every axis confidently.
Known areas where caution is required:
- intervention quality measured over time
- published runtime-eval rigor beyond tests/diagnostics
- real-world outcome quality for self-healing and autonomous routines
Those should stay Unknown until stronger official evidence appears.
Implication For The Docs
The research tree should now do two things at once:
- define the target Seraph product
- show exactly where that target comes from and where the evidence is still thin
That keeps the roadmap anchored to verifiable gaps rather than taste alone.
The implementation tree should mirror that work explicitly:
- research evidence rules should have an implementation-side docs contract mirror
- benchmark logic in research should have a benchmark-status mirror in implementation
- superiority-program logic in research should have a delivery mirror in implementation
- the live PR queue should live only in implementation, not as a stale duplicate in research