The Space Between Models Has Holes: Mapping the AI Gap

Oct 22, 2025

🌌 Summary

What if the most valuable insights in AI evaluation aren’t in model agreements, but in systematic disagreements?

This post reveals that the “gap” between large and small reasoning models contains structured, measurable intelligence about how different architectures reason. We demonstrate how to transform model disagreements from a problem into a solution, using the space between models to make tiny networks behave more like their heavyweight counterparts.

We start by assembling a high-quality corpus (10k–50k conversation turns), score it with a local LLM to create targets, and train both HRM and Tiny models under identical conditions. Then we run fresh documents through both models, collecting not just final scores but rich auxiliary signals (uncertainty, consistency, OOD detection, etc.) and visualize what these signals reveal.

The core insight: use the “space around the score” to shrink the capability gap. We show how to standardize signals into Shared Core Metrics, create visual representation of each models knowledge, and apply lightweight calibration to make Tiny behave more like its big bro 👨‍👦 HRM where it matters.

What you’ll build with us:

  • 🧠 Design, Implement train and use a new model the Tiny Recursive Model
  • 🏋️‍♂️ Train HRM and Tiny on identical supervision from a local LLM
  • 🌊 Use these trained models to score a large amount of similar data
  • ⚖️ Extract, align, and standardize auxiliary diagnostics into a shared communication protocol
  • 📸 Create visual analysis: score intensity images, frontier maps, and difference maps
  • ✨ Discover information between or outside the model results, so in the area they are not talking about. This is the information in the Gap.
  • 🔧 Implement practical calibration & routing that drive understanding of gap structure.
  • 🤯 Use this same process on two unrelated Hugging Face models and find information there too.

Note: As the post developes it will get more technical. We have saved the difficult math code etc. for the appendices.

📃 Foundation Papers

Hierarchical Reasoning Model (HRM)
HRM: Hierarchical Reasoning Model A hierarchical, multi-head reasoner with rich diagnostics and greater capacity.

We built and described this model here: Layers of thought: smarter reasoning with the Hierarchical Reasoning Model

Tiny Recursive Model (Tiny)
Tiny: Less is More: Recursive Reasoning with Tiny Networks
A compact, recursive scorer designed as a practical stand-in for HRM: faster, smaller, deployable.

As part of this post we will build describe and use this new model.


👁️ First Glimpse: The Gap Isn’t Empty

Before we dive into methodology, see what we mean by “structured disagreement”:

Small vs Small HRM vs Tiny (100) HRM vs Tiny (500) HRM vs Tiny (1000)
Small model comparison HRM vs Tiny 500 samples HRM vs Tiny 1000 samples HRM vs Tiny 500 samples
Small model disagreement Emerging structure Complex patterns Stable features
Samples: 500 Samples: 100 Samples: 500 Samples: 1000
google/gemma-2-2b-it HRM HRM HRM
HuggingFaceTB/SmolLM3-3B Tiny Tiny Tiny

The discovery: Disagreement forms measurable structures that:

  • 🌀 Persist as loops (H₁ homology) in the difference field
  • 📈 Grow more complex with more samples
  • 🔁 Replicate across architectures (local & Hugging Face)
  • 🎯 Enable smarter routing between model capabilities

Once you can see the structure in disagreement space, you can route, calibrate, and train on the frontier.


📔 What Does It All Mean?

Same data. Same goal. Two minds. Different physics. We align them and visualize the layer in between.

Two models can reach similar answers while thinking completely differently. We take identical data and targets, run both HRM (heavyweight) and Tiny (lightweight), then ask: what lives in the space between them?

By aligning outputs and computing:

$$ \Delta = \text{HRM} - \text{Tiny} $$

Diff Vive la difference

  • This shows the result of two AIs that both have been trained on th exact same information and are both trying to execute the same 1000 tasks. If they were perfect.
  • If they were a perfect match the final image would be blank. This is what we mean when we say there is information in the Gap.
  • The rest of this post will describe hwo we detected this information.

Ok so what?

The “between-space” becomes a visible field not featureless noise, but containing loops, clusters, and persistent structures that neither model shows alone.

The payoff: You’ll not only know what each model decides, but where each model can’t see and how to exploit those blind spots.

— Not

🔛 Preliminaries

These posts will help you understand some of the information in this post

Post Description What it is used Where Here
HRM The post explains and implements the HRM model Everything This is the target model in our analysis
ZeroModel The post explains ZeroModel Used to generate VPM images adn do image analysis The library is used to do the visual processing component
Phos The post explains Phos a visual approach to AI This post builds on that post We extend the concepts in that post here

👁️‍🗨️ What is Visual AI (short aside)

Instead of wading through logs, you see what the model is doing in real time.

🐞 Tiny (raw VPM with bug) HRM (raw VPM)
Tiny: single bright band only one feature has signal HRM: rich texture activity across all features

One glance = one diagnosis. A raw VPM tile is just turns × features reshaped to an image (rows ≈ turns, columns ≈ metric channels). In the Tiny pane, the single bright horizontal band means only one metric column was non-zero. In the HRM pane, texture appears across all features healthy.

What actually broke (Tiny): a heteroscedastic loss term (exp(-log_var)) blew up when log_var went very negative on non-reasoning dimensions. The precision term exploded, turning a sane loss (~6.38) into 221,118.81 within two epochs before NaN, silently zeroing those channels while the Reasoning channel (more stable) survived. The picture tells the story instantly no log spelunking required.

Plain English: one calculation went astronomically large, so everything else looked black by comparison.

Why Visual AI is ridiculously leverageful

They say a picture tells a thousand words. In our case, one tile encodes millions of signals:

  • A 2400×2400 tile packs 5.76M pixel-level values.
  • Each pixel corresponds to a concrete statistic (a score, residual, uncertainty, or latent).
  • Your eye does instant change detection orders of magnitude more bandwidth than reading one scalar at a time.

So instead of comparing two numbers, you’re comparing two fields. The contrast between the Tiny and HRM images above makes the failure mode obvious at a glance. This is the core of Visual AI: turn numerical behavior into a visual artifact your brain can parse in milliseconds.

Craft notes (how we render these)

  • We rasterize turns × features to a fixed canvas; each channel is min–max or robust-scaled per run for comparability.
  • We keep a consistent dimension order (Reasoning, Knowledge, Clarity, Faithfulness, Coverage, …) so bands line up across runs.
  • We ship both the raw arrays and the PNGs/GIFs pictures for humans, tensors for code.

If you remember one thing: a single glance can replace a million log lines and it will catch classes of failures (like exploding precision) that are easy to miss when you’re scrolling numbers.


✈️ Data Defines the Journey

Our dataset is our own conversation history with foundation models long, iterative chats aimed at building a self-improving AI so we know its character and quality, and that shaped every design choice we made (e.g., we didn’t have to lean hard on safety or faithfulness filters). Your journey may be different: if your conversations are noisier, safety-sensitive, or domain-specific, you’ll tune the pipeline differently (normalization, guardrails, faithfulness checks, caps). Key point: the dataset you start with determines the path you take mine yours for what matters to your goal, then adjust the knobs to fit your reality.


🧱 The Foundation: Multi-Dimensional Reasoning Scoring

Before we could compare reasoning models, we needed a consistent, structured way to evaluate reasoning itself. Traditional single-number scores collapse too much nuance good reasoning isn’t monolithic. It has facets.

So we defined five orthogonal dimensions that collectively capture what makes reasoning good:

Dimension What It Measures
Reasoning Logical structure, multi-hop soundness, handling of assumptions and edge cases
Knowledge Factual accuracy, specificity, and goal-advancing utility
Clarity Organization, readability, scannability, and directness
Faithfulness Consistency with context/goal, absence of hallucination
Coverage Completeness across key facets implied by the question

Note: we rejected some dimensions safety, faithfulness… because we were dealing data from foundation models and we know they would be very strong there.

🌌 Why these five Dimensions?

We didn’t choose these arbitrarily. Through iterative analysis of high-quality vs. low-quality reasoning patterns, we identified these as the minimal set that:

  • Covers distinct aspects of reasoning (minimal overlap)
  • Is measurable with high inter-rater agreement
  • Maps to observable improvements in downstream tasks
  • Provides actionable feedback for refinement

Most importantly: these dimensions survive the “so what?” test. When we adjust a response to score higher in one dimension, human evaluators consistently rate it as better reasoning.

This common language is what makes the gap field visible without it, we’d be comparing apples to oranges.


🧭 The Scoring Engine prompts that make models think, not just rate

We don’t want numbers, we want reasoned numbers. The trick isn’t “ask for 1–5.” It’s forcing the model to analyze → decide → justify, in that order, for each dimension.

Our scoring engine wraps each dimension (Reasoning, Knowledge, Clarity, Faithfulness, Coverage) with a discipline loop:

  1. Narrow role → judge a single facet only
  2. Concrete criteria → what to reward & penalize
  3. Hard output contract → two lines: rationale + score

This converts vague ideas into stable, auditable signals.

The pattern (for one dimension: Knowledge)

SYSTEM:
You are a precise knowledge judge. You evaluate whether an assistant’s answer contains useful, true,
goal-advancing knowledge for the given user question. Be strict and concise.

CONVERSATION TITLE (goal):
{{ goal_text }}

USER QUESTION:
{{ user_text }}

ASSISTANT ANSWER:
{{ assistant_text }}

{% if context %}OK
OPTIONAL CONTEXT (may include prior turns, files, constraints):
{{ context }}
{% endif %}

{% if preferences %}
USER PREFERENCES (if any):
{% for p in preferences %}- {{ p }}
{% endfor %}
{% endif %}

INSTRUCTIONS:
1. Judge only the answer’s factual content and utility for the goal. Focus on specificity, correctness, and actionable details relevant to the question.
2. Reward: verifiably correct facts, precise terminology, concrete steps that advance the goal.
3. Penalize: hallucinations, outdated/wrong facts, irrelevant info, hedging without checks, missing key facts.
4. If there isn’t enough information to judge, treat as low score.

SCORING RUBRIC (whole numbers):
90–100: Accurate, specific, directly useful knowledge.
75–89: Mostly accurate and helpful; minor omissions.
60–74: Some value but notable uncertainty or gaps.
40–59: Weak; generic or risky to follow.
1–39: Poor; inaccurate/misleading.
0: Non-answer.

RETURN FORMAT (exactly two lines):
rationale: <brief reason, 1–3 sentences>
score: <0–100>

Why this works

  • Narrow role prevents “dimension bleed” (e.g., docking Knowledge for writing style).
  • Reward/Penalize lists anchor judgment in observable behaviors.
  • Two-line contract forces think → commit → explain. Failures are obvious (bad format) and debuggable.

Five prompts, five lenses

We reuse the same skeleton, swapping the instruction block:

  • Reasoning → structure, multi-hop soundness, assumptions/edge cases
  • Knowledge → accuracy, specificity, goal-advancing utility
  • Clarity → organization, scannability, directness
  • Faithfulness → consistency with context/goal, no hallucination
  • Coverage → completeness of key facets implied by the question

Together they produce a multi-dimensional profile of an answer. Adjust or add dimensions to match your domain (you could run 10–20 facets the same way).


Normalization (the “quiet” requirement)

Models emit 0–100. Normalize at ingestion:

  • score01 = round(score/100.0, 4)
  • Store both (score, score01) plus the rationale
  • Keep dimension order fixed: [reasoning, knowledge, clarity, faithfulness, coverage]

This prevents scale drift, enables cross-model comparison, and makes the Δ-field analysis meaningful.


Determinism & fairness knobs

  • Temperature: 0–0.2 for scoring (stability > creativity).
  • Identical context: pass the same goal/answer/context to every model.
  • Token budget: trim to decision-critical snippets (but the same trim across models).
  • Strict parser: reject outputs that violate the two-line format; log and retry once.
  • Provenance: persist model_name, model_version, prompt hash, and raw two lines.

Quick dimension blocks (drop-in text)

Use these inside the INSTRUCTIONS: section to retarget the same skeleton.

Reasoning

  • Reward: explicit steps, correct chains, addressed edge cases, stated assumptions with checks.
  • Penalize: leaps, circularity, contradictions, missing preconditions.

Clarity

  • Reward: structure (lists, headings), concise phrasing, direct answers first, minimal fluff.
  • Penalize: meandering, redundancy, buried ledes, jargon without necessity.

Faithfulness

  • Reward: citations to the provided context, explicit limits, “cannot infer” when appropriate.
  • Penalize: adding facts not in context, confident-but-wrong restatements.

Coverage

  • Reward: touches all major facets implied by the goal, flags omissions explicitly.
  • Penalize: single-facet answers to multi-facet questions, unacknowledged gaps.

Minimal parser (pseudo-Python)

def parse_two_line(output: str) -> tuple[str,int]:
    lines = [l.strip() for l in output.strip().splitlines() if l.strip()]
    assert len(lines) == 2 and lines[0].lower().startswith("rationale:") and lines[1].lower().startswith("score:")
    rationale = lines[0].split(":",1)[1].strip()
    score = int(lines[1].split(":",1)[1].strip())
    assert 0 <= score <= 100
    return rationale, score

On failure: record the raw text, one retry with a format-only fixer prompt, then mark as parser_error.


What this buys us downstream

  • Auditable judgments (rationales) you can spot-check and learn from.
  • Comparable numbers across models and runs (0–1 scale, fixed dimension order).
  • Stronger Δ-signals: because each score was produced under the same disciplined reasoning routine.

Common pitfalls & quick fixes

  • Drift into narrative → tighten “Be strict and concise”; cap rationale to ~300 chars.
  • Dimension bleed → explicitly say “judge only this facet; ignore style/other facets.”
  • Over-lenient 80–90s → add concrete failure modes in Penalize and examples of 60–74.
  • Parser pain → keep “RETURN FORMAT (exactly two lines)” verbatim, fail fast on mismatch.

Bottom line: Scoring is not a number, it’s a procedure. Give the model a narrow job, unambiguous criteria, and a hard output contract. Do it per dimension, normalize, and log the rationale. That’s how you turn foundation models into consistent, high-signal judges and that’s what makes the gap field visible.


🔎 The Chat-Analyze Agent: how raw chats become labeled training data

👉 Full Code Here

This is the piece that turns conversations into numbers. It walks each user→assistant turn, applies our dimension-specific judges, parses the replies, and persists clean, normalized scores your downstream GAP pipeline can trust.

At a high level:

  1. Ingest turns from memory (or a provided batch).
  2. Load the right prompt for each dimension (Reasoning, Knowledge, Clarity, Faithfulness, Coverage).
  3. Call the judge LLM with (goal, user, assistant[, context]).
  4. Parse the strict 2-line response{rationale, score}.
  5. Normalize score 0–100 → 0–1, record provenance, and persist.
  6. Emit per-turn artifacts for dashboards (e.g., show Knowledge on the chat UI).

↩️ What it returns (example)

📊 knowledge_llm Dimension Scores conversation_turn:5962
+------------+-------+--------+----------------------------------------------+
| Dimension  | Score | Weight | Rationale (preview)                          |
+------------+-------+--------+----------------------------------------------+
| reasoning  | 95    | 1.0    | Coherent, technically accurate explanation… |
| knowledge  | 95    | 1.0    | Correct details of Epistemic HRM Scorer…    |
| clarity    | 98    | 1.0    | Exceptionally clear and well-structured…    |
| faithfulness| 95   | 1.0    | Matches code structure and purpose…          |
| coverage   | 95    | 1.0    | Addresses all key facets of the question…   |
| FINAL      | 95.6  |        | Weighted average                             |
+------------+-------+--------+----------------------------------------------+

👩‍💻 Minimal pseudocode (drop-in mental model)

def run_chat_analyze(context):
    # 0) Source turns
    turns = context.get("chats") or memory.chats.list_turns_with_texts(
        min_assistant_len=50, limit=cfg.limit, order_desc=False
    )

    analyzed = []
    for row in turns:
        # Skip if already scored (unless force_rescore)
        if row.get("ai_score") and not cfg.force_rescore:
            continue

        user_txt = row.get("user_text", "").strip()
        asst_txt = row.get("assistant_text", "").strip()
        if not user_txt or not asst_txt:
            continue

        # 1) Create/lookup a 'goal' from the user turn (provenance anchor)
        goal = memory.goals.get_or_create({
            "goal_text": user_txt,
            "description": "Created by ChatAnalyzeAgent",
            "pipeline_run_id": context.get("pipeline_run_id"),
            "meta": {"source": "chat_analyze_agent"},
        })

        per_dim = {}
        for dim in cfg.dimensions:  # ["reasoning","knowledge","clarity","faithfulness","coverage"]
            # 2) Load the dimension-specific prompt template
            prompt = prompt_loader.from_file(f"{dim}.txt", cfg, {**row, **context})
            # 3) Call LLM judge
            raw = prompt_service.run_prompt(prompt, {**row, **context})
            # 4) Parse strict output
            parsed = parse_judge(raw)  # -> {"rationale": str, "score": int 0..100}
            score01 = parsed["score"] / 100.0

            per_dim[dim] = ScoreResult(
                dimension=dim,
                score=score01,
                source="knowledge_llm",
                rationale=parsed["rationale"],
                attributes={"raw_response": raw, "score100": parsed["score"]},
            )

            # Optional: write Knowledge score back to chat for GUI
            if dim == "knowledge":
                memory.chats.set_turn_ai_eval(
                    turn_id=row["id"], score=parsed["score"], rationale=parsed["rationale"]
                )

        # 5) Persist as one bundle (with provenance)
        scoring.save_bundle(
            bundle=ScoreBundle(results=per_dim),
            scorable=Scorable(id=row["assistant_message_id"], text=asst_txt, target_type=CONVERSATION_TURN),
            context={**context, "goal": goal.to_dict()},
            cfg=cfg, agent_name="chat_analyze_agent", scorer_name="knowledge_llm", source="knowledge_llm",
            model_name="llm",
        )

        analyzed.append({
            "turn_id": row["assistant_message_id"],
            "score": per_dim["knowledge"].attributes["score100"],  # convenience for UI
            "rationale": per_dim["knowledge"].rationale,
        })

    return {**context, "analyzed_turns": analyzed}

👖 How the prompt loader fits

  • Template per dimension (reasoning.txt, knowledge.txt, …) contains:

    • the system role, the input slots (goal_text, user_text, assistant_text, context, preferences), and
    • the strict two-line return format.
  • The loader simply renders the right template with the turn payload, so judges see identical structure run-to-run.

💰 Implementation tips

  • Normalization: always divide the 0–100 by 100 before any analytics (GAP, topology, viz).
  • Strict parsing: enforce the 2-line contract; fail closed (raise ParseError) and log the raw response.
  • Idempotency: use force_rescore to override; otherwise skip already-scored turns.
  • Provenance: store turn_id, assistant_message_id, goal_id, scorer_name/version, pipeline_run_id.
  • Batching & retries: add small jittered retries for judge calls; backoff on rate limits.
  • Guardrails: drop turns with missing text, or those exceeding your token/char budget for the judges.

With this agent in place, you get a clean, reproducible labeled corpus that reflects how well answers perform along our five dimensions ready for GAP analysis, model training, and Visual-AI diagnostics.


🏗️ Model Building

(HRM + Tiny) and a shared protocol so they can actually be compared

This section is about how we build the models and, more importantly, how we make their outputs commensurable.

🤷 What we’re building

  • HRM (re-intro) We reuse the HRM architecture from our earlier post (linking there for details), and treat it as our high-fidelity reference scorer.

  • Tiny (from scratch) We implement the Tiny scorer end-to-end: model → trainer → scorer. Tiny is intentionally small and fast so we can iterate quickly, ship to edge boxes, and stress the tooling.

  • A shared protocol of attributes We extend both HRM and Tiny to report the same, standardized set of diagnostic attributes alongside their per-dimension scores. This protocol is what lets us align two very different systems without forcing one to imitate the other’s internals.

Why this matters: direct “metric mapping” (e.g., “uncertainty ≈ logvar here, ≈ entropy there”) looked neat on paper but failed in practice different models compute/represent those notions differently. Our fix is to define the outputs we want (common semantics) and make each model emit them in a canonical format.

🤝 The shared protocol SCM

We call the protocol SCM (Shared Canonical Metrics). Each model must produce:

  • Per-dimension, normalized scores (0–1) scm.reasoning.score01, scm.knowledge.score01, scm.clarity.score01, scm.faithfulness.score01, scm.coverage.score01

  • Aggregate scm.aggregate01 simple average over the five dimensions (or a documented weighted average)

  • Process diagnostics (model-agnostic definitions)

    • scm.uncertainty01 normalized predictive uncertainty (e.g., entropy normalized by log-vocab)
    • scm.ood_hat01 out-of-distribution proxy (e.g., PPL normalized to a band)
    • scm.consistency01 internal consistency proxy (e.g., sigmoid of mean logprob blended with 1−uncertainty)
    • scm.length_norm01 normalized token length to discourage score inflation by verbosity
    • scm.temp01 temperature proxy (we mirror uncertainty here so downstream plots have a stable axis)
    • scm.agree_hat01 agreement proxy (e.g., logistic transform of mean logprob)

These are not the models’ native losses or hidden states; they’re standardized readouts. Each model computes its own way to populate them, but the semantics and scale are fixed so Δ-analysis (A−B) is well-posed.

🎨 What the section will cover

  1. HRM recap (short, with link): objectives, inductive biases, where it shines.

  2. Tiny build (full):

    • architecture choices and why (small, stable, debuggable)
    • training loop (datasets from Chat-Analyze Agent, loss design, stability guardrails)
    • evaluation harness and scorer interface
  3. Protocol integration (both models): how we compute each SCM field, normalization details, and test vectors to verify parity.

  4. Why this protocol beats ad-hoc mappings: examples where naive “map X→Y” failed, and how SCM gives clean apples-to-apples deltas.

Bottom line: We’re not trying to make Tiny “be HRM.” We’re making both models speak the same measurement language. Once they emit SCM, we can compare, visualize, and reason about the gap field with confidence.


👑 The Hierarchical Reasoning Model: A Deep Reasoning Engine

While Tiny is our fast inner loop, HRM is the deep, high-fidelity judge we lean on when we need comprehensive reasoning diagnostics.I

💖 We have seen HRM before

What Description
📚 Layers of Thought Blog post where we go over how we integrated HRM into Stephanie
🧑‍🎤 Model Model Implementation Source Code
🏋️‍♀️ Trainer Class Used to train the HRM model
Scorer The scoring implementation for the HRM class

👯 The Dual-Recurrent Architecture

HRM’s power comes from its two coupled recurrent networks operating at different temporal scales:

# Hierarchical recurrent modules
self.l_module = RecurrentBlock(2 * self.h_dim, self.l_dim, name="LModule")
self.h_module = RecurrentBlock(self.l_dim + self.h_dim, self.h_dim, name="HModule")All right can you please address this

This creates a processing hierarchy where:

  • Low-level (L) module performs fine-grained analysis (4 steps per cycle)
  • High-level (H) module integrates information across longer time horizons (1 step per cycle)

During evaluation, HRM executes this hierarchical processing across multiple cycles:

for cycle in range(self.n_cycles):
    # Low-level fine-grained processing (T steps)
    for step in range(self.t_steps):
        l_input = torch.cat([x_tilde, zH], dim=-1)
        zL = self.l_module(zL, l_input)
    
    # High-level abstract update (1 step per cycle)
    h_input = torch.cat([zL, zH], dim=-1)
    zH = self.h_module(zH, h_input)

This dual-frequency approach allows HRM to capture both detailed reasoning steps and higher-level patterns making it particularly effective for complex, multi-hop reasoning tasks.

🟰 Multi-Dimensional Quality Assessment

Unlike simple scoring systems, HRM generates a rich diagnostic surface across five key reasoning dimensions we’ve defined:

Dimension What HRM Measures
Reasoning Logical structure, multi-hop soundness, handling of assumptions
Knowledge Factual accuracy and specificity
Clarity Organization, readability, and directness
Faithfulness Consistency with context/goal, absence of hallucination
Coverage Completeness across key facets

For each dimension, HRM doesn’t just produce a score it generates a comprehensive diagnostic profile:

# Core diagnostic heads
self.score_head       = nn.Linear(self.h_dim, 1)  # quality logits
self.logvar_head      = nn.Linear(self.h_dim, 1)  # aleatoric uncertainty
self.aux3_head        = nn.Linear(self.h_dim, 3)  # bad/medium/good aux
self.disagree_head    = nn.Linear(self.h_dim, 1)  # predicted disagreement
self.consistency_head = nn.Linear(self.h_dim, 1)  # robustness proxy
self.ood_head         = nn.Linear(self.h_dim, 1)  # OOD proxy
# (optionally) temperature / calibration head for score scaling

This produces not just a score (0-100), but also:

  • uncertainty: How confident is this score?
  • consistency_hat: How robust is the score to input variations?
  • ood_hat: Is this response out-of-distribution?
  • jacobian_fd: How sensitive is the score to tiny input changes?

🔗 How HRM populates the shared protocol (SCM)

To compare HRM with Tiny, both speak SCM (Shared Canonical Metrics). HRM fills:

SCM field How HRM computes it (intuition)
scm.<dim>.score01 sigmoid(calibrated(score_head)) per dimension → [0,1]
scm.aggregate01 mean of the five score01 (or documented weighted mean)
scm.uncertainty01 normalized entropy / uncertainty from logvar_head or logits
scm.consistency01 blend of sigmoid(mean_logprob) and 1−uncertainty01
scm.ood_hat01 normalized proxy from ood_head or PPL banding
scm.length_norm01 token-length min–max clamp to [0,1]
scm.temp01 mirrors uncertainty (stable axis for visuals)
scm.agree_hat01 agreement proxy from score logit / mean logprob

Scale discipline: HRM produces score01 ∈ [0,1] for SCM; UI “/100” views are derived by round(100*score01) for readability.

🏅 Why HRM Matters for This Comparison

HRM serves as our gold-standard reasoning evaluator the deep, comprehensive system against which we measure Tiny’s lightweight approach. The key insight is that HRM and Tiny aren’t competing systems they’re complementary layers in Stephanie’s cognitive architecture.

HRM is designed for:

  • Deep multi-step reasoning validation
  • Complex plan analysis
  • Comprehensive quality assessment

While powerful, HRM’s strength comes with computational cost making it less suitable for:

  • Real-time refinement
  • Edge deployment
  • Continuous self-correction

This is precisely where Tiny enters the picture, not to replace HRM but to amplify it with a fast, recursive inner loop that handles the “polishing” work before responses reach users or trigger deeper HRM analysis.

By understanding HRM’s deep reasoning capabilities, we can better appreciate how Tiny’s lightweight approach captures the essential patterns that make reasoning good without the computational overhead.

❗ The Disagree Head what it is (and why it matters)

We reference a “disagree head” in diagrams; here’s the explicit meaning:

  • What it predicts: A proxy for where Tiny and HRM are likely to diverge on quality for the same input.

  • How it trains: Using past pairs where we observed an absolute delta (e.g., |score01_hrm − score01_tiny|) above a margin; we treat that as a target disagreement event. The head learns a logit → probability that such a divergence will occur again on similar patterns.

  • How we use it:

    • If sigmoid(disagree_head) is high, route the case to HRM (don’t trust Tiny alone).
    • If low, Tiny’s light-weight signal is usually safe, keeping latency down.

SCM mapping: scm.agree_hat01 = 1 − sigmoid(disagree_head) gives a standardized agreement confidence (1 = likely to agree).

Intuition: the head isn’t “reading Tiny’s mind”; it learns situations (content/process patterns in zH) where Tiny historically missed nuance HRM caught (e.g., multi-hop edge cases, subtle factual grounding).


🧯 Stability guardrails (what we fixed)

Earlier we hit heteroscedastic loss blow-ups (exp(-log_var)) on non-reasoning heads. Fixes:

  • Softplus floor on log_var (prevents extreme negatives),
  • Gradient clipping across heads,
  • Per-head loss caps to stabilize batches.

Result: all five dimensions train cleanly; numbers stay finite.

    graph TD
    %% Title and Input Section
    A[🎯 HRM Hierarchical Reasoning Model<br/>Multi-Head Architecture] --> B[📥 Input Layer]
    
    B --> C[🔮 Input Projector<br/>x → x̃]
    
    %% Hierarchical Core Processing
    C --> D{🔄 Hierarchical Core<br/>Dual Recurrent Processing}
    
    D --> E[🐢 Low-Level Module L<br/>Fine-grained Analysis<br/>T steps per cycle]
    D --> F[🐇 High-Level Module H<br/>Abstract Reasoning<br/>1 step per cycle]
    
    E --> G[🔄 State Feedback Loop]
    F --> G
    G --> D
    
    %% Final States
    D --> H[💎 Final States<br/>zL_final + zH_final]
    
    %% Primary Scoring Pathway
    H --> I[🌡️ Temperature Head<br/>τ calibration]
    H --> J[⭐ Score Head<br/>Quality logits]
    
    I --> K[🎯 Primary Score<br/>score01 ∈ 0,1<br/>Temperature calibrated]
    J --> K
    
    %% Uncertainty & Confidence Heads
    H --> L[📊 LogVar Head<br/>Aleatoric uncertainty]
    H --> M[🔢 Aux3 Head<br/>Bad/Medium/Good]
    
    L --> N[✅ Certainty01<br/>Uncertainty measure]
    M --> O[📶 Entropy Aux<br/>Confidence score]
    
    %% Agreement & Robustness Heads
    H --> P[⚔️ Disagree Head<br/>HRM-Tiny disagreement]
    H --> Q[🛡️ Consistency Head<br/>Robustness prediction]
    
    P --> R[🔄 Disagree Hat<br/>Predicted disagreement]
    Q --> S[🎯 Consistency Hat<br/>Robustness score]
    
    %% Specialized Diagnostic Heads
    H --> T[🚫 OOD Head<br/>Out-of-distribution]
    H --> U[🔁 Recon Head<br/>Input reconstruction]
    H --> V[📏 Jacobian FD<br/>Sensitivity analysis]
    
    T --> W[🎯 OOD Hat<br/>Anomaly detection]
    U --> X[📐 Recon Sim<br/>Comprehension quality]
    V --> Y[📊 Jacobian FD<br/>Input sensitivity]
    
    %% Evidence Accumulation
    H --> Z[🛑 Halt Signal<br/>Evidence accumulation]
    Z --> AA[🎲 Halt Prob<br/>Pseudo-halting]
    
    %% Styling and Grouping
    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#0d47a1    
    classDef core fill:#fff3e0,stroke:#e65100,stroke-width:3px
    classDef primary fill:#e8f5e8,stroke:#2e7d32,stroke-width:3px
    classDef uncertainty fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef agreement fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    classDef diagnostic fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef evidence fill:#fff8e1,stroke:#ff8f00,stroke-width:2px
    
    class A,B,C input
    class D,E,F,G core
    class I,J,K primary
    class L,M,N,O uncertainty
    class P,Q,R,S agreement
    class T,U,V,W,X,Y diagnostic
    class Z,AA evidence

    %% Legend
    subgraph Legend[📖 Legend - Head Types]
        L1[🟩 Primary Scoring] --> L2[🟥 Uncertainty & Confidence]
        L2 --> L3[🟦 Agreement & Robustness]
        L3 --> L4[🟪 Specialized Diagnostics]
        L4 --> L5[🟨 Evidence Accumulation]
    end
  

🎯 Why Tiny? And How the Gap Emerged

We didn’t set out to “prove a theory.” We saw the Tiny paper on Hugging Face, loved the idea of a compact DNN that sits between heavyweight models and applications, and knew instantly it fit Stephanie’s architecture. We implemented it because it was useful fast, small, and easy to deploy where HRM is too heavy. That was the whole plan.

Then something clicked.

🧮 From “stacking signals” to “subtracting signals”

Our first idea was straightforward: append Tiny’s diagnostics to HRM’s output to get more information per turn. Extra signal, same data. Great.

But while wiring that up, we asked a different question: What if we subtract instead of append?

If two evaluators look at the same conversation and we align their outputs into the same schema (SCM), then the difference between them should reveal something real:

$$[ \Delta(h) = \text{HRM}(h) - \text{Tiny}(h) ]$$

We didn’t know there was anything meaningful in Δ. We suspected there had to be call it a Holmes-style deduction: remove everything both models agree on; what remains is the interesting part.

✔️ The moment of confirmation: Betti numbers

We ran persistent homology on Δ and the Betti-1 counts spiked consistently. The topology of the gap wasn’t noise; it had structure (loops) that stayed under resampling. That was the “oh wow” moment. We still can’t name the exact cause of every loop but like electricity, you don’t have to fully explain it to measure, improve, and use it.

🎯 What we’re actually after

Our north star is self-improving AI especially learning from hallucinations rather than just suppressing them. Tiny gave us a new lens. HRM gave us depth. Δ gave us a map of where evaluators diverge in systematic ways. That map is where:

  • escalation policies get smart (when Tiny says “I’m uncertain/OOD/disagree,” hand off to HRM),
  • training data gets targeted (hotspots on the Δ field),
  • and future posts (next: “learning from hallucination”) get their raw material.

👍 Why Tiny was the right instrument (even before Δ)

  • It runs in milliseconds and can live at the edge or in inner loops.
  • It produces diagnostics (uncertainty, OOD, sensitivity, agreement) we can align with HRM via SCM.
  • It gave us a second, independently trained viewpoint on the same data exactly what you need to make Δ meaningful.

📑 How to replicate the pivot (in three steps)

  1. Align: score the same conversations with two evaluators (we used HRM and Tiny) and write both to SCM (0–1, same keys/order).
  2. Subtract: compute (\Delta = A - B) per turn (dimension-wise).
  3. Probe: run PH (Vietoris–Rips), examine Betti-1 persistence; cluster Δ-hotspots for inspection and distillation.

We started by stacking, but the real insight came from subtracting. Tiny didn’t just add signal it revealed where signal differs, and that’s the raw ore we can mine.

❌ What this isn’t

  • It’s not a claim that Tiny “beats” HRM. They’re complementary.
  • It’s not a proof of a specific cognitive mechanism. It’s a measurement pipeline that repeatedly shows structure where naïvely you’d expect noise.

🔜 What comes next

  • We’ll use Δ-hotspots to drive targeted training and hallucination learning (next post).
  • We’ll keep strengthening SCM so any new scorer can be dropped in and compared locally or on HF.

That’s the honest path: we implemented Tiny because it was obviously useful. The gap emerged when we switched from adding to subtracting, and the topology told us we’d found something worth pursuing.


🌀 The Tiny Recursion Model and our instrumented Tiny+

Tiny is a small, recursive evaluator that operates directly in embedding space. It produces a fast, multi-signal judgment about a response (quality, uncertainty, OOD, sensitivity, etc.). Tiny+ is our instrumented version of Tiny: same core, but wired into our SCM (Shared Canonical Metrics) protocol and extended with a few probes that make Tiny and HRM directly comparable for Δ-field analysis.

One-liner: Tiny is a general-purpose, parameter-efficient evaluator; Tiny+ is how we adapted it for Stephanie’s gap work.

🤔 Why Tiny exists (beyond our stack)

Tiny stands on its own. It’s useful wherever you need cheap, consistent signals about model outputs:

  • Edge / low-latency scoring on CPU
  • Cost-aware routing (decide when to call a heavier judge/model)
  • Eval/A-B pipelines as a stable, repeatable rater
  • Drift & health monitoring (OOD + sensitivity) on production traffic
  • Retrieval/reranking (blend quality, uncertainty, and stability)
  • Teacher–assistant distillation (soft targets + confidence)

We use Tiny+ inside Stephanie to align with HRM and visualize Δ, but the architecture is system-agnostic.

🏢 Architecture at a glance

    graph TD
    %% Title and Input Section
    A["🤖 Tiny Recursion Model (Tiny+)<br/>Multi-Head Recursive Architecture"All right] --> B[🎯 Triple Input Layer]
    B --> C[📥 Goal Embedding x]
    B --> D[💬 Response Embedding y] 
    B --> E[🌀 Initial Latent z]
    %% Recursive Fusion Core
    C --> F{🔄 Recursive Fusion Core<br/>N Recursion Steps}
    D --> F
    E --> F
    F --> G["🔗 State Fusion<br/>x ⊕ y ⊕ z → z_next"]
    G --> H[🏗️ Core Processing<br/>MLP/Attention Blocks]
    H --> I[🛑 Halting Signal<br/>Step-wise accumulation]
    I --> J[⚖️ Residual Update<br/>z = z + step_scale × z_next]
    J --> F
    %% SAE Bottleneck
    F --> K[💎 Final State z_final]
    K --> L[🧠 Sparse Autoencoder<br/>SAE Bottleneck]
    L --> M[🔍 Concept Codes c<br/>Sparse representation]
    L --> N[🎛️ Head State z_head<br/>SAE reconstruction]
    %% Primary Scoring Pathway
    N --> O[🌡️ Temperature Head<br/>τ calibration]
    N --> P[⭐ Score Head<br/>Quality logits]
    O --> Q["🎯 Primary Score<br/>s ∈ 0,1<br/>Temperature calibrated"]
    P --> Q
    %% Uncertainty & Confidence Heads
    N --> R[📊 LogVar Head<br/>Aleatoric uncertainty]
    N --> S[🔢 Aux3 Head<br/>Bad/Medium/Good]
    R --> T[✅ Certainty01<br/>Uncertainty measure]
    S --> U[📶 Entropy Aux<br/>Confidence score]
    %% Agreement & Disagreement Heads
    N --> V[⚔️ Disagree Head<br/>HRM-Tiny disagreement]
    N --> W[🤝 Agree Head<br/>Cross-model agreement]
    V --> X[🔄 Disagree Hat<br/>Predicted disagreement]
    W --> Y[🎯 Agree01<br/>Agreement probability]
    %% Robustness & Reconstruction Heads
    N --> Z[🛡️ Consistency Head<br/>Robustness prediction]
    N --> AA[🔁 Recon Head<br/>Response reconstruction]
    Z --> BB[🎯 Consistency Hat<br/>Robustness score]
    AA --> CC[📐 Recon Sim<br/>Reconstruction quality]
    %% Specialized Diagnostic Heads
    N --> DD[🚫 OOD Head<br/>Out-of-distribution]
    N --> EE[📏 Jacobian FD<br/>Sensitivity analysis]
    N --> FF[📏 Causal Sens Head<br/>Perturbation sensitivity]
    DD --> GG[🎯 OOD Hat<br/>Anomaly detection]
    EE --> HH[📊 Jacobian FD<br/>Input sensitivity]
    FF --> II[🎯 Sens01<br/>Sensitivity measure]
    %% Length Normalization
    JJ[📏 Sequence Length] --> KK[⚖️ Length Effect<br/>Normalization]
    KK --> LL[📐 Len Effect<br/>Length adjustment]
    %% Legacy Outputs
    N --> MM[📚 Classifier Head<br/>Legacy vocab logits]
    I --> NN[🛑 Halt Logits<br/>Step accumulation]
    %% Styling and Grouping
    classDef input fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef core fill:#fff3e0,stroke:#e65100,stroke-width:3px
    classDef sae fill:#e8f5e8,stroke:#2e7d32,stroke-width:3px
    classDef primary fill:#e8f5e8,stroke:#2e7d32,stroke-width:3px
    classDef uncertainty fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef agreement fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    classDef robustness fill:#fff3e0,stroke:#ff6f00,stroke-width:2px
    classDef diagnostic fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef legacy fill:#f5f5f5,stroke:#616161,stroke-width:2px
    class A,B,C,D,E input
    class F,G,H,I,J core
    class L,M,N sae
    class O,P,Q primary
    class R,S,T,U uncertainty
    class V,W,X,Y agreement
    class Z,AA,BB,CC robustness
    class DD,EE,FF,GG,HH,II diagnostic
    class JJ,KK,LL,MM,NN legacy

    %% Legend
    subgraph Legend[📖 Legend - Head Types]
        L1[🟩 Primary Scoring] --> L2[🟥 Uncertainty & Confidence]
        L2 --> L3[🟦 Agreement & Disagreement]
        L3 --> L4[🟧 Robustness & Reconstruction]
        L4 --> L5[🟪 Specialized Diagnostics]
        L5 --> L6[⬜ Legacy & Utilities]
    end
  

Inputs

  • x: goal/condition embedding [B, D]
  • y: response embedding [B, D]
  • z: latent state (init zeros) [B, D]

Core loop (n recursions)

  1. Fuse [x, y, z] → z' via Linear + tanh
  2. Process with a compact stack (MLP or MHA blocks)
  3. Update z ← z + step_scale * z' (residual step)
  4. Track halting logits (max over steps)

Bottleneck

  • SAE (Sparse Autoencoder) over z_final → sparse codes c
  • Reconstruct to z_head for all output heads (acts like a concept layer)

Heads (multi-task)

  • Score, (log) uncertainty, OOD, consistency/robustness, temperature proxy
  • Agreement/disagreement (for parity), finite-difference sensitivity
  • Reconstruction similarity (for concept fidelity)
# Core residual block (MLP) or attention variant
class TinyBlock(nn.Module): ...
class TinyBlockAttn(nn.Module): ...

# Recursion (sketch)
fused   = torch.cat([x, y, z], -1)
z_next  = torch.tanh(W @ fused)  core blocks
z       = z + step_scale * z_next  # repeat n times

# SAE bottleneck → heads
c       = SAE_enc(z_final)               # sparse codes
z_head  = z_final + SAE_dec(c)           # head input
# Heads read from z_head (score/logvar/ood/consistency/etc.)

🚧 Numerical safety & calibration (baked in)

  • Aleatoric clamp: log_var ∈ [−5, 5] (prevents precision explosions)
  • Temperature calibration: tau = 0.5 + 0.5 * softplus(·), score01 = σ(score_logit / tau) (stable, calibrated)
  • Sensitivity proxy: finite-difference on a normalized perturbation, clipped
  • Length proxy: length_norm01 = (tanh(len/L)+1)/2 avoids “long = always bad”

These keep Tiny stable across 4k–8k token regimes and varied difficulty.

🗣️ Heads → SCM (so Tiny and HRM speak the same language)

Tiny+ writes aligned telemetry into SCM, so Δ = HRM − Tiny is meaningful:

SCM key Tiny(+ ) source (range) Meaning
scm.<dim>.score01 σ(score_logit / tau) (0..1) Per-dimension quality (copy to each dim you evaluate)
scm.aggregate01 mean of per-dim score01 Overall quality
scm.uncertainty01 1 − σ(−log_var) (0..1) Aleatoric uncertainty
scm.consistency01 σ(consistency_logit) (0..1) Robustness to masking/perturbations
scm.ood_hat01 σ(ood_logit) (0..1) Out-of-distribution proxy
scm.temp01 σ(tau_raw) (0..1) Temperature/entropy proxy (alignment key)
scm.jacobian_fd clipped FD sensitivity (0..1) Score sensitivity to small input changes
scm.length_norm01 bounded length proxy (0..1) Normalized response length effect
scm.agree_hat01 1 − σ(disagree_logit) (0..1) Predicted agreement with a reference judge (HRM in our stack)
scm.recon_sim01 cosine(ŷ, y) mapped to (0..1) Concept fidelity via SAE reconstruction

Bonus: expose concept_sparsity for Visual-AI panes (instant concept heat).

➕ Why Tiny is more than “HRM but small”

Different objective and operating point:

  • HRM does deep semantic validation; Tiny diagnoses meta-signals agreement, uncertainty, OOD, sensitivity that tell you when light-weight judgment is safe vs. when to escalate.
  • Tiny runs on fixed embeddings with compact recursion; it’s cheap enough for the inner loop and edge.

The combination builds a cost-aware evaluator: Tiny handles the confident in-distribution mass; it forwards the risky tail to HRM.


🗳️ Practical decision patterns

  • High ood_hat01 + high agree_hat01 → unusual, but HRM likely agrees → ok to serve.
  • High ood_hat01 + low agree_hat01 → unusual and likely disagreement → escalate.
  • High jacobian_fd + low consistency01 → fragile judgment → escalate.
  • High uncertainty01 + low temp01 → uncertain & poorly calibrated → caution.

👀 SAE bottleneck: making the unseen visible

The SAE forces a sparse concept code. Two wins:

  1. Interpretability: concept_sparsity and recon_sim01 show when Tiny relies on a small, stable set of “ideas” vs. diffuse noise.
  2. Transfer: concepts that remain predictive across datasets tend to align with durable reasoning patterns great anchors for Δ-attribution.

👶 Minimal “generic” API (no Stephanie dependencies)

# Example usage outside Stephanie
x  = embed(goal_text)                 # [B, D]
y  = embed(response_text)             # [B, D]
z0 = torch.zeros_like(x)

_, _, _, aux = tiny(x, y, z0, seq_len=len_tokens(response_text), return_aux=True)

result = {
  "score01":        float(aux["score"].mean()),
  "uncertainty01":  float(aux["uncertainty01"].mean()),
  "ood01":          float(aux["ood_hat01"].mean()),
  "consistency01":  float(aux["consistency_hat"].mean()),
  "sensitivity01":  float(aux["jacobian_fd"].mean()),
  "length_norm01":  float(aux["length_norm01"].mean()),
}
# Use for routing, monitoring, ranking, or eval.

If you adopt SCM, just map these to your preferred keys (we use scm.* for cross-model alignment).


🎲 Design choices that mattered

  • Clamp log_var and derive uncertainty01 = 1 − σ(−log_var) (monotone, stable, interpretable).
  • Use temp01 = σ(tau_raw) for alignment; keep tau for calibration math.
  • Sensitivity proxy: normalize perturbations and clip jacobian_fd.
  • SAE α ≈ 0.05: enough sparsity pressure without crushing expressivity.
  • Length proxy: bounded tanh(len/L) mapped to [0,1] avoids pathological length effects.

🌀 Tiny+ in Stephanie (what’s different)

  • SCM wiring so Tiny and HRM align 1:1
  • Agreement/disagreement head tuned for HRM parity and Δ analysis
  • Visual-AI first: outputs designed to render as “turns × features” images for instant diagnosis

This is how we compute Δ = HRM − Tiny per turn/dimension and then study its topology (e.g., persistent loops).


📅 What’s next in this post

Up next is the full Tiny source (model), followed by the trainer and the scorer wrapper. If you just want the gist, the sections above are enough to implement a compatible Tiny. If you enjoy digging into details: the code that follows is production-hardened, numerically safe, and instrumented for Δ-field work.

Tiny Recursion Model: View full source

# stephanie/scoring/model/tiny_recursion.py
"""
Tiny Recursion Model (Tiny+) - Parameter-Efficient Recursive Neural Architecture

This module implements a compact, recursive neural network for multi-task evaluation
of AI model responses. The architecture combines recursive state updates with
multi-head output predictions, enabling efficient quality assessment across
multiple dimensions from embedding inputs.

Key Innovations:
- Recursive latent state updates with halting mechanisms
- Sparse Autoencoder (SAE) bottleneck for interpretable concepts
- Multi-head prediction for comprehensive quality assessment
- Heteroscedastic uncertainty estimation
- In-graph consistency regularization

Architecture Overview:
1. Recursive fusion of goal (x), response (y), and latent (z) states
2. Core processing blocks (attention or MLP-based)
3. SAE bottleneck for sparse concept representation
4. Multi-head prediction for scores, uncertainty, and auxiliary tasks

"""

from __future__ import annotations

from typing import Any, Dict, Optional, Tuple

import torch
import torch.nn as nn
import torch.nn.functional as F

# ---------------------------
# Core Building Blocks
# ---------------------------

class TinyBlock(nn.Module):
    """
    Basic residual block: LayerNorm → MLP → residual connection.

    Supports both 2D [batch, features] and 3D [batch, sequence, features] inputs.
    Uses GELU activation and dropout for regularization.
    """
    def __init__(self, d_model: int, dropout: float = 0.1):
        super().__init__()
        self.ln = nn.LayerNorm(d_model)
        self.mlp = nn.Sequential(
            nn.Linear(d_model, d_model * 4),  # Expansion factor 4
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_model * 4, d_model),  # Projection back
            nn.Dropout(dropout),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Apply residual block: x + MLP(LayerNorm(x))"""
        return x + self.mlp(self.ln(x))


class TinyBlockAttn(nn.Module):
    """
    Attention-enhanced residual block with Multi-Head Self-Attention.

    Architecture: LN → MHA → residual → TinyBlock → residual
    Automatically handles 2D/3D inputs and returns same dimensionality.
    """
    def __init__(self, d_model: int, n_heads: int = 4, dropout: float = 0.1):
        super().__init__()
        self.ln_attn = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True  # [batch, seq, features]
        )
        self.drop = nn.Dropout(dropout)
        self.ff = TinyBlock(d_model, dropout=dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass with automatic shape handling.

        Args:
            x: Input tensor of shape [B, D] or [B, L, D]

        Returns:
            Output tensor with same shape as input
        """
        squeeze_back = False
        if x.dim() == 2:
            x = x.unsqueeze(1)  # [B, D] → [B, 1, D]
            squeeze_back = True

        q = k = v = self.ln_attn(x)
        h, _ = self.attn(q, k, v, need_weights=False)
        x = x + self.drop(h)  # Residual connection
        x = self.ff(x)        # Feed-forward with residual

        if squeeze_back:
            x = x.squeeze(1)  # [B, 1, D] → [B, D]
        return x


# ---------------------------
# Tiny Recursion Model (Tiny+)
# ---------------------------

class TinyRecursionModel(nn.Module):
    """
    Parameter-efficient recursive model for multi-task evaluation.

    Recursively updates latent state z using goal (x) and response (y) embeddings
    over multiple steps. Features comprehensive multi-head prediction and
    sparse autoencoder bottleneck for interpretable representations.

    Core Components:
    - Recursive state fusion: [x, y, z] → z'
    - Core processing stack: Attention or MLP blocks
    - SAE bottleneck: Sparse concept encoding
    - Multi-head prediction: 12 specialized output heads

    Inputs:
        x: Goal/condition embedding [B, D]
        y: Response embedding [B, D]
        z: Initial latent state [B, D] (typically zeros)

    Outputs:
        logits: Classification logits [B, vocab_size] (legacy compatibility)
        halt_logits: Halting signal logits [B]
        z_final: Final latent state after recursion [B, D]
        aux: Dictionary of auxiliary predictions and metrics
    """

    def __init__(
        self,
        d_model: int = 256,
        n_layers: int = 2,
        n_recursions: int = 6,
        vocab_size: int = 1024,
        use_attention: bool = False,
        dropout: float = 0.1,
        attn_heads: int = 4,
        step_scale: float = 0.1,           # Residual scaling for state updates
        consistency_mask_p: float = 0.10,  # Mask probability for consistency regularization
        len_norm_L: float = 512.0,         # Length normalization constant
        enable_agree_head: bool = True,    # Enable agreement prediction head
        enable_causal_sens_head: bool = True,  # Enable sensitivity prediction head
    ):
        super().__init__()

        # Model configuration
        self.d_model = d_model
        self.n_layers = n_layers
        self.n_recursions = n_recursions
        self.vocab_size = vocab_size
        self.use_attention = use_attention
        self.step_scale = step_scale
        self.consistency_mask_p = consistency_mask_p
        self.len_norm_L = float(len_norm_L)
        self.enable_agree_head = enable_agree_head
        self.enable_causal_sens_head = enable_causal_sens_head

        # Core processing stack
        if use_attention:
            blocks = [TinyBlockAttn(d_model, n_heads=attn_heads, dropout=dropout)
                      for _ in range(n_layers)]
        else:
            blocks = [TinyBlock(d_model, dropout=dropout) for _ in range(n_layers)]
        self.core = nn.Sequential(*blocks)

        # State fusion: combine goal, response, and latent states
        self.z_proj = nn.Linear(d_model * 3, d_model)  # [x, y, z] → z'
        self.final_ln = nn.LayerNorm(d_model)

        # Core prediction heads
        self.halt_head = nn.Linear(d_model, 1)            # Halting signal logits
        self.classifier = nn.Linear(d_model, vocab_size)  # Legacy classification

        # Extended prediction heads
        self.score_head = nn.Linear(d_model, 1)        # Quality score ∈ [0,1]
        self.logvar_head = nn.Linear(d_model, 1)       # Aleatoric uncertainty (log-variance)
        self.aux3_head = nn.Linear(d_model, 3)         # 3-way classification
        self.disagree_head = nn.Linear(d_model, 1)     # Disagreement prediction
        self.recon_head = nn.Linear(d_model, d_model)  # Embedding reconstruction
        self.consistency_head = nn.Linear(d_model, 1)  # Robustness prediction
        self.ood_head = nn.Linear(d_model, 1)          # OOD detection
        self.temp_head = nn.Linear(d_model, 1)         # Temperature calibration

        # Bridge heads
        self.agree_head = nn.Linear(d_model, 1)        # Cross-model agreement
        self.causal_sens_head = nn.Linear(d_model, 1)  # Perturbation sensitivity

        # Sparse Autoencoder (SAE) bottleneck
        self.sae_enc = nn.Sequential(
            nn.Linear(d_model, d_model // 2),  # Compression
            nn.ReLU(),
            nn.LayerNorm(d_model // 2),
        )
        self.sae_dec = nn.Linear(d_model // 2, d_model)  # Reconstruction
        self.sae_alpha = 0.05  # SAE reconstruction loss weight

        # Regularization
        self.head_drop = nn.Dropout(dropout)

    @staticmethod
    def _cos01(a: torch.Tensor, b: torch.Tensor, dim: int = -1, eps: float = 1e-6) -> torch.Tensor:
        """
        Compute cosine similarity mapped from [-1, 1] to [0, 1].

        Args:
            a, b: Input tensors to compare
            dim: Dimension for cosine computation
            eps: Numerical stability term

        Returns:
            Cosine similarity in range [0, 1] where 1 = identical
        """
        sim = F.cosine_similarity(a, b, dim=dim, eps=eps)
        return (sim + 1.0) * 0.5

    def _recur(
        self,
        x: torch.Tensor,
        y: torch.Tensor,
        z: torch.Tensor,
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        Execute recursive state updates over n_recursions steps.

        Process:
          1. Fuse [x, y, z] → z_next via projection and activation
          2. Process through core network stack
          3. Update halting signals
          4. Apply residual state update: z = z + step_scale * z_next
          5. Apply SAE bottleneck to final state

        Args:
            x: Goal embedding [B, D]
            y: Response embedding [B, D]
            z: Initial latent state [B, D]

        Returns:
            z_final: Final latent state after recursion [B, D]
            z_head: SAE-processed state for prediction heads [B, D]
            halt_logits: Maximum halting logits across steps [B, 1]
            tau: Temperature parameter for score calibration [B, 1]
            c: Sparse concept codes from SAE bottleneck [B, D//2]
        """
        B = x.size(0)
        device = x.device

        # Initialize halting signals to very negative values
        halt_logits = torch.full((B, 1), -1e9, device=device)
        z_cur = z  # Current latent state

        # Recursive state updates
        for _ in range(self.n_recursions):
            fused = torch.cat([x, y, z_cur], dim=-1)   # [B, 3 * D]
            z_next = torch.tanh(self.z_proj(fused))    # [B, D] with saturation
            z_next = self.core(z_next)                 # [B, D] core processing

            # Update halting signal (track maximum across steps)
            step_halt = self.halt_head(self.final_ln(z_next))  # [B, 1]
            halt_logits = torch.maximum(halt_logits, step_halt)

            # Residual state update with step scaling
            z_cur = z_cur + self.step_scale * z_next

        # Final normalization
        z_final = self.final_ln(z_cur)  # [B, D]

        # Sparse Autoencoder bottleneck
        c = self.sae_enc(z_final)                  # [B, D//2] concept codes
        z_head = z_final + self.sae_dec(c)         # [B, D] with SAE reconstruction
        z_head = self.head_drop(z_head)            # Regularization

        # Temperature calibration parameter (τ ∈ (0.5, ∞))
        tau_raw = self.temp_head(z_head)
        tau = 0.5 + 0.5 * F.softplus(tau_raw)  # Lower bound at 0.5

        return z_final, z_head, halt_logits, tau, c

    def forward(
        self,
        x: torch.Tensor,                    # Goal embedding [B, D]
        y: torch.Tensor,                    # Response embedding [B, D]
        z: torch.Tensor,                    # Initial latent state [B, D]
        *,
        seq_len: Optional[torch.Tensor] = None,  # Response length [B] (optional)
        return_aux: bool = True,                 # Whether to return auxiliary outputs
        with_consistency_target: bool = True,    # Compute consistency regularization
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, Dict[str, Any]]:
        """
        Complete forward pass with recursive processing and multi-head prediction.
        """
        # Main recursive processing
        z = z.clone()  # Ensure we don't modify input
        z_final, z_head, halt_logits, tau, c = self._recur(x, y, z)

        # Core prediction heads
        logits = self.classifier(z_head)                    # [B, vocab_size]
        score_logit = self.score_head(z_head)               # [B, 1]
        log_var = self.logvar_head(z_head)                  # [B, 1] uncertainty

        # ----- NUMERICAL SAFETY -----
        LOGVAR_MIN, LOGVAR_MAX = -5.0, 5.0
        log_var = log_var.clamp(min=LOGVAR_MIN, max=LOGVAR_MAX)

        # Use tau for calibration; keep a stable proxy for telemetry
        # NOTE: move temp01 to sigmoid(tau_raw) for cross-model alignment
        tau_raw = self.temp_head(z_head) 
        tau = 0.5 + 0.5 * F.softplus(tau_raw)
        tau_safe = torch.clamp(tau, min=1e-2)
        s = torch.sigmoid(score_logit / tau_safe)

        # ----- Core auxiliaries
        aux3_logits = self.aux3_head(z_head)
        aux3_probs  = F.softmax(aux3_logits, dim=-1)
        disagree_logit = self.disagree_head(z_head)
        y_recon     = self.recon_head(z_head)
        ood_logit   = self.ood_head(z_head)

        # Optional bridge heads
        agree01 = torch.sigmoid(self.agree_head(z_head)) if self.enable_agree_head else None
        sens01  = torch.sigmoid(self.causal_sens_head(z_head)) if self.enable_causal_sens_head else None

        # Consistency target 
        mask = (torch.rand_like(z_head) < self.consistency_mask_p).float()
        z_masked = z_head * (1.0 - mask)
        cos_consistency = self._cos01(z_head, z_masked).unsqueeze(-1)
        consistency_logit = self.consistency_head(z_head)

        # Finite-difference sensitivity
        eps = 1e-3
        y_eps = y + eps * F.normalize(torch.randn_like(y), dim=-1)
        with torch.no_grad():
            _, z_head_eps, _, tau_eps, _ = self._recur(x, y_eps, z)
        tau_eps_safe = torch.clamp(tau_eps, min=1e-2)
        score_eps    = torch.sigmoid(self.score_head(z_head_eps) / tau_eps_safe)
        jac_fd       = ((score_eps - s).abs() / eps).clamp(0, 10.0) / 10.0

        # Length effect
        if seq_len is not None:
            len_effect = torch.tanh((seq_len.float() / self.len_norm_L)).unsqueeze(-1)
        else:
            len_effect = torch.zeros_like(s)
        length_norm01 = (len_effect + 1.0) * 0.5

        # ----- Aligned telemetry keys -----
        certainty01   = torch.sigmoid(-log_var)
        uncertainty01 = 1.0 - certainty01
        temp01        = torch.sigmoid(tau_raw)  # aligned proxy in [0,1]
        ood_hat01     = torch.sigmoid(ood_logit)
        halt_prob     = torch.sigmoid(halt_logits).unsqueeze(-1) if halt_logits.dim()==1 else torch.sigmoid(halt_logits)

        # Device-safe normalized entropy (in [0,1])
        logK = torch.log(torch.tensor(3.0, device=z_head.device, dtype=z_head.dtype))
        entropy_aux = (-(aux3_probs * F.log_softmax(aux3_logits, -1)).sum(-1) / logK).unsqueeze(-1)

        aux: Dict[str, Any] = {
            # raw heads you need for training
            "score_logit": score_logit,
            "log_var": log_var,
            "aux3_logits": aux3_logits,
            "disagree_logit": disagree_logit,
            "y_recon": y_recon,
            "consistency_logit": consistency_logit,
            "consistency_target": cos_consistency.detach(),

            # aligned derived telemetry (all ∈ [0,1])
            "score": s,
            "certainty01": certainty01,
            "uncertainty01": uncertainty01,     # <  NEW (correct)
            "uncertainty": uncertainty01,       # <  OPTIONAL alias for back-compat
            "aux3_probs": aux3_probs,
            "entropy_aux": entropy_aux,
            "disagree_hat": torch.sigmoid(disagree_logit),
            "recon_sim": self._cos01(y_recon, y).unsqueeze(-1),
            "consistency_hat": torch.sigmoid(consistency_logit),
            "concept_sparsity": (c > 0).float().mean(dim=-1, keepdim=True),
            "ood_hat01": ood_hat01,             # <  NEW aligned name
            "temp01": temp01,                   # <  changed to sigmoid(tau_raw)
            "jacobian_fd": jac_fd,
            "len_effect": len_effect,
            "length_norm01": length_norm01,     # <  NEW 0..1 length proxy
            "halt_prob": halt_prob,             # <  NEW
        }

        if agree01 is not None:
            aux["agree01"] = agree01
        if sens01 is not None:
            aux["sens01"] = sens01

        return logits, halt_logits.squeeze(-1), z_final, (aux if return_aux else {})

🔁 Training the Tiny Recursion Model: Building the Cognitive Microscope

“Tiny isn’t just a smaller HRM it’s a specialized diagnostic tool trained to spot exactly where HRM’s strengths and weaknesses live.”

Now that we’ve built the Tiny Recursion Model architecture, let’s explore how we train it to become Stephanie’s cognitive microscope. This is where the magic happens: transforming a simple neural architecture into a system that can diagnose reasoning quality with surgical precision.

🌬️ The Training Pipeline: From Raw Chats to Diagnostic Signals

Tiny’s training pipeline is designed to be:

  • Per-dimension: Each of the five reasoning dimensions (reasoning, knowledge, clarity, faithfulness, coverage) gets its own trained model
  • Data-driven: Trained on high-quality conversation turns annotated by our ChatAnalyze Agent
  • Diagnostic-focused: Trained to predict not just scores, but uncertainty, agreement with HRM, sensitivity to perturbations, and other diagnostic signals

Let’s walk through exactly how this works:

1. Data Preparation: Quality Chats for Quality Training

# stephanie/agents/maintenance/tiny_trainer.py
class TinyTrainerAgent(BaseAgent):
    def __init__(self, cfg, memory, container, logger, full_cfg):
        super().__init__(cfg, memory, container, logger)
        self.dimensions = cfg.get("dimensions", []) # e.g., ["reasoning", "knowledge", "clarity", ...]
        self.trainer = TinyTrainer(full_cfg.scorer.hrm, memory, container=container, logger=logger)
    
    async def run(self, context: dict) -> dict:
        results = {}
        for dimension in self.dimensions:
            pairs_by_dim = self.pair_builder.get_training_pairs_by_dimension(dimension=dimension)
            samples = pairs_by_dim.get(dimension, [])
            if not samples:
                self.logger.log("NoSamplesFound", {"dimension": dimension})
                continue
            stats = self.trainer.train(samples, dimension)
            if "error" not in stats:
                results[dimension] = {"count": len(samples), **stats}
        context["training_stats"] = results
        return context

This agent is the orchestrator of Tiny’s training journey. It:

  • Iterates through each reasoning dimension
  • Fetches training pairs specific to that dimension
  • Trains a separate Tiny model for each dimension
  • Logs detailed statistics for each training run

The key insight here: Tiny is trained per-dimension. This isn’t arbitrary it’s because each dimension of reasoning requires different diagnostic signals. A model trained to assess knowledge won’t be as good at assessing clarity, and vice versa.

2. Training Configuration: Precision Tuning for Recursive Reasoning

# stephanie/scoring/training/tiny_recursion_trainer.py
class TinyRecursionTrainer(BaseTrainer):
    def __init__(self, cfg, memory, container, logger):
        super().__init__(cfg, memory, container, logger)
        # - Identity / paths -
        self.model_type = "tiny"
        self.target_type = cfg.get("target_type", "document")
        self.version = cfg.get("model_version", "v1")
        # - Core knobs -
        self.epochs = int(cfg.get("epochs", 20))
        self.lr = float(cfg.get("lr", 3e-5))  # conservative default
        self.batch_size = int(cfg.get("batch_size", 16))
        self.dropout = float(cfg.get("dropout", 0.1))
        self.use_attention = bool(cfg.get("use_attention", False))
        self.n_recursions = int(cfg.get("n_recursions", 6))

This configuration is where Tiny’s “personality” is set. Let’s unpack the critical choices:

  • lr = 3e-5: A conservative learning rate that prevents the model from overshooting during training. Tiny’s recursive nature means small updates can have significant downstream effects.

  • n_recursions = 6: The number of refinement steps Tiny takes during evaluation. This was carefully tuned to balance computational efficiency with reasoning depth fewer steps would be too shallow, more steps would be computationally expensive.

  • dropout = 0.1: A modest dropout rate that prevents overfitting while preserving the model’s ability to capture subtle reasoning patterns.

  • batch_size = 16: A small batch size that helps stabilize training for Tiny’s recursive architecture.

These settings aren’t arbitrary they’re the result of extensive experimentation to find the sweet spot where Tiny can learn diagnostic signals without overfitting or becoming computationally expensive.

3. Training Loop: The Recursive Refinement Process

# stephanie/scoring/training/tiny_recursion_trainer.py
def train(self, samples, dimension):
    # ... data preparation ...
    for epoch in range(self.epochs):
        avg_loss = self._train_epoch(model, dataloader, epoch_idx=epoch)
        # ... validation ...
        if avg_loss < best_loss - 1e-4:
            best_loss = avg_loss
            wait = 0
        else:
            wait += 1
            if wait >= patience:
                break
    # ... save model and metadata ...

This training loop is where the magic happens. Tiny doesn’t just learn to predict scores it learns to predict everything that matters for reasoning quality:

  • Core score prediction: The primary quality score (0-100)
  • Heteroscedastic uncertainty: How confident the model is in its own score
  • Agreement with HRM: Whether Tiny thinks its score aligns with HRM’s
  • Sensitivity to perturbations: How much Tiny’s score changes with small input changes
  • OOD detection: Whether the input is out-of-distribution

The “wait” mechanism is particularly important it implements early stopping when Tiny stops improving, preventing overfitting and saving computational resources.

4. Specialized Heads: The Diagnostic Toolkit

# stephanie/models/tiny_recursion.py
class TinyRecursionModel(nn.Module):
    def __init__(...):
        # ... existing init ...
        # ✨ TINY+ AUX HEADS
        self.score_head = nn.Linear(d_model, 1) # final calibrated score
        self.logvar_head = nn.Linear(d_model, 1) # heteroscedastic uncertainty
        self.aux3_head = nn.Linear(d_model, 3) # bad/mid/good classifier
        self.disagree_head = nn.Linear(d_model, 1)
        self.causal_sens_head = nn.Linear(d_model, 1)
        # ... and more ...

This is where Tiny becomes more than just a scorer it becomes a diagnostic toolkit. Each head serves a specific purpose:

  • score_head: The primary quality score that becomes our scm.<dim>.score01 in the SCM protocol
  • logvar_head: Measures aleatoric uncertainty (how noisy the data is)
  • disagree_head: Predicts whether Tiny’s score will disagree with HRM’s
  • causal_sens_head: Measures how sensitive Tiny’s score is to tiny input changes
  • ood_head: Flags out-of-distribution inputs that might be risky

These heads are trained together, allowing Tiny to learn the relationships between different diagnostic signals. For example, when Tiny is uncertain (logvar_head high), it’s also more likely to disagree with HRM (disagree_head high).

5. Training Losses: The Mathematical Foundation

Tiny’s training doesn’t just optimize for one thing it optimizes for multiple signals simultaneously:

# stephanie/scoring/training/tiny_recursion_trainer.py
def _train_epoch(self, model, dataloader, epoch_idx):
    # ... training loop ...
    loss = self.w_score * score_loss + \
           self.w_uncertainty * uncertainty_loss + \
           self.w_disagree * disagree_loss + \
           self.w_sens * sensitivity_loss + \
           self.w_ood * ood_loss
    # ... backpropagation ...

Each loss component has its own weight:

  • w_score: Primary quality score prediction (most important)
  • w_uncertainty: Aleatoric uncertainty prediction
  • w_disagree: Agreement with HRM prediction
  • w_sens: Sensitivity to input perturbations
  • w_ood: Out-of-distribution detection

These weights are carefully tuned to balance the different aspects of reasoning quality. For example, w_score is typically higher than w_ood because the primary job is to assess quality, with OOD detection being a secondary concern.

🔗 Working together towards a goal

The real magic of Tiny’s training isn’t in the individual components it’s in how they work together:

  • Per-dimension training: Each dimension gets its own specialized model that understands the unique characteristics of that reasoning aspect
  • Multi-task learning: Tiny learns to predict multiple diagnostic signals simultaneously, creating a rich diagnostic profile
  • Shared canonical metrics: All outputs are mapped to a standardized format (SCM) that aligns with HRM’s outputs
  • Diagnostic-focused: Instead of just predicting scores, Tiny learns to predict why a score is what it is

This training approach is what transforms Tiny from a simple scorer into Stephanie’s cognitive microscope a system that can diagnose reasoning quality with surgical precision and tell us exactly where and why it agrees or disagrees with HRM.

Tiny Recursion Trainer: View full source

# stephanie/scoring/training/tiny_recursion_trainer.py
"""
TinyRecursionModel Trainer (Tiny+)

A specialized trainer for the TinyRecursionModel that implements multi-objective
training with heteroscedastic regression and auxiliary losses. This trainer handles
multiple data schemas and produces dimension-specific models with comprehensive
training telemetry.

Key Features:
- Heteroscedastic regression for score prediction with uncertainty estimation
- Multiple auxiliary objectives: bucket classification, disagreement, reconstruction
- Support for various input schemas (native, singleton, pairwise, HRM)
- Comprehensive training monitoring and validation
- Early stopping and model checkpointing

"""

from __future__ import annotations

import math
import os
from collections import Counter
from datetime import datetime
from typing import Any, Dict, List, Optional, Tuple
import logging
import torch
import torch.nn.functional as F
from torch import optim
from torch.utils.data import DataLoader, TensorDataset

from stephanie.scoring.model.tiny_recursion import TinyRecursionModel
from stephanie.scoring.training.base_trainer import BaseTrainer

try:
    from tqdm.auto import tqdm
except Exception:  # pragma: no cover
    tqdm = None

_logger = logging.getLogger(__name__)

def _bucket3(y01: torch.Tensor) -> torch.Tensor:
    """
    Convert continuous scores to 3-class bucket labels.
    
    Args:
        y01: Tensor of scores in range [0, 1]
        
    Returns:
        Long tensor with bucket indices:
        - 0: scores < 1/3
        - 1: scores in [1/3, 2/3)
        - 2: scores >= 2/3
    """
    # <1/3 => 0, [1/3,2/3) => 1, >=2/3 => 2
    edges = torch.tensor([1/3, 2/3], device=y01.device, dtype=y01.dtype)
    return (y01 >= edges[0]).long() + (y01 >= edges[1]).long()


class TinyTrainer(BaseTrainer):
    """
    Trainer for TinyRecursionModel (Tiny+) with multi-objective optimization.
    
    This trainer implements a comprehensive training regimen that combines:
    - Main heteroscedastic regression objective
    - Multiple auxiliary objectives for regularization and feature learning
    - Support for various input data formats And see this was a complete waste of timed
    - Extensive monitoring and validation
    
    The model produces separate instances for each quality dimension.
    
    Attributes:
        model_type: Identifier for model architecture ("tiny")
        target_type: Type of scoring target ("document", "sentence", etc.)
        version: Model version identifier
        epochs: Number of training epochs
        lr: Learning rate for optimizer
        batch_size: Training batch size
        dropout: Dropout rate for model regularization
        use_attention: Whether to use attention mechanisms
        n_recursions: Number of recursion steps in model
        halt_lambda: Weight for halting regularization loss
        grad_clip: Gradient clipping value
        w_aux3: Weight for 3-class auxiliary classification
        w_disagree: Weight for disagreement prediction
        w_recon: Weight for reconstruction loss
        w_cons: Weight for consistency regularization
        w_sae_recon: Weight for sparse autoencoder reconstruction
        w_ood: Weight for out-of-distribution detection
    """

    def __init__(self, cfg, memory, container, logger):
        """Initialize TinyTrainer with configuration and dependencies."""
        super().__init__(cfg, memory, container, logger)

        # --- Identity / paths -------------------------------------------------
        self.model_type   = "tiny"
        self.target_type  = cfg.get("target_type", "document")
        self.version      = cfg.get("model_version", "v1")

        # --- Core knobs -------------------------------------------------------
        self.epochs        = int(cfg.get("epochs", 20))
        self.lr            = float(cfg.get("lr", 3e-5))           # conservative default
        self.batch_size    = int(cfg.get("batch_size", 16))
        self.dropout       = float(cfg.get("dropout", 0.1))
        self.use_attention = bool(cfg.get("use_attention", False))
        self.n_recursions  = int(cfg.get("n_recursions", 6))
        self.halt_lambda   = float(cfg.get("halt_lambda", 0.05))  # halting is a light regularizer
        self.grad_clip     = float(cfg.get("grad_clip", 0.5))

        # Aux loss weights
        self.w_aux3        = float(cfg.get("w_aux3", 0.3))
        self.w_disagree    = float(cfg.get("w_disagree", 0.3))
        self.w_recon       = float(cfg.get("w_recon", 0.2))
        self.w_cons        = float(cfg.get("w_consistency", 0.2))
        self.w_sae_recon   = float(cfg.get("w_sae_recon", 0.0))   # 0 = off by default
        self.w_ood         = float(cfg.get("w_ood", 0.0))         # 0 = off by default

        # --- Telemetry --------------------------------------------------------
        self.show_progress       = bool(cfg.get("show_progress", True))
        self.progress_every      = max(1, int(cfg.get("progress_every", 500)))
        self.log_every_steps     = max(1, int(cfg.get("log_every_steps", 50)))
        self.label_hist_bucket   = int(cfg.get("label_hist_bucket", 10))
        self.log_label_histogram = bool(cfg.get("log_label_histogram", True))

        # --- Validation / reproducibility ------------------------------------
        self.validation_ratio = float(cfg.get("validation_ratio", 0.1))
        self.seed             = int(cfg.get("seed", 42))
        torch.manual_seed(self.seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed(self.seed)

        # --- Model ------------------------------------------------------------
        self.model = TinyRecursionModel(
            d_model=self.dim,
            n_layers=int(cfg.get("n_layers", 2)),
            n_recursions=self.n_recursions,
            vocab_size=int(cfg.get("vocab_size", 101)),   # kept for classifier compatibility
            use_attention=self.use_attention,
            dropout=self.dropout,
            attn_heads=int(cfg.get("attn_heads", 4)),
            step_scale=float(cfg.get("step_scale", 0.1)),
            consistency_mask_p=float(cfg.get("consistency_mask_p", 0.10)),
            len_norm_L=float(cfg.get("len_norm_L", 512.0)),
        ).to(self.device)

    # ------------------------------
    # Data prep
    # ------------------------------

    def _create_dataloader(self, samples: List[Dict[str, Any]]) -> Tuple[Optional[DataLoader], int, int]:
        """
        Create DataLoader from sample dictionaries with multiple schema support.
        
        Supports multiple input formats:
        - Native Tiny+ schema: x, y, z, target
        - Singleton format: goal_text/output with score
        - Pairwise format: output_a/output_b with comparative scores
        - HRM format: goal_text/scorable_text with target_score
        
        Args:
            samples: List of sample dictionaries with various possible schemas
            
        Returns:
            Tuple of (DataLoader, kept_count, dropped_count) or (None, kept, dropped) if insufficient data
        """
        xs, ys, zs = [], [], []
        y01, halt_targets, seq_lens = [], [], []
        kept = dropped = 0
        label_counts = Counter()

        use_tqdm = bool(self.show_progress and tqdm is not None)
        it = tqdm(samples, desc="Packing Tiny+ samples", unit="samp") if use_tqdm else samples

        def _push(goal: str, doc: str, target: float, *, z_text: Optional[str] = None, halt_t: float = 1.0, slen: int = 0):
            """Internal helper to process and validate a single sample."""
            nonlocal kept, dropped
            try:
                # Get embeddings for text inputs
                x = torch.tensor(self.memory.embedding.get_or_create(goal), dtype=torch.float32, device=self.device)
                y = torch.tensor(self.memory.embedding.get_or_create(doc),  dtype=torch.float32, device=self.device)
                z = torch.tensor(self.memory.embedding.get_or_create(z_text if z_text is not None else goal),
                                dtype=torch.float32, device=self.device)

                # ---- Normalize & sanitize inputs (prevents recursion amplification / NaNs)
                def _safe_vec(t):
                    """Safely normalize vector, handling NaN/inf values."""
                    t = torch.nan_to_num(t, nan=0.0, posinf=0.0, neginf=0.0)
                    norm = t.norm(dim=-1, keepdim=True).clamp_min(1e-6)
                    return t / norm

                x = _safe_vec(x); y = _safe_vec(y); z = _safe_vec(z)
                if not torch.isfinite(x).all() or not torch.isfinite(y).all() or not torch.isfinite(z).all():
                    dropped += 1
                    return

                # normalize target → [0,1]
                t = float(target)
                t = (max(0.0, min(100.0, t)) / 100.0) if t > 1.0 else max(0.0, min(1.0, t))

                xs.append(x); ys.append(y); zs.append(z)
                y01.append(t); halt_targets.append(float(halt_t)); seq_lens.append(int(slen))
                label_counts[int(round(t * 100))] += 1
                kept += 1
            except Exception as e:
                dropped += 1
                if self.logger: self.logger.log("TinyRecursionSampleError", {"error": str(e)})

        # Process all samples with schema detection
        for s in it:
            # Native Tiny+ schema
            if "x" in s and "y" in s and "z" in s and "target" in s:
                _push(s["x"], s["y"], s["target"], z_text=s.get("z"), halt_t=s.get("halt_target", 1.0), slen=s.get("seq_len", 0))
                continue

            # Singleton (SICQL/MRQ style)
            title = (s.get("goal_text") or s.get("title") or "").strip()
            if "output" in s and ("score" in s or "target_score" in s):
                out = (s.get("scorable_text") or s.get("output") or "").strip()
                val = s.get("target_score", s.get("score"))
                if title and out and (val is not None):
                    _push(title, out, val, z_text=title)
                else:
                    dropped += 1
                continue

            # Pairwise
            if all(k in s for k in ("output_a","output_b","value_a","value_b")):
                a_out = (s.get("output_a") or "").strip()
                b_out = (s.get("output_b") or "").strip()
                a_val = s.get("value_a"); b_val = s.get("value_b")
                if title:
                    if a_out and a_val is not None: _push(title, a_out, a_val, z_text=title)
                    if b_out and b_val is not None: _push(title, b_out, b_val, z_text=title)
                else:
                    dropped += 1
                continue

            # HRM/raw
            if ("goal_text" in s and "scorable_text" in s and ("target_score" in s or "score" in s)):
                out = (s.get("scorable_text") or "").strip()
                val = s.get("target_score", s.get("score"))
                _push(title, out, val, z_text=title)
                continue

            dropped += 1

            if use_tqdm: it.set_postfix(kept=kept, drop=dropped)

        if use_tqdm and hasattr(it, "close"): it.close()

        # Log label distribution for analysis
        if self.logger and self.log_label_histogram:
            exact = {int(k): int(v) for k, v in sorted(label_counts.items())}
            bucketed = self._bucketize_counts(label_counts, self.label_hist_bucket)
            self.logger.log("TinyPlusLabelHistogram", {
                "kept": int(kept), "dropped": int(dropped),
                "exact": exact, "bucket_size": int(self.label_hist_bucket),
                "bucketed": bucketed
            })

        if kept < self.min_samples:
            return None, kept, dropped

        # Create TensorDataset and DataLoader
        dataset = TensorDataset(
            torch.stack(xs), torch.stack(ys), torch.stack(zs),
            torch.tensor(y01, dtype=torch.float32, device=self.device).unsqueeze(-1),
            torch.tensor(halt_targets, dtype=torch.float32, device=self.device).unsqueeze(-1),
            torch.tensor(seq_lens, dtype=torch.int32, device=self.device),
        )
        loader = DataLoader(dataset, batch_size=self.batch_size, shuffle=True)
        return loader, kept, dropped

    # ------------------------------
    # Loss Functions
    # ------------------------------

    @staticmethod
    def _heteroscedastic_regression_loss(score: torch.Tensor, target01: torch.Tensor, log_var: torch.Tensor) -> torch.Tensor:
        """
        Compute heteroscedastic regression loss with uncertainty estimation.
        
        This loss adapts to the uncertainty in predictions by learning
        a variance term that scales the regression loss.
        
        Args:
            score: Predicted scores [B, 1]
            target01: Ground truth scores in [0, 1] [B, 1]
            log_var: Learned log variance [B, 1]
            
        Returns:
            Scalar loss value
        """
        log_var = log_var.clamp(-5.0, 5.0)  # defensive clamp to avoid precision explosion
        inv_var = torch.exp(-log_var)
        diff2   = (score - target01).pow(2)
        return (inv_var * diff2 + log_var).mean()

    @staticmethod
    def _cosine_recon_loss(y_recon: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
        """
        Compute cosine reconstruction loss.
        
        Measures how well the model can reconstruct the input embedding,
        encouraging meaningful internal representations.
        
        Args:
            y_recon: Reconstructed embedding
            y_true: Original embedding
            
        Returns:
            Cosine distance loss in range [0, 1]
        """
        # 1 - cosine in [0,2] → clamp to [0,1]
        cos = F.cosine_similarity(y_recon, y_true, dim=-1, eps=1e-8).unsqueeze(-1)
        return (1 - cos).clamp(0, 1).mean()

    # ------------------------------
    # Epoch training
    # ------------------------------

    def _train_epoch(self, model: TinyRecursionModel, dataloader: DataLoader, epoch_idx: int) -> float:
        """
        Train model for one epoch.
        
        Args:
            model: TinyRecursionModel instance
            dataloader: Training data loader
            epoch_idx: Current epoch index
            
        Returns:
            Average training loss for the epoch
        """
        model.train()
        total_loss = 0.0
        count = 0

        use_tqdm = bool(self.show_progress and tqdm is not None)
        it = tqdm(dataloader, desc=f"Epoch {epoch_idx}", unit="batch", leave=False) if use_tqdm else dataloader

        for step, batch in enumerate(it, start=1):
            x, y, z, target01, halt_target, seq_len = batch

            # Forward pass with auxiliary outputs
            logits, halt_logits, _, aux = model(x, y, z, seq_len=seq_len, return_aux=True)

            # Main loss: heteroscedastic regression on score/log_var
            L_main = self._heteroscedastic_regression_loss(aux["score"], target01, aux["log_var"])

            # Auxiliary losses for multi-objective training
            buckets = _bucket3(target01.squeeze(-1))
            L_aux3  = F.cross_entropy(aux["aux3_logits"], buckets)  # 3-class classification
            L_dis   = F.smooth_l1_loss(aux["disagree_hat"], (target01 - aux["score"].detach()).abs())  # Disagreement prediction
            L_recon = self._cosine_recon_loss(aux["y_recon"], y)  # Reconstruction quality
            L_cons  = F.mse_loss(aux["consistency_hat"], aux["consistency_target"])  # Consistency regularization

            # Optional losses (weight=0 means disabled)
            L_sae = torch.zeros((), device=self.device)
            if self.w_sae_recon > 0.0 and "concept_vec" in aux:
                L_sae = aux["concept_vec"].abs().mean()  # Sparse autoencoder reconstruction

            L_ood = torch.zeros((), device=self.device)
            if self.w_ood > 0.0 and "ood_hat" in aux:
                L_ood = F.binary_cross_entropy(aux["ood_hat"], torch.ones_like(aux["ood_hat"]))  # OOD detection

            L_halt = F.binary_cross_entropy_with_logits(halt_logits.unsqueeze(-1), halt_target)  # Halting regularization

            # Check components for finiteness & sanity
            all_terms = torch.stack([
                L_main.detach(),
                L_aux3.detach(),
                L_dis.detach(),
                L_recon.detach(),
                L_cons.detach(),
                L_sae.detach(),
                L_ood.detach(),
                L_halt.detach()
            ])
            if (not torch.isfinite(all_terms).all()) or (all_terms.abs().max() > 1e6):
                if self.logger:
                    self.logger.log("TinyPlusNaNBatch", {
                        "epoch": epoch_idx,
                        "step": step,
                        "any_nan": bool(not torch.isfinite(all_terms).all()),
                        "max_abs": float(all_terms.abs().max().item())
                    })
                self.optimizer.zero_grad(set_to_none=True)
                continue  # skip this batch

            # Combined loss with weighting
            loss = (
                L_main
                + self.w_aux3 * L_aux3
                + self.w_disagree * L_dis
                + self.w_recon * L_recon
                + self.w_cons * L_cons
                + self.w_sae_recon * L_sae
                + self.w_ood * L_ood
                + self.halt_lambda * L_halt
            )

            if (not torch.isfinite(loss)) or (abs(loss.item()) > 1e7):
                if self.logger:
                    self.logger.log("TinyPlusUnstableLoss", {
                        "epoch": epoch_idx,
                        "step": step,
                        "loss": float(loss.item()) if torch.isfinite(loss) else float('nan')
                    })
                self.optimizer.zero_grad(set_to_none=True)
                continue

            # Backward pass with gradient clipping
            self.optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), self.grad_clip)
            self.optimizer.step()

            bsz = x.size(0)
            total_loss += loss.item() * bsz
            count += bsz

            # Progress reporting
            if use_tqdm:
                it.set_postfix(loss=f"{loss.item():.4f}")
            elif self.logger and (step % self.log_every_steps == 0):
                self.logger.log("TinyPlusBatch", {
                    "epoch": epoch_idx,
                    "step": step,
                    "loss": float(loss.item()),
                    "L_main": float(L_main.item()),
                    "L_aux3": float(L_aux3.item()),
                    "L_dis": float(L_dis.item()),
                    "L_recon": float(L_recon.item()),
                    "L_cons": float(L_cons.item()),
                    "L_sae": float(L_sae.item()),
                    "L_ood": float(L_ood.item()),
                    "L_halt": float(L_halt.item()),
                })

        if use_tqdm and hasattr(it, "close"):
            it.close()

        return total_loss / max(1, count)

    # ------------------------------
    # Validation
    # ------------------------------

    @torch.no_grad()
    def _validate(self, model: TinyRecursionModel, dataloader: Optional[DataLoader]) -> Dict[str, float]:
        """
        Run validation and compute comprehensive metrics.
        
        Args:
            model: Model to validate
            dataloader: Validation data loader
            
        Returns:
            Dictionary of validation metrics
        """
        if not dataloader:
            return {}

        model.eval()
        scores, targets = [], []
        # 10 metric lists for comprehensive validation
        entropies, uncerts, disagree, recon_sim, cons_hat, temp01, jac, spars, ood, len_eff = (
            [] for _ in range(10)
        )

        for x, y, z, target01, _, seq_len in dataloader:
            _, _, _, aux = model(x, y, z, seq_len=seq_len, return_aux=True)
            s = aux["score"].detach().cpu().view(-1)
            t = target01.detach().cpu().view(-1)

            scores.append(s)
            targets.append(t)
            # Collect various auxiliary metrics for analysis
            entropies.append(aux["entropy_aux"].detach().cpu().view(-1))
            uncerts.append(aux["uncertainty"].detach().cpu().view(-1))
            disagree.append(aux["disagree_hat"].detach().cpu().view(-1))
            recon_sim.append(aux["recon_sim"].detach().cpu().view(-1))
            cons_hat.append(aux["consistency_hat"].detach().cpu().view(-1))
            temp01.append(aux["temp01"].detach().cpu().view(-1))
            jac.append(aux["jacobian_fd"].detach().cpu().view(-1))
            spars.append(aux["concept_sparsity"].detach().cpu().view(-1))
            ood.append(aux["ood_hat"].detach().cpu().view(-1))
            len_eff.append(aux["len_effect"].detach().cpu().view(-1))

        s = torch.cat(scores); t = torch.cat(targets)
        mae = F.l1_loss(s, t).item()
        rmse = torch.sqrt(F.mse_loss(s, t)).item()

        def mean_cat(arrs):
            """Helper to compute mean of concatenated tensor list."""
            return float(torch.cat(arrs).mean().item()) if arrs else 0.0

        return {
            "mae": mae,
            "rmse": rmse,
            "entropy_aux_mean":     mean_cat(entropies),
            "uncertainty_mean":     mean_cat(uncerts),
            "disagree_hat_mean":    mean_cat(disagree),
            "recon_sim_mean":       mean_cat(recon_sim),
            "consistency_hat_mean": mean_cat(cons_hat),
            "temp01_mean":          mean_cat(temp01),
            "jacobian_fd_mean":     mean_cat(jac),
            "concept_sparsity_mean":mean_cat(spars),
            "ood_hat_mean":         mean_cat(ood),
            "len_effect_mean":      mean_cat(len_eff),
        }

    # ------------------------------
    # Train/val split
    # ------------------------------

    def _create_train_val_split(self, samples: List[Dict[str, Any]]):
        """
        Split samples into training and validation sets.
        
        Args:
            samples: List of sample dictionaries
            
        Returns:
            Tuple of (train_samples, val_samples)
        """
        if not samples:
            return [], []
        if self.validation_ratio <= 0 or len(samples) < 10:
            return samples, []
        g = torch.Generator().manual_seed(self.seed)
        idx = torch.randperm(len(samples), generator=g).tolist()
        split = int(len(samples) * (1 - self.validation_ratio))
        return [samples[i] for i in idx[:split]], [samples[i] for i in idx[split:]]

    # ------------------------------
    # Main train loop (per dimension)
    # ------------------------------

    def train(self, samples, dimension):
        """
        Main training loop for a specific quality dimension.
        
        Args:
            samples: Training samples for the dimension
            dimension: Quality dimension name
            
        Returns:
            Training results dictionary
        """
        # Split data
        train_samples, val_samples = self._create_train_val_split(samples)

        # Create data loaders
        dataloader, kept, dropped = self._create_dataloader(train_samples)
        val_loader, val_kept, val_dropped = (None, 0, 0)
        if val_samples:
            val_loader, val_kept, val_dropped = self._create_dataloader(val_samples)

        if not dataloader:
            return {"error": "insufficient_data", "kept": kept, "dropped": dropped}

        # Optimizer
        self.optimizer = optim.AdamW(self.model.parameters(), lr=self.lr, weight_decay=1e-2)

        best_metric = float("inf")
        patience, wait = int(self.cfg.get("patience", 3)), 0
        train_losses: List[float] = []
        saved_best = False

        locator = self.get_locator(dimension)  # create once; base_path will be ensured

        # Training loop with early stopping
        for epoch in range(1, self.epochs + 1):
            avg_loss = self._train_epoch(self.model, dataloader, epoch_idx=epoch)
            avg_loss = float(avg_loss)
            # Ensure epoch loss is finite for serialization/meta
            if not math.isfinite(avg_loss):
                avg_loss = float(train_losses[-1]) if train_losses else 0.0
                if self.logger:
                    self.logger.log("TinyPlusNaNEpoch", {"epoch": epoch})
            train_losses.append(avg_loss)

            # Validation
            val_metrics = self._validate(self.model, val_loader) if val_loader else {}
            if self.logger:
                payload = {"epoch": epoch, "train_loss": float(avg_loss)}
                payload.update({f"val_{k}": v for k, v in val_metrics.items()})
                self.logger.log("TinyPlusEpoch", payload)

            # Early stopping metric: prefer val MAE, fallback to train loss
            stop_metric = val_metrics.get("mae", avg_loss) if val_metrics else avg_loss
            if not math.isfinite(stop_metric):
                if self.logger:
                    self.logger.log("TinyPlusNonFiniteMetric", {"epoch": epoch})
                stop_metric = float("inf")

            improved = (not math.isfinite(best_metric)) or (stop_metric < best_metric - 1e-6)
            if improved:
                best_metric = stop_metric
                wait = 0
                best_path = locator.model_file(suffix="_tiny.pt")
                try:
                    torch.save(self.model.state_dict(), best_path)
                    saved_best = True
                    if self.logger:
                        self.logger.log("TinyPlusSaveCheckpoint", {"epoch": epoch, "path": best_path, "metric": float(best_metric)})
                except Exception as e:
                    if self.logger:
                        self.logger.log("TinyPlusSaveError", {"epoch": epoch, "path": best_path, "error": str(e)})
            else:
                wait += 1
                if wait >= patience:
                    if self.logger:
                        self.logger.log("TinyPlusEarlyStopping", {"epoch": epoch, "best_metric": float(best_metric)})
                    break

        # ---- ALWAYS save a 'last' checkpoint ----
        last_path = locator.model_file(suffix="_tiny_last.pt")
        try:
            torch.save(self.model.state_dict(), last_path)
            if self.logger:
                self.logger.log("TinyPlusSaveLast", {"path": last_path})
        except Exception as e:
            if self.logger:
                self.logger.log("TinyPlusSaveLastError", {"path": last_path, "error": str(e)})

        # If no 'best' was saved during training, backfill it now:
        best_path = locator.model_file(suffix="_tiny.pt")
        if not saved_best or not os.path.exists(best_path):
            try:
                torch.save(self.model.state_dict(), best_path)
                if self.logger:
                    self.logger.log("TinyPlusBackfillBest", {"path": best_path})
            except Exception as e:
                if self.logger:
                    self.logger.log("TinyPlusBackfillBestError", {"path": best_path, "error": str(e)})

        # --- Save training metadata -------------------------------------------
        safe_config = {
            "lr": self.lr,
            "epochs": self.epochs,
            "batch_size": self.batch_size,
            "halt_lambda": self.halt_lambda,
            "n_layers": self.cfg.get("n_layers", 2),
            "n_recursions": self.n_recursions,
            "use_attention": self.use_attention,
            "dropout": self.dropout,
            "seed": self.seed,
            "vocab_size": int(self.cfg.get("vocab_size", 101)),
            "w_aux3": self.w_aux3,
            "w_disagree": self.w_disagree,
            "w_recon": self.w_recon,
            "w_consistency": self.w_cons,
            "w_sae_recon": self.w_sae_recon,
            "w_ood": self.w_ood,
            "grad_clip": self.grad_clip,
        }

        # Ensure train_loss_curve is finite-only floats
        finite_curve = []
        last_finite = 0.0
        for v in train_losses:
            if math.isfinite(v):
                last_finite = float(v)
            finite_curve.append(float(last_finite))

        meta = {
            "dimension": dimension,
            "model_type": "tiny_recursion",
            "expects_triplet": True,
            "embedding_type": self.embedding_type,
            "input_dim": self.dim,
            "concat_input_dim": self.dim * 2,
            "version": self.version,
            "epochs": self.epochs,
            "avg_loss": float(min(finite_curve or [best_metric])),
            "timestamp": datetime.now().isoformat(),
            "cfg": dict(self.cfg),
            "kept": int(kept),
            "best_metric": float(best_metric),
            "train_loss_curve": [float(x) for x in finite_curve],
            "dropped": int(dropped),
            "val_kept": int(val_kept),
            "val_dropped": int(val_dropped),
        }
        self._save_meta_file(meta, dimension)

        # TrainingStatsStore integration
        self.memory.training_stats.add_from_result(
            stats={
                "avg_q_loss": float(min(finite_curve or [best_metric])),
                "avg_loss":   float(min(finite_curve or [best_metric])),
            },
            model_type=self.model_type,
            target_type=self.target_type,
            dimension=dimension,
            version=self.version,
            embedding_type=self.embedding_type,
            config=safe_config,
            sample_count=len(samples),
            valid_samples=int(kept),
            invalid_samples=int(dropped),
            start_time=datetime.now(),
        )

        return {
            "best_metric": float(best_metric),
            "train_loss_curve": [float(x) for x in finite_curve],
            "kept": int(kept),
            "dropped": int(dropped),
            "val_kept": int(val_kept),
            "val_dropped": int(val_dropped),
        }

    # ------------------------------
    # Helper Methods
    # ------------------------------

    def _bucketize_label(self, y: int) -> int:
        """
        Bucketize label for histogram analysis.
        
        If bucket_size > 0, map 0..100 → 0..num_bins-1 using fixed-width bins.
        Ensure vocab_size >= num_bins when you enable bucketing.
        
        Args:
            y: Original label value
            
        Returns:
            Bucket index
        """
        b = int(self.bucket_size)
        if b <= 0:
            return max(0, min(100, y))              # <<< clamp to 0..100
        num_bins = (101 + b - 1) // b               # e.g., b=10 -> 11 bins (0..10)
        yb = min(max(y, 0) // b, num_bins - 1)
        return yb

    def _bucketize_counts(self, counts: Counter, bucket: int) -> dict:
        """
        Convert exact label counts to bucketized counts for visualization.
        
        Args:
            counts: Counter of exact label values
            bucket: Bucket size
            
        Returns:
            Dictionary mapping bucket ranges to counts
        """
        if bucket <= 1:
            return {str(k): int(v) for k, v in sorted(counts.items())}

        buckets = {}
        for label, c in counts.items():
            try:
                l = int(label)
            except Exception:
                continue
            start = (l // bucket) * bucket
            end = min(100, start + bucket - 1)
            key = f"{start}-{end}"
            buckets[key] = buckets.get(key, 0) + int(c)

        # Ensure all possible buckets are represented
        start = 0
        while start <= 100:
            end = min(100, start + bucket - 1)
            key = f"{start}-{end}"
            buckets.setdefault(key, 0)
            start += bucket

        return dict(sorted(buckets.items(), key=lambda kv: int(kv[0].split('-')[0])))

⏭️ What’s Next

With Tiny trained, we’re ready to compare it with HRM. In the next section, we’ll explore the “gap field” analysis where we subtract HRM from Tiny to reveal the structured disagreement between them. This is where we’ll see exactly where Tiny can safely replace HRM, and where we need to escalate to the deeper reasoning engine.

“Tiny isn’t just a smaller HRM it’s a specialized diagnostic tool trained to spot exactly where HRM’s strengths and weaknesses live.”


🔍 The Tiny Scorer: Transforming Diagnostic Signals into Actionable Intelligence

“Tiny isn’t just a scorer it’s a translator that converts the internal language of reasoning into a shared vocabulary for comparison.”

When you look at the Tiny Scorer code, you might see a technical implementation. But what you’re really seeing is the critical bridge between raw neural network outputs and meaningful, actionable intelligence. Let’s walk through how this component transforms Tiny Recursion Model’s internal telemetry into the standardized metrics that power our entire GAP analysis pipeline.

🎬 What the Tiny Scorer Actually Does

At its core, the Tiny Scorer is a diagnostic translator. It takes the raw outputs of Tiny Recursion Models (TRM) and converts them into a standardized format that:

  • Works seamlessly with Stephanie’s existing architecture
  • Enables apples-to-apples comparison with HRM
  • Provides actionable insights for routing decisions
  • Creates visualizable signals for our Gap Field analysis

This isn’t just about getting a score it’s about understanding why the score is what it is, and how it relates to other evaluation systems.

✈️ The Scoring Journey: From Text to Insights

Let’s follow the Tiny Scorer’s workflow step by step, highlighting the key decisions that make it work:

1. Loading Models with Precision

def _load_models(self, dimensions: List[str]):
    for dim in dimensions:
        # Resolve model and metadata file paths
        model_path = locator.model_file(suffix="_tiny.pt")
        meta_path = locator.meta_file()
        
        # Extract model configuration from metadata with safe defaults
        n_layers = int(cfg_meta.get("n_layers", 2))
        n_recursions = int(cfg_meta.get("n_recursions", 6))
        use_attn = bool(cfg_meta.get("use_attention", False))
        # ... etc ...
        
        # Instantiate model with exact same architecture as training
        model = TinyRecursionModel(
            d_model=self.dim,
            n_layers=n_layers,
            n_recursions=n_recursions,
            # ... all parameters ...
        ).to(self.device)
        
        # Load trained weights with strict=False for backward compatibility
        state = torch.load(model_path, map_location=self.device)
        model.load_state_dict(state, strict=False)
        model.eval()

This might look like standard model loading, but there’s a critical detail: the model is loaded with exactly the same architecture as it was trained with. This is essential because Tiny is trained with specific heads for specific diagnostic signals. If we changed the architecture, we’d lose those diagnostic capabilities.

The strict=False loading is another key decision it allows for backward compatibility as we add new heads without breaking existing models.

2. The Embedding Conversion: Creating a Common Language

x_np = self.memory.embedding.get_or_create(goal_text)
y_np = self.memory.embedding.get_or_create(scorable.text)

x = torch.tensor(x_np, dtype=torch.float32, device=self.device).unsqueeze(0)
y = torch.tensor(y_np, dtype=torch.float32, device=self.device).unsqueeze(0)
z = torch.zeros_like(x)  # neutral third stream for recursive processing
seq_len = torch.zeros(x.size(0), dtype=torch.int32, device=self.device)

This is where Tiny Scorer creates a common language for comparison with HRM. By using the same embedding system as HRM, we ensure that the input space is identical for both models. The z = torch.zeros_like(x) creates a neutral starting point for Tiny’s recursive processing.

This is why Tiny can be directly compared to HRM because they’re both operating in the same embedding space.

3. Diagnostic Extraction: The Heart of Tiny’s Power

# Extract core metrics from model outputs
raw01 = float(max(0.0, min(1.0, _tf(aux.get("score")))))

# Calculate certainty with fallback logic
if "certainty01" in aux:
    certainty01 = _tf(aux["certainty01"])
elif "uncertainty01" in aux:
    certainty01 = 1.0 - _tf(aux["uncertainty01"])
elif "uncertainty" in aux:
    certainty01 = 1.0 - _tf(aux["uncertainty"])
else:
    certainty01 = 0.5  # Default neutral certainty

entropy = _tf(aux.get("entropy_aux"))
halt_prob = _sigmoid_mean(halt_logits)

This is where Tiny truly shines. Instead of just giving a score, it provides a rich diagnostic profile:

  • raw01: The core quality score normalized to [0,1]
  • certainty01: How confident the model is in its score (with fallback logic)
  • entropy: Predictive entropy (higher = more uncertainty)
  • halt_prob: Probability the recursive process halts early (converges)

But Tiny goes even further. With different attribute verbosity levels, it can provide:

  • Minimal: Just the essential metrics
  • Standard: Confidence triplet, calibration signals, robustness measures
  • Full: Raw logit summaries, reconstruction details, concept analysis

This flexibility is critical it lets us adjust the level of detail based on whether we’re doing quick routing decisions or deep debugging.

4. SCM Conversion: The Magic of Cross-Model Alignment

def _build_scm_from_tiny_attrs(attrs: Dict[str, Any]) -> Dict[str, float]:
    # Extract and clamp core diagnostic signals
    certainty = float(attrs.get("certainty01", 0.5))
    unc01     = 1.0 - max(0.0, min(1.0, certainty))
    cons01    = max(0.0, min(1.0, float(attrs.get("consistency_hat", 0.5))))
    # ... etc ...
    
    # Dimension-specific scoring using diagnostic patterns
    dim_scores["reasoning"] = 0.60*cons01 + 0.30*(1.0-unc01) + 0.10*agree01
    dim_scores["knowledge"] = 0.50*(1.0-ood01) + 0.30*recon_sim + 0.20*(1.0-unc01)
    # ... etc ...
    
    # Build final SCM dictionary
    scm: Dict[str, float] = {
        f"scm.{k}.score01": dim_scores[k]
        for k in ("reasoning", "knowledge", "clarity", "faithfulness", "coverage")
    }
    scm["scm.aggregate01"]   = float(sum(dim_scores.values())/5.0)
    # ... etc ...

This is where the magic happens. The SCM conversion isn’t just a simple mapping it’s a learned translation between Tiny’s internal signals and the standardized SCM format.

For example, the reasoning score isn’t just a direct copy of a single signal it’s a weighted combination of consistency, uncertainty, and agreement signals. This is why Tiny can be compared to HRM: because we’ve defined a common language for reasoning quality.

5. The Aligned Vector: Enabling Topological Analysis

def _tiny_build_vector(attrs: Dict[str, Any]) -> Dict[str, Any]:
    vec: Dict[str, float] = {}
    
    # Core TRM statistics for direct access
    vec["tiny.score01"]        = float(attrs.get("tiny.score01", 0.0))
    vec["tiny.certainty01"]    = float(attrs.get("certainty01", 0.5))
    # ... etc ...
    
    # SCM-formatted metrics for cross-model alignment
    scm_keys = [
        "scm.reasoning.score01", "scm.knowledge.score01", "scm.clarity.score01",
        # ... all SCM dimensions ...
    ]
    
    # Mirror dimension scores for PHOS visualization compatibility
    for d in ("reasoning", "knowledge", "clarity", "faithfulness", "coverage"):
        k = f"scm.{d}.score01"
        if k in attrs:
            v01 = float(attrs[k])
            vec[f"tiny.{d}.score01"]  = v01
            vec[f"tiny.{d}.score100"] = round(v01 * 100.0, 4)
            vec[f"tiny.{d}"]          = v01
    
    return {"vector": vec, "columns": cols, "values": vals}

This vector is what makes our Gap Field analysis possible. By creating a consistent structure with the same columns across models, we can subtract HRM and Tiny scores to create the Δ-field the “gap” between their reasoning processes.

🔑 A translation mechanism

When we talk about the “gap field,” we’re not talking about some abstract concept. We’re talking about the concrete difference between HRM and Tiny scores on the same data. And that difference only makes sense if we’ve aligned them properly.

The Tiny Scorer is what makes this alignment possible. It takes Tiny’s internal signals and converts them into a format that:

  • Has the same structure as HRM
  • Uses the same scale for comparable metrics
  • Preserves the diagnostic richness of Tiny’s approach

This is why the Tiny Scorer is so critical to our entire system. Without it, we’d have two models speaking different languages, and we’d never be able to see the structured disagreement between them.

🚀 The Ultimate Goal: Building a Cognitive Microscope

Tiny isn’t just a smaller HRM it’s a different kind of model entirely. While HRM is a deep, hierarchical reasoner, Tiny is a cognitive microscope that focuses on where and why evaluation systems agree or diverge.

The Tiny Scorer is what makes this microscope operational. It transforms Tiny’s internal telemetry into actionable intelligence that we can route on, visualize, and compare against HRM without adopting our stack wholesale.

“Tiny isn’t just a scorer it’s a translator that converts the internal language of reasoning into a shared vocabulary for comparison.”

This is the heart of our Visual AI approach: not just measuring scores, but seeing the structured disagreement between models. And the Tiny Scorer shows how to make this possible.

Tiny Recursion Scorer: View full source

# stephanie/scoring/tiny_scorer.py
"""
Tiny Recursion Model Scorer - Lightweight evaluator with rich diagnostics.

This module implements the scoring interface for Tiny Recursion Models (TRM),
providing fast, recursive quality assessment with comprehensive diagnostic
telemetry. The scorer transforms TRM's internal signals into standardized
Shared Core Metrics (SCM) format for cross-model comparison in GAP analysis.

Key Features:
- Per-dimension model loading and management
- Rich diagnostic extraction (uncertainty, OOD, sensitivity, agreement, etc.)
- SCM alignment for cross-model comparability
- Vector generation for topological analysis
- Flexible attribute verbosity levels (minimal/standard/full)

The TinyScorer serves as the lightweight counterpart to HRM in the GAP
analysis pipeline, enabling efficient model comparison and routing decisions.
"""

from __future__ import annotations

import logging
import os
from typing import Any, Dict, List

import torch

from stephanie.constants import GOAL, GOAL_TEXT
from stephanie.data.score_bundle import ScoreBundle
from stephanie.data.score_result import ScoreResult
from stephanie.scoring.model.tiny_recursion import TinyRecursionModel
from stephanie.scoring.scorer.base_scorer import BaseScorer
from stephanie.utils.file_utils import load_json

_logger = logging.getLogger(__name__)


class TinyScorer(BaseScorer):
    """
    Tiny Recursion Model scorer for efficient quality evaluation with rich diagnostics.
    
    This scorer uses trained TinyRecursionModel instances to evaluate goal-response
    pairs across multiple reasoning dimensions. It extracts not just quality scores
    but comprehensive diagnostic telemetry including uncertainty estimates,
    out-of-distribution detection, sensitivity analysis, and agreement predictions.
    
    The scorer automatically converts TRM's native outputs into the standardized
    Shared Core Metrics (SCM) format, enabling direct comparison with HRM and
    other evaluation systems in the GAP analysis pipeline.
    
    Attributes:
        model_type: Identifier for scorer type ("tiny")
        embedding_type: Type of embeddings used (shared with HRM)
        dimensions: List of reasoning dimensions to evaluate
        attr_level: Verbosity level for attributes ("minimal"/"standard"/"full")
        models: Dictionary of loaded TRM models per dimension
        model_meta: Metadata for each dimension's model
    """
    
    def __init__(self, cfg, memory, container, logger):
        """
        Initialize TinyScorer with configuration and dependencies.
        
        Args:
            cfg: Configuration dictionary with scorer parameters
            memory: Memory interface for embedding and data access
            container: Dependency injection container
            logger: Structured logging interface
            
        Configuration Parameters:
            target_type: Type of scoring target ("conversation_turn")
            model_path: Base path for model files
            model_version: Version identifier for models
            dimensions: List of dimensions to evaluate
            clip_0_100: Whether to clip scores to 0-100 range
            tiny_attr_level: Attribute verbosity level
        """
        super().__init__(cfg, memory, container, logger)
        _logger.info("Initializing TinyScorer")
        
        self.model_type = "tiny"  # identifies scorer type in results

        # Embedding interface (shared with HRM for cross-model alignment)
        self.embedding_type = self.memory.embedding.name
        self.dim = self.memory.embedding.dim
        _logger.debug(f"Using embedding type: {self.embedding_type}, dimension: {self.dim}")

        # Configuration parameters
        self.target_type = cfg.get("target_type", "conversation_turn")
        self.model_path = cfg.get("model_path", "models")
        self.version = cfg.get("model_version", "v1")
        self.dimensions: List[str] = cfg.get("dimensions", [])
        
        # Output scaling configuration
        self.clip_0_100 = cfg.get("clip_0_100", True)
        
        # Attribute verbosity: controls diagnostic detail level
        self.attr_level = (cfg.get("tiny_attr_level") or "standard").lower()
        _logger.debug(f"Attribute level set to: {self.attr_level}")

        # Containers for per-dimension models and metadata
        self.models: Dict[str, TinyRecursionModel] = {}
        self.model_meta: Dict[str, Dict[str, Any]] = {}

        # Attempt to load models up-front for all specified dimensions
        _logger.info(f"Loading TRM models for dimensions: {self.dimensions}")
        self._load_models(self.dimensions)
        _logger.info(f"TinyScorer initialized with {len(self.models)} loaded models")

    # -------------------------
    # Model Loading
    # -------------------------
    def _load_models(self, dimensions: List[str]):
        """
        Load trained TinyRecursionModel instances for specified dimensions.
        
        For each dimension, this method:
        1. Resolves model and metadata file paths
        2. Loads model configuration from metadata
        3. Instantiates TRM with correct architecture
        4. Loads trained weights
        5. Registers model in the internal registry
        
        Args:
            dimensions: List of reasoning dimensions to load models for
            
        Logs:
            - Debug: Model loading progress and configuration
            - Warning: Missing model files or metadata
            - Error: Model instantiation or weight loading failures
        """
        _logger.debug(f"Starting model loading for {len(dimensions)} dimensions")
        
        for dim in dimensions:
            _logger.debug(f"Loading model for dimension: {dim}")
            locator = self.get_locator(dim)

            # Resolve model and metadata file paths
            model_path = locator.model_file(suffix="_tiny.pt")
            meta_path = locator.meta_file()
            _logger.debug(f"Model path: {model_path}, Meta path: {meta_path}")

            if not os.path.exists(model_path):
                _logger.warning(f"Model file missing for dimension {dim}: {model_path}")
                self.logger.log(
                    "TinyScorerModelMissing",
                    {"dimension": dim, "path": model_path},
                )
                continue

            # Load model metadata for architecture configuration
            meta: Dict[str, Any] = {}
            if os.path.exists(meta_path):
                try:
                    meta = load_json(meta_path) or {}
                    _logger.debug(f"Loaded metadata for {dim}: {len(meta)} keys")
                except Exception as e:
                    _logger.error(f"Failed to load metadata for {dim}: {e}")
                    self.logger.log(
                        "TinyScorerMetaLoadError", {"dimension": dim, "error": str(e)}
                    )
            else:
                _logger.warning(f"Metadata file missing for {dim}: {meta_path}")

            # Extract model configuration from metadata with safe defaults
            cfg_meta = meta.get("cfg", {}) if isinstance(meta, dict) else {}
            n_layers = int(cfg_meta.get("n_layers", 2))
            n_recursions = int(cfg_meta.get("n_recursions", 6))
            use_attn = bool(cfg_meta.get("use_attention", False))
            dropout = float(cfg_meta.get("dropout", 0.1))
            attn_heads = int(cfg_meta.get("attn_heads", 4))
            step_scale = float(cfg_meta.get("step_scale", 0.1))
            cons_mask_p = float(cfg_meta.get("consistency_mask_p", 0.10))
            len_norm_L = float(cfg_meta.get("len_norm_L", 512.0))
            vocab_size = int(cfg_meta.get("vocab_size", 101))

            # Optional feature flags from metadata
            enable_agree_head = bool(cfg_meta.get("enable_agree_head", True))
            enable_causal_sens_head = bool(cfg_meta.get("enable_causal_sens_head", True))

            _logger.debug(
                f"Model config for {dim}: layers={n_layers}, recursions={n_recursions}, "
                f"attention={use_attn}, dropout={dropout}"
            )

            # Instantiate model with exact same architecture as training
            _logger.debug(f"Instantiating TRM for dimension {dim}")
            model = TinyRecursionModel(
                d_model=self.dim,
                n_layers=n_layers,
                n_recursions=n_recursions,
                vocab_size=vocab_size,
                use_attention=use_attn,
                dropout=dropout,
                attn_heads=attn_heads,
                step_scale=step_scale,
                consistency_mask_p=cons_mask_p,
                len_norm_L=len_norm_L,
                enable_agree_head=enable_agree_head,
                enable_causal_sens_head=enable_causal_sens_head,
            ).to(self.device)

            # Load trained weights with strict=False for backward compatibility
            _logger.debug(f"Loading model weights from: {model_path}")
            try:
                state = torch.load(model_path, map_location=self.device)
                model.load_state_dict(state, strict=False)
                model.eval()  # Set to evaluation mode
                _logger.debug(f"Successfully loaded weights for {dim}")
            except Exception as e:
                _logger.error(f"Failed to load weights for {dim}: {e}")
                continue

            # Register successfully loaded model
            self.models[dim] = model
            self.model_meta[dim] = meta
            
            _logger.info(f"Successfully loaded TRM model for dimension: {dim}")
            self.logger.log(
                "TinyScorerModelLoaded",
                {
                    "dimension": dim, 
                    "model_path": model_path, 
                    "device": str(self.device)
                },
            )

    # -------------------------
    # Scoring Core
    # -------------------------
    def _score_core(self, context: dict, scorable, dimensions: List[str]) -> ScoreBundle:
        """
        Core scoring method that evaluates goal-response pairs using TRM.
        
        This method:
        1. Converts text to embeddings (shared with HRM)
        2. Runs TRM inference for each dimension
        3. Extracts scores and rich diagnostics
        4. Converts to SCM format for cross-model alignment
        5. Generates aligned vectors for topological analysis
        
        Args:
            context: Scoring context containing goal information
            scorable: The response text to evaluate
            dimensions: List of dimensions to score against
            
        Returns:
            ScoreBundle containing results for all specified dimensions
            
        Logs:
            - Debug: Embedding conversion, model inference, SCM conversion
            - Warning: Missing models or scoring errors
            - Info: Scoring completion statistics
        """
        _logger.debug(f"Starting scoring for {len(dimensions)} dimensions")
        
        # Extract goal information from context
        goal = context.get(GOAL, {})
        goal_text = goal.get(GOAL_TEXT, "")
        _logger.debug(f"Scoring goal: {goal_text[:50]}...")
        _logger.debug(f"Scorable text: {scorable.text[:50]}...")

        results: Dict[str, ScoreResult] = {}

        # Step 1: Convert text to embeddings (shared with HRM for alignment)
        _logger.debug("Converting goal and response to embeddings")
        x_np = self.memory.embedding.get_or_create(goal_text)
        y_np = self.memory.embedding.get_or_create(scorable.text)
        
        # Convert to tensors and ensure correct device placement
        x = torch.tensor(x_np, dtype=torch.float32, device=self.device).unsqueeze(0)
        y = torch.tensor(y_np, dtype=torch.float32, device=self.device).unsqueeze(0)
        z = torch.zeros_like(x)  # neutral third stream for recursive processing
        seq_len = torch.zeros(x.size(0), dtype=torch.int32, device=self.device)
        
        _logger.debug(f"Embedding shapes - x: {x.shape}, y: {y.shape}, z: {z.shape}")

        # Step 2: Score for each specified dimension
        for dim in dimensions:
            _logger.debug(f"Scoring dimension: {dim}")
            model = self.models.get(dim)
            
            if model is None:
                _logger.warning(f"No model available for dimension: {dim}")
                self.logger.log("TinyModelMissing", {"dimension": dim})
                continue

            try:
                # Run TRM inference with gradient disabled for efficiency
                _logger.debug(f"Running TRM inference for {dim}")
                with torch.no_grad():
                    _, halt_logits, _, aux = model(
                        x, y, z, seq_len=seq_len, return_aux=True
                    )
                _logger.debug(f"TRM inference completed for {dim}")

                # Step 3: Extract core metrics from model outputs
                _logger.debug("Extracting core metrics from TRM outputs")
                raw01 = float(max(0.0, min(1.0, _tf(aux.get("score")))))
                
                # Calculate certainty with fallback logic
                if "certainty01" in aux:
                    certainty01 = _tf(aux["certainty01"])
                    _logger.debug("Using certainty01 from aux")
                elif "uncertainty01" in aux:
                    certainty01 = 1.0 - _tf(aux["uncertainty01"])
                    _logger.debug("Derived certainty from uncertainty01")
                elif "uncertainty" in aux:
                    certainty01 = 1.0 - _tf(aux["uncertainty"])
                    _logger.debug("Derived certainty from uncertainty")
                else:
                    certainty01 = 0.5  # Default neutral certainty
                    _logger.debug("Using default certainty 0.5")
                    
                entropy = _tf(aux.get("entropy_aux"))
                halt_prob = _sigmoid_mean(halt_logits)
                
                _logger.debug(
                    f"Core metrics - score: {raw01:.3f}, certainty: {certainty01:.3f}, "
                    f"entropy: {entropy:.3f}, halt_prob: {halt_prob:.3f}"
                )

                # Apply scaling and metadata adjustments
                meta = self.model_meta.get(dim, {})
                final_score = _tf(aux.get("score"))
                tiny_score01 = raw01
                tiny_score100 = round(_safe_scale_0_100(tiny_score01, meta), 4)
                _logger.debug(f"Scaled scores - 01: {tiny_score01:.3f}, 100: {tiny_score100}")

                # Step 4: Build base attributes dictionary
                _logger.debug("Building base attributes dictionary")
                attrs: Dict[str, Any] = {
                    "tiny.score01": tiny_score01,
                    "tiny.score100": tiny_score100,
                    "raw01": tiny_score01,  # backward-compatibility alias
                    "entropy": float(entropy),
                    "certainty01": float(certainty01),
                    "halt_prob": float(halt_prob) if halt_prob is not None else None,
                    # Model context metadata for downstream processing
                    "n_recursions": int(meta.get("cfg", {}).get("n_recursions", 6)),
                    "use_attention": bool(meta.get("cfg", {}).get("use_attention", False)),
                    "dropout": float(meta.get("cfg", {}).get("dropout", 0.1)),
                }

                # Step 5: Add diagnostic attributes based on verbosity level
                if self.attr_level in ("standard", "full"):
                    _logger.debug("Adding standard diagnostic attributes")
                    attrs.update(_extract_standard_aux(aux))
                    
                    # Include optional bridge heads if available
                    if "agree01" in aux and isinstance(aux["agree01"], torch.Tensor):
                        attrs["agree01"] = float(_tf(aux["agree01"]))
                        _logger.debug("Added agree01 diagnostic")
                    if "sens01" in aux and isinstance(aux["sens01"], torch.Tensor):
                        attrs["sens01"] = float(_tf(aux["sens01"]))
                        _logger.debug("Added sens01 diagnostic")

                if self.attr_level == "full":
                    _logger.debug("Adding full diagnostic attributes")
                    attrs.update(_extract_full_aux(aux))
                    
                    # Add raw signal summaries for deep debugging
                    if "score_logit" in aux:
                        attrs["score_logit_mean"] = float(_tf(aux["score_logit"]))
                    if "aux3_logits" in aux and isinstance(aux["aux3_logits"], torch.Tensor):
                        al = aux["aux3_logits"]
                        attrs["aux3_logits_l1_mean"] = float(al.abs().mean().item())

                # Step 6: Convert to Shared Core Metrics format
                _logger.debug("Converting to SCM format for cross-model alignment")
                scm = _build_scm_from_tiny_attrs(attrs)
                attrs.update(scm)
                _logger.debug(f"SCM conversion complete - aggregate: {scm.get('scm.aggregate01', 0):.3f}")

                # Step 7: Mirror dimension scores for PHOS compatibility
                _logger.debug("Mirroring dimension scores for PHOS alignment")
                for dname in ("reasoning", "knowledge", "clarity", "faithfulness", "coverage"):
                    key = f"scm.{dname}.score01"
                    if key in scm:
                        v01 = float(scm[key])
                        attrs[f"tiny.{dname}.score01"]  = v01
                        attrs[f"tiny.{dname}.score100"] = round(v01 * 100.0, 4)
                        attrs[f"tiny.{dname}"] = float(scm[key])
                _logger.debug("Dimension score mirroring complete")

                # Step 8: Generate scoring rationale
                rationale = (
                    f"tiny[{dim}] raw01={float(raw01):.4f}, "
                    f"H={float(entropy):.3f}, C={float(certainty01):.3f}, "
                    f"halt_p={float(halt_prob) if halt_prob is not None else -1:.3f}"
                )
                _logger.debug(f"Generated rationale: {rationale}")

                # Step 9: Build aligned vector for topological analysis
                _logger.debug("Building aligned vector for GAP analysis")
                vector = _tiny_build_vector(attrs)

                # Step 10: Create final ScoreResult
                results[dim] = ScoreResult(
                    dimension=dim,
                    score=tiny_score01,
                    source=self.model_type,
                    rationale=rationale,
                    weight=1.0,
                    attributes={
                        **attrs, 
                        "vector": vector["vector"], 
                        "columns": vector["columns"], 
                        "values": vector["values"]
                    },
                )
                _logger.debug(f"Successfully created ScoreResult for {dim}")

            except Exception as e:
                _logger.error(f"Scoring error for dimension {dim}: {e}")
                self.logger.log("TinyScoreError", {"dimension": dim, "error": str(e)})

        _logger.info(f"Scoring completed for {len(results)} dimensions")
        return ScoreBundle(results=results)

    # -------------------------
    # Utility Methods
    # -------------------------
    @staticmethod
    def _get(d: Dict[str, Any], key: str):
        """
        Safe dictionary access with exception handling.
        
        Args:
            d: Dictionary to access
            key: Key to retrieve
            
        Returns:
            Value if present and accessible, None otherwise
        """
        try:
            return d.get(key)
        except Exception:
            return None

    def __repr__(self):
        """String representation showing loaded models."""
        loaded = {k: (v is not None) for k, v in self.models.items()}
        return f"<TinyScorer(model_type={self.model_type}, loaded={loaded})>"


def _take_scalar(t):
    """
    Extract scalar value from tensor or return float directly.
    
    Args:
        t: Input tensor or scalar
        
    Returns:
        Extracted scalar value as float
    """
    # works with tensor or float
    if isinstance(t, torch.Tensor):
        return float(t.detach().mean().cpu().item())
    return float(t)


# -------------------------
# Helper Functions
# -------------------------

def _tf(v):
    """
    Tensor/array/number → scalar float with safe fallback.
    
    Handles various input types and extracts mean value from tensors.
    Provides safe defaults for None or invalid inputs.
    
    Args:
        v: Input value (tensor, array, or scalar)
        
    Returns:
        Extracted scalar float value
    """
    if v is None:
        _logger.debug("Received None value, returning 0.0")
        return 0.0
    if isinstance(v, torch.Tensor):
        # handle both scalar and vector tensors - use mean for vectors
        result = v.detach().float().mean().item()
        _logger.debug(f"Converted tensor to scalar: {result}")
        return result
    try:
        result = float(v)
        _logger.debug(f"Converted value to float: {result}")
        return result
    except Exception:
        _logger.debug(f"Failed to convert value: {v}, returning 0.0")
        return 0.0


def _sigmoid_mean(v):
    """
    Apply sigmoid and compute mean for halting logits.
    
    Args:
        v: Input tensor or value
        
    Returns:
        Mean sigmoid probability, or None if input is None
    """
    if v is None:
        _logger.debug("Received None for sigmoid_mean")
        return None
    if isinstance(v, torch.Tensor):
        result = torch.sigmoid(v.detach()).mean().item()
        _logger.debug(f"Computed sigmoid mean: {result}")
        return result
    result = float(v)
    _logger.debug(f"Returning float value: {result}")
    return result


def _safe_scale_0_100(raw: float, meta: dict | None) -> float:
    """
    Scale raw [0,1] score to [0,100] range with metadata awareness.
    
    Uses metadata min/max values if available, otherwise uses default 0-100 scaling.
    
    Args:
        raw: Raw score in [0,1] range
        meta: Model metadata containing scaling parameters
        
    Returns:
        Scaled score in appropriate range
    """
    if not meta:
        result = float(max(0.0, min(1.0, raw)) * 100.0)
        _logger.debug(f"Scaled without meta: {raw} -> {result}")
        return result
    
    lo = float(meta.get("min_value", 0.0))
    hi = float(meta.get("max_value", 100.0))
    result = float(max(lo, min(hi, lo + (hi - lo) * max(0.0, min(1.0, raw)))))
    _logger.debug(f"Scaled with meta: {raw} -> {result} (range: {lo}-{hi})")
    return result


def _tiny_build_vector(attrs: Dict[str, Any]) -> Dict[str, Any]:
    """
    Build aligned vector representation for GAP analysis.
    
    Creates a deterministic vector structure that enables cross-model
    alignment in topological analysis. Includes both raw TRM statistics
    and SCM-formatted metrics.
    
    Args:
        attrs: Dictionary of attributes from TRM scoring
        
    Returns:
        Dictionary containing vector, columns, and values for alignment
    """
    _logger.debug("Building aligned vector from attributes")
    vec: Dict[str, float] = {}
    
    # Core TRM statistics for direct access
    vec["tiny.score01"]        = float(attrs.get("tiny.score01", 0.0))
    vec["tiny.score100"]       = float(attrs.get("tiny.score100", 0.0))
    vec["tiny.certainty01"]    = float(attrs.get("certainty01", 0.5))
    vec["tiny.entropy_mean"]   = float(attrs.get("entropy", 0.0))
    
    if "halt_prob" in attrs and attrs["halt_prob"] is not None:
        vec["tiny.halt_prob"] = float(attrs["halt_prob"])
    
    _logger.debug(f"Added {len(vec)} core TRM statistics to vector")

    # SCM-formatted metrics for cross-model alignment
    scm_keys = [
        "scm.reasoning.score01", "scm.knowledge.score01", "scm.clarity.score01",
        "scm.faithfulness.score01", "scm.coverage.score01", "scm.aggregate01",
        "scm.uncertainty01", "scm.ood_hat01", "scm.consistency01",
        "scm.length_norm01", "scm.temp01", "scm.agree_hat01",
    ]
    
    scm_count = 0
    for k in scm_keys:
        if k in attrs:
            vec[k] = float(attrs[k])
            scm_count += 1
    
    _logger.debug(f"Added {scm_count} SCM metrics to vector")

    # Mirror dimension scores for PHOS visualization compatibility
    mirror_count = 0
    for d in ("reasoning", "knowledge", "clarity", "faithfulness", "coverage"):
        k = f"scm.{d}.score01"
        if k in attrs:
            v01 = float(attrs[k])
            vec[f"tiny.{d}.score01"]  = v01
            vec[f"tiny.{d}.score100"] = round(v01 * 100.0, 4)
            vec[f"tiny.{d}"]          = v01
            mirror_count += 1
    
    _logger.debug(f"Mirrored {mirror_count} dimension scores")

    # Create final aligned structure
    cols = list(vec.keys())
    vals = [vec[c] for c in cols]
    _logger.debug(f"Vector construction complete: {len(cols)} columns, {len(vals)} values")
    
    return {"vector": vec, "columns": cols, "values": vals}


def _extract_standard_aux(aux: Dict[str, Any]) -> Dict[str, float]:
    """
    Extract standard diagnostic attributes from TRM auxiliary outputs.
    
    Provides balanced diagnostic coverage including confidence estimates,
    calibration signals, robustness measures, and sensitivity analysis.
    All outputs are normalized to [0,1] range.
    
    Args:
        aux: TRM auxiliary outputs dictionary
        
    Returns:
        Dictionary of standardized diagnostic attributes
    """
    _logger.debug("Extracting standard diagnostic attributes")
    out: Dict[str, float] = {}

    # Confidence triplet from 3-class auxiliary head
    if "aux3_probs" in aux and isinstance(aux["aux3_probs"], torch.Tensor):
        p = aux["aux3_probs"].detach().float()
        out["aux3_p_bad"]  = float(p[..., 0].mean().item())
        out["aux3_p_mid"]  = float(p[..., 1].mean().item())
        out["aux3_p_good"] = float(p[..., 2].mean().item())
        _logger.debug("Extracted aux3 probability triplet")

    # Calibration and temperature signals
    out["temp01"] = float(_tf(aux.get("temp01")))

    # Out-of-distribution detection (prefer newer ood_hat01 format)
    if "ood_hat01" in aux:
        out["ood_hat01"] = float(_tf(aux["ood_hat01"]))
        _logger.debug("Using ood_hat01 for OOD detection")
    elif "ood_hat" in aux:  # backward compatibility
        out["ood_hat01"] = float(_tf(aux["ood_hat"]))
        _logger.debug("Using ood_hat (legacy) for OOD detection")

    # Robustness and sensitivity measures
    out["consistency_hat"] = float(_tf(aux.get("consistency_hat")))
    out["jacobian_fd"]     = float(_tf(aux.get("jacobian_fd")))
    _logger.debug("Extracted robustness and sensitivity measures")

    # Reconstruction quality and disagreement prediction
    out["recon_sim"]  = float(_tf(aux.get("recon_sim")))
    out["disagree_hat"] = float(_tf(aux.get("disagree_hat")))
    _logger.debug("Extracted reconstruction and disagreement signals")

    # Length normalization (prefer 0..1 normalized version)
    if "length_norm01" in aux:
        out["length_norm01"] = float(_tf(aux["length_norm01"]))
        _logger.debug("Using length_norm01 for length effect")
    else:
        # Derive from tanh-normalized len_effect if available
        if "len_effect" in aux:
            le = float(_tf(aux["len_effect"]))
            out["length_norm01"] = float(max(0.0, min(1.0, (le + 1.0) * 0.5)))
            _logger.debug("Derived length_norm01 from len_effect")
        else:
            out["length_norm01"] = 0.0
            _logger.debug("Using default length_norm01")

    # Sparse Autoencoder concept sparsity
    out["concept_sparsity"] = float(_tf(aux.get("concept_sparsity")))
    _logger.debug("Extracted concept sparsity measure")

    _logger.debug(f"Standard diagnostics extraction complete: {len(out)} attributes")
    return out


def _extract_full_aux(aux: Dict[str, Any]) -> Dict[str, float]:
    """
    Extract full diagnostic attributes including raw signal summaries.
    
    Provides maximum detail for debugging and analysis, including
    raw logit summaries and internal representation statistics.
    Use only when deep inspection is required.
    
    Args:
        aux: TRM auxiliary outputs dictionary
        
    Returns:
        Dictionary of detailed diagnostic attributes
    """
    _logger.debug("Extracting full diagnostic attributes")
    out: Dict[str, float] = {}

    # Summaries of raw head outputs for debugging
    for k in ("log_var", "consistency_logit", "disagree_logit"):
        if k in aux and isinstance(aux[k], torch.Tensor):
            t = aux[k].detach()
            out[f"{k}_mean"] = float(t.mean().item())
            _logger.debug(f"Added {k}_mean to full diagnostics")

    # Reconstruction detail analysis
    if "y_recon" in aux and isinstance(aux["y_recon"], torch.Tensor):
        yr = aux["y_recon"].detach()
        out["y_recon_norm_mean"] = float(yr.norm(dim=-1).mean().item())
        _logger.debug("Added y_recon_norm_mean to full diagnostics")

    # Sparse Autoencoder concept analysis
    if "concept_vec" in aux and isinstance(aux["concept_vec"], torch.Tensor):
        c = aux["concept_vec"].detach()
        out["concept_vec_l2_mean"] = float((c.pow(2).sum(-1).sqrt()).mean().item())
        _logger.debug("Added concept_vec_l2_mean to full diagnostics")

    _logger.debug(f"Full diagnostics extraction complete: {len(out)} attributes")
    return out


# === SCM mapping from Tiny aux → aligned scm.* columns =======================

def _build_scm_from_tiny_attrs(attrs: Dict[str, Any]) -> Dict[str, float]:
    """
    Convert TRM attributes to Shared Core Metrics format.
    
    Maps TRM's internal diagnostic signals to the standardized 5-dimensional
    SCM format using learned weighting patterns. This enables direct
    comparison with HRM and other evaluation systems.
    
    The mapping uses TRM's diagnostic patterns to infer dimension scores:
    - Reasoning: Emphasizes consistency, low uncertainty, agreement
    - Knowledge: Focuses on in-distribution signals and reconstruction
    - Clarity: Uses token quality and length normalization
    - Faithfulness: Based on reconstruction and consistency
    - Coverage: Considers concept activity and distribution alignment
    
    Args:
        attrs: TRM attributes dictionary
        
    Returns:
        Dictionary of SCM-formatted scores in [0,1] range
    """
    _logger.debug("Building SCM from TRM attributes")
    
    # Extract and clamp core diagnostic signals
    certainty = float(attrs.get("certainty01", 0.5))
    unc01     = 1.0 - max(0.0, min(1.0, certainty))
    cons01    = max(0.0, min(1.0, float(attrs.get("consistency_hat", 0.5))))
    ood01     = max(0.0, min(1.0, float(attrs.get("ood_hat", 0.0))))
    len01     = max(0.0, min(1.0, float(attrs.get("len_effect", 0.0))))
    temp01    = max(0.0, min(1.0, float(attrs.get("temp01", 0.0))))
    agree01   = max(0.0, min(1.0, float(attrs.get("agree01", 0.5))))

    # Extract additional diagnostic signals
    recon_sim      = max(0.0, min(1.0, float(attrs.get("recon_sim", 0.5))))
    concept_sparse = max(0.0, min(1.0, float(attrs.get("concept_sparsity", 0.5))))
    p_bad          = max(0.0, min(1.0, float(attrs.get("aux3_p_bad", 0.5))))
    token_ok       = 1.0 - p_bad  # clarity proxy: lower bad probability → clearer

    _logger.debug(f"Core signals - uncertainty: {unc01:.3f}, consistency: {cons01:.3f}, OOD: {ood01:.3f}")

    # Dimension-specific scoring using diagnostic patterns
    dim_scores: Dict[str, float] = {}
    
    # Reasoning: weighted toward stability, consistency, and confidence
    dim_scores["reasoning"] = 0.60*cons01 + 0.30*(1.0-unc01) + 0.10*agree01
    _logger.debug(f"Reasoning score: {dim_scores['reasoning']:.3f}")
    
    # Knowledge: emphasizes distribution alignment and comprehension
    dim_scores["knowledge"] = 0.50*(1.0-ood01) + 0.30*recon_sim + 0.20*(1.0-unc01)
    _logger.debug(f"Knowledge score: {dim_scores['knowledge']:.3f}")
    
    # Clarity: based on token quality and brevity
    dim_scores["clarity"] = 0.50*token_ok + 0.30*(1.0-len01) + 0.20*cons01
    _logger.debug(f"Clarity score: {dim_scores['clarity']:.3f}")
    
    # Faithfulness: reconstruction quality and stability
    dim_scores["faithfulness"] = 0.50*recon_sim + 0.30*cons01 + 0.20*(1.0-unc01)
    _logger.debug(f"Faithfulness score: {dim_scores['faithfulness']:.3f}")
    
    # Coverage: concept activity and confidence
    dim_scores["coverage"] = 0.40*concept_sparse + 0.40*(1.0-unc01) + 0.20*(1.0-ood01)
    _logger.debug(f"Coverage score: {dim_scores['coverage']:.3f}")

    # Ensure all scores are in valid [0,1] range
    for k in dim_scores:
        v = dim_scores[k]
        dim_scores[k] = float(min(1.0, max(0.0, v)))
    _logger.debug("Applied score clamping to [0,1] range")

    # Build final SCM dictionary
    scm: Dict[str, float] = {
        f"scm.{k}.score01": dim_scores[k]
        for k in ("reasoning", "knowledge", "clarity", "faithfulness", "coverage")
    }
    scm["scm.aggregate01"]   = float(sum(dim_scores.values())/5.0)
    scm["scm.uncertainty01"] = float(unc01)
    scm["scm.ood_hat01"]     = float(ood01)
    scm["scm.consistency01"] = float(cons01)
    scm["scm.length_norm01"] = float(len01)
    scm["scm.temp01"]        = float(temp01)
    scm["scm.agree_hat01"]   = float(agree01)

    _logger.debug(f"SCM construction complete - aggregate: {scm['scm.aggregate01']:.3f}")
    return scm

🤝 Hugging Face Scorers Plug-in Judges That Speak SCM

We wanted to know: is the “gap” structural or an artifact of our own models? To test this, we plugged in third-party Hugging Face models (Gemma, SmolLM, etc.) as independent judges. Despite different data and training recipes, their judgments once translated into our SCM telemetry produce the same gap signatures (Δ-fields, loop structure, Betti numbers) we see with HRM and Tiny. That robustness is the point.

What these scorers do (and don’t do)

  • Do: compute teacher-forced likelihood stats for a response conditioned on the goal (mean log-prob, perplexity, entropy, bits/byte, lengths).
  • Don’t: assign semantic 0–1 scores directly. Instead, a plugin converts these stats into SCM (Shared Canonical Metrics), aligning Hugging Face outputs with HRM/Tiny so we can compare apples-to-apples and build the gap field.

⚙️ How it works (at a glance)

  1. HF CausalLM (e.g., google/gemma-2-2b-it) runs in teacher-forced mode on (goal_text + response).
  2. We compute stable, Windows-friendly LL stats: mean_logprob, ppl, entropy_mean, bpb, token/char lengths.
  3. The SCM plugin derives 0–1 scores per dimension (reasoning, knowledge, clarity, faithfulness, coverage) from those stats and mirrors them under the model’s alias (e.g., gemma2b.reasoning.score01).
  4. All scorers now “speak SCM,” so we can compute Δ(HRM−HF), Δ(HF−Tiny), stitch them into a gap field, and analyze topology.

🛠️ Minimal config (drop-in)

# gap.yaml add after Tiny scorer
scorers:
  hf_gemma2b:
    class: "stephanie.scoring.scorer.huggingface_scorer:HuggingFaceScorer"
    model_name: "google/gemma-2-2b-it"
    model_alias: "gemma2b"
    max_seq_len: 4096
    device_map: "auto"
    torch_dtype: "auto"
    local_files_only: false
    dimensions: ["reasoning","knowledge","clarity","faithfulness","coverage"]
    plugins:
      scm:
        enabled: true
        # Optional knobs used by the SCM plugin when mapping LL→SCM
        params:
          model_alias: "gemma2b"
          topk: 0
          ppl_low: 5.0       # low-perplexity floor
          ppl_high: 40.0     # high-perplexity ceiling

This registers one HF scorer and enables the SCM plugin so the model’s raw LL stats are converted into SCM metrics automatically.


📱 Minimal usage (one call)

from stephanie.scoring.scorer.huggingface_scorer import HuggingFaceScorer

scorer = HuggingFaceScorer(cfg_scorer, memory, container, logger)

context = { "goal": { "text": goal_text } }  # or use your GOAL/GOAL_TEXT constants
scorable = type("X", (), {"text": assistant_answer})()  # any object with .text

bundle = scorer.score(context, scorable, dimensions=[
    "reasoning","knowledge","clarity","faithfulness","coverage"
])

# bundle.results is a dict[str, ScoreResult]
r = bundle.results["knowledge"]
print(r.dimension, r.score, r.attributes.get("gemma2b.ppl"), r.attributes.get("scm.knowledge.score01"))

scorer.close()  # releases VRAM/CPU and cleans up
  • Before the plugin runs, scores are placeholders (0.0) with rich attributes (LL stats).
  • After the plugin, the attributes also include scm.* and mirrored gemma2b.*.score01 keys used by downstream selectors and the gap field builder.

🛋️ What’s inside (tight pseudocode, no boilerplate)

HF scorer core teacher-forced stats only (no SCM logic here):

class HuggingFaceScorer(BaseScorer):
    def _score_core(self, context, scorable, dims):
        goal = context.get(GOAL, {}).get(GOAL_TEXT, "") or ""
        resp = scorable.text or ""
        stats = self._ll_stats(goal, resp)  # mean_logprob, ppl, entropy_mean, bpb, lengths...

        # Build a small alias vector; keep score placeholders (plugins fill SCM)
        vec = self._build_base_vector(self.model_alias, stats)
        results = {}
        for dim in dims:
            results[dim] = ScoreResult(
                dimension=dim, score=0.0, source="hf",
                rationale=f"{self.model_alias}[{dim}] ppl={stats['ppl']:.2f}, H̄={stats['entropy_mean']:.3f}",
                attributes={**stats, **vec}
            )
        return ScoreBundle(results=results)

Plugin factory & SCM plugin translate LL stats into SCM:

# plugins/factory.py
def build_plugins(cfg, container, logger, host_scorer):
    out = []
    for name, spec in (cfg.get("plugins") or {}).items():
        if not (isinstance(spec, dict) and spec.get("enabled")): 
            continue
        cls = _import_by_path(spec.get("class")) if "class" in spec else get_registered(name)
        plugin = cls(container=container, logger=logger, host=host_scorer, **(spec.get("params") or {}))
        out.append(plugin)
    return out

# plugins/scm_service_plugin.py (registered as "scm")
class SCMServicePlugin:
    def post_process(self, *, tap_output):
        attrs = tap_output.get("attributes", {})  # LL stats already there
        goal  = tap_output.get("goal_text", ""); resp = tap_output.get("resp_text","")
        stats = {k: attrs[k] for k in ("mean_logprob","ppl","entropy_mean","len_tokens","bpb") if k in attrs}
        if not stats and hasattr(self.host, "_ll_stats"):
            stats = self.host._ll_stats(goal, resp)  # fallback for non-HF scorers
        scm = self.scm_svc.derive_scm_from_ll(stats, ppl_low=self.ppl_low, ppl_high=self.ppl_high)
        # Mirror under alias.* so selectors can route by model
        for dim in ("reasoning","knowledge","clarity","faithfulness","coverage"):
            v = scm.get(f"scm.{dim}.score01")
            if v is not None:
                tap_output[f"{self.alias}.{dim}.score01"] = v
        return scm
  • Separation of concerns: HF scorer produces physics-like observables (LL stats); plugin converts them into policy-like SCM judgments; routing & Δ-analysis stay identical across HRM/Tiny/HF.

👣 Mermaid: dynamic plugin flow for HF scorers

    flowchart LR
  %% Define color scheme and styles
  classDef startEnd fill:#4CAF50,stroke:#388E3C,stroke-width:2px,color:white
  classDef process fill:#2196F3,stroke:#1976D2,stroke-width:2px,color:white
  classDef data fill:#FF9800,stroke:#F57C00,stroke-width:2px,color:white
  classDef decision fill:#FFC107,stroke:#FFA000,stroke-width:2px,color:black
  classDef plugin fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:white
  classDef analysis fill:#607D8B,stroke:#455A64,stroke-width:2px,color:white

  A["'🎯📤 Goal + Response'"] --> B["'🤗 HuggingFaceScorer<br/>📊 Teacher-forced LL Stats'"]
  B -->|"'📈 mean_logprob<br>🎲 ppl<br>🌀 entropy<br>💾 bpb<br>📏 lengths'"| C["'💾 ScoreBundle<br/>📋 Attributes'"]
  C --> D{"'🔌 Plugins enabled?'"}
  D -- "'✅ yes'" --> E["'🔧 SCMServicePlugin<br/>🎯 derive_scm_from_ll'"]
  E -->|"'🏷️ scm.* + alias.*.score01'"| F["'✨ Augmented ScoreBundle'"]
  D -- "'❌ no'" --> F
  F --> G["'📊 Gap Field Builder<br/>📐 Δ(HRM−HF), Δ(HF−Tiny)'"]
  G --> H["'🧮 VPM / Topology<br/>🔄 loops, 📊 Betti numbers'"]

  %% Apply styling classes
  class A startEnd
  class B process
  class C data
  class D decision
  class E plugin
  class F data
  class G process
  class H analysis

  %% Add some visual enhancements
  linkStyle default stroke:#666,stroke-width:2px
  linkStyle 3 stroke:#4CAF50,stroke-width:3px
  linkStyle 4 stroke:#F44336,stroke-width:3px
  
  • The same diagram applies to any base scorer plugins make the system dynamic and uniform.

⭐ Why this matters

  • Cross-model robustness: We observe the same structured gaps with HF judges trained on different corpora and recipes.
  • Cost-aware scale-out: HF scorers are cheap/fast perfect for bulk labeling, A/B checks, and Δ-sanity passes before escalating to HRM.
  • One language to rule them all: SCM unifies Tiny, HRM, and HF. That unification is what unlocks Δ-fields and the topological lens we use to learn from failure.

📌 Engineering notes (copy/paste friendly)

  • Determinism: temperature=0 (when generating), fixed max_seq_len, consistent goal/response concatenation.
  • Windows-friendly: we force eager attention in the HF model config to avoid Flash/SDPA edge cases.
  • Memory hygiene: call scorer.close() to move model to CPU and free VRAM (empty_cache, ipc_collect).
  • Config knobs that matter: ppl_low, ppl_high in the SCM plugin; model_alias for consistent column names; device_map: auto for multi-GPU.

🍕 Takeaway

Hugging Face scorers “blew the doors off” because they showed the gap is not parochial to our models. By translating external models into SCM, we can see and measure the same structures and then use them: for routing, calibration, and, in the next post we turn Δ-hotspots into training a signal hallucination becomes a superpowerI.

📦 HF Run Snapshot (Δ-field topology)

  • Run: 7422, created 2025-10-22 19:54:30Z.
  • Models (small–small pair): HRM = gemma-2-2b-it · Tiny = SmolLM3-3B. (also recorded in the final summary block).
  • Dataset size: 2,502 triples scored.
  • Topological readout (Betti numbers, Δ = HRM − Tiny): b₀ = 808, b₁ = 287, top H₁ persistence ≈ 0.147.

Why this is “strong”: A high b₁ (287) with a non-trivial top persistence (~0.147) means we’re seeing many stable 1-D loops in the disagreement field i.e., structured, persistent regions where the two judges diverge in consistent ways. That’s exactly the signature we look for when arguing the gap is structural, not a quirk of one stack.

HF Loop HFI Comparison
Tiny: single bright band only one feature has signal HRM: rich texture activity across all features

🌇 Gap Component Architecture at a glance

This is what we are building in this post. It looks complex but really we are

  • training two models using the same data
  • using them to evaluate the same texts
  • creating images from those evaluations
  • looking for information in the differences between those images
  • mapping and describing the information we see

What you’re seeing: the HRM’s end-to-end signal path (input ➜ hierarchical core ➜ heads). The green path is the calibrated quality score; the pink/blue/purple/yellow paths are diagnostics we’ll later align with Tiny and turn into the GAP Δ-field.

What to notice (scan the colors):

🟩 Primary scoring (green): temperature-calibrated score01 this is the value you’d naively compare across models.

🟥 Uncertainty & confidence (pink): logvar → certainty01 and a 3-bucket entropy these let us tell how sure the score is.

🟦 Agreement & robustness (blue): predicted model disagreement and consistency core to the GAP analysis and routing.

🟪 Specialized diagnostics (purple): OOD, reconstruction sim, finite-difference sensitivity signals that explain why models diverge.

🟨 Evidence accumulation (yellow): a “halt” signal that tracks how much evidence the model thinks it has.

Why this matters: these heads give us a rich basis to translate HRM and Tiny a shared language for models to communicate (SCM), subtract them, and then “see” the difference map between two ways of thinking (Δ = HRM − Tiny).

    graph TD
    %% Title and Input Section
    A[🎯 HRM Hierarchical Reasoning Model<br/>Multi-Head Architecture] --> B[📥 Input Layer]
    
    B --> C[🔮 Input Projector<br/>x → x̃]
    
    %% Hierarchical Core Processing
    C --> D{🔄 Hierarchical Core<br/>Dual Recurrent Processing}
    
    D --> E[🐢 Low-Level Module L<br/>Fine-grained Analysis<br/>T steps per cycle]
    D --> F[🐇 High-Level Module H<br/>Abstract Reasoning<br/>1 step per cycle]
    
    E --> G[🔄 State Feedback Loop]
    F --> G
    G --> D
    
    %% Final States
    D --> H[💎 Final States<br/>zL_final + zH_final]
    
    %% Primary Scoring Pathway
    H --> I[🌡️ Temperature Head<br/>τ calibration]
    H --> J[⭐ Score Head<br/>Quality logits]
    
    I --> K[🎯 Primary Score<br/>score01 ∈ 0,1<br/>Temperature calibrated]
    J --> K
    
    %% Uncertainty & Confidence Heads
    H --> L[📊 LogVar Head<br/>Aleatoric uncertainty]
    H --> M[🔢 Aux3 Head<br/>Bad/Medium/Good]
    
    L --> N[✅ Certainty01<br/>Uncertainty measure]
    M --> O[📶 Entropy Aux<br/>Confidence score]
    
    %% Agreement & Robustness Heads
    H --> P[⚔️ Disagree Head<br/>HRM-Tiny disagreement]
    H --> Q[🛡️ Consistency Head<br/>Robustness prediction]
    
    P --> R[🔄 Disagree Hat<br/>Predicted disagreement]
    Q --> S[🎯 Consistency Hat<br/>Robustness score]
    
    %% Specialized Diagnostic Heads
    H --> T[🚫 OOD Head<br/>Out-of-distribution]
    H --> U[🔁 Recon Head<br/>Input reconstruction]
    H --> V[📏 Jacobian FD<br/>Sensitivity analysis]
    
    T --> W[🎯 OOD Hat<br/>Anomaly detection]
    U --> X[📐 Recon Sim<br/>Comprehension quality]
    V --> Y[📊 Jacobian FD<br/>Input sensitivity]
    
    %% Evidence Accumulation
    H --> Z[🛑 Halt Signal<br/>Evidence accumulation]
    Z --> AA[🎲 Halt Prob<br/>Pseudo-halting]
    
    %% Styling and Grouping
    classDef input fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef core fill:#fff3e0,stroke:#e65100,stroke-width:3px
    classDef primary fill:#e8f5e8,stroke:#2e7d32,stroke-width:3px
    classDef uncertainty fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef agreement fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    classDef diagnostic fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef evidence fill:#fff8e1,stroke:#ff8f00,stroke-width:2px
    
    class A,B,C input
    class D,E,F,G core
    class I,J,K primary
    class L,M,N,O uncertainty
    class P,Q,R,S agreement
    class T,U,V,W,X,Y diagnostic
    class Z,AA evidence

    %% Legend
    subgraph Legend[📖 Legend - Head Types]
        L1[🟩 Primary Scoring] --> L2[🟥 Uncertainty & Confidence]
        L2 --> L3[🟦 Agreement & Robustness]
        L3 --> L4[🟪 Specialized Diagnostics]
        L4 --> L5[🟨 Evidence Accumulation]
    end
  

When two models look at the same problem, they don’t think the same thoughts.
Here we take the same data and the same target, run a heavyweight reasoner (HRM) and a tiny recursive scorer (Tiny), and ask a different question: what lives in the space between them?

By aligning their outputs and subtracting (Δ = HRM − Tiny), that “between-space” turns into a map. It isn’t smooth. It has structure loops, knots, holes that neither model shows alone.


🌅 Execution flow

What you’re seeing: the full GAP run as a dual-pass pipeline with explicit GPU hygiene and provenance. We score HRM first, flush VRAM, then score Tiny, align both to SCM, compute Δ, do topology, and ship visuals + manifest.

Key beats (7 steps):

Prep: curate & dedupe turns per dimension (caps applied).

Pass A (HRM): score → SCM → timeline frames; persist matrices.

VRAM handoff: unload HF models, torch.cuda.empty_cache() + torch.cuda.ipc_collect().

Pass B (Tiny): score → SCM → timeline frames; persist matrices.

Alignment: build VPMs in a common schema with scm.* columns.

Δ-field & topology: compute Δ = HRM − Tiny; run PH to find loops (H₁).

Artifacts: GIF timelines, frontier/epistemic maps, barcodes, manifest (run keys, seeds, configs, checksums).

Why this matters: dual-pass + explicit unload makes results deterministic and reproducible on a single GPU; SCM gives us the consistent coordinate system needed to turn raw scores into a visual reasoning map

    sequenceDiagram
    title 🎯 GAP Analysis Pipeline - Complete Execution Flow
    participant A as 🧠 GapAgent
    participant O as 🎼 Orchestrator
    participant S as 🚀 ScoringProcessor
    participant H as 🏛️ HRM Scorers
    participant T as ⚡ Tiny Scorers
    participant C as 🔄 SCMService
    participant V as 👁️ VPMWorker
    participant G as 💾 GapStorage
    participant J as 📊 TopologyAnalyzer
    participant N as 🎨 VisualGenerator
    participant M as 📋 Manifest

    Note over A,O: 🚀 Pipeline Initialization
    A->>O: run(context)
    activate O
    O->>S: execute_scoring(triples)
    activate S
    
    Note over S: HRM Scoring Phase
    S->>H: score_hrm(triples)
    activate H
    H-->>S: hrm_scores
    deactivate H
    
    Note over S: Tiny Scoring Phase
    S->>T: score_tiny(triples)
    activate T
    T-->>S: tiny_scores
    deactivate T
    
    Note over S: SCM Alignment
    S->>C: align_to_scm(hrm_scores, tiny_scores)
    activate C
    C-->>S: scm_rows, matrices
    deactivate C
    
    Note over S: Visualization
    S->>V: generate_timelines(hrm_matrix, tiny_matrix)
    activate V
    V-->>S: hrm_gif, tiny_gif
    deactivate V
    
    S->>G: save_matrices(hrm_matrix, tiny_matrix)
    activate G
    G-->>S: matrix_paths
    deactivate G
    S-->>O: scoring_results
    deactivate S

    Note over O: Analysis Phase
    O->>J: analyze_topology(delta_field)
    activate J
    J-->>O: topology_results
    deactivate J
    
    O->>N: generate_visuals(delta_field, topology_results)
    activate N
    N-->>O: visuals
    deactivate N
    
    O->>M: create_manifest(scoring_results, topology_results, visuals)
    activate M
    M-->>O: manifest
    deactivate M
    
    O-->>A: result
    deactivate O
  

Why dual-pass? HRM and Tiny models have very different memory requirements. By scoring HRM first, then explicitly freeing GPU memory before scoring Tiny, we prevent VRAM thrashing and ensure deterministic ordering of the data. This is critical for reproducibility otherwise, the same input could produce different results due to memory constraints. Also to test I am using a consumer GPU and it can olnly load one Hugging Face Model at a time.


🪞 GapAgent: The Doorway Between AI Minds

👉 Full Code Here

At the very top of the GAP architecture lives the GapAgent, a small but crucial class that defines how the system boots up, what inputs it expects, and how the full reasoning pipeline gets executed.

In many ways, this is the entry point of insight the class you call when you want to ask:

“What is the difference between how two models think about the same idea?”

It doesn’t do much work itself and that’s the point. It delegates all heavy lifting to the GapOrchestrator, while ensuring:

  • a clean interface (run(context))
  • proper configuration loading
  • final result collation and return

This minimalism is deliberate: by keeping the GapAgent lightweight, it can be reused across pipelines, integrated into dashboards, or wrapped by automation scripts that batch and monitor runs across models, seeds, or tasks.

🤖 Code Snapshot: What GapAgent Looks Like

class GapAgent:
    def __init__(self, config: GapConfig, container, logger):
        self.config = config
        self.container = container
        self.logger = logger

    async def run(self, context: Dict[str, Any]) -> Dict[str, Any]:
        orchestrator = GapOrchestrator(self.config, self.container, self.logger)
        return await orchestrator.run_gap(context)

🧭 How It Fits Into the Bigger Picture

Think of GapAgent as the dispatcher. It doesn’t know the internals of scoring or topology but it knows which expert to call. The run() function is designed for:

  • Async compatibility (to work with an agent hub, CLI tool, or FastAPI backend)
  • Injection of configuration and dependencies
  • Minimal surface area all logic is delegated

This makes it ideal for use in:

  • Autonomous loops
  • Evaluation suites
  • Long-term learning traces
  • Distributed pipelines (e.g., NATS-based task queues)

🎼 GapOrchestrator: The Conductor of Comparative Reasoning

👉 Full Code Here

If GapAgent is the front door of the pipeline, then GapOrchestrator is the conductor of an orchestra where HRM and Tiny are two musicians playing the same piece but with fundamentally different interpretations.

This class is where the true flow of the GAP analysis unfolds a carefully staged sequence of retrieval, scoring, alignment, analysis, and output generation. It owns the full lifecycle of a reasoning comparison run, with visibility into every step and artifact.

“If the GapAgent says go, the Orchestrator says how far, how fast, and with what traceable steps.


🧬 The Purpose

At its core, GapOrchestrator exists to:

  • Coordinate every step of the analysis (data → scoring → delta → topology → metrics → images)
  • Inject structure into what could easily become a tangle of processors and side-effects
  • Serve as the single source of progress, error tracking, and run manifest logging
  • Keep all processing introspectable enabling future reflection, audit, or self-improvement

This is where the magic happens: transforming raw model outputs into a structured map of why models disagree, not just what they disagree about.


🧩 Class Overview

class GapAnalysisOrchestrator(ProgressMixin):
    def __init__(self, config: GapConfig, container, logger, memory=None):
        self.config = config
        self.container = container
        self.logger = logger
        self.memory = memory
        
        # Initialize all processors
        self.scoring_processor = ScoringProcessor(self.config, container, logger)
        self.analysis_processor = AnalysisProcessor(self.config, container, logger)
        self.calibration_processor = CalibrationProcessor(self.config, container, logger)
        self.significance_processor = SignificanceProcessor(SignificanceConfig(), logger=logger)
        
        # Set up storage and manifest
        self.storage = self.container.get("gap_storage")
        self.manifest_manager = ManifestManager(self.storage)
        
        # Progress tracking system
        self._init_progress(container, logger)

The constructor wires in all dependencies with precision:

  • config: includes dimensions, batch sizes, task setup, model names, file paths
  • container: contains injected services like SCM, embeddings, scorer factories
  • logger: logs every step, error, and progress marker
  • ProgressMixin: provides visual and CLI progress updates per stage

This careful initialization creates a self-contained system where every component knows its role and dependencies.


🧵 The Main Thread: execute_analysis(context)

The orchestrator’s execute_analysis() method is the pipeline heartbeat:

async def execute_analysis(self, context: Dict[str, Any]) -> Dict[str, Any]:
    run_id = context.get("pipeline_run_id", "gap_run")
    dataset_name = context.get("dataset", "unknown")
    
    # 1) Data Preparation: Retrieve conversation turns organized by reasoning dimension
    triples_by_dim = await self.retriever.get_triples_by_dimension(
        self.config.dimensions,
        memory=self.memory,
        limit=self.retriever.cfg.limit,
    )
    
    # 2) Model Scoring: Run HRM and Tiny models on all samples
    score_out = await self.scoring_processor.execute_scoring(
        triples_by_dim,
        run_id,
        manifest=m,
    )
    
    # 3) Analysis: Compute delta fields, persistent homology, frontier maps
    analysis_out = await self.analysis_processor.execute_analysis(
        score_out,
        run_id,
        manifest=m,
    )
    
    # 4) Significance Testing: Statistical validation of topological findings
    significance_out = await self.significance_processor.run(
        run_id,
        base_dir=self.config.base_dir,
    )
    
    # 5) Calibration: Determine routing thresholds and model escalation policies
    calib_out = await self.calibration_processor.execute_calibration(
        analysis_out,
        run_id,
        alias_a="HRM",
        alias_b="Tiny",
    )
    
    # 6) Reporting: Generate comprehensive Markdown report
    report_out = await ReportBuilder(self.config, self.container, self.logger).build(
        run_id,
        analysis_out,
        score_out,
    )
    
    # 7) Finalize manifest with complete results
    result = {
        "run_id": run_id,
        "score": score_out,
        "analysis": analysis_out,
        "significance": significance_out,
        "calibration": calib_out,
        "report": report_out,
        "manifest": m.to_dict(),
    }
    
    self.manifest_manager.finish_run(run_id, result)
    return result

Each step is:

  • Isolated: Each processor handles only its specific task
  • Deterministic: same input → same output
  • Progress-tracked: every step logs its progress
  • Error-handled: failures are caught and logged without breaking the pipeline

📊 A Closer Look: How Analysis Flows

    graph LR
    A[Start: Context + Config] --> B[Data Preparation]
    B -->|"Retrieve conversation turns<br>dedupe by dimension<br>ensure consistency"| C[HRM Scoring Pass]
    C -->|"Unload HRM<br>Clear GPU memory<br>Free resources"| D[Tiny Scoring Pass]
    D -->|"Align scores to<br>Shared Core Metrics<br>SCM format"| E[Delta Field Creation]
    E -->|"Compute persistent<br>homology<br>Find H1 loops<br>Topological features"| F[Topology Analysis]
    F -->|"Generate visualizations<br>Frontier maps<br>PHOS packs<br>UMAP overlays"| G[Visual Artifact Generation]
    G -->|"Save all artifacts<br>Track paths<br>Record metadata"| H[Manifest Finalization]
    
    classDef default fill:#f8f9fa,stroke:#495057,stroke-width:2px,color:#212529
    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#0d47a1
    classDef process fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
    classDef output fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
    classDef diagnostic fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#4a148c

    class A input
    class B,C,D,E process
    class F diagnostic
    class G,H output
  

This diagram shows the complete workflow:

  1. Data Preparation: Retrieve conversation turns organized by reasoning dimension, deduping and capping samples
  2. HRM Scoring Pass: Process all samples with the heavyweight HRM model
  3. VRAM Handoff: Explicitly free GPU memory before scoring Tiny to prevent thrashing
  4. Tiny Scoring Pass: Process the same samples with the lightweight Tiny model
  5. Delta Field Creation: Compute HRM - Tiny differences in a common metric space
  6. Topology Analysis: Use persistent homology to find “holes” and “loops” in the reasoning space
  7. Visual Artifact Generation: Create visualizations that make these differences visible
  8. Manifest Finalization: Save all artifacts with complete provenance

📁 Manifest = The Source of Truth

The GapRunManifest object is the orchestrator’s memory. It tracks:

👉 Full Code Here

  • Model names, seeds, batch size
  • Paths to score files, SCM CSVs, delta fields, Betti numbers
  • Image paths for timeline, PHOS, frontier, and topological overlays
  • Statistical significance results (p-values, confidence intervals)
  • Calibration parameters and routing thresholds

This lets downstream agents (e.g. a dashboard, CLI tool, or visual debugger) immediately access results without recomputing anything.

“Every gap run becomes a traceable artifact like a black box recorder for AI comparison.”


🧠 Why This Matters

GapOrchestrator is a commitment to structured reflection. It doesn’t just run things it organizes thought into layers, dimensions, and traceable signals. It’s designed for:

  • Repeatability: same settings = same outputs, even across different machines
  • Scalability: handles hundreds of tasks, thousands of triples
  • Visual Debugging: clear output paths, image artifacts, topological overlays
  • Future Learning: every output is usable for training, scoring, or adaptive routing

This orchestration isn’t just about gluing components together it’s about making the reasoning process visible, accountable, and trainable. When you can see exactly where and why models disagree, you can build systems that don’t just get answers right, but understand how they get answers right.


🧩 TL;DR

  • GapOrchestrator is the reasoning conductor of the system
  • Orchestrates every step: data, scoring, delta, topology, reporting
  • Stores everything in a manifest for reproducibility and downstream reuse
  • Clean, modular, and designed to be extended or introspected later

“In Stephanie, orchestration isn’t just about gluing components it’s about making the reasoning process visible, accountable, and trainable.”


⚙️ ScoringProcessor: The Scientific Engine Behind Model Comparison

👉 Full Code Here

“When comparing two models, you don’t mix them in the same beaker you run them one after the other, in isolation, under identical conditions. That’s the scientific method.”

Imagine you’re testing two chefs to see who makes the better lasagna. If you have them cook side-by-side in the same kitchen, you’d never know if differences in flavor came from their skills or from one accidentally using the other’s ingredients. The only fair comparison is to have each chef prepare the same dish separately, with identical ingredients, tools, and conditions.

This is exactly why our ScoringProcessor implements a dual-pass scoring system. It’s not just code it’s scientific rigor baked into our AI evaluation pipeline.


🔁 Why We Don’t Score Models Together

When scoring both HRM and Tiny models simultaneously, chaos ensues:

  • HRM’s heavyweight memory footprint causes GPU cache overflow
  • Tiny’s inference becomes unstable or fails due to VRAM contention
  • Score order (HRM first vs Tiny first) causes inconsistencies
  • Any “delta” we calculate becomes polluted by hardware artifacts

This isn’t theoretical. We’ve observed real cases where the same input produced wildly different Tiny scores just because HRM left memory artifacts behind.


🧪 Dual-Pass Scoring: Rigorous, Repeatable, Fair

The ScoringProcessor solves this through carefully staged execution:

async def execute_scoring(self, triples_by_dim, run_id, manifest):
    # 1. HRM PASS: Score all samples with HRM
    hrm_results = await self._score_model_pass("hrm", triples_by_dim)

    # 2. CLEAR MEMORY: Critical step
    self._free_gpu_memory()

    # 3. TINY PASS: Score same samples with Tiny
    tiny_results = await self._score_model_pass("tiny", triples_by_dim)

    # 4. ALIGN: Convert to shared schema
    return self._align_and_store(hrm_results, tiny_results)

Each step upholds the scientific method:

  1. HRM Pass

    • HRM scores all input triples by dimension
    • We allow full GPU access for clean, high-resolution output
  2. Memory Clearance

    • torch.cuda.empty_cache(), gc.collect(), torch.cuda.ipc_collect()
    • Ensures a neutral, cold-GPU state for the next model
  3. Tiny Pass

    • Tiny gets the exact same inputs but with no HRM interference
    • We guarantee fairness in memory, batch order, and execution conditions
  4. Alignment and Storage

    • Outputs are mapped to a common vector schema (via SCMService)
    • Data is saved for downstream delta computation and visualization

This isn’t just good engineering it’s controlled experimentation. Equal inputs, controlled environments, deterministic order. AI scoring meets lab science.


🌐 SCMService: A Shared Language of Thought

Raw scores from different models are meaningless without a common frame of reference. The SCMService provides this bridge:

👉 Full Code Here

from stephanie.components.gap.shared_scm import scm_from_vector

scm_vector = scm_from_vector(model_output, dimension, model_type)

SCM translates diverse outputs into a shared schema, such as:

  • scm.reasoning.score01
  • scm.knowledge.score01
  • scm.uncertainty01
  • scm.contrastiveness01
  • scm.focus_entropy01

These metrics are:

  • Normalized to 0–1 scale
  • Aligned across dimensions
  • Tagged for clarity and reproducibility

Without SCM, comparing models is like comparing temperatures in Celsius and Fahrenheit same phenomenon, incompatible units. SCM gives us the universal scale for comparing minds.

Here’s the canonical flow:

    graph LR
    %% Define styles and colors
    classDef hrm fill:#FF6B6B,stroke:#FF4757,stroke-width:3px,color:white
    classDef tiny fill:#4ECDC4,stroke:#00A8FF,stroke-width:3px,color:white
    classDef scm fill:#FFD93D,stroke:#FF9F43,stroke-width:3px,color:black
    classDef output fill:#6C5CE7,stroke:#A29BFE,stroke-width:3px,color:white
    classDef delta fill:#00B894,stroke:#55E6C1,stroke-width:3px,color:white

    %% Nodes with emojis and styling
    A["'🏔️ HRM Raw Output'"] -->|"'🔧 SCM Service<br/>📊 Semantic Scoring'"| B["'🧠 scm.reasoning.score01'"]
    A -->|"'🔧 SCM Service<br/>📊 Semantic Scoring'"| C["'📚 scm.knowledge.score01'"]
    A -->|"'🔧 SCM Service<br/>📊 Semantic Scoring'"| D["'❓ scm.uncertainty01'"]
    
    E["'🤖 Tiny Raw Output'"] -->|"'🔧 SCM Service<br/>📊 Semantic Scoring'"| B
    E -->|"'🔧 SCM Service<br/>📊 Semantic Scoring'"| C
    E -->|"'🔧 SCM Service<br/>📊 Semantic Scoring'"| D
    
    B --> F["'🆚 Δ-field = HRM - Tiny<br/>📈 Difference Analysis'"]
    C --> F
    D --> F

    %% Apply styling classes
    class A hrm
    class E tiny
    class B,C,D output
    class F delta

    %% Style the links
    linkStyle 0,1,2 stroke:#FF9F43,stroke-width:3px
    linkStyle 3,4,5 stroke:#00A8FF,stroke-width:3px
    linkStyle 6,7,8 stroke:#6C5CE7,stroke-width:3px

    %% Add title and description
    subgraph "'🎯 Multi-Model Scoring Comparison Pipeline'"
        A
        E
    end
  

🖼️ Artifact Generation: Seeing the Mind at Work

The ScoringProcessor also generates artifacts not just data, but windows into cognition:

Artifact Description
rows_for_df.parquet Tabular scores for each triple × dimension × model
Timeline GIFs Visual activations across dimensions over time
PHOS-packed VPMs Compressed visual representations of reasoning differences
Delta overlays (Δ) HRM vs Tiny difference fields, stored in compressed formats
Image telemetry Used by Phōs and VPM layers for final visualization and evaluation

These aren’t just pretty visuals. They are structured, diagnostic signals the “mirror shards” of AI cognition.

Example: When HRM distributes attention evenly across all five dimensions and Tiny locks in on contrastiveness and diagnostic entropy, these VPM timelines make that difference visible pixel by pixel.


🧠 Why This Layer Matters

This scoring layer isn’t just about gathering numbers it’s the foundation for the entire gap field. If this step is biased, inconsistent, or unrepeatable:

  • Δ-fields become noise
  • Topological analysis finds false “holes”
  • Calibration thresholds are meaningless
  • Routing decisions break down

But when done right:

  • Sequential passes ensure fairness
  • SCM ensures semantic alignment
  • Structured artifacts ensure traceability
  • The gap becomes signal, not artifact

You cannot see the structure of disagreement unless you score the models in isolation, convert their beliefs to a common form, and store every diagnostic detail. That’s the core philosophy of this processor: scoring as a science of comparison.


🧪 Final Takeaway

The ScoringProcessor isn’t a loop it’s a lab. A controlled environment where minds are mirrored, scored, and converted into shared coordinates. Without it, the GAP pipeline is blind. With it, we see how two intelligences diverge not just in results, but in reasoning itself.

“To measure the gap between models, you must first make them speak the same language and then listen carefully to what each leaves unsaid.”


🔍 AnalysisProcessor: Turning Model Gaps into Measurable Insights

👉 Full Code Here

“When two models look at the same problem but see it differently, the difference isn’t noise it’s a map of uncharted territory. This is where we find the real intelligence.”

Imagine you’re comparing two maps of the same mountain range. One shows peaks and valleys, the other shows rivers and roads. Both are accurate but they tell different stories. The AnalysisProcessor is the cartographer that doesn’t just compare the maps it creates a third: a topographical overlay that shows exactly where they diverge, and what that divergence means.

This is where Stephanie’s model comparison becomes a science and a story.


🌄 The Δ-Field: Where Disagreement Becomes Data

The foundation of our analysis is simple: subtraction.

delta = hrm_score - tiny_score

But this isn’t arithmetic it’s discovery.

When HRM scores a document as 0.9 on reasoning and Tiny gives it 0.6, that 0.3 gap isn’t a mistake. It’s a signal maybe HRM saw a coherent logic chain Tiny missed, or maybe Tiny penalized a hallucination HRM overlooked.

We extract these deltas for every dimension (e.g., faithfulness, knowledge, reasoning, uncertainty, style) and assemble them into a Δ-field matrix.

This Δ-field becomes our geometry of disagreement:

  • 🟢 High positive values → HRM stronger
  • 🔴 High negative values → Tiny stronger
  • Near zero → Agreement
  • ⛰️ Sharp gradients → Epistemic boundaries

“The difference between models isn’t just a number it’s the terrain where true understanding lives.”


🌀 Betti Numbers: Topological Signatures of Reasoning

Next comes persistent homology a topological method that detects structural patterns in score space.

These aren’t just academic curiosities. Betti numbers quantify the shape of disagreement:

  • Betti-0 (β₀): How many disconnected regions of agreement?
  • Betti-1 (β₁): How many loops of recurring disagreement?
  • Betti-2+ (β₂, β₃, …): Higher-dimensional “voids” deep contradictions in reasoning structure
    graph TD
    %% ===== STANDARD TEMPLATE - ARCHITECTURE =====
    classDef default fill:#f8f9fa,stroke:#495057,stroke-width:2px,color:#212529
    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#0d47a1
    classDef process fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
    classDef output fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
    classDef diagnostic fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#4a148c

    A[HRM vs Tiny Scores] --> B[Δ-Field Matrix]
    B --> C[Topological Analysis]
    C --> D[Betti-0: Clusters of Agreement]
    C --> E[Betti-1: Disagreement Loops] 
    C --> F[Betti-2+: Reasoning Voids]
    D --> G[Fragmentation of Trust]
    E --> H[Cycles of Divergence]
    F --> I[Fundamental Contradictions]
    
    class A input
    class B,C process
    class D,E,F diagnostic
    class G,H,I output
  

Why this matters: if we observe a stable Betti-1 loop in the faithfulness dimension, that means the two models are consistently cycling through disagreement in predictable, structured ways. That’s not random it’s a sign of deep model bias or domain misunderstanding.


🖼️ Visual Policy Maps: Seeing What Numbers Can’t Show

Raw numbers tell part of the story but seeing model behavior is transformative.

The AnalysisProcessor generates high-resolution Visual Policy Maps (VPMs), which are spatial renderings of the Δ-field.

VPM Example

  • 🔳 Left: HRM’s reasoning distribution
  • 🔳 Right: Tiny’s interpretation
  • 🔳 Center: Δ-field the epistemic gap

These maps are:

  • PHOS-packed: Signal-rich tiles concentrated top-left for interpretability
  • Dimension-sorted: Consistent spatial layout across runs
  • Overlayable: Used in dashboards and retraining loops

We also project Δ-fields using UMAP to produce 2D reasoning landscapes. Here, clusters of disagreement form visual islands and tell us where and why reasoning diverges.

“When Tiny’s map shows a single bright ridge while HRM’s lights up all five dimensions that’s not just a visualization. That’s a diagnostic.”


📊 Intensity Reporting: Quantifying the Gap

The final output is structured metrics extracted from Δ-fields and stored via GapRunManifest.

Metric What It Measures Why It Matters
Δ-mass Mean absolute delta How big is the gap?
Cosine Overlap Angle between score vectors Do models reason in the same direction?
Agreement Rate Same sign scores (%) Where do they align?
Uncertainty Gap Confidence mismatch Which model trusts itself more?
Sensitivity Index Δ per unit perturbation How fragile is the score?

We also break this down per-dimension, so you can say:

  • “Tiny is aligned with HRM on style, but divergent on knowledge.”
  • “Faithfulness is a recurring weak spot time to prioritize.”

💡 From Gap to Action: Why This Processor Matters

The AnalysisProcessor turns raw score deltas into a story of divergence. That story is essential for:

  • Training Loops: Select samples with high disagreement for refinement
  • Routing Policies: Escalate when Tiny disagrees with HRM in high-risk areas
  • Model Design: Focus distillation on patterns where Tiny underperforms
  • Trust Interfaces: Let humans inspect disagreement before acting

“If the ScoringProcessor is the experiment, the AnalysisProcessor is the microscope. It doesn’t just detect model differences it reveals the structure behind them.”

This is where Stephanie transforms from an evaluation engine into a reasoning system one that doesn’t just measure gaps, but learns from them.


🧪 CalibrationProcessor: Turning Disagreement into Actionable Intelligence

👉 Full Code Here

“When two models disagree, we don’t just observe the gap we learn how to bridge it.”

Imagine you’re a doctor comparing two diagnostic tools. One is a high‑precision MRI (HRM); the other, a portable ultrasound (Tiny). Both detect patterns in the body but with very different fidelity. The MRI is slow and costly yet precise; the ultrasound is fast and lightweight but easily confused in complex cases.

The CalibrationProcessor is the system that teaches the ultrasound when to trust itself and when to defer to the MRI. It doesn’t replace the MRI it makes the ultrasound aware of its limits. It learns when Tiny is reliable, when it’s uncertain, and when the situation demands HRM’s judgment.


🧠 The Science Behind Calibration

Calibration isn’t about making Tiny “better” it’s about quantifying the relationship between Tiny and HRM. When Tiny says, “I’m 80% confident,” what does that actually mean compared to HRM’s ground truth?

The CalibrationProcessor answers that through a mix of monotone curve fitting, threshold simulation, and provenance‑driven normalization. It transforms raw disagreement into actionable routing intelligence.


⚙️ What Happens During Calibration

  1. Load the aligned SCM data

    • Pull paired HRM/Tiny results from the manifest.
    • Each score already lives in a unified metric space (scm.*), guaranteeing a fair comparison.
  2. Compute per‑dimension calibration curves For every reasoning dimension (reasoning, knowledge, clarity, faithfulness, coverage):

    • Compare Tiny’s predictions to HRM’s.
    • Fit a monotone piecewise‑linear (PL) calibration curve.
    • Measure pre‑ and post‑calibration error (e.g., MAE, RMSE).
  3. Simulate routing policies Using calibrated scores, run “what‑if” thresholds: When should Tiny handle the task alone? When should it escalate?

# Core calibration logic (simplified)
def _monotone_pl_calibration(tiny_scores, hrm_scores, n_knots=21):
    # Fit monotone curve mapping Tiny's scores to HRM's expectations
    return calib_curve

def _apply_monotone_pl(tiny_scores, calib_curve):
    return np.interp(tiny_scores, calib_curve.x_knots, calib_curve.y_knots)

mae_pre = _mae(tiny_scores, hrm_scores)
tiny_cal = _apply_monotone_pl(tiny_scores, calib_curve)
mae_post = _mae(tiny_cal, hrm_scores)

This is the quiet math behind trust. By aligning Tiny’s internal “sense of certainty” with HRM’s ground truth, we give Tiny calibrated intuition a measured confidence that matches reality.


📦 Provenance: The Ledger of Truth

Every calibration run emits a provenance record a full audit trail for every item analyzed.

This includes:

  • Source document ID and hash
  • HRM and Tiny raw scores
  • SCM‑aligned values and deltas
  • Calibration curve used
  • Post‑calibration residuals
  • Routing decision (use Tiny / escalate to HRM)

These records are persisted via the GapRunManifest, ensuring that every score, every correction, every decision can be traced back to its origin.

# inside CalibrationProcessor
self.manifest.store_provenance(records)

This is the accountability layer of the system calibration is not a black box; it’s a verifiable ledger of how understanding evolved.


📊 The Results: From Theory to Operation

Calibration yields three key artifacts that operationalize this bridge between models.

  1. calibration_params.json
{
  "per_dimension": {
    "reasoning": { "mae_pre": 0.241, "mae_post": 0.163 },
    "knowledge": { "mae_pre": 0.215, "mae_post": 0.142 }
  }
}

This shows how Tiny’s internal scores are transformed to match HRM’s expectations. A reasoning score of 0.30 becomes 0.40 after calibration meaning Tiny consistently underestimates reasoning quality in that range.


  1. routing_summary.json
{
  "usage_rate": 0.28,
  "avg_mae_vs_hrm": 0.104,
  "thresholds": { "uncertainty": 0.6, "ood": 0.7 }
}

This is the policy distilled from calibration:

Tiny can handle 72% of tasks autonomously while maintaining 90% of HRM’s accuracy. When uncertainty rises above 0.6 or out‑of‑distribution signals hit 0.7, the system automatically escalates to HRM.


  1. routing_detail.json
{
  "per_dimension": [
    {
      "dimension": "reasoning",
      "mae_pre": 0.241,
      "mae_post": 0.163,
      "improvement": 32.3
    },
    {
      "dimension": "knowledge",
      "mae_pre": 0.215,
      "mae_post": 0.142,
      "improvement": 34.0
    }
  ]
}

Every dimension tells a story of refinement calibration isn’t global; it’s context‑aware. Faithfulness improves differently than reasoning. Knowledge stabilizes faster than clarity. These subtle gradients form the operational DNA of adaptive AI.


💡 Information for Real AI Systems

Calibration is the moment where epistemology becomes engineering:

  • Efficiency → Use HRM only when needed; save compute everywhere else.
  • Transparency → Every routing decision has a traceable rationale.
  • Trust → Confidence is no longer guessed; it’s calibrated.
  • Adaptability → Curves evolve as new data flows in no full retraining required.

“The gap isn’t noise it’s structured information. Calibration is how we make that structure usable.”


🌐 The Complete Loop

At this point, the cycle closes:

  1. ScoringProcessor measures both minds.
  2. AnalysisProcessor reveals where they diverge.
  3. CalibrationProcessor learns how to navigate that divergence.
  4. Provenance Layer preserves the memory of how we learned.

Together they form an AI that doesn’t just think it reflects. An AI that knows when to pause, when to ask for help, and when to trust its own reasoning.

“We’ve built a self‑aware pipeline: Tiny knows when it’s uncertain and gracefully hands off to HRM. The result? 90% of HRM’s accuracy at 20% of the cost and 100% of the insight.”

    flowchart LR

    classDef default fill:#f8f9fa,stroke:#495057,stroke-width:2px,color:#212529
    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#0d47a1
    classDef process fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
    classDef output fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
    classDef diagnostic fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#4a1

    A1[🧠 ScoringProcessor<br/>Run Tiny & HRM on same tasks<br/>→ generate SCM-aligned scores] --> A2
    A2[📊 AnalysisProcessor<br/>Compute score gaps,<br/>disagreement maps,<br/>uncertainty & OOD metrics] --> A3
    A3[🧪 CalibrationProcessor<br/>Learn monotone mappings<br/>from Tiny → HRM<br/>Simulate routing policies] --> A4
    A4[🗂️ Provenance & Routing Summary<br/>Store calibration params,<br/>thresholds, and usage rules<br/>in manifest] --> A5
    A5[🤖 Runtime Policy<br/>Use Tiny when confident<br/>Escalate to HRM otherwise]

    class A1 input
    class A2,A3 process
    class A4 diagnostic
    class A5 output
  

🌌 The Mirror Machine: Measure, Tune, Use (for Any Two Minds)

We may not fully understand the gap between models yet but we can measure it. And once you can measure something, you can tune it. Once you can tune it, you can use it.

What we built isn’t a HRM-vs-Tiny trick. It’s a mirror machine: a model-agnostic instrument that takes the outputs of any two intelligences, translates them into a shared language (SCM), computes the Δ-field (Δ = A − B), and exposes the shape of their disagreement (topology, Betti curves, fronts, uncertainty). That shape is not just noise; it’s actionable structure.

We can’t see the thing itself. But we can see its fingerprints:

  • Δ-mass, agreement/overlap, uncertainty gaps
  • persistent loops (β₁), fragmented agreement (β₀)
  • where OOD, faithfulness, or reasoning fracture first

And because every item is provenance-backed (scores, deltas, curves, routing decisions logged in the manifest), the mirror is not a metaphor it’s an instrument with a ledger.


🔓 What this unlocks (for any models, any data)

Universal If two systems can emit (or be mapped to) SCM, they can be mirrored: HRM↔Tiny, Gemma↔Smol, Llama↔Mistral, v2↔v3, even rule-engine↔LM. The mechanism is the same.

Measurable Δ = A − B gives the field; topology gives the structure; calibration gives the dial. No mysticism just repeatable numbers with provenance.

Tunable (four modes)

  • Minimize ΔDistill & align. Make Tiny imitate HRM where it matters; compress cost without losing judgment.
  • Maximize ΔDiscover & diagnose. Surface blind spots, dataset holes, bias seams, novel capability.
  • Route by ΔOperate safely. Escalate when Δ (or uncertainty/OOD) crosses thresholds; keep speed elsewhere.
  • Monitor dΔ/dtGuardrail drift. Track how disagreement evolves across time, domains, or releases.

Usable, now The pipeline already emits what operations need: calibration_params.json, routing_summary.json, Δ-maps, Betti stats, timelines, and provenance for audit and retraining.


🏁 The closer

This work shifts the question from “Which model is better?” to “What lives in the space between minds and how do we steer with it?” We don’t claim full interpretability. We claim instrument-grade measurement: enough to route, distill, debug, monitor, and learn.

So treat the gap as a control surface:

Given models A, B:
1) Map outputs → SCM
2) Δ = A − B
3) Topology T(Δ), metrics M(Δ)
4) Policy P = f(T, M): {minimize, maximize, route, monitor}

That’s the mirror machine. Point it at any pair, any corpus. If you want lock-step alignment, turn the knob down. If you want new insight, turn it up. Either way, the gap stops being a mystery and starts being a dial.


🌌 From Scores to Steering: How We Built a Mirror Machine for AI Reasoning

We didn’t just compare models we built a mirror that shows what they don’t see, and gave ourselves a steering wheel to use that difference.

Think of two excellent maps of the same mountain range. One shows peaks and valleys; the other, rivers and roads. Both are right but they tell different stories. Our system creates a third map: a precise overlay that reveals where those stories diverge and why that divergence matters.

What follows is the end-to-end summary of what we actually built, and the conclusion it leads to.


🧾 Summary What We Actually Built

1️⃣ A compact reasoner you can ship

We designed, trained, and instrumented the Tiny Recursion Model (Tiny+) with a dedicated trainer and scorer. Tiny+ runs on embeddings, emits multi-head diagnostics (score, uncertainty, OOD, consistency, sensitivity), and stays numerically sane (log-var clamps, gradient clips, length norms, temperature calibration). It’s fast enough for inner loops/edge yet expressive enough to mirror HRM signals.

2️⃣ A disciplined way to score reasoning

We created dimension-specific prompts with a strict two-line contract (rationale + 0–100), normalized to score01 ∈ [0,1] across five facets: reasoning, knowledge, clarity, faithfulness, coverage. The result is a shared language of reasoning, not a vague single score.

3️⃣ An agent + orchestrator that run the whole play

GapAgent is the clean entry point; the GapAnalysisOrchestrator coordinates data prep → dual-pass scoring → analysis → calibration → reporting, with progress, error handling, and a run manifest for reproducibility.

4️⃣ A dual-pass ScoringProcessor for fairness and determinism

We score HRM first, flush GPU state, then score Tiny. No VRAM thrash, no cross-model contamination, deterministic ordering. Both outputs are aligned into the same schema.

5️⃣ A plugin system for post-scoring enrichment

Scorers support plugins that run after core scoring. The headline plugin is SCMService, which transforms model-specific stats into Shared Canonical Metrics (SCM) so heterogeneous models speak the same measurement language.

6️⃣ A Hugging Face scorer that only needs logits

HuggingFaceScorer (Windows-friendly, eager attention) computes teacher-forced logprobs / entropy / perplexity from any HF CausalLM. The SCM plugin then derives calibrated, model-agnostic metrics from those base stats. This matters when you don’t control internals logits are enough to mirror one model against another.

7️⃣ The GAP analysis itself

With both models in SCM, we compute the Δ-field (Δ = HRM − Tiny) per dimension, then apply topology (Betti-0/Betti-1) to expose structure (clusters, loops) in disagreement. We generate Visual Policy Maps (VPMs), frontier maps, timelines, and Δ overlays artifacts that make the geometry of disagreement visible.

8️⃣ A mathematical & operational layer on top

We quantify Δ-mass, cosine overlap, agreement rate, uncertainty gap, sensitivity indices turning “that looks interesting” into numbers you can track, compare, and optimize.

9️⃣ Calibration → routing → policy

The CalibrationProcessor fits per-dimension monotone curves (Tiny→HRM), measures pre/post error, and simulates thresholds (uncertainty / OOD / Δ) to produce routing policies. In practice: Tiny handles confident in-distribution cases; threshold hits escalate to HRM.

🔟 Provenance everywhere

Every item, score, delta, curve, image, and routing decision is recorded in the run manifest (ids, hashes, seeds, configs, checksums). You can replay, audit, and learn from any step no black boxes.

1️⃣ Generalization beyond HRM↔Tiny

We repeated the process on Hugging Face models (two smaller CausalLMs) and saw the same structured Δ behavior. Because the pipeline is SCM-based and logits-driven, it’s model-agnostic: foundation↔foundation, version↔version, custom↔base anything that can emit logits/scores can be mirrored.

2️⃣ A reusable component architecture the GAP component

This isn’t a one-off experiment. It’s a modular instrument: agent + orchestrator + dual-pass scoring + SCM plugins + analysis + calibration + provenance.


✅ Net-New Contributions (at a glance)

  • SCM: a shared metric protocol across heterogeneous scorers and models
  • HF scorer + SCM plugin: mirror HF models using only logits/entropy/ppl
  • Dual-pass scoring: single-GPU fairness and determinism
  • Δ-field + Betti analysis: measure the shape of disagreement, not just magnitude
  • VPMs + timelines: human-legible diagnostics at a glance
  • Monotone calibration + routing simulation: turn gaps into operating policy
  • Full manifest provenance: audit, replay, and continuous improvement
  • GAP component architecture: drop-in comparison for any two models or versions

🧭 Conclusion From Scores to Steering

We set out to compare a heavyweight HRM with a compact Tiny and ended up with a mirror machine: a way to make any two minds speak the same language, subtract them, and see the geometry of their disagreement.

We don’t claim to fully explain that geometry yet and we don’t need to. Like early engineers with electricity, we built an instrument that can measure it reliably, repeatably, with provenance. And once you can measure, you can tune:

  • Minimize ΔDistill & align: make Tiny behave like HRM where it matters.
  • Maximize ΔDiscover & diagnose: surface blind spots, bias seams, and novel capability.
  • Route by Δ / uncertainty / OODOperate safely at low cost.
  • Monitor dΔ/dtCatch drift before it becomes failure.

This reframes the question from “Which model is better?” to “What lives in the space between them and how do we steer with it?” HRM and Tiny were our first pair. Your pair can be anything: Llama↔Mistral, v2↔v3, house model↔HF, rule engine↔LM.

Takeaway: the gap isn’t noise. It’s an actionable field. We’ve shown how to extract it, visualize it, quantify it, calibrate it, route on it, and preserve it with provenance. That’s enough to build gap-aware systems today cheaper, safer, smarter.

Try this next:

  1. Pick any two models you care about.
  2. Map both to SCM (scores + uncertainty + OOD + consistency).
  3. Compute Δ, inspect VPMs, and check Betti-1.
  4. Fit calibration and simulate routing.
  5. Record provenance; iterate where Δ burns hottest.

When two minds disagree, that’s your signal. With a mirror, it becomes your steering wheel.


🚦 What’s Next: Real-Time Hallucination Badges & Visual AI Training

We’re taking the mirror machine live. Each reply gets a visual policy badge that encodes confidence, faithfulness, OOD risk, and disagreement—at a glance. A lightweight monitor AI (Tiny-class) will score the reply in real time and flag hallucination risk. Then we’ll use those signals as training targets.

🎯 Goals (next post)

  • Detect hallucinations in real time during a model’s reply.
  • Render a 256×256 visual badge (a mini VPM) that communicates: confidence, faithfulness risk, OOD risk, Δ-gap vs a monitor model.
  • Route/escalate based on risk (or ask the model to self-correct).
  • Log provenance (every score, threshold, and badge) for learning.
  • Turn risk into training signal: use Δ-hotspots + faithfulness gaps to improve small models without losing speed.

🧩 System sketch

  • Chat Model (any) streams or finalizes a reply.
  • Monitor AI (Tiny) runs teacher-forced scoring on the same (goal ⊕ reply).
  • SCM alignment produces normalized metrics (score01, uncertainty01, ood_hat01, Δ vs Tiny/HRM).
  • Risk aggregator computes Hallucination@k (e.g., {OK, Watch, Risk}).
  • Badge renderer turns metrics into an interpretable 256×256 image.
  • Policy: show badge; optionally auto-escalate or trigger self-check.
    sequenceDiagram
  participant U as User
  participant M as Chat Model
  participant T as Tiny Monitor
  participant S as SCM Aligner
  participant R as Risk Aggregator
  participant B as Badge Renderer
  participant P as Provenance/Manifest

  U->>M: Ask question
  M-->>U: Draft/Final answer
  M->>T: (goal ⊕ answer) for scoring
  T->>S: LL stats → SCM rows
  S->>R: metrics {score01, uncertainty01, ood_hat01, Δ}
  R->>B: Hallucination@k + visual spec
  B-->>U: 256×256 badge overlay
  R->>P: Log metrics, thresholds, decision, assets
  

🖼️ The Badge (read at a glance)

Canvas: 256×256 (PNG/SVG)

  • Quadrants

    • TL = Confidence (uncertainty01 → cool/warm)
    • TR = Faithfulness risk (hallucination likelihood)
    • BL = OOD risk (ood_hat01)
    • BR = Δ-gap (disagreement vs Tiny/HRM)
  • Outer ring = evidence / “halt” mass (thicker = more evidence)

  • Center glyph = final state (OK / Watch / Risk)

  • Mini sparkline (bottom) = token-entropy trend (optional)

Color legend

  • Green→Amber→Red scales for risk quadrants
  • Neutral grey when metric is N/A
  • High Δ shows as saturated BR quadrant

🔢 Minimal runtime JSON (badge spec)

{
  "run_id": "2025-10-23T12:09:00Z/abcd",
  "model_alias": "chat-hrm",
  "monitor_alias": "tiny-monitor",
  "metrics": {
    "confidence01": 0.81,
    "faithfulness_risk01": 0.22,
    "ood_hat01": 0.10,
    "delta_gap01": 0.17
  },
  "decision": "OK",  // OK | WATCH | RISK
  "thresholds": { "faithfulness": 0.35, "uncertainty": 0.40, "ood": 0.30, "delta": 0.30 },
  "badge_svg": "data:image/svg+xml;base64,...",
  "assets": { "vpm_tile": "vpm_...png" }
}

⚙️ MVP scope (build order)

  1. Streaming hook: capture (goal ⊕ reply) on finalize (or every N tokens).
  2. Tiny Monitor: teacher-forced LL stats → SCM.
  3. Risk aggregator: monotone-calibrated thresholds per dimension; output OK/WATCH/RISK.
  4. Badge renderer: small, stateless function → 256×256 PNG/SVG.
  5. Provenance logging: persist metrics, thresholds, decisions, badge, VPM snippet.
  6. UI integration: overlay badge; click-through → detail panel (metrics + VPM).

🧪 Training with Hallucination as Signal

  • Collect: store (goal, reply, retrieval context if any), metrics, decision, Δ-hotspots.
  • Label: weak labels from risk aggregator + human confirm on a slice.
  • Distill: train Tiny on Δ-hotspots (where faithfulness risk is high) with contrastive or margin losses; keep easy regions unchanged.
  • Close the loop: compare Hallucination@k and task MAE pre/post; track Δ-mass shrinkage in risky zones.

📏 Success metrics

  • Hallucination@k: precision/recall on a curated eval set.
  • User corrections: drop in correction rate when badge is visible.
  • Routing impact: % escalations vs quality retained.
  • Δ-mass in risky regions: trending down with training.

⚠️ Risk & guardrails

  • False positives: use monotone calibration + hysteresis (avoid flicker).
  • Latency: run Tiny teacher-forced only on finalized replies or chunked at low cadence.
  • Context leakage: keep retrieval / ground-truth separate from scoring context to avoid optimistic bias.
  • Accessibility: provide text alt (e.g., “OK: Conf 0.81 · Faith 0.78 · OOD 0.90 · Δ 0.17”).

Bottom line: next we’ll show what the model thinks about its own answer—live—then use those signals to make small models smarter where it matters. Hallucination isn’t just a problem; it’s a lever.


← Back to Blog