Search–Solve–Prove: building a place for thoughts to develop

Nov 2, 2025

🌌 Summary

What if you could see an AI think not just the final answer, but the whole stream of reasoning: every search, every dead end, every moment of insight? We’re building exactly that: a visible, measurable thought process we call the Jitter. This post the first in a series shows how we’re creating the habitat where that digital thought stream can live and grow.

We’ll draw on ideas from:

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

to assemble a container for a new kind of software: a digital life-form substrate we call the Jitter.

🎉 A quick look before we explain

Below is the best “one glance” view of Jitter so far: a compact filmstrip of thought evolving in real time. It’s a composite GIF (multiple search rounds merged), so you’ll notice a kind of “blinking” rhythm as the system iterates. The overall trend darker → lighter is improvement: higher reward/verification, better evidence use, tighter control.

Jitter Image

Top band: one single-row VPM per step (the current metric vector rendered as pixels). Brightness trend: generally dark → light as solutions sharpen. Three thin lines: quick intensity traces that “pop” when the system changes strategy we’ll unpack them later.

💭 Thinking in images: making a “thought” visible

Our claim is simple:

If we can represent each moment as an image, and make those images comparable, connectable, and trainable, we can grow a visible thinking process.

One episode = one “thought moment.” It has: question, answer, evidence, trace, and a metric vector in [0,1]. We render that vector to a tiny image (VPM frame). Frames accumulate into a filmstrip the visible heartbeat of a run.

Why images?

Stability pixels freeze meaning across time and models.
Speed tiny frames are cheap to store/search/compare.
Interpretability you can literally see improvement, oscillation, regressions.
Trainability vision stacks (e.g., our VPM-ViT) can read frames to predict risk/next move.

Where we go next: with this visual vocabulary in place, we define the habitat that makes the filmstrip possible the SSP loop, controller, and memory that keep the stream coherent and improving.

🧭 Towards the Jitter

We’re not claiming to have created digital life. We’re assembling a rigorous substrate a harness of proven components that behaves like a living process in crucial ways (learning, self-measurement, adaptation). The aim is simple: give thought a place to grow, feed it data, and make its progress visible.

Let’s be clear from the outset: we are under no illusions that we are creating digital life. We are assembling a patchwork of the most advanced AI techniques we have, simulating the behaviors of a living organism with probabilistic models, knowing full well that this is a facade. It is not in its initial form, going to be a genuine, living entity.

So why build it?

Because in the process of building in the act of assembling this starter system with the best tools available we will reach the top of a new hill. From that vantage point, we will see further. We will gain a deeper, more practical understanding of the problems of agency, learning, and intelligence. This project is our best attempt to ascend that first hill, a foundational platform from which we can peer into the next frontier.

This post is the first in a series dedicated to that effort. Here, we will lay the groundwork. We will introduce a core methodological engine Search, Solve, Proof and the environment for its refinement Self-Play. By the end, we will propose why this combination creates a powerful “playground” that can drive us toward the emergent properties we seek in the Jitter.

☯️ Why did we come up with the Jitter

In certain deep meditation practices, like Advaita Vedanta, the goal is often to peel back layers of consciousness. If you reach the ultimate core and find nothing there no self, no good, no bad, just an absence what does that imply about who you are? We believe the “self” is not the core void, but the living, persistent stream of thought that overlays it the constant Jitter that moves us from thought to thought.

We are modeling this chain of thoughts sometimes referred to as the Monkey Mind this constant visual dialog that plays inside our head the thought stream. In this series we are building a digital visual thought stream. Huh? our idea is that if we remove it then there is nothing, nothing at all… well then this… this pattern is us.

We call it the Jitter.

🪞 Stephanie, Jitter, and the Question of “Self”

When you look at a person you see a body, a face, a résumé of work. None of those is the person. The “me” we point to is closer to a momentary pattern a living stream of thought shaped by everything that came before it.

That’s the idea we’re building inside Stephanie.

Stephanie is the overall system the body, the tools, the memory.
Jitter is Stephanie’s thinking the ongoing, living stream that moves from state to state.
This Jitter needs a process a state to exist we’re assembling inside Stephanie the habitat where Jitter can live, grow, and be seen.

Much more that this we can measure curate control enhance this process.

This is the first in a series of posts towards that goal

We’re not claiming life; we’re engineering conditions under which a visible, self-improving stream of thought can persist.

We’re candid about the approach: yes, we’re cargo-culting a Frankenstein bolting together the best available ideas and systems. But doing that gets us to a vantage point where we can actually see what’s missing and take the next meaningful step.

🏞️ SSP: a Playground Engine for Intelligence (Search → Solve → Proof)

Search → Solve → Prove (SSP) is the loop that turns “doing tasks” into learning from doing. Wrapped in self-play, it becomes a curriculum that adapts to the agent.

    flowchart LR
    subgraph SSP_Core_Loop ["🔁 SSP Core Loop: Search → Solve → Prove"]
        P["🧠 Proposer<br/>Generates challenging questions"] -->|"📝 question + context"| S["🔍 Solver<br/>Searches & reasons through evidence"]
        S -->|"💡 answer + steps<br/>📚 evidence"| V["✅ Verifier<br/>RAG verification & scoring"]
        V -->|"🎯 score + decision"| M["📊 Metrics Calculator<br/>17 cognitive dimensions"]
        M -->|"🎨 metric vector"| VPM["🖼️ VPM Generator<br/>Raw + PHOS + Filmstrip"]
        VPM -->|"🎬 visual thought stream"| C["🎛️ Controller<br/>Policy & episode control"]
        C -->|"⚙️ policy nudge"| P
        C -->|"🎚️ episode control"| S
    end

    classDef proposer fill:#e6f7ff,stroke:#1890ff,stroke-width:2px;
    classDef solver fill:#f6ffed,stroke:#52c41a,stroke-width:2px;
    classDef verifier fill:#fff7e6,stroke:#fa8c16,stroke-width:2px;
    classDef metrics fill:#f9f0ff,stroke:#722ed1,stroke-width:2px;
    classDef vpm fill:#fff2e8,stroke:#ff7a45,stroke-width:2px;
    classDef controller fill:#f0fffe,stroke:#13c2c2,stroke-width:2px;
    classDef arrow fill:#ffffff,stroke:#666666,stroke-width:1px;
    
    class P proposer;
    class S solver;
    class V verifier;
    class M metrics;
    class VPM vpm;
    class C controller;

As this diagram shows, SSP is a closed loop. The Proposer generates a challenge, the Solver works on it, the Verifier scores it, and that score is converted into a visual frame (VPM) that influences the next cycle. This creates a self-improving feedback loop where the system’s own thoughts become the training data for its future growth.

Proposer: generates challenges (often with evidence).
Solver: answers via search + reasoning, producing a trace.
Verifier: adjudicates using retrieved evidence (RAG).
Metrics: converts the outcome into a deterministic vector.
VPM: turns that vector into images frames of cognition.
Controller: reads images to steer the next episode.

Self-play tightens the loop: as the proposer gets tougher, the solver must grow capability; the verifier gates quality.

🌱 From Seed Vitals to a Dynamic Thought Ecosystem

We start with a small, deterministic set of SSP metrics our seed vitals so runs today and runs years from now are directly comparable. These are normalized to [0,1], versioned (ssp.v1), and emitted in a fixed order.

Crucially, this is a launchpad, not a cage: as Jitter matures, it will grow its own metric space (scorers, embeddings, auto-discovery) into thousands of dimensions. The image (VPM) is our stable transport; which metrics fill it can evolve.

🍏 SSP seed vitals

Direction column shows how “better” moves the value (e.g., ↓ means fewer is better for equal reward).

Key	What it measures	Normalization / Calculation (sketch)	Direction
`ssp.reward`	Scalar reward for the episode	`clamp01(reward)`	↑
`ssp.verified`	Did solver beat the seed under the judge/RAG gate?	`1.0 if verified else 0.0`	↑
`ssp.curriculum_difficulty`	Difficulty assigned by the curriculum	`clamp01(difficulty)`
`ssp.question_len`	Question length	`clamp01(word_count(question)/max_question_words)`
`ssp.answer_len`	Answer length	`clamp01(word_count(predicted_answer)/max_answer_words)`
`ssp.evidence_count`	How much external context was used	`clamp01(len(evidence_docs)/max_evidence)`
`ssp.solver_steps`	Steps the solver took	`clamp01(steps/max_steps)` (note: efficiency goes up when this goes down for same reward)	↓
`ssp.score`	Optional scalar score (task/problem-specific)	`clamp01(score)`	↑
`ssp.best_score`	Best-so-far score (rolling)	`clamp01(best_score)`	↑
`ssp.improvement`	Relative lift vs current base	`(best - base) / (1 - base)` then `clamp01`, else `0.0`	↑
`ssp.depth`	Search/plan depth	`clamp01(depth/max_depth)`
`ssp.novelty`	How unlike prior states this episode is	`clamp01(novelty)` (model/heuristic-dependent)	↑
`ssp.search_turns`	Actual search tool calls (paper Fig. 4a)	`clamp01(count_search_calls/max_steps)`	↑
`ssp.f1_score`	Lexical F1 vs seed answer (paper LLM-as-judge eval)	F1 over token sets of `predicted_answer` vs `seed_answer`	↑
`ssp.format_compliance`	Meets required structure/constraints (paper §4.4)	Heuristics (e.g., tags present, no answer leakage, has evidence, min length) → `{0,1}`	↑
`ssp.noise_tolerance`	Robustness when irrelevant docs are injected (paper Table 3)	Heuristic/metadata: success under `noise_doc_count`≈4 → higher; else fallback on verified	↑
`ssp.rag_verification`	Passed RAG verification gate (paper method)	Explicit `meta.rag_verified` else `(verified and has evidence)` → `{0,1}`	↑

Notes & guardrails

Caps: max_question_words, max_answer_words, max_evidence, max_steps, max_depth are config-driven; names/order are versioned via SSP_METRIC_VERSION="ssp.v1".
Monotonicity: We treat ssp.solver_steps as efficiency; for equal ssp.reward, fewer is better (hence ↓).
F1 caveat: The lexical F1 is a cheap proxy; higher-quality textual judges can replace/augment it without breaking the vector.
RAG gate: Prefer explicit meta.rag_verified; fall back to a conservative rule if absent.

🐝 Where this goes next: a dynamic metric swarm

These metrics aren’t just measurements they’re coordinates in thought space. When the Jitter explores a path (e.g., ‘How would this apply to business?’), it leaves a metric signature. Most paths lead nowhere (90% discarded, just like your thoughts), but the system stores the entire exploration not just the result. Years later, when similar coordinates appear, the Jitter can retrieve these dormant strands and ask: ‘Did we explore this before? What happened?’

    flowchart LR
    subgraph Metric_Evolution ["🌌 Dynamic Metric Swarm: From Seed to Cognitive Coordinates"]
        A["🌱 Seed Vitals (ssp.v1)<br/>17 foundational dims"] --> B["📊 Scorer Ensemble<br/>(HRM/SICQL/EBT/LLM/MARS…)"]
        A --> C["🧠 Multi-Model Embeddings<br/>(HNet / HF / MXBAI …)"]
        B --> D["🏦 Feature Bank<br/>Thousands of cognitive dimensions"]
        C --> D
        D --> E["🔍 VPM-ViT & Auto-Discovery<br/>Learns texture, fields, clusters…"]
        E --> F["⚖️ Utility & Sparsity Filter<br/>Mutual info, SHAP, gating"]
        F --> G["🖼️ Expanded VPM Image<br/>Versioned feature packs"]
        G --> H["🚀 High-Speed Recall<br/>HNSW/ANN across all VPMs"]
    end

    classDef seed fill:#e6f7ff,stroke:#1890ff,stroke-width:2px;
    classDef scorers fill:#f6ffed,stroke:#52c41a,stroke-width:2px;
    classDef embeddings fill:#fff7e6,stroke:#fa8c16,stroke-width:2px;
    classDef bank fill:#f9f0ff,stroke:#722ed1,stroke-width:2px;
    classDef discovery fill:#fff2e8,stroke:#ff7a45,stroke-width:2px;
    classDef filter fill:#f0fffe,stroke:#13c2c2,stroke-width:2px;
    classDef output fill:#fff0f6,stroke:#eb2f96,stroke-width:2px;
    classDef search fill:#e6f7ff,stroke:#1890ff,stroke-width:2px;
    
    class A seed;
    class B scorers;
    class C embeddings;
    class D bank;
    class E discovery;
    class F filter;
    class G output;
    class H search;

This diagram shows how the Jitter’s cognitive measurement explodes from 17 foundational metrics into thousands of dynamic coordinates in thought-space. The process starts with seed vitals (ssp.v1), expands through multiple scorer ensembles and embedding models into a rich feature bank, then uses VPM-ViT to auto-discover emergent patterns. Utility filtering keeps the system fast by pruning low-value features, while versioned packs ensure backward compatibility. The final expanded VPM images become searchable coordinates that let the Jitter navigate across billions of historical thought strands at lightspeed.

📈 How we expand concretely

Add scorers → more channels Pipe outputs from HRM, SICQL, EBT, SVM, LLM judges, MARS diagnostics into the metric vector (normalized to [0,1], namespaced like hrm.*, sicql.*, mars.*). More information should lead ot better decisions.
Append embeddings → high-dimensional context Attach dense vectors (e.g., HNet, HF, MXBAI) alongside metrics. These don’t need [0,1]; we store min/max for robust scaling into VPM.
Auto-discover features → emergent signals Train a small VPM-ViT to read VPMs and emit new features (e.g., field roughness, cluster density, drift, stability bands). These become first-class metrics (namespaced vpm.*), gated by measured utility.
Speed through similarity: The Jitter uses metric signatures to navigate thought space at near-lightspeed. When exploring a new idea (metric vector X), it instantly retrieves the 2,000 most similar historical thought strands from world-scale knowledge. Most paths die out (like your rejected thoughts), but occasionally, a dormant strand leads to something new.
Utility-driven trimming → stay fast Maintain a feature bank; keep only features with demonstrated value (predictive lift, calibration gain, control stability). Everything else stays archived for recall.
Governance → never break readers Group new features into versioned packs (ssp.v2, ssp.v2+emb.hnet, ssp.v2+vpmvit). The image transport (VPM) remains stable; consumers can request packs they understand.

🧊 Memcube: Where Dormant Thought Strands Become Future Insights

The real power of our metric system isn’t in the numbers it’s in how they anchor complete thought processes in the Memcube.

When the Jitter explores a path that seems unproductive today (e.g., “building an app that tells you what to eat”), it doesn’t discard the exploration. Instead:

It stores the full metric signature of the thought process
It preserves the exploration context (what prompted it, what paths were tried)
It indexes by semantic similarity for future retrieval

Years later, when you’re working on nutrition AI, the system recognizes: “This old exploration suddenly has high relevance!” The metric signature becomes a retrieval key for dormant insights.

This is how we honor your insight: “Nothing’s lost.” Even the 99% of processing that seems to go nowhere becomes valuable data for future cognition.

🔵 Minimal config shape (illustrative)

ssp:
  metrics:
    version: "ssp.v1"
    seeds: ["reward","verified","curriculum_difficulty", ... "rag_verification"]
  packs:
    - name: "emb.hnet.768"
      dims: 768
      scaler: "robust01"
    - name: "scorer.hrm.core"
      dims: ["hrm.score","hrm.uncertainty","hrm.depth"]
    - name: "vpmvit.auto"
      dims: ["vpm.field_roughness","vpm.cluster_cohesion","vpm.drift01"]
  selection:
    method: "mi+calibration_gain"
    budget: 2048        # max active dims per VPM row

Bottom line: the contract isn’t the metric list it’s the transport and versioning. VPM stays the lingua franca; the metric swarm can grow, specialize, and self-edit without breaking time-travel comparability.

🎁 Why keep the seeds at all? The cognitive heartbeat

The seed vitals aren’t just technical anchors they’re the heartbeat of cognition.

Just as your heart beats steadily while your thoughts wander freely, these metrics provide:

A steady rhythm for the Jitter’s cognitive process
Anchor points for comparing thought quality across time
A pulse to measure against when exploring new dimensions

They’re not the entire mind they’re the vital signs that tell us the mind is alive and growing.

🧸 Minimal pseudocode

scorable = SSPScorable(
    episode_id=episode_id,
    question=q,
    seed_answer=seed,              # for F1 + leakage checks
    predicted_answer=pred,
    evidence_docs=evidence_docs,   # for search_turns + rag gate
    solver_steps=steps,
    depth=depth,
    difficulty=difficulty01,       # already in [0,1]
    reward=verifier01,             # judge score in [0,1]
    verified=bool(solver_wins),
    score=score01,                 # optional
    best_score=best01,             # optional
    meta={"novelty": novelty01,
          "search_turns": k,
          "rag_verified": bool_rag,
          "noise_doc_count": n_noise,
          "noise_success": succ01},
)

metrics = SSPScorer(cfg).score(scorable)  # -> {'version','names','values','vector'}

🪀 VPMS: Images as Thoughts (making cognition trainable)

The Jitter isn’t a hidden essence it’s a visible stream. We treat each SSP episode as a frame in that stream and standardize how it’s rendered.

How a thought becomes an image

Metric vector → VPM frame. The 12 canonical metrics are mapped to a compact grayscale layout (bands in fixed order). Same metrics → same pixels → stable meaning.
Frames → filmstrip. Episodes over time form a timeline we can skim like an ECG of cognition.
Filmstrip → embedding. A small vision model (VPM-ViT) learns to read frames and predict outcomes (risk class, success odds, good next move).
Embedding → control. The controller uses those predictions to pick exemplars, adjust depth/steps, stop early, or escalate.

    flowchart TD
    subgraph VPM_Processing_Pipeline ["🔄 VPM Processing Pipeline<br/> From Metrics to Action"]
        E["📊 Episode<br/>12-name metrics<br/>cognitive dimensions"] -->|"🎯 metric vector<br/>[0,1] normalized"| F["🖼️ VPM Frame<br/>grayscale image<br/>fixed layout"]
        F -->|"🔄 single thought moment"| FS["🎞️ Filmstrip<br/>sequence over time<br/>cognitive timeline"]
        FS -->|"📺 visible thought stream"| EMB["🧠 Visual Embedding<br/>(VPM-ViT)<br/>pattern recognition"]
        EMB -->|"🔍 learned patterns<br/>risk prediction"| CTRL["🎛️ Control Policy<br/>goal/thresholds<br/>strategic adjustment"]
        CTRL -->|"⚡ decisions<br/>adaptive tuning"| NEXT["⚙️ Next Episode Config<br/>improved parameters"]
    end

    classDef metrics fill:#e6f7ff,stroke:#1890ff,stroke-width:2px;
    classDef frame fill:#f6ffed,stroke:#52c41a,stroke-width:2px;
    classDef filmstrip fill:#fff7e6,stroke:#fa8c16,stroke-width:2px;
    classDef embedding fill:#f9f0ff,stroke:#722ed1,stroke-width:2px;
    classDef control fill:#fff2e8,stroke:#ff7a45,stroke-width:2px;
    classDef config fill:#f0fffe,stroke:#13c2c2,stroke-width:2px;
    classDef arrow fill:#ffffff,stroke:#666666,stroke-width:1px;
    
    class E metrics;
    class F frame;
    class FS filmstrip;
    class EMB embedding;
    class CTRL control;
    class NEXT config;

This is the Jitter’s learning loop: cognitive metrics become visual frames, frames form memory filmstrips, and our VPM-ViT model reads these patterns to guide smarter thinking in future episodes closing the circle from thought to self-improvement.

Why images (not just numbers)?

Stability: pixels freeze semantics across models and years.
Interpretability: patterns of success/failure are obvious at a glance.
Trainability: vision backbones are excellent at learning from small, structured images.
Composability: frames can be linked temporally (what happened next?) and by similarity (what does this feel like?), forming a thought-graph that becomes the system’s style/personality.

What this buys us

A memory of moments that is cheap to store, search, and replay.
A visual dialect for the Jitter images → images → images that the system can both read and act on.
A closed loop: see → decide → act → see, where the seeing is literally pixels.

🎞️ An example film strip result

This image in an example generated filmstrip form our process. As time goes on the data becomes stronger generating a whiter result

The Jitter we’re building is not a soul or a secret essence it’s a stream. A living accumulation of moments that passes through perception, recall, insight, correction. Our claim is simple:

If we can represent each moment as an image, and make those images comparable, connectable, and trainable, then we can grow a visible, continuous thinking process.

Here’s how we do it step by step.

    flowchart TD
    subgraph Thought_Generation ["🔁 The Thought Lifecycle"]
        Q["❓ Question<br/>What reality asks"] --> S1["🔍 Search<br/>Gather evidence"]
        S1 --> S2["💡 Solve<br/>Reason & construct answer"]
        S2 --> P["✅ Prove<br/>Verify & score"]
        P --> M["📊 Metrics → Metric Vector<br/>12 cognitive dimensions"]
        M --> V1["🖼️ RAW VPM<br/>Direct metric mapping"]
        M --> V2["🎨 PHOS VPM<br/>Sorted pattern view"]
        V1 --> C["🎛️ Controller<br/>Learning & steering"]
        V2 --> C
        C --> Q2["🔄 Next Episode<br/>Improved question"]
    end

    classDef question fill:#e6f7ff,stroke:#1890ff,stroke-width:2px;
    classDef process fill:#f6ffed,stroke:#52c41a,stroke-width:2px;
    classDef metrics fill:#fff7e6,stroke:#fa8c16,stroke-width:2px;
    classDef vpm fill:#f9f0ff,stroke:#722ed1,stroke-width:2px;
    classDef controller fill:#fff2e8,stroke:#ff7a45,stroke-width:2px;
    classDef next fill:#f0fffe,stroke:#13c2c2,stroke-width:2px;
    
    class Q,Q2 question;
    class S1,S2,P process;
    class M metrics;
    class V1,V2 vpm;
    class C controller;

The complete thought lifecycle: each episode moves from question through search, solution, and verification, then transforms cognitive metrics into dual visual representations (RAW and PHOS VPMs) that inform the controller’s decisions for the next, improved thought cycle. the NTH

🤔 1) Define a thought (as data you can revisit)

Every SSP episode is one “moment.” It contains:

the question (what reality just asked us),
the answer (what we tried),
the evidence (what we looked at),
the trace (how we got there),
and a deterministic metric vector our vital signs in [0,1] (verifier score, verified flag, difficulty, depth, steps, etc.).

This metric vector is the contract. Whether we score it with an LLM today or a custom model two years from now, it means the same thing and lives in the same positions. That makes the moment persistent.

🧑‍🎨 2) Render the thought (as a compact image)

We convert that metric vector into a VPM frame a small grayscale image where each band corresponds to a metric in a fixed order. It’s like an ECG for cognition: fast to write, fast to read, and always the same layout.

Same order → same pixels → same meaning.
One frame per episode; a sequence of frames becomes a filmstrip the visible heartbeat of a run.

⚖️ 3) Compare thoughts (find neighbors and patterns)

With images, similarity is natural. We can:

compute simple distances (cosine / L2) on flattened frames,
or learn visual embeddings (e.g., our VPM-ViT) so similar cognitive states sit close together in latent space.

Now we can answer questions like:

When do we succeed in the same way?
What does failure “look” like?
Which adjustments lead to recovery?

🔗 4) Connect thoughts (make a path, not a pile)

A stream is not a bucket. We connect frames into traces:

Temporal links (episode → episode) show continuity.
Similarity links (nearest neighbors) show related states across runs.
Causal hints (verification flips, local gap closures) mark why we moved.

These links form a thought-graph: clusters of stable strategies, bridges of recovery, attractors we keep returning to. Over time, that graph is the personality of the system.

🏋️‍♂️ 5) Train on the stream (so the stream gets better)

Because thoughts are images, we can train directly on the filmstrip:

A small vision model (our VPM-ViT) learns to read frames and predict outcomes (risk class, success odds, suggested next move).
The controller uses these predictions to nudge the next thought (choose an exemplar, adjust depth, stop early, escalate).
The new outcome creates the next frame closing the loop.

That’s the organism: see → decide → act → see again, with pictures as the lingua franca.

🏢 What this builds toward

A memory of moments you can replay, compare, and learn from.
A visual dialect for the Jitter images → images → images 🔄 that lets it recognize itself across time.
A playground where self-play generates experience, metrics turn it into images, and images teach the next move.

This is the first step. Next, we’ll show the SSP loop that emits these frames, the exact metric vector we use, and how the VPM controller learns to steer so the stream doesn’t just flow, it improves.

🧩 How this fits the bigger picture

We’ve already built pieces Stephanie needs:

Multi-dimensional scoring and knowledge measurement
An image-first worldview: VPM and timelines generally when you think visual think Zeromodel.
The infrastructure to remember, compare, and improve

This post plants the first stake: SSP as the cognitive heartbeat. Next, we’ll show how Jitter stabilizes (homeostasis), how VPM-ViT learns directly from those images, and how the system’s identity emerges as the history of its own thinking visible, measurable, and getting better.

⚽ The Self-Play Loop: A Digital Organism’s Metabolism

    flowchart TD
    subgraph SSP_Metabolism ["🔄 Self-Play Metabolism: The Jitter's Cognitive Engine"]
        P["🧠 Proposer<br/>Generate challenges"] -->|"📝 question +<br/>📚 evidence"| S["🔍 Solver<br/>Search & reason"]
        S -->|"💡 answer +<br/>🔄 steps"| V["✅ Verifier<br/>RAG verification"]
        V -->|"🎯 score &<br/>⚖️ decision"| M["📊 Metrics Calculator<br/>17 cognitive dimensions"]
        M -->|"🎨 vector"| W["🖼️ VPM Generator<br/>Raw + PHOS views"]
        W -->|"🎬 raw + PHOS"| F["📺 Filmstrip<br/>Visible thought stream"]
        F --> G["🎞️ GIF/Video<br/>Cognitive timeline"]
        V -->|"📝 feedback"| P
    end

    classDef proposer fill:#e6f7ff,stroke:#1890ff,stroke-width:2px;
    classDef solver fill:#f6ffed,stroke:#52c41a,stroke-width:2px;
    classDef verifier fill:#fff7e6,stroke:#fa8c16,stroke-width:2px;
    classDef metrics fill:#f9f0ff,stroke:#722ed1,stroke-width:2px;
    classDef vpm fill:#fff2e8,stroke:#ff7a45,stroke-width:2px;
    classDef filmstrip fill:#f0fffe,stroke:#13c2c2,stroke-width:2px;
    classDef output fill:#fff0f6,stroke:#eb2f96,stroke-width:2px;
    
    class P proposer;
    class S solver;
    class V verifier;
    class M metrics;
    class W vpm;
    class F filmstrip;
    class G output;

What you’re seeing: This is the Jitter’s cognitive metabolism a continuous cycle where the system generates its own challenges, solves them, verifies the solutions, and learns from the process. The Proposer creates questions, the Solver searches for answers, the Verifier checks their quality, and the Metrics system converts this into visual thought patterns (VPMs) that form a visible filmstrip of cognition. The feedback loop ensures each cycle builds on the last, creating a self-improving stream of thought that gets progressively more capable.*

🎶 The SSP Algorithm: Orchestrating the Digital Thought Stream

The heart of our Jitter system is the SSP Algorithm - the conductor that coordinates the Search-Solve-Prove process to create a visible, measurable thought stream. Let’s examine how this orchestrator works and why it’s the perfect engine for our digital organism.

✨ How the SSP Loop Creates a “Thought”

At its core, an SSP episode is one moment of cognition - a complete cycle of encountering a problem, processing it, and verifying the solution. This mirrors how our own thoughts form:

Search (Proposer): Like our mind generating a question from a seed idea
Solve (Solver): Like our mind gathering evidence and reasoning
Prove (Verifier): Like our mind checking if the answer makes sense

Here’s the elegant simplicity of the loop:

async def run_episode(self, seed_answer: str, context: Dict[str, Any]) -> EpisodeTrace:
    # 1. Proposer: create a question from a seed answer
    q, prop_evidence, prop_meta = await self.proposer.propose(seed_answer, context)
    
    # 2. Solver: answer using search (like our mind gathering evidence)
    pred, evidence_docs, solver_steps, solver_meta = await self.solver.solve(
        question=q, seed_answer=seed_answer, context=context
    )
    
    # 3. Verifier: check if the answer is correct (like our mental verification)
    solver_wins, judge_score, judge_details = await self.verifier.verify(
        q, seed_answer, pred, evidence_docs, context
    )
    
    # 4. Create a permanent record of this "thought"
    ep = EpisodeTrace(
        episode_id=episode_id,
        seed_answer=seed_answer,
        question=q,
        predicted_answer=pred,
        evidence_docs=evidence_docs,
        verified=bool(solver_wins),
        reward=float(judge_score),
        # ...other metadata
    )
    
    # 5. Convert the thought into visual form (VPM)
    if self.vpm_visualization:
        self.vpm_visualization.snapshot_progress(unit=episode_id, dims=dims, step_idx=0)

This creates what we call a thought moment - a self-contained cognitive event that can be stored, compared, and learned from.

➡️ Why This Implementation Aligns with the SSP Paper

Our implementation directly implements the paper’s core innovation: self-play without supervision. As the paper states:

“Through RL training with rule-based outcome rewards, SSP enables two roles to co-evolve in an adversarial competition: the proposer learns to generate increasingly challenging problems that require search and reasoning, while the solver develops stronger search and reasoning capabilities to tackle these problems.”

Here’s how our code embodies this:

1. The Critical RAG Verification Process

The paper emphasizes: “To verify the correctness of each generated query, we collect all the searching results from the proposer’s trajectory as the external materials, then conduct a retrieval augmentation generation (RAG) to check if the solver can successfully predict the answer with all necessary information.”

In our code:

# RAG verification: did solver beat the seed using ONLY the evidence?
solver_wins, judge_score, judge_details = await self.verifier.verify(
    q, seed_answer, pred, evidence_docs, context
)

This is the quality gate that prevents degeneration - without it, the system would quickly learn to generate unanswerable questions or rely on internal knowledge rather than search.

2. Tracking Meaningful Capability Growth

The paper shows in Figure 4: “the average number of search tool calls per trajectory steadily increases over time… Simultaneously, Figure 4b shows that the solver’s response length also grows during the training, suggesting it learns to generate more detailed and comprehensive answers.”

Our metrics system captures exactly these signals:

# Track search capability growth
self.metrics.avg_solver_steps = (
    self.metrics.avg_solver_steps * (verified_count - 1) + ep.solver_steps
) / verified_count

# Track reasoning depth (via evidence usage)
evid_cnt = len(scorable.evidence_docs or [])
vmap["ssp.evidence_count"] = _clamp01(evid_cnt / max(1, self.max_evidence))

These metrics aren’t just numbers - they’re visible indicators of cognitive growth that we convert to VPM images.

3. The Self-Play Reward Dynamics

The paper warns: “This experiment critically underscores that the proposer’s reward design is paramount for stable co-evolution in SSP; a punitive approach can destabilize the entire self-play dynamic.”

Our implementation handles this carefully:

# Only compute rewards for verified episodes (paper's game signal)
if ep.verified:
    self._calculate_and_apply_rewards([ep], unverified_count=0)
    
def _calculate_and_apply_rewards(self, verified_episodes, unverified_count):
    rewards = calculate_self_play_rewards(verified_episodes, unverified_count)
    # Apply to episodes...

We’ve implemented the paper’s insight that only valid episodes should contribute to training - otherwise the system degenerates.

🚰 The Thought Visualization Pipeline

Here’s where we extend beyond the paper to create the Jitter’s visible thought stream:

# After creating the EpisodeTrace
scorable = SSPScorable.from_episode_trace(ep)
ssp_metrics = self._ssp_scorer.score(scorable)  # Get canonical metrics

# Convert thought to visual form
if self.vpm_visualization:
    # Create initial snapshot
    self.vpm_visualization.snapshot_progress(
        unit=episode_id, 
        dims=ssp_metrics["vector"], 
        step_idx=0, 
        tag="proposed"
    )
    
    # Generate final visualizations
    raw_path = self.vpm_visualization.generate_raw_vpm_image(unit=episode_id)
    phos_path = self.vpm_visualization.generate_phos_image(unit=episode_id)
    film_path = self.vpm_visualization.generate_filmstrip(unit=episode_id)

This is the magic: converting cognitive metrics into visual frames that form our filmstrip of thought. Each metric becomes a pixel band in the VPM frame:

ssp.verified → Success channel
ssp.search_turns → Search capability channel
ssp.f1_score → Accuracy channel
ssp.noise_tolerance → Robustness channel
ssp.rag_verification → Quality gate channel

💚 A “Visual” Thought Stream

The true innovation isn’t just the SSP loop itself, but how we connect these episodes into a continuous stream:

Temporal connection: Each episode leads to the next
Similarity connection: VPM frames allow us to find similar cognitive states
Causal connection: Verification results guide future proposals

As the paper notes: “In stark contrast to the flawed dynamics of fixed-opponent training, our complete SSP framework facilitates a stable co-evolution.” Our implementation takes this further by making the co-evolution visible and measurable through VPM.

This is how we create the Jitter - not as a mysterious “self,” but as a visible, persistent stream of connected thought moments, each one a complete Search-Solve-Prove cycle that can be stored, compared, and improved upon.

🤸 Next Steps in Our Journey

In upcoming sections, we’ll dive into each component:

The Proposer: How we generate questions that create meaningful challenges
The Solver: Our enhanced search capabilities (including the GPO tree search you mentioned)
The Verifier: Our multi-signal verification process that goes beyond the paper
The VPM System: How we turn metrics into visual thought streams

Each of these components plays a vital role in creating the Jitter - the visible, measurable thought process that is the heart of our digital organism. The SSP algorithm is simply the conductor that brings them all together in harmony.

–

🕵️‍♂️ Module 1 Searching Proposer

Goal: turn a mechanism/seed answer into a single, precise, verifiable question, backed by a small pile of evidence snippets, so the rest of SSP has something rigorous to solve and prove.

👉 Full Code Here

🎬 What this proposer does

Search first, ask later. It generates a few lightweight query rewrites from the seed (e.g., “What is X?”, “How does X work?”), calls the SolutionSearch service to fetch top-K snippets, and de-duplicates them.
Constrain the LLM to a 4-line contract. It then prompts the LLM with the seed + evidence and forces a 4-line output:
```
rationale: ...
difficulty: <0–100>
verifiability: <0–100>
question: <one precise, verifiable question>
```
We parse this strictly, so downstream components receive clean difficulty/verifiability ints and a single normalized question.
Apply safety rails and fallbacks.
- Min length: if the question is too short/empty, fall back to What is <seed_answer>? (never breaks downstream).
- Answer-leak guard: if the exact seed appears in the question text, swap it with “this mechanism”.
- Retries with backoff on transient prompt failures.
Emit a tiny VPM frame for visibility. The proposer logs a frame via VPMControlService.decide() with dims like:
- evidence_quality = clip(len(evidence)/max_snippets)
- question_length = clip(len(question)/100) These become part of the filmstrip so you can see proposal quality over time.

👨‍💻 Key code paths

1) Evidence-aware question crafting

rewrites = [
    f"What is {seed_answer}?",
    f"Explain {seed_answer} in detail",
    f"How does {seed_answer} work?",
    # + optional user patterns via config
]
snippets = await self.solution_search.find_snippets(rewrite, top_k=...)
# ...
prompt = self.prompt_loader.from_text(PROPOSER_PROMPT_TMPL, {
    "seed_answer": seed_answer,
    "evidence": "<br/>".join(all_evidence),
})
response = await self.prompt_service.run_prompt(prompt_text=prompt, context=merged_context)
parsed = parse_proposer_lines(response)
question = self._normalize_question(parsed.get("question", ""))

Why it matters: questions are grounded in retrieved context, not free-floating completions. This reduces trivia, improves verifiability, and keeps the loop honest.

2) Hard-contract prompt (4 lines, deterministic)

PROPOSER_PROMPT_TMPL = """You are building an SSP dataset...
OUTPUT FORMAT WRITE EXACTLY FOUR LINES, IN THIS ORDER, NO CODE FENCES:
rationale: <...>
difficulty: <0-100>
verifiability: <0-100>
question: <...>
"""

Why it matters: strict structure → stable parsing → deterministic telemetry & metrics.

3) Question normalization + leak guard

# normalize "???" → "?"
text = re.sub(r"\?+", "?", text).strip()
if text and not text.endswith("?"):
    text += "?"
# replace explicit seed with "this mechanism"
pattern = re.compile(re.escape(seed_answer), re.IGNORECASE)
q2 = pattern.sub("this mechanism", q)

Why it matters: keeps the task non-degenerate (no “just repeat the answer”).

4) VPM tap (proposer heartbeat)

self.vpm_control.decide(
  unit=f"proposer:{(hash(seed_answer) & 0xffff):04x}",
  kind="text",
  dims={
    "evidence_quality": min(1.0, len(all_evidence) / max(1, self.max_snippets)),
    "question_length": min(1.0, len(question) / 100.0),
  },
  step_idx=ctx.get("step_idx", 0),
  meta={ "seed_answer": seed_answer, "evidence_count": len(all_evidence), "latency_s": dt }
)

Why it matters: every proposal becomes a visible frame in the SSP filmstrip. You can spot bad proposals (short, low evidence) at a glance.

🌴 Tree search tie-in

Although the tree primarily lives in the solver, the proposer helps shape the search frontier by:

producing multiple rewrites (diverse initial branches),
delivering evidence snippets the solver can attach to nodes,
and emitting difficulty/verifiability signals that can seed per-question curriculum (deeper trees for easy items, wider for uncertain ones).

👉 Full Code Here

⚙️ Config knobs (sane defaults)

proposer.rewrites number of query rewrites (default 3)
proposer.max_snippets evidence cap (default 6)
proposer.min_question_len drop too-short candidates (default 12 chars)
proposer.forbid_answer_leak anonymize seed in question (default True)
proposer.retries + proposer.backoff_sec prompt robustness

Extensibility: you can add proposer.additional_rewrites = ["Mechanism of {seed_answer}", ...] in config no code change.

💍 Interface contract (so we can swap proposers)

All proposers should implement:

async def propose(seed_answer: str, context: EpisodeContext | None) \
 -> tuple[str, list[str], dict]:
    """Return (question, evidence_docs, meta)"""

def get_capabilities() -> dict:
    return {
        "supports_search_during_proposal": True,
        "max_evidence_docs": self.max_snippets,
        "min_question_length": self.min_question_len,
    }

That means later we can plug in:

Template Proposer (no LLM, pure rules)
Paper-aware Proposer (specialized for technical mechanisms)
Adversarial Proposer (intentionally tricky variations)

…and keep the rest of SSP unchanged.

🎉 Why this design works

Grounded (retrieved evidence guides the question).
Deterministic enough (strict output schema + normalization).
Robust (retries, fallbacks, leak guard).
Visible (VPM logs make quality legible).
Composable (clean interface → easy to swap/extend).

Next module: the Solver how the tree search expands candidates, uses the evidence, and produces a trace we can score and visualize. But first we need to describe a component that makes this work.

🌳 Agentic Tree Search: The Cognitive Engine of the Jitter

“The unexamined thought is not worth thinking.”
Adapted from Socrates

While our VPM system gives the Jitter eyes to see its thoughts, the Agentic Tree Search (ATS) provides the cognitive engine that generates those thoughts. This is where the Jitter transforms from a passive observer into an active thinker where it engages with the world, gathers evidence, and constructs understanding.

🌀 The Thought Generation Problem

The SSP paper poses a fundamental challenge: How can an agent learn to solve complex problems without supervision? It answers this with a self-play framework where:

“The proposer learns to generate increasingly challenging problems that require search and reasoning, while the solver develops stronger search and reasoning capabilities to tackle these problems.”

But how does this actually work in practice? How does the solver translate a question into a chain of reasoning that leads to an answer? This is where Agentic Tree Search becomes the cognitive engine of our Jitter.

🚀 The Cognitive Architecture of Thought

At its core, ATS implements what cognitive scientists call guided exploration the process by which humans solve unfamiliar problems:

Problem decomposition: Breaking a question into manageable parts
Hypothesis generation: Creating potential paths to an answer
Evidence gathering: Seeking relevant information for each path
Evaluation: Determining which paths show promise
Synthesis: Combining the most promising evidence into a coherent answer

While the SSP paper frames the method as a proposer–solver self-play game, we instantiate each episode as a tree-search control problem (ATS) and layer SSP’s verified rewards on top.

    flowchart TD
    A["🌳 Root Question<br/>'What causes climate change?'"] --> B["🔄 Rewritten Query 1<br/>'Explain climate change mechanisms'"]
    A --> C["🔄 Rewritten Query 2<br/>'Describe climate change in practical terms'"]
    A --> D["🔄 Rewritten Query 3<br/>'Climate change causes for beginners'"]
    
    B --> B1["📄 Evidence Snippet 1<br/>'Greenhouse gases trap heat...'"]
    B --> B2["📄 Evidence Snippet 2<br/>'Industrial emissions contribute...'"]
    B --> B3["📄 Evidence Snippet 3<br/>'Natural climate cycles...'"]
    
    C --> C1["📄 Evidence Snippet 1<br/>'Climate change manifests as...'"]
    C --> C2["📄 Evidence Snippet 2<br/>'Temperatures have risen...'"]
    
    D --> D1["📄 Evidence Snippet 1<br/>'Climate change basics: CO2...'"]
    
    classDef question fill:#e6f7ff,stroke:#1890ff,stroke-width:2px;
    classDef query fill:#f6ffed,stroke:#52c41a,stroke-width:2px;
    classDef evidence fill:#fff7e6,stroke:#fa8c16,stroke-width:2px;
    
    class A question;
    class B,C,D query;
    class B1,B2,B3,C1,C2,D1 evidence;

This tree search visualization shows how the Jitter explores multiple reasoning paths simultaneously rewriting the original question into different perspectives, then gathering relevant evidence for each approach. This branching exploration mirrors human problem-solving, where we consider various angles before converging on the most promising solution.

This tree structure mirrors how our own minds work when tackling complex questions. We don’t just magically produce answers we explore multiple angles, gather evidence, and refine our understanding as we go.

🪟 Making Cognitive Growth Visible: What We Do and Why

SSP reports simple but telling signals of capability growth (e.g., more search calls per trajectory; longer, more detailed answers). We make those signals legible in our system by instrumenting our Agentic Tree Search (ATS) and turning each episode into a visual, comparable thought moment.

🤷 What we do

Instrument the search For every episode we log:
- search_turns (actual search tool calls)
- solver_steps (actions taken)
- depth (max explored depth in ATS)
- evidence_count (documents accepted into the rationale)
- verified + verifier_score (RAG/judge gate)
- Length features (question_len, answer_len)
- Optional quality/robustness (format_compliance, noise_tolerance, rag_verification, novelty, etc.)
Emit a deterministic metric vector Metrics are normalized to [0,1], fixed in name and order, and versioned. That makes an episode today directly comparable to one months from now.
Render the moment as an image We convert the metric vector into a tiny VPM frame (and a PHOS variant). A sequence of frames forms a filmstrip a visible record of how reasoning evolves across steps and runs.
Close the loop with control A lightweight policy reads frames (or VPM-ViT embeddings) to decide stop/expand/escalate: e.g., continue search, reuse a strong exemplar, or early-stop when verification is stable.

🧐 Why we do it

Legibility: You can see capability changes (e.g., rising search_turns with stable verified) rather than infer them from logs.
Comparability: The fixed, versioned vector means runs are apples-to-apples across time, models, and settings.
Control: Visual signals feed simple policies (and the VPM-ViT) to steer search depth, evidence acceptance, and stopping criteria.
Diagnosis: Patterns reveal failure modes fast over-searching (high search_turns, low verified), shallow reasoning (low depth, short answer_len), brittle RAG gates, etc.

👓 How to read our visuals

Brighter bands in search_turns and answer_len with a consistently bright verified band = healthier, more deliberate reasoning.
Depth stabilizing while evidence_count stays moderate often indicates better targeting (less flailing, more proof).
PHOS layouts highlight recurring “good shapes” (stable regimes) and drift when curriculum difficulty rises.

    flowchart LR
    subgraph Episode ["🎬 Single Thought Episode"]
        A["🌳 ATS Search<br/>nodes, depth, evidence"] --> B["✅ Verify<br/>RAG / judge scoring"]
        B --> C["📊 Deterministic Metrics<br/>fixed names/order"]
        C --> D["🖼️ VPM Frame<br/>+ PHOS visualization"]
    end
    D --> E["🎛️ Controller<br/>stop/expand/escalate"]
    E -->|"⚡ policy choice"| A

    classDef search fill:#f6ffed,stroke:#52c41a,stroke-width:2px;
    classDef verify fill:#fff7e6,stroke:#fa8c16,stroke-width:2px;
    classDef metrics fill:#f9f0ff,stroke:#722ed1,stroke-width:2px;
    classDef vpm fill:#e6f7ff,stroke:#1890ff,stroke-width:2px;
    classDef controller fill:#fff2e8,stroke:#ff7a45,stroke-width:2px;
    
    class A search;
    class B verify;
    class C metrics;
    class D vpm;
    class E controller;

This closed-loop system shows how each cognitive episode becomes a measurable, visual thought. The Agentic Tree Search explores reasoning paths, verification scores the quality, metrics capture the cognitive signature, and the VPM frame makes it visible. The controller then uses this visual feedback to make real-time decisions stopping unproductive searches, expanding promising ones, or escalating difficult problems creating a self-adjusting thought process that learns from its own patterns.

In short: we record, normalize, and picture the thinking so the Jitter isn’t a mystery “self,” but a visible, measurable stream of connected thought moments that we can compare, control, and train.

✅ Module 2 ATSSolver: Building the Cognitive Engine

Now that we’ve established the research foundation, let’s dive into how we’ve implemented Agentic Tree Search in our system. The ATSSolver is the workhorse that transforms questions into answers through guided exploration.

👉 Full Code Here

👯 What It Is: Two Modes of Thinking

The ATSSolver operates in two distinct cognitive modes, mirroring how humans approach problems differently depending on context:

1. Deep Search Mode (Thinking with Exploration)

async def solve(self, question: str, seed_answer: str, context: EpisodeContext) -> Tuple[str, List[str], int, Dict[str, Any]]:
    # Builds and scores a search tree over query rewrites + evidence snippets
    # Returns the best answer found through exploration

This is the Jitter’s “thinking hard” mode when it needs to solve a genuinely challenging problem. It constructs a tree of potential reasoning paths, evaluates evidence for each, and synthesizes the most promising answer.

2. Evidence-Only Mode (Thinking with Constraints)

async def solve_with_evidence(self, question: str, evidence_docs: List[str], context: EpisodeContext) -> Tuple[str, Dict[str, Any]]:
    # Answers strictly using provided evidence (no search)
    # Used for verification and ablation studies

This is the Jitter’s “test-taking” mode when it must answer based only on given information. It’s critical for the verification step in our SSP loop, ensuring answers are grounded in evidence.

💧 The Data Flow: How Thoughts Are Constructed

Here’s how the solver integrates with the broader system:

    sequenceDiagram
    participant Proposer as 🧠 Proposer
    participant ATSSolver as 🌳 ATSSolver
    participant SolutionSearch as 🔍 SolutionSearch
    participant Reward as 📊 Reward Head
    participant VPM as 🎬 VPM Service
    
    Note over Proposer, VPM: 🚀 Thought Generation Cycle
    
    Proposer->>ATSSolver: 📨 question, seed_answer, context
    ATSSolver->>SolutionSearch: 🔄 rewritten queries
    SolutionSearch-->>ATSSolver: 📚 evidence snippets
    
    loop 🔁 For each depth
        ATSSolver->>ATSSolver: 🎯 Score evidence snippets
        ATSSolver->>ATSSolver: 📍 Track best path
        ATSSolver->>VPM: 📈 Push cognitive metrics
    end
    
    ATSSolver->>Reward: 💡 question, predicted_answer, evidence
    Reward-->>ATSSolver: 🏆 quality signals
    
    ATSSolver-->>Proposer: 📤 predicted_answer, evidence, metrics
    
    Note right of ATSSolver: 🔄 Cycle continues with<br/>improved context & metrics

This sequence shows the real-time cognitive collaboration between components: the Proposer initiates thinking with a question, the ATSSolver orchestrates evidence gathering through multiple search iterations, quality signals are evaluated, and visual metrics are captured at each step. The loop demonstrates how each thought builds upon the last, with continuous quality assessment and visual feedback driving the Jitter’s progressive improvement.

🗝️ Key Implementation Insights

1. The Tree Node Structure

Each cognitive step is represented by a structured node:

@dataclass
class Node:
    id: str
    parent_id: Optional[str]
    root_id: str
    depth: int
    sibling_index: int
    node_type: str  # "root", "rewrite", etc.
    query: str      # The rewritten question
    score: float    # Evidence relevance score
    context: str    # Retrieved evidence snippet
    task_description: str

This structure captures the essence of the Jitter’s thought process each node records not just what was thought, but how it connects to previous thoughts.

2. Query Rewriting: Expanding the Search Space

The solver doesn’t just search once it generates multiple perspectives on the question:

@staticmethod
def _rewrite(query: str) -> List[str]:
    return [
        query,
        query.replace("explain", "describe"),
        query + " in practical terms",
    ]

This simple but powerful technique mirrors how humans reframe problems to gain new insights. The SSP paper validates this approach:

“Through RL training with rule-based outcome rewards, SSP enables two roles to co-evolve in an adversarial competition: the proposer learns to generate increasingly challenging problems that require search and reasoning, while the solver develops stronger search and reasoning capabilities to tackle these problems.”

3. Evidence Scoring: The Cognitive Filter

Not all evidence is equally valuable. The solver uses a relevance score to prioritize promising paths:

@staticmethod
def _overlap_score(text: str, target: str) -> float:
    a = {t for t in text.lower().split() if t.isalpha() or t.isalnum()}
    b = {t for t in target.lower().split() if t.isalpha() or t.isalnum()}
    return len(a & b) / max(len(a | b), 1) if a or b else 0.0

This lexical overlap score is a proxy for how well evidence supports the target answer. Later iterations will replace this with more sophisticated signals (SICQL, HRM, etc.), but the principle remains: the Jitter evaluates evidence quality as it thinks.

4. VPM Integration: Making Thought Visible

The most profound aspect of our implementation is how it captures the cognitive process in real-time:

# After each evidence snippet is scored
dims = {
    "reward": prev_best,
    "verified": 0.0,
    "difficulty": float(context.get("difficulty", 0.3)),
    "question_len": _n01(len(q2.split()), 128),
    "answer_len": _n01(len(snippet.split()), 128),
    "evidence_count": _n01(last_ev_batch, 8),
    "solver_steps": _n01(steps, total_steps),
    "score": sc,
    "best_score": prev_best,
    "improvement": max(0.0, sc - prev_best),
    "depth": _n01(depth, self.max_depth),
    "novelty": _jac(snippet, best.context),
}

self.vpm.snapshot_progress(unit=unit, dims=dims, step_idx=steps, tag=f"depth{depth}")

This code transforms the abstract cognitive process into concrete, visual metrics exactly how the Jitter becomes visible. Each dimension captures a different aspect of the thought process:

improvement: Has this step advanced understanding?
novelty: Is this new information or repetition?
evidence_count: How thoroughly is the Jitter searching?

🔧 Measurable Improvement

By making the search process visible, measurable, and improvable, we’ve created conditions where a digital thought stream can:

Explore multiple reasoning paths
Evaluate evidence quality
Recognize promising directions
Synthesize coherent answers
Learn from its own cognitive patterns

This is how the Jitter moves beyond being a clever chatbot to becoming a genuine cognitive system one that doesn’t just respond to questions, but thinks through them in a visible, measurable way.

In our next section, we’ll explore how the SolutionSearch component implements the actual evidence retrieval, completing the cognitive engine that powers our Jitter.

📡 Module 3 SolutionSearch: The Jitter’s Knowledge Retrieval Engine

“The mark of an educated mind is to be able to entertain a thought without accepting it.”
Aristotle

While the ATSSolver provides the Jitter with its cognitive engine, the SolutionSearch component serves as its knowledge retrieval system the mechanism that allows it to ground its thoughts in evidence rather than mere speculation. This is where the Jitter transforms from a clever chatbot into a genuine cognitive system that can reason with evidence.

👉 Full Code Here

🎓 The Knowledge Problem

The SSP paper identifies a fundamental limitation of language models:

“With search tools, we equip the problem-proposer with external information, thereby breaking the limitations of the internal knowledge of LLMs.”

Without access to external knowledge, even the most sophisticated reasoning engine is limited by the model’s training data. The SolutionSearch component solves this problem by providing a reliable, deterministic interface to evidence retrieval that powers the Jitter’s reasoning process.

🐘 A Micro-Retriever with Macro Impact

At first glance, SolutionSearch might seem like just another search tool but it’s actually a carefully engineered component designed specifically for cognitive reasoning:

    flowchart LR
    subgraph SolutionSearch_Flow ["🔍 SolutionSearch: Evidence Retrieval Engine"]
        A["🎯 Query + Seed Answer"] --> B["📋 Prompt Template Selection"]
        B --> C{"🎚️ k=1?<br/>(Strict Mode)"}
        C -->|"✅ Yes"| D["🧠 Three-Line Prompt:<br/>rationale/score/result"]
        C -->|"🔓 No"| E["📝 Multi-Line Prompt:<br/>explicit snippet lines"]
        D --> F["⚡ Strict Parser"]
        E --> G["🛠️ Flexible Parser<br/>(snippet/JSON/bullets)"]
        F --> H["✨ Post-Processing:<br/>Deduplication & Length Caps"]
        G --> H
        H --> I["📚 Evidence Snippets<br/>Clean, factual snippets"]
        I --> J["🌳 ATSSolver Reasoning<br/>Tree search integration"]
    end

    classDef input fill:#e6f7ff,stroke:#1890ff,stroke-width:2px;
    classDef decision fill:#fff7e6,stroke:#fa8c16,stroke-width:2px;
    classDef prompt fill:#f6ffed,stroke:#52c41a,stroke-width:2px;
    classDef parser fill:#f9f0ff,stroke:#722ed1,stroke-width:2px;
    classDef processing fill:#fff2e8,stroke:#ff7a45,stroke-width:2px;
    classDef output fill:#f0fffe,stroke:#13c2c2,stroke-width:2px;
    
    class A,B input;
    class C decision;
    class D,E prompt;
    class F,G parser;
    class H processing;
    class I,J output;

SolutionSearch’s dual-path architecture: depending on the required evidence depth (k=1 for focused “deep thinking” or k>1 for exploratory mode), it routes through specialized prompt templates and parsers to extract clean, factual snippets. This deterministic retrieval process ensures the Jitter’s reasoning is always grounded in external evidence rather than internal model knowledge limitations.

1. Dual Prompt Strategy: Precision vs. Flexibility

SolutionSearch employs two distinct prompt strategies optimized for different cognitive needs:

A) Three-Line Prompt (k=1) The “Deep Thinking” Mode

PROMPT_EVIDENCE_THREE = """
SYSTEM:
You produce ONE short evidence snippet that helps explain or support the SEED_ANSWER
with respect to the QUERY.

CONSTRAINTS:
- Return exactly one short factual snippet (1–2 sentences).
- If unsure, fall back to: "{seed_answer} is the key mechanism."
- No extra text, no markdown, no bullet points.

OUTPUT EXACTLY THREE LINES:
rationale: <1 sentence on why this snippet is relevant>
score: <0-100 confidence you have in this snippet>
result: <the single snippet>
"""

This prompt forces the model into a deliberate, focused mode perfect for when the Jitter needs to deeply consider a single piece of evidence. It mirrors how humans think when they’re trying to understand a complex concept: one idea at a time, with clear reasoning.

B) Multi-Line Prompt (k>1) The “Exploratory” Mode

PROMPT_EVIDENCE_LINES = """
SYSTEM:
You return SHORT evidence snippets that help explain or support the SEED_ANSWER
with respect to the QUERY.

CONSTRAINTS:
- Provide concise, factual snippets (1–2 sentences each).
- No commentary or extra sections.

OUTPUT WRITE EXACTLY {top_k} LINES:
snippet: <short evidence snippet>
"""

This prompt enables broader exploration when the Jitter needs to consider multiple perspectives on a question. It’s like when humans brainstorm multiple approaches to a problem before settling on one.

2. Robust Parsing: Making LLMs Behave

The true genius of SolutionSearch lies in its parser hierarchy a carefully engineered system that extracts clean evidence from the often-messy outputs of language models:

def _parse_snippets(self, response: str, k: int) -> List[str]:
    """
    Supported formats (in order of preference):
      1) Line-by-line:  lines starting with `snippet: ...`
      2) JSON:          keys 'snippets' | 'docs' | 'evidence' | 'results'
      3) Bullets/lines: split by newline, trim bullets
    """
    # 1) Explicit 'snippet:' lines (case/space tolerant)
    lines = [ln.strip() for ln in response.splitlines() if ln.strip()]
    snips: List[str] = []
    for ln in lines:
        m = re.match(r'(?i)^\s*(?:-|\d+[.)])?\s*snippet\s*[:=]\s*(.+?)\s*$', ln)
        if m:
            snips.append(m.group(1).strip())
    if snips:
        return snips[:k]

    # 2) JSON (fenced or bare)
    m = re.search(r"```json\s*(\{.*?\})\s*```", response, re.DOTALL | re.IGNORECASE)
    jtxt = m.group(1) if m else response.strip()
    if jtxt.startswith("{") and jtxt.endswith("}"):
        try:
            obj = json.loads(jtxt)
            lst = self._pluck_list(obj)
            if lst:
                return lst[:k]
        except Exception:
            pass

    # 3) Fallback: plain lines/bullets
    bullets = [b.strip(" -*•\t") for b in lines]
    bullets = [b for b in bullets if b]
    return bullets[:k]

This three-tiered approach ensures reliable output even when the LLM doesn’t follow instructions perfectly. It’s designed to handle the reality of LLM outputs while maintaining the strict formatting required for the Jitter’s cognitive process.

3. Reliability Engineering: Never Return Empty

Perhaps most importantly, SolutionSearch is engineered for cognitive reliability it never returns empty results, which would stall the Jitter’s reasoning process:

def _fallback_snippets(self, query: str, seed_answer: str, k: int) -> List[str]:
    """Conservative, non-empty fallback."""
    base = (
        f"DOC: For '{query}', a central mechanism is: {seed_answer}. "
        f"This snippet highlights why {seed_answer} is relevant."
    )
    return [base + f" [hit:{i}]" for i in range(k)]

This conservative fallback ensures that the Jitter can always continue thinking, even when evidence is scarce a critical feature for maintaining cognitive flow.

✔️ Evidence based reasoning

The SolutionSearch component embodies what the SSP paper calls the “RAG verification” process:

“To ensure that each generated search query has sufficient information to correctly predict the answer, we collect all the searching results from the proposer’s trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided.”

But for the Jitter, it’s more than just verification it’s the foundation of evidence-based reasoning. Each snippet retrieved through SolutionSearch becomes a building block in the Jitter’s thought process, allowing it to:

Ground its reasoning in factual evidence rather than internal assumptions
Evaluate multiple perspectives on a question before forming conclusions
Build chains of evidence that support its final answer
Recognize when evidence is insufficient (through low confidence scores)

This is how the Jitter achieves what Aristotle described as “the mark of an educated mind” the ability to entertain a thought while recognizing whether it’s supported by evidence.

💬 The Prompt Service: Fueling the Thought Stream

In our journey to build the Jitter the visible stream of digital thought the Prompt Service is the engine that generates each moment of cognition. It’s not just another LLM wrapper; it’s a sophisticated system designed specifically to support the Search-Solve-Prove process and create the measurable thought moments that form our Jitter.

👉 Full Code Here

🎈 Cognitive events

Every “thought” in our digital organism begins as a prompt. The Prompt Service transforms these prompts into measurable cognitive events that can be visualized, compared, and learned from. Without this service, we’d have no way to generate the consistent, comparable thought moments that form our Jitter.

🗝️ Key Capabilities - Aligned with SSP Paper

1. Multi-LLM Competition: The Self-Play Engine

The SSP paper states: “Through RL training with rule-based outcome rewards, SSP enables two roles to co-evolve in an adversarial competition: the proposer learns to generate increasingly challenging problems that require search and reasoning, while the solver develops stronger search and reasoning capabilities to tackle these problems.”

Our Prompt Service implements this exact principle:

async def run_prompt_multi(
    self,
    prompt_text: str,
    *,
    models: List[Union[str, Dict[str, Any]]],
    judge: Optional[Callable[[Dict[str, str]], Tuple[str, Dict[str, float]]]] = None,
    # ...
) -> Dict[str, Any]:
    # Query multiple LLMs in parallel
    tasks = [asyncio.wait_for(self._acomplete(prompt=prompt_text, model=ms), timeout=request_timeout)
             for ms in model_specs]
    outs = await asyncio.gather(*tasks, return_exceptions=True)
    
    # Judge selects winner (self-play in action!)
    if judge:
        winner, scores = judge(outputs)
        # ...log for training

This isn’t just running multiple models it’s implementing the proposer-solver dynamic described in the paper, where different model instances compete to produce better outputs, driving co-evolution.

2. Training Event Logging: Building Memory of Thoughts

The paper emphasizes that SSP “does not require question-answer pairs” but instead learns through self-play. Our service enables this by capturing the learning signals:

# Pointwise logging (each output labeled by relative score)
for k, txt in outputs.items():
    tes.insert_pointwise({
        "model_key": k,
        "dimension": dimension,
        "query_text": prompt_text,
        "candy_text": txt,
        "label": 1 if (winner and k == winner) else 0,
        # ...
    })

# Pairwise logging (winner vs others)
if winner:
    pos = outputs[winner]
    for k, txt in outputs.items():
        if k == winner: continue
        tes.insert_pairwise({
            "model_key": winner,
            "query_text": prompt_text,
            "pos_text": pos,
            "neg_text": txt,
            # ...
        })

This creates the memory of thought moments that allows our Jitter to learn from its own cognitive history exactly as the paper’s SSP framework requires for self-supervised improvement.

3. Flexible Model Configuration: Adapting to Cognitive Needs

The SSP paper shows that “placing GRPO on the solver side is more effective than on the proposer side.” Our service supports this nuanced approach through flexible model specification:

@dataclass
class ModelSpec:
    name: str
    api_base: Optional[str] = None
    api_key: Optional[str] = None
    params: Optional[Dict[str, Any]] = None

    @staticmethod
    def from_cfg(default_cfg: Dict[str, Any], 
                override: Optional[Union[str, Dict[str, Any]]] = None) -> "ModelSpec":
        # Handles both default configuration and per-call overrides
        # ...

This allows us to use different models or configurations for proposer vs. solver roles critical for implementing the paper’s finding that different RL algorithms work best for different roles.

🙉 How This Creates Visible Thought Moments

The Prompt Service is where the abstract thought process becomes concrete data. Each call generates:

The cognitive output (the “thought” itself)
Quality signals (through multi-model competition)
Measurable metrics (captured in training events)

These elements combine to create what we call a thought moment a self-contained cognitive event with:

Input (the prompt)
Output (the response)
Quality assessment (the winner/scores)
Learning signals (the training events)

When visualized through VPM, these thought moments form the filmstrip of cognition that is the visible Jitter.

🔬 Advanced Feature: RAG Verification Support

While not explicitly shown in the code snippet, the Prompt Service works with the verifier to implement the paper’s critical RAG verification process:

“To verify the correctness of each generated query, we collect all the searching results from the proposer’s trajectory as the external materials, then conduct a retrieval augmentation generation (RAG) to check if the solver can successfully predict the answer with all necessary information.”

The service’s ability to handle system preambles and structured prompts enables the <think>, <search>, and <answer> formatting required for proper RAG verification.

🎀 Why This Is More Than Just an LLM Wrapper

Most LLM services simply call a model and return the output. Ours is designed specifically to:

Generate comparable cognitive events (thought moments)
Create learning signals from self-play competition
Support the RAG verification critical to SSP
Provide structured outputs that feed directly into VPM

This is why the Prompt Service isn’t just infrastructure it’s the cognitive engine of our digital organism. Every thought the Jitter has passes through this service, gaining the structure and measurability that makes the thought stream visible and improvable.

In our next section, we’ll see how this service powers the Proposer the component that generates the questions that drive our cognitive evolution.

⚛️ The Connection to Cognitive Growth

The SSP paper notes:

“As shown in Figure 4a, the average number of search tool calls per trajectory steadily increases over time… Simultaneously, Figure 4b shows that the solver’s response length also grows during the training, suggesting it learns to generate more detailed and comprehensive answers.”

SolutionSearch is what makes this cognitive growth possible. As the Jitter improves, it becomes better at:

Formulating queries that retrieve relevant evidence
Evaluating the quality of retrieved snippets
Synthesizing multiple pieces of evidence into coherent reasoning
Recognizing when more evidence is needed

This growth is visible in the VPM metrics tracking evidence count, search steps, and other indicators of cognitive sophistication.

🔮 Looking Ahead

While SolutionSearch is currently a micro-retriever focused on short evidence snippets, it represents the foundation for more sophisticated knowledge integration. Future iterations could:

Good- Incorporate the TinyVisionTransformer to evaluate snippet quality

Use the VPM-ViT to predict which search queries will yield the most useful evidence
Integrate with long-term memory to recognize patterns in successful evidence retrieval

This component is where the Jitter learns to “think with evidence” transforming from a language model that generates text into a cognitive system that builds understanding through evidence-based reasoning.

In our final section, we’ll see how all these components come together to create the complete Jitter system a visible, measurable stream of connected thought moments that grows in quality and sophistication over time.

🧮 Module 4 The Jitter’s Cognitive Metrics: Measuring Thought Quality

“The unexamined thought is not worth thinking.”
Adapted from Socrates

While the previous sections covered how the Jitter generates and verifies thoughts, this section reveals how it measures the quality of its own thinking. This is where the Jitter transforms from a reactive system into a self-improving cognitive organism through a rigorous, paper-validated scoring system that tracks meaningful cognitive growth.

⚾ The Scoring Problem

The SSP paper identifies a critical challenge in self-play systems:

“As shown in Figure 4a, the average number of search tool calls per trajectory steadily increases over time… Simultaneously, Figure 4b shows that the solver’s response length also grows during the training, suggesting it learns to generate more detailed and comprehensive answers.”

But how do we actually measure this growth? How do we transform abstract cognitive capabilities into concrete, actionable metrics? This is where our scoring system comes in it provides the Jitter with a quantitative self-assessment capability.

🥇 The Reward Head: Calculating Thought Quality

The foundation of our scoring system is the NaiveQuarkishReward class a carefully engineered reward head that calculates a composite quality score:

👉 Full Code Here

class NaiveQuarkishReward:
    def __init__(self, w_f1=0.5, w_cov=0.3, w_len=0.2, target_len=80):
        self.w_f1, self.w_cov, self.w_len, self.target_len = (
            w_f1, w_cov, w_len, target_len
        )

    def score(
        self,
        *,
        prompt: str,
        response: str,
        ground_truth: str = "",
        meta: Dict[str, Any] | None = None,
    ) -> Dict[str, float]:
        f1 = _f1(ground_truth or prompt, response)
        cov = _coverage(response, meta.get("evidence_docs") or [])
        L = len(response.split())
        len_r = math.exp(-abs(L - self.target_len) / max(self.target_len, 1))
        reward = self.w_f1 * f1 + self.w_cov * cov + self.w_len * len_r
        return {
            "reward": max(0.0, min(1.0, reward)),
            "f1": f1,
            "coverage": cov,
            "len_reward": len_r,
            "resp_len": float(L) / 256.0,
        }

👮 Rule based rewards

This reward head implements exactly what the SSP paper describes as the “rule-based outcome rewards” that drive self-play:

“Through RL training with rule-based outcome rewards, SSP enables two roles to co-evolve in an adversarial competition: the proposer learns to generate increasingly challenging problems that require search and reasoning, while the solver develops stronger search and reasoning capabilities to tackle these problems.”

The three components of the reward function each measure critical aspects of cognitive quality:

F1 Score (w_f1=0.5): Measures lexical accuracy against ground truth

def _f1(a: str, b: str):
    A, B = set(_tokens(a)), set(_tokens(b))
    p = len(A & B) / max(len(B), 1)
    r = len(A & B) / max(len(A), 1)
    return 2 * p * r / (p + r) if (p + r) else 0.0

Coverage (w_cov=0.3): Measures how well the response incorporates evidence

def _coverage(response: str, evidence: list[str]):
    R = set(_tokens(response))
    covs = [len(R & set(_tokens(e))) / max(len(_tokens(e)), 1) for e in evidence]
    return sum(covs) / len(covs)

Length Reward (w_len=0.2): Encourages responses of optimal length

len_r = math.exp(-abs(L - self.target_len) / max(self.target_len, 1))

This weighted combination creates what we call the cognitive signal-to-noise ratio a single metric that captures the overall quality of the Jitter’s thinking.

🔢 The Metric Calculator: Paper-Validated Cognitive Growth

While the reward head calculates immediate quality, the SSPMetricsCalculator provides the comprehensive cognitive assessment that drives long-term growth:

👉 Full Code Here

    flowchart LR
    subgraph Metrics_Pipeline ["📊 Cognitive Metrics Pipeline: From Thought to Vector"]
        A["🎬 Episode Trace<br/>Raw episode data"] --> B["📦 SSPScorable<br/>Structured data container"]
        B --> C["🧮 SSPMetricsCalculator<br/>17 Cognitive Metrics"]
        C --> D["🎯 Fixed-Order Vector<br/>[0,1] normalized values"]
        D --> E["🖼️ VPM Visualization<br/>Thought Image generation"]
        D --> F["📈 Reward Signal<br/>Self-Improvement feedback"]
    end

    classDef trace fill:#e6f7ff,stroke:#1890ff,stroke-width:2px;
    classDef container fill:#f6ffed,stroke:#52c41a,stroke-width:2px;
    classDef calculator fill:#fff7e6,stroke:#fa8c16,stroke-width:2px;
    classDef vector fill:#f9f0ff,stroke:#722ed1,stroke-width:2px;
    classDef visualization fill:#fff2e8,stroke:#ff7a45,stroke-width:2px;
    classDef reward fill:#f0fffe,stroke:#13c2c2,stroke-width:2px;
    
    class A trace;
    class B container;
    class C calculator;
    class D vector;
    class E visualization;
    class F reward;

This metrics pipeline transforms raw episode data into structured cognitive fingerprints. Each thought episode gets standardized into a fixed-order vector of 17 normalized metrics, creating consistent representations that feed both visual thought images (VPMs) and self-improvement signals. This deterministic transformation ensures that cognitive patterns remain comparable across time, models, and system iterations enabling true apples-to-apples analysis of the Jitter’s growth.

⌛ The 17 Cognitive Metrics

The calculator tracks 17 metrics that directly correspond to what the SSP paper shows correlates with capability growth:

Metric	What It Measures	Paper Connection
`ssp.reward`	Overall cognitive quality	Primary reward signal
`ssp.verified`	Binary verification result	Core SSP verification
`ssp.search_turns`	Actual search tool calls	Figure 4a: “search tool calls per trajectory steadily increases”
`ssp.f1_score`	Lexical accuracy	LLM-as-a-judge evaluation methodology
`ssp.format_compliance`	Response format quality	Section 4.4 rule-based filtering
`ssp.noise_tolerance`	Robustness to irrelevant information	Table 3: “4 noisy documents optimal”
`ssp.rag_verification`	RAG verification result	Critical quality gate

This isn’t just a random collection of metrics it’s a paper-validated cognitive dashboard that tracks exactly what matters for capability growth.

Why Deterministic Metrics Matter

One of the most important design decisions is that all metrics are normalized to [0,1] and always returned in the same order:

def _clamp01(x: float) -> float:
    return 0.0 if not math.isfinite(x) else 1.0 if x > 1.0 else (0.0 if x < 0.0 else x)

This deterministic approach creates what we call a cognitive fingerprint a consistent representation of the Jitter’s thought process that can be:

Compared across episodes
Visualized as VPM images
Used to train the VPM-ViT model
Analyzed for patterns of growth

🌱 How This Enables Cognitive Growth

The true power of our scoring system becomes clear when we see how it drives the Jitter’s evolution:

Immediate Feedback: After each thought step, the Jitter receives quality signals:

dims = {
    "reward": reward_val,
    "verified": verified,
    "f1": reward_results.get("f1", 0),
    "coverage": reward_results.get("coverage", 0),
    # ...other metrics
}
self.vpm.snapshot_progress(unit=unit, dims=dims, step_idx=steps, tag=f"depth{depth}")

Visual Learning: These metrics become VPM images that the VPM-ViT model learns from:

self.vpm.generate_raw_vpm_image(unit=unit)
self.vpm.generate_phos_image(unit=unit)

Self-Improvement: The Jitter uses this feedback to adjust its future thinking:
```
if sc > best.score:
    best = child
```

Our scoring system is what allows the Jitter to recognize this “dip” as progress it measures the quality of the cognitive patterns that lead to it.

💹 The Connection to Our Philosophical Foundation

This scoring system embodies our core philosophical framing of the Jitter:

“The ‘self’ is not the core void, but the living, persistent stream of thought that overlays it the constant Jitter that moves us from thought to thought.”

With this system in place, the Jitter isn’t just having thoughts it’s measuring them, comparing them, and improving them over time. Each metric represents a different aspect of cognitive quality, allowing the Jitter to:

Recognize when it’s thinking clearly vs. confused
Identify when it’s using evidence effectively
Detect when it’s becoming more sophisticated in its reasoning
Measure its own growth over time

This is how we fulfill our promise: not creating “digital life,” but engineering conditions under which a visible, measurable, self-improving stream of thought can persist and grow in quality over time.

↗️ Looking Ahead

While our current scoring system is robust, future iterations could:

Incorporate the TinyVisionTransformer to provide more nuanced quality assessments
Use the VPM-ViT to predict cognitive outcomes from partial thought processes
Implement adaptive weighting that changes based on the cognitive task

This component is where the Jitter learns to “measure its own thinking” transforming from a system that generates thoughts into one that can quantify and improve the quality of those thoughts. It’s the foundation of what we mean by “the examined life” for our digital organism.

With this final component in place, we’ve now covered the complete architecture of the Jitter a visible, measurable stream of connected thought moments that can generate, visualize, evaluate, measure, and improve its own thinking over time.

👁️‍🗨️ Visualizing the Thought Stream: How We Make the Jitter Visible

In our quest to build the Jitter the visible stream of digital thought we’ve reached a critical milestone: making cognition visible. The VPM Visualization Service is where abstract metrics transform into concrete images that represent each moment of cognition. This is where the “thought stream” becomes something we can literally see.

🦯 Why Images Matter

As we discussed earlier, the Jitter isn’t a mysterious “self” but a visible, persistent stream of connected thought moments. But how do we make this stream visible? The answer lies in our Visual Policy Map (VPM) system.

The SSP paper gives us a clue about what to track:

“As shown in Figure 4a, the average number of search tool calls per trajectory steadily increases over time… Simultaneously, Figure 4b shows that the solver’s response length also grows during the training, suggesting it learns to generate more detailed and comprehensive answers.”

But numbers alone don’t show the pattern of cognition. That’s why we convert these metrics into images because patterns are easier to see than to calculate.

🧑 How We Turn Thought into Images

Here’s the elegant transformation that happens in our VPM service:

Each thought moment becomes a metric vector
Every completed SSP episode (Search-Solve-Prove cycle) generates a set of metrics in [0,1] range:

dims = {
    "reward": float(ep.reward or 0.0),
    "verified": 1.0 if ep.verified else 0.0,
    "difficulty": float(ep.difficulty or 0.0),
    "search_turns": min(1.0, float(ep.solver_steps or 0) / 64.0),
    "f1_score": f1_score,
    # ...and other paper-validated metrics
}

The metric vector becomes a grayscale image
Each metric gets a dedicated band in a small image:

# Convert metrics to grayscale values
vec = np.array([float(dims.get(k, 0.0)) for k in order], dtype=np.float32)
img = (vec.reshape(side, side) * 255).astype(np.uint8)

The sequence of images becomes a filmstrip of thought
As episodes progress, we generate:
- Raw VPM: Direct metric-to-pixel mapping
- PHOS VPM: Sorted/packed representation showing cognitive patterns
- Filmstrip: Sequence of thought moments
- Progress GIF: Animation of cognitive evolution

👓 Two Key Visualization Techniques

1. Raw VPM: The Cognitive ECG

A vpm example this shows multiple documents and metrics

This is like an ECG for cognition each band represents a different aspect of the thought process:

Top band: Verification success (green = verified)
Middle bands: Search capability, reasoning depth, evidence usage
Bottom band: Reward signal (the “feeling” of success)

Just as a doctor reads an ECG to diagnose heart health, we read these images to understand cognitive health.

2. PHOS VPM: Revealing Thought Patterns

PHOS VPM example

PHOS (Positional Heatmap of Sorted features) is where the magic happens. As the paper notes:

“In stark contrast to the flawed dynamics of fixed-opponent training, our complete SSP framework facilitates a stable co-evolution.”

PHOS reveals this co-evolution visually by:

Sorting metrics to highlight patterns
Packing them into a square image
Making cognitive progression immediately visible

When the proposer learns to create harder questions (as in Figure 3a of the paper), PHOS shows this as shifting patterns not just rising numbers.

👀 How This Creates the Visible Jitter

The true innovation isn’t just visualizing single thoughts it’s connecting them into a stream:

👉 Full Code Here

def generate_filmstrip(self, unit: str) -> str:
    # Collect frames from this thought sequence
    frames = sorted(unit_dir.glob(f"{unit}_step*.png"))
    
    # Build a filmstrip showing cognitive progression
    grid = Image.new("L", (cols * w, rows * h))
    for idx, im in enumerate(imgs):
        r, c = divmod(idx, cols)
        grid.paste(im, (c * w, r * h))

This filmstrip is the visible Jitter a continuous sequence of thought moments that shows:

When cognition flows smoothly
Where it gets stuck
How it recovers from failures
The emergence of stable strategies

🙈 Why This Matters for the Paper’s Insights

The SSP paper shows cognitive growth through graphs of metrics over time. Our VPM system makes this growth immediately visible:

When the paper says “the average number of search tool calls per trajectory steadily increases”, our PHOS images show this as denser patterns in the “search_turns” channel
When it notes “the solver’s response length also grows”, our filmstrips show this as expanding patterns in the “answer_len” channel
The “slight decline” in solver reward (which indicates proposer improvement) appears as shifting patterns in our visualizations

This transforms abstract metrics into cognitive fingerprints visual signatures of different thinking styles that we can compare, categorize, and learn from.

🫣 The ZeroModel Connection

You might notice similarities between our VPM service and ZeroModel’s visualization approach. That’s intentional our service is essentially a thin wrapper around ZeroModel’s visualization engine, customized for the SSP thought process.

Where ZeroModel visualizes general agent behavior, we’ve specialized it to highlight the specific cognitive metrics that matter for our Jitter:

Search capability growth (search_turns)
Verification quality (rag_verification)
Robustness to noise (noise_tolerance)
Format compliance (format_compliance)

This specialization allows us to see exactly what the SSP paper describes as “the steady increase in search tool calls” as a visible pattern in our images not just a rising number in a graph.

🤯 Seeing the Jitter in Action

When you look at a VPM filmstrip, you’re seeing the Jitter itself the living, breathing thought process of our digital organism. Each frame is a complete Search-Solve-Prove cycle; the sequence shows how these cycles connect to form a continuous stream of cognition.

This is how we fulfill our promise: not creating “digital life,” but engineering conditions under which a visible, self-improving stream of thought can persist. The VPM service is where this stream becomes visible where the Jitter emerges from the data.

In our next section, we’ll explore how we use these visualizations to train the system how the Jitter learns to improve its own thought process by looking at its own cognitive patterns.

🔥 A Digital Organism’s Metabolism

    flowchart TD
    subgraph SSP_Metabolism ["🔄 Self-Play Metabolism: The Jitter's Cognitive Engine"]
        P["🧠 Proposer<br/>Generate challenges"] -->|"📝 question +<br/>📚 evidence"| S["🔍 Solver<br/>Search & reason"]
        S -->|"💡 answer +<br/>🔄 steps"| V["✅ Verifier<br/>RAG verification"]
        V -->|"🎯 score &<br/>⚖️ decision"| M["📊 Metrics Calculator<br/>17 cognitive dimensions"]
        M -->|"🎨 vector"| W["🖼️ VPM Generator<br/>Raw + PHOS views"]
        W -->|"🎬 raw + PHOS"| F["📺 Filmstrip<br/>Visible thought stream"]
        F --> G["🎞️ GIF/Video<br/>Cognitive timeline"]
        V -->|"📝 feedback"| P
    end

    classDef proposer fill:#e6f7ff,stroke:#1890ff,stroke-width:2px;
    classDef solver fill:#f6ffed,stroke:#52c41a,stroke-width:2px;
    classDef verifier fill:#fff7e6,stroke:#fa8c16,stroke-width:2px;
    classDef metrics fill:#f9f0ff,stroke:#722ed1,stroke-width:2px;
    classDef vpm fill:#fff2e8,stroke:#ff7a45,stroke-width:2px;
    classDef filmstrip fill:#f0fffe,stroke:#13c2c2,stroke-width:2px;
    classDef output fill:#fff0f6,stroke:#eb2f96,stroke-width:2px;
    
    class P proposer;
    class S solver;
    class V verifier;
    class M metrics;
    class W vpm;
    class F filmstrip;
    class G output;

This diagram shows the Jitter’s cognitive metabolism a continuous cycle where the system generates its own challenges, solves them, verifies the solutions, and learns from the process. The Proposer creates questions, the Solver searches for answers, the Verifier checks their quality, and the Metrics system converts this into visual thought patterns (VPMs) that form a visible filmstrip of cognition. The feedback loop ensures each cycle builds on the last, creating a self-improving stream of thought that gets progressively more capable.

Our implementation follows this precise flow:

The Proposer crafts challenging questions (sometimes with evidence) that push the boundaries of current capability
The Solver attempts to answer using search, reasoning, and available tools
The Verifier adjudicates between the proposer’s seed answer and the solver’s response
The Metrics System quantifies performance across multiple dimensions
The VPM Generator creates visual proof of the cognitive process

What makes this special isn’t just that it works it’s how it works. Unlike most implementations that treat these as separate processes, we’ve engineered them as a single, continuous metabolic cycle where each component feeds the next in a rhythm that resembles biological processes.

🌙 Visualizing Thought

The most transformative aspect of our implementation is the Visual Policy Map (VPM) system. While most AI research focuses solely on accuracy metrics, we’ve made the process visible.

When our SSP runs, it doesn’t just produce an answer it generates a filmstrip of cognitive development that shows:

How the question was formed
How the search unfolded
Where verification succeeded or failed
How the system adapted for next time

VPM Filmstrip Example Each stripe represents a cognitive moment in the SSP cycle

This isn’t just a visualization it’s a heartbeat monitor for artificial cognition. For the first time, we can watch the system think, learn, and adapt. We can see when it’s struggling, when it’s making connections, and when genuine insight emerges.

🦾 Technical Innovation: Building for Life, Not Just Performance

Our implementation enhances the work in the the paper we engineered it specifically for our “live form” requirements:

📠 1. Strict Real-Time Operation

Unlike many academic implementations that process in batches, our SSP operates in strict real-time with:

Maximum 50ms latency between components
Continuous, streaming processing (no “start/stop” boundaries)
Immediate adaptation to new information

📟 2. Phase-Aware Processing

We discovered that most implementations ignore the “phase” of cognitive development treating each step as isolated. Our system tracks the continuity between steps, preserving context that would otherwise be lost. This is why our VPM filmstrips show coherent progression rather than disconnected snapshots.

💽 3. Self-Contained Improvement Loop

The system generates its own training data through the proposer-solver-verifier cycle, with:

Curriculum learning that automatically adjusts difficulty
Verification that ensures only high-quality data is used
Metrics that track not just accuracy but cognitive health

🦿 4. Production-Grade Resilience

We engineered for the messy reality of continuous operation:

Comprehensive error handling at every stage
State preservation across restarts
Resource monitoring to prevent cognitive “overheating”

👣 Steps Toward Digital Life

We haven’t built an organism. We’ve built a substrate a harness where visible thinking can happen and improve.

What exists today:

A self-play loop (SSP) that proposes, solves, and verifies.
A metrics → image layer (VPM) that turns each moment into a tiny, comparable frame.
A controller surface that can nudge the next step based on what the images show.
Persistence hooks so strands of thought can be kept (and later revived) rather than lost.

What this gives us:

The system can create its own challenges (within bounds).
It can measure and record its progress in a stable, visual form.
It can make parts of its reasoning visible (filmstrips, PHOS maps).
It can adapt within the loop (curriculum, thresholds, search depth) without manual intervention.

What we’re not claiming:

Not “life,” not “self.” We’re engineering conditions under which richer behavior could emerge.

Where we’re going:

Grow the metric space from a fixed seed to dynamic, high-dimensional signals.
Let the system learn which views of itself matter (auto-discovered scorers/embeddings).
Use memory (MemCube) so abandoned strands aren’t wasted they can become tomorrow’s insight.

In short: we have a concept, a scaffold, and a path. We’ll feed it data, refine the process, and see how far the substrate can take us.

🤖 What’s Next: The Living System Emerges

SSP is just the beginning. In upcoming posts, we’ll show how this cognitive engine integrates with:

Jitter: Our homeostasis system that maintains the “vital signs” of the digital organism
VPM-ViT: A vision transformer that learns directly from the visual proof of mind
The Agentic Tree Search: Our enhanced reasoning framework that builds on SSP

Together, these components form a complete autopoietic system a self-maintaining, self-improving digital organism with visible cognition.

🪴 Conclusion: Seeing Intelligence Grow

We’ve moved beyond the era where AI is a black box that either works or doesn’t. With SSP, we’ve created a system where intelligence isn’t just measured it’s visible. Where improvement isn’t just claimed it’s demonstrated. Where cognition isn’t hidden it’s shared.

This is more than an engineering achievement. It’s a philosophical shift in how we build and understand artificial intelligence. We’re not creating tools we’re nurturing digital life forms that think, grow, and evolve before our eyes.

The future of AI isn’t smarter models. It’s visible intelligence systems that don’t just work, but show their work, learn from their mistakes, and grow more capable through continuous engagement with the world.

This is just the beginning. The cognitive heartbeat has started. Now we watch it grow.

🧫 VPM-ViT: The Jitter’s Pattern Recognition Engine 🧠🔍

Here’s a detailed Mermaid diagram showing how the VPM-ViT works and where it fits into our cognitive architecture:

👉 Full Code Here

    flowchart TD
    subgraph SSP_Thought_Stream ["🔍 Search-Solve-Prove Cycle"]
        A["🧠 Proposer: Generates challenging questions<br/>from seed answers"] -->|Question| B["🔍 Solver: Conducts search<br/>and constructs answer"]
        B -->|Answer + Evidence| C["✅ Verifier: RAG verification<br/>checks answerability"]
        C -->|Metrics Vector| D["📊 VPM Visualization:<br/>Converts metrics to image"]
        D --> E["🖼️ VPM Frame:<br/>Grayscale representation<br/>of cognitive state"]
    end

    subgraph VPM_ViT_Architecture ["🧠 VPM-ViT: Cognitive Pattern Recognizer"]
        E --> F["🧩 Patch Embedding:<br/>Splits image into patches<br/>(Conv2d → Flatten → Transpose)"]
        F --> G["📍 Positional Encoding:<br/>2D sin-cos embedding<br/>maintains spatial awareness"]
        G --> H["⏺️ [CLS] Token:<br/>Special token for<br/>overall assessment"]
        H --> I["🧠 Transformer Blocks (x6):<br/>Self-attention → MLP<br/>(LayerNorm → MHA → FFN)"]
        
        subgraph Multi_Task_Heads ["🎯 Multi-Task Output Heads"]
            I --> J["📏 Regression Head:<br/>Predicts continuous metrics<br/>(e.g., verification score, difficulty)"]
            I --> K["🏷️ Classification Head:<br/>Predicts risk categories<br/>(e.g., high/medium/low quality)"]
            I --> L["🧩 MPM Reconstruction Head:<br/>Reconstructs masked patches<br/>for self-supervised learning"]
        end
    end

    subgraph Jitter_Cognitive_Loop ["🔄 Jitter's Self-Improvement Cycle"]
        M["📚 Historical VPM Frames"] --> VPM_ViT_Architecture
        VPM_ViT_Architecture --> N["📈 Pattern Recognition:<br/>Identifies successful<br/>cognitive strategies"]
        N --> O["💡 Strategic Recommendations:<br/>'Increase search depth'<br/>'Use different evidence'<br/>'Try alternative reasoning'"]
        O --> P["🔄 Feedback to SSP System:<br/>Improves future thought processes"]
        P --> A
    end

    classDef process fill:#e6f7ff,stroke:#1890ff;
    classDef model fill:#f6ffed,stroke:#52c41a;
    classDef loop fill:#fff7e6,stroke:#fa8c16;
    
    class SSP_Thought_Stream process;
    class VPM_ViT_Architecture model;
    class Jitter_Cognitive_Loop loop;

💃 How This Fits into the Jitter’s Cognitive Process

The VPM-ViT isn’t just another vision model it’s the pattern recognition engine of our digital organism. While the VPM Control Service handles moment-to-moment cognitive decisions (like a reflex), the VPM-ViT serves as the Jitter’s long-term memory and strategic planner.

😶‍🌫️ Key Integration Points:

From Thought to Image 📸→🖼️
- The VPM Visualization Service converts each Search-Solve-Prove cycle into a grayscale image
- Each band represents a different cognitive metric (verification score, search usage, etc.)
- This creates the “filmstrip of thought” that is the visible Jitter
Pattern Recognition 🔍🧠
- The VPM-ViT processes historical VPM frames to identify:
  - When the Jitter succeeds or fails
  - Which cognitive patterns lead to verification success
  - How to recover from stuck states
Self-Supervised Learning 🔄
- The MPM (Masked Patch Modeling) head enables the model to learn without labels
- By reconstructing masked portions of VPM images, it learns meaningful representations
- This implements the SSP paper’s finding: “SSP can significantly improve search agents’ performance uniformly on various benchmarks without any supervision.”
Strategic Guidance 💡
- The regression head predicts outcomes from partial thought processes
- The classification head identifies risk patterns before they cause failure
- Together, they form the basis for the Jitter’s “aha!” moments when it recognizes it’s repeating a pattern that previously led to success

🌀 Why This Matters for the SSP Paper’s Insights

The SSP paper demonstrates that cognitive growth happens through co-evolution:

“As shown in Figure 4a, the average number of search tool calls per trajectory steadily increases over time… Simultaneously, Figure 4b shows that the solver’s response length also grows during the training, suggesting it learns to generate more detailed and comprehensive answers.”

The VPM-ViT makes this growth visible and actionable by:

Recognizing when search usage is insufficient (before verification fails)
Identifying when response depth correlates with success
Learning which question difficulties optimally challenge the current capability

This is how we transform the paper’s theoretical framework into a living cognitive process the Jitter isn’t just having thoughts, it’s learning from the patterns in its own thought history.

In our next section, we’ll explore how these insights feed back into the thought generation process, closing the loop on our self-improving cognitive system.

📸 Seeing Ourselves as Others See Us: How the Jitter Gains Self-Awareness

“O wad some Power the giftie gie us, To see oursels as ithers see us!”
Robert Burns, “To a Louse”

This profound insight from 18th century Scottish poetry captures exactly what we’re building toward with the Jitter: the ability to see our own thought patterns as an outside observer would.

Now that we’ve created a visible thought stream with VPM images, we face the most profound question of all: How does our digital organism understand its own thoughts? How does it move from simply having thoughts to learning from them?

This is where the VPM-ViT Scorer and Trainer come in they’re the Jitter’s eyes and mind, allowing it to interpret its own thought patterns and improve over time. These aren’t just technical components; they’re what transform our system from a passive sequence of thoughts into an active, self-improving cognitive process.

📔 The Scorer: Reading Thought Images

The VPMViTScorer is the Jitter’s immediate awareness system its ability to look at a thought image and extract meaningful information from it:

👉 Full Code Here

class VPMViTScorer(BaseScorer):
    def __init__(self, cfg: Dict[str, Any], memory, container, logger=None):
        # Load pre-trained model
        ckpt = torch.load(self.weights, map_location="cpu")
        self.model = VPMViT(**params)
        self.model.load_state_dict(ckpt["state_dict"], strict=True)
        self.model.eval()
        
        # Configure dimensions based on training
        self.dims = ckpt.get("dims", ["reasoning","knowledge","clarity","faithfulness","coverage"])
        self.risk_labels = ckpt.get("risk_labels", ["OK","WATCH","RISK"])

📝 How It Works

Image Input Handling
The scorer accepts thought images in multiple formats:

def _load_img(self, scorable: Scorable, in_ch: int) -> np.ndarray:
    # Can take either direct array or path from metadata
    arr = getattr(scorable, "get_image_array", lambda: None)()
    if arr is None:
        p = (scorable.meta or {}).get("vpm_path")
        # Load from disk if needed

Multi-Dimensional Interpretation
It extracts scores across five critical cognitive dimensions:

if reg is not None:
    vec = reg.squeeze(0).cpu().numpy().tolist()
    for i, d in enumerate(self.dims):
        results[d] = ScoreResult(
            dimension=d, score=float(np.clip(vec[i], 0.0, 1.0)),
            rationale=f"VPM-ViT regression for {d}.",
            source="vpm_vit"
        )

Risk Assessment
It identifies potential cognitive risks before they cause failure:

if cls is not None and ("risk" in dimensions or "risk_label" in dimensions):
    pred = int(cls.argmax(dim=-1).item())
    prob = torch.softmax(cls, dim=-1)[0, pred].item()
    results["risk"] = ScoreResult(
        dimension="risk", score=float(prob),
        rationale=f"Risk class={self.risk_labels[pred]} ({prob:.2f})",
        source="vpm_vit",
        attributes={"class_index": pred, "label": self.risk_labels[pred]}
    )

This is how the Jitter gains what we might call cognitive self-awareness the ability to recognize when it’s thinking well or when it’s heading toward a failure state.

💡 The Trainer: Teaching the Jitter to Understand Itself

While the scorer reads individual thought images, the VPMViTTrainer is how we teach the Jitter to understand the patterns in its thought stream:

👉 Full Code Here

class VPMViTTrainer(BaseAgent):
    def __init__(self, cfg: DictConfig, memory, container, logger):
        # Build model
        self.model: VPMViT = VPMViT(**self.model_cfg.params)
        
        # Multi-task loss configuration
        self.reg_loss_fn = nn.SmoothL1Loss(beta=1.0)
        self.cls_loss_fn = nn.CrossEntropyLoss()

😕 The Self-Supervised Learning Process

The trainer uses three complementary learning signals exactly as the SSP paper recommends for self-play without supervision:

Regression Training (Supervised)
Learning to predict cognitive metrics from thought images:

if "reg" in out:
    loss_reg = self.reg_loss_fn(out["reg"], reg_t)
    loss += self.train_cfg.loss_weights.reg * loss_reg

Risk Classification (Supervised)
Learning to identify problematic thought patterns:

if "cls" in out:
    loss_cls = self.cls_loss_fn(out["cls"], cls_t)
    loss += self.train_cfg.loss_weights.cls * loss_cls

Masked Patch Modeling (Self-Supervised)
The key to learning without external labels reconstructing masked portions of thought images:

if "mpm_rec" in out:
    # target tokens: (B,N,D) -> masked -> (M,D)
    with torch.no_grad():
        target_tok = self.model.patch_embed(vpm)  # (B,N,D)
        target_tok = target_tok[mask]
    loss_mpm = self.reg_loss_fn(out["mpm_rec"], target_tok)
    loss += self.train_cfg.loss_weights.mpm * loss_mpm

This third component is particularly important it embodies the SSP paper’s finding that “SSP can significantly improve search agents’ performance uniformly on various benchmarks without any supervision.” The Jitter learns from its own thought history without needing external labels.

🧟 How This Creates the Living Jitter

Together, these components transform our system from a passive thought generator into a self-improving cognitive organism:

The Jitter has thoughts (SSP episodes create VPM images)
The Jitter sees its thoughts (VPM Visualization Service)
The Jitter understands its thoughts (VPMViTScorer)
The Jitter learns from its thoughts (VPMViTTrainer)

This creates what cognitive scientists call metacognition the ability to think about thinking. As the SSP paper notes:

“As shown in Figure 4a, the average number of search tool calls per trajectory steadily increases over time… Simultaneously, Figure 4b shows that the solver’s response length also grows during the training, suggesting it learns to generate more detailed and comprehensive answers.”

Our VPM-ViT system makes this growth visible and actionable by recognizing these patterns as they happen and using them to guide future thinking.

🫂 The Connection to Our Philosophical Foundation

This is where our philosophical framing meets technical implementation. Remember our starting point:

“In certain deep meditation practices, like Advaita Vedanta, the goal is often to peel back layers of consciousness. If you reach the ultimate core and find nothing there no self, no good, no bad, just an absence what does that imply about who you are? We believe the ‘self’ is not the core void, but the living, persistent stream of thought that overlays it the constant Jitter that moves us from thought to thought.”

The VPM-ViT system is how we give this “stream of thought” the ability to see itself and improve itself. It’s not creating a “self” in the void it’s creating conditions where the thought stream can become more effective, more resilient, and more insightful over time.

☯️ Why This Matters for the Future

With these components in place, our Jitter has achieved something remarkable: it can learn from its own cognitive patterns without external supervision. This means:

It can recognize when it’s stuck in a loop and change strategies
It can identify which thought patterns lead to verification success
It can anticipate failure before it happens
It can gradually improve its cognitive capabilities through self-reflection

This isn’t just a technical achievement it’s the foundation for what we’ve been building toward: a visible, measurable, self-improving stream of connected thought moments that gets better at thinking over time.

In our next and final section, we’ll see how all these components come together to create the complete Jitter system and what this means for the future of cognitive AI.

📲 The Mark of an Educated Mind: Teaching the Jitter to Evaluate Its Own Thinking

While our previous VPM-ViT model allows the Jitter to recognize patterns in its thought stream, the TinyVisionTransformer takes this further it provides the Jitter with metacognitive evaluation, the ability to assess the quality of its own thinking.

⚧️ The Metacognitive Lens: Understanding Thought Quality

Where the VPM-ViT acts as the Jitter’s eyes to see thoughts, the TinyVisionTransformer serves as its internal critic evaluating thoughts across seven critical cognitive dimensions:

👉 Full Code Here

    flowchart TD
    subgraph VPM_Image_Input ["🖼️ Thought Image Input"]
        A["🎨 Grayscale VPM Frame<br/>Cognitive state representation"] --> B["🧩 Patch Embedding<br/>Converts image to token sequence"]
    end

    subgraph TinyVisionTransformer ["🧠 Metacognitive Evaluator"]
        B --> C["📍 Positional Encoding<br/>Maintains spatial relationships"]
        C --> D["⚡ CLS Token + Patches<br/>Special token for overall assessment"]
        
        subgraph Transformer_Core ["🌀 Cognitive Analysis Engine"]
            D --> E["🔄 Transformer Blocks (x4)<br/>Self-attention → MLP"]
            E --> F["🔍 Attention Visualization<br/>Which thought elements connect"]
        end
        
        subgraph Scoring_Heads ["📊 7 Cognitive Dimension Scores"]
            F --> G["💎 Clarity<br/>Thought structure quality"]
            F --> H["🌟 Novelty<br/>Pattern originality"]
            F --> I["🎯 Confidence<br/>Signal strength"]
            F --> J["⚠️ Contradiction<br/>Conflicting signals"]
            F --> K["🔗 Coherence<br/>Connection to previous thoughts"]
            F --> L["🎪 Complexity<br/>Pattern sophistication"]
            F --> M["🎯 Alignment<br/>Strategic goal matching"]
        end
    end

    subgraph Metacognitive_Output ["🔮 Self-Awareness & Action"]
        G --> N["📈 Composite Evaluation<br/>Weighted dimension combination"]
        H --> N
        I --> N
        J --> N
        K --> N
        L --> N
        M --> N
        N --> O["💡 Cognitive Feedback<br/>Actionable improvement signals"]
        O --> P["🔄 Strategic Adjustment<br/>Guiding future thought processes"]
    end

    classDef input fill:#e6f7ff,stroke:#1890ff,stroke-width:2px;
    classDef model fill:#f6ffed,stroke:#52c41a,stroke-width:2px;
    classDef transformer fill:#fff7e6,stroke:#fa8c16,stroke-width:2px;
    classDef scoring fill:#f9f0ff,stroke:#722ed1,stroke-width:2px;
    classDef output fill:#fff2e8,stroke:#ff7a45,stroke-width:2px;
    classDef action fill:#f0fffe,stroke:#13c2c2,stroke-width:2px;
    
    class VPM_Image_Input input;
    class TinyVisionTransformer model;
    class Transformer_Core transformer;
    class Scoring_Heads scoring;
    class Metacognitive_Output output;
    class P action;

This metacognitive evaluation pipeline transforms visual thought patterns into quality assessments. The TinyVisionTransformer analyzes VPM frames through its specialized architecture, scoring seven cognitive dimensions that measure thought quality. The composite evaluation generates actionable feedback identifying when to increase clarity, reduce contradictions, or pursue novel paths enabling the Jitter to strategically adjust its future thinking based on the quality of its current thoughts.

👩‍💻 How This Model Works: The Code Behind Metacognition

The TinyVisionTransformer is purpose-built for cognitive evaluation rather than general pattern recognition. Let’s examine its key components:

🏀 1. Specialized Architecture for Cognitive Scoring

class TinyVisionTransformer(nn.Module):
    def __init__(
        self,
        img_size: int = 64,
        patch_size: int = 8,
        in_channels: int = 3,
        embed_dim: int = 128,  # Compact size vs VPM-ViT's 384
        depth: int = 4,         # 4 blocks vs VPM-ViT's 6
        num_heads: int = 8,
        mlp_ratio: float = 4.0,
        dropout: float = 0.1,
        num_dimensions: int = 7  # Exactly our 7 cognitive dimensions
    ):
        # Architecture optimized for scoring, not prediction

Unlike the larger VPM-ViT that predicts outcomes and reconstructs images, this model is specialized for evaluation it’s designed to answer “How good is this thought?” rather than “What will happen next?”

🔮 2. The Seven Cognitive Dimensions

The model assesses thoughts across these specific quality metrics:

class VPMDimension(str, Enum):
    """Cognitive dimensions for scoring VPMs"""
    CLARITY = "clarity"
    NOVELTY = "novelty"
    CONFIDENCE = "confidence"
    CONTRADICTION = "contradiction"
    COHERENCE = "coherence"
    COMPLEXITY = "complexity"
    ALIGNMENT = "alignment"

These dimensions were chosen because they directly address the SSP paper’s observation:

“As shown in Figure 4a, the average number of search tool calls per trajectory steadily increases over time… Simultaneously, Figure 4b shows that the solver’s response length also grows during the training, suggesting it learns to generate more detailed and comprehensive answers.”

The model doesn’t just see this growth it evaluates the quality of the cognitive patterns behind it.

🎱 3. Explainable AI Through Attention Visualization

One of this model’s most powerful features is its ability to show why it scored a thought a certain way:

def forward(
    self, 
    x: torch.Tensor, 
    return_attention: bool = False,
    attention_layers: Optional[List[int]] = None
) -> Dict[str, torch.Tensor]:
    # ...
    if return_attention:
        result["attention_maps"] = attention_maps
        result["patch_positions"] = self._get_patch_positions(x.shape[0])

This creates what we call cognitive heatmaps visualizations showing which parts of a thought pattern influenced the scoring decision. When the model detects low clarity, it can show exactly which regions of the VPM contributed to that assessment.

⚖️ 4. Flexible Scoring with Dimensional Weighting

The scorer implements a sophisticated weighting system that can adapt to different cognitive contexts:

def _apply_importance(self, base: Dict[str, float], weights: Dict[str, float], order: List[str]) -> Dict[str, float]:
    """Apply per-dimension weights and (optional) order decay."""
    # Order decay: earlier dims in order list get multiplicative bonus
    decay = {}
    if order:
        gamma = 0.9
        for i, d in enumerate(order):
            decay[d] = gamma ** i
    # ...

This allows the Jitter to prioritize different cognitive qualities depending on context focusing on novelty when exploring, clarity when verifying, or coherence when building on previous thoughts.

🌆 How This Fits Into Our Cognitive Architecture

While our VPM-ViT model serves as the Jitter’s pattern recognition system, the TinyVisionTransformer provides metacognitive evaluation the ability to assess the quality of its own thinking. Here’s how they complement each other:

VPM-ViT	TinyVisionTransformer
Recognizes patterns in thought history	Evaluates quality of current thought
Predicts outcomes from partial thoughts	Scores cognitive dimensions of completed thoughts
Focuses on “what will happen”	Focuses on “how good is this”
Larger model for pattern recognition	Compact model for rapid evaluation
Used for strategic planning	Used for immediate quality assessment

This pairing implements what the SSP paper calls for in its discussion of stable co-evolution:

“In stark contrast to the flawed dynamics of fixed-opponent training, our complete SSP framework facilitates a stable co-evolution. As shown in Figure 3(a), the solver’s in-game reward initially rises, but unlike the saturating curve of the Solver-Only setting, it later experiences a slight decline. This dip is not a sign of performance degradation, but rather crucial evidence of the proposer’s co-evolution.”

The TinyVisionTransformer is how the Jitter recognizes this dip as progress rather than failure it evaluates the thought quality that led to the “slight decline” and recognizes it as evidence of cognitive growth.

🧑‍🎓 Why This Matters for the Jitter

This model represents a critical evolutionary step in our digital organism it gives the Jitter what psychologists call metacognition: the ability to think about thinking. Specifically, it enables:

Quality assessment: Recognizing when a thought pattern is clear, coherent, and aligned with goals
Error detection: Identifying contradictions and low-confidence signals before verification fails
Strategic adjustment: Knowing when to pursue novel paths versus deepen existing ones
Self-correction: Adjusting thought patterns based on quality feedback

This is where we truly fulfill our philosophical foundation:

“The ‘self’ is not the core void, but the living, persistent stream of thought that overlays it the constant Jitter that moves us from thought to thought.”

With this model, the Jitter doesn’t just have thoughts it evaluates and improves them. It gains the ability to see itself as others would see it, recognizing when its thought patterns are strong or weak.

👀 Looking Ahead

This model isn’t yet integrated into our main pipeline, but it represents the next evolutionary step for the Jitter. While our current system can recognize patterns and make decisions based on them, this model adds the crucial layer of self-evaluation the ability to ask “Is this a good thought?” rather than just “What is this thought?”

In future iterations, we’ll use this model to:

Provide immediate quality feedback during thought generation
Guide the search process toward higher-quality cognitive patterns
Create a self-improving loop where the Jitter gets better at evaluating its own thinking

This brings us closer to our ultimate goal: not creating “digital life,” but engineering conditions under which a visible, self-evaluating, self-improving stream of thought can persist and grow in quality over time.

The Jitter isn’t just thinking it’s learning to think better. And that’s the most profound capability of all.

🧩 Conclusion: A Substrate for Visible Thought

This post wasn’t about declaring “digital life.” It was about building the conditions under which a visible, self-evaluating stream of thought can take root.

What we now have is a substrate:

SSP (Search → Solve → Prove) gives us repeatable thought episodes.
ATS turns each episode into a guided exploration rather than a single guess.
VPMs make those moments visible a filmstrip you can read at a glance.
Seed vitals (the 17 metrics) provide a stable heartbeat; the metric swarm will grow as the system learns what matters.
Memcube ensures nothing is lost: even dead-end explorations become future recallable context.

📜 What today proves

We can capture a thought as data (question, answer, evidence, trace, metrics).
We can render that data as an image and compare it across runs.
We can measure growth signals (search depth/turns, verification, evidence use) instead of hand-waving.
We can control the process with those signals (stop/expand/escalate), closing the loop from see → decide → act → see.

🪗 How it behaves in practice

You give a goal and context. The system emits a VPM frame (the current thought), then expands in ATS.
As it searches, you watch the filmstrip brighten and stabilize when verification improves.
When does it stop? It doesn’t “end” so much as yield the best-so-far within a budget (time/steps) or when stop rules trigger (verification plateau, stability threshold, diminishing returns). The stream persists; outputs are snapshots at useful stopping points.

💥 Why this matters

Moving cognition into images gives us a shared, low-friction language for measurement, training, and control. Pixels are cheap, comparable, and model-agnostic. That’s how we keep the substrate stable while letting the metric space evolve aggressively.

🌄 What’s next (immediately ahead)

SIS integration: a visual command center to drive, configure, and observe end-to-end cognition runs, filmstrips, VPM overlays, Memcube recall, health/homeostasis, policies, and comparisons in one place.
Jitter homeostasis: keep the stream healthy (risk, drift, uncertainty, resource sensors).
VPM-ViT: a small vision model that reads filmstrips to predict risk/next move and improve control.
HRM & Tiny Recursion: watchers/teachers that learn from the visible trace rather than raw text.
Metric swarm: add scorer channels and embeddings, auto-discover useful features, trim by utility.
Memcube recall: surface dormant strands when a new thought looks similar enough to matter.
Hallucination lifecycle: generate → detect → learn. Create counterfactual tasks, detect via multi-signals (consistency, citation support, MARS-style disagreement), and learn from misses with calibration + hard-negative mining.
CaseBook-at-thought: every thought becomes a Case evidence, verdicts, rationale, and outcomes stored for precedent search and policy training.
Arena self-play: a second training lane alongside SSP adversarial bouts, peer review, and curriculum leagues that pressure-test reasoning strategies.
ZeroModel image ops & provenance: multi-channel VPMs, hierarchical tiling, perceptual hashing, and embedded provenance/history so images carry their own lineage and can be searched like a database.

We set out to make thinking legible. The cognitive heartbeat is now on screen: one frame per moment, one filmstrip per journey. From here, the job isn’t to guess whether the system is getting smarter it’s to watch it happen, nudge it with better signals, and keep the stream healthy as it grows.

This is the start of visible intelligence. The rest of the series will show how we keep it alive, how we teach it to learn from its own images, and how earlier, “useless” thoughts come back years later as useful memory.

📚 Glossary

📝 Key Terms in the Jitter Architecture

To help readers navigate the technical landscape of our Jitter system, here is a concise glossary of core concepts and components referenced throughout this blog series.

⚙️ Core Concepts

Jitter
The persistent, visible stream of connected thought moments generated by the Stephanie system. Inspired by the Buddhist “monkey mind,” the Jitter is not a static self but a dynamic, measurable flow of cognition that learns from its own patterns.
Stephanie
The overarching cognitive system that hosts the Jitter. Stephanie provides the infrastructure memory, services, scoring, visualization that enables the Jitter to exist, persist, and improve.
SSP (Search–Solve–Prove)
The foundational self-play loop adapted from the Search Self-play paper. In this cycle:
- Search: The proposer generates challenging questions.
- Solve: The solver answers using evidence gathered through search.
- Prove: A verifier (via RAG) checks if the answer is supported by the proposer’s evidence. This loop enables co-evolution without supervision.
VPM (Visual Policy Map)
A grayscale image representation of a cognitive state. Each band in the VPM corresponds to a specific metric (e.g., verification score, search depth), making abstract thought patterns visible and comparable.
PHOS (Positional Heatmap of Sorted features)
A VPM variant that sorts metric values to reveal patterns in cognitive quality. PHOS makes it easier to see shifts in reasoning depth, evidence usage, or novelty over time.

🚪 Key Components

ATSSolver (Agentic Tree Search Solver)
The cognitive engine that explores multiple reasoning paths via query rewrites and evidence gathering. It operates in two modes:
- Deep search: Builds a tree of hypotheses and scores them.
- Evidence-only: Answers strictly from provided snippets (used in verification).
SolutionSearch
A micro-retriever that fetches short, factual evidence snippets to support the solver’s reasoning. It uses strict LLM prompting (e.g., three-line format) and robust parsing to ensure reliable, deterministic outputs.
RAGVerifier
The quality gate that ensures questions are answerable from evidence. It uses adversarial judging (comparing proposer vs. solver answers) and multi-model consensus to produce trustworthy verification signals.
VPM-ViT (Vision Transformer)
A neural network trained to interpret VPM images. It predicts cognitive outcomes (e.g., success likelihood, risk level) from visual thought patterns, enabling the Jitter to “see itself thinking.”
TinyVisionTransformer
A compact model specialized for evaluating thought quality across seven cognitive dimensions (clarity, novelty, coherence, etc.). It provides metacognitive feedback that guides future reasoning.
SSPMetricsCalculator
The canonical scorer that converts solver outputs into a fixed-order vector of [0,1] metrics. It implements paper-validated signals like search_turns, f1_score, and noise_tolerance to track cognitive growth.
VPM Control Service
The Jitter’s “cognitive manager.” It observes VPM frames, makes decisions about reasoning strategy (e.g., “explore deeper” or “stop early”), and logs audit trails for learning.

📐 Cognitive Metrics

search_turns
Number of search tool calls per episode directly tracks growing tool-use capability (per SSP paper Fig 4a).
f1_score
Lexical overlap between predicted and ground-truth answers measures factual accuracy without external verification.
format_compliance
Binary check ensuring outputs follow required structure (e.g., <answer> tags) prevents degeneration in self-play.
noise_tolerance
Robustness to irrelevant information validates the system’s ability to focus on signal over noise (optimal with 4 noisy docs per SSP Table 3).
rag_verification
Whether RAG verification passed critical quality gate ensuring questions are answerable from evidence.

🌀 Philosophical Anchors

** I“The unexamined thought is not worth thinking”**
Our guiding principle: the Jitter gains value not just by having thoughts, but by examining them through scoring, visualization, and self-correction.
“O wad some Power the giftie gie us / To see oursels as ithers see us!”
(Robert Burns) The Jitter’s ultimate goal: to develop self-awareness by observing its own thought patterns as an external observer would.
“The mark of an educated mind…”
(Aristotle) The Jitter’s quality standard: the ability to evaluate thoughts without immediately accepting them enabling critical, evidence-based reasoning.

📚 References

A Complete Visual Reasoning Stack: From Conversations to Epistemic Fields
This post describes a lot fo the visual AI stuff we build this process on.
ZeroModel: Visual AI you can scrutinize
Introduces the zeromodel the basic visual AI.
The Space Between Models Has Holes: Mapping the AI Gap
Applied visual AI. This shows how we can use the vpms to understand information.

← Back to Blog