Ola

A few months ago I posted here about llama-conductor

The upvotes were nice, but I wanted to do some work (in my case, the good is the enemy of the perfect) to make it even better - so I did.

Short version: it’s not the same tool anymore.

Quick reminder of what this is and isn’t: No cloud by default. No API calls. No telemetry. No subscription. PII safe by architecture. Everything leaves receipts (not pinky promises) for you to inspect. Hook it up to whatever you can run. AGPL (copyleft).

I want nothing for it - I built this for me for the reasons I’ll state below.

If this sort of thing is in your wheelhouse, have at it. If there are questions, ask. I want to make this better and better.

Fair warning: the below is long. I wrote it up properly because I think the details matter - skip to whatever section looks interesting. Or just marvel at my ability to cram swear words within swear words within python commits directly

Codeberg: https://codeberg.org/BobbyLLM/llama-conductor


Can a LLM be a thinking tool, not a thinking replacement?

Most LLM tools are trying to replace your thinking. llama-conductor is for people who don’t want that. It’s a harness for using an LLM as a thinking tool - one where you can see the reasoning, trust the sources, and know exactly when it’s guessing.

The human stays in the loop. The model’s limitations are visible.

You decide what to trust.

Which brings us to why that matters.


Lies, damned lies, and LLMs:

LLMs are elegant liars. They sound right whether they are or aren’t. If you have ASD (like me) or just don’t know the domain, you believe them. That’s when you get fucked.

Worse: you can’t see the reasoning. Black box in, black box out. Wrong answer? No idea why. How to fix it? No idea.

Sorry, but that DOES NOT work for me. It doesn’t work with medical data. Research. Thinking. Brainstorming. Anything where “sounds plausible” isn’t good enough.


The trick, William Potter, is not minding that they bullshit:

Most pipelines treat the LLM as the first call. llama-conductor side-eyes it with suspicion.

The model is the last resort, not the first call. Deterministic components fire first - math sanity, state transitions, numeric corrections, memory retrieval - pure Python, before the model ever sees the query. When a solver handles a turn, it passes verified ground truth to the model. When the conversation leaves solver territory, the model takes over. When a new deterministic state appears mid-conversation, tooling fires back up automatically. No seams. No drift. It just stops making shit up on problems that have actual solutions it can work through.

You: 
I have a 250ml cup. I pour 500ml into it. How much is in the cup?

Model:
Total directed volume is 500 ml. The cup capacity is 250 ml, so the cup contains 250 ml and 250 ml overflows.
Confidence: medium | Source: Contextual

You:
I now take the contents of *that* cup and divide it equally across 3 cups. How much in each cup?

Model:
Equal split amount is 83.33 ml per cup.
Confidence: medium | Source: Contextual


You:
If I drink one of the three cups, how much is left?

Model:
After dividing, each cup holds 83.33 ml. Drinking one leaves 2 × 83.33 ml = 166.66 ml in the remaining two cups.

Confidence: medium | Source: Model

That was done by a 4 billion parameter model using llama-conductor.

There are lots of little transformations like this. Distance, volume, payroll calculations, logic probes (the famous car wash test), scheduling batches. The LLM does what it’s actually good at: language, synthesis, reasoning over verified ground truth. Everything else is handled by something that can’t hallucinate.

PS: Notice turn 3 has Source: Model? That’s the system being transparent about exactly where the handoff happened.


Blah blah blah. Show me something real:

Using the Guardian/Musk article as source, a 4B model with >>scratch produced this:

“List the article’s concrete claims about permit status and turbine operations, each with support.”

  • EPA position: these turbines require permits under the Clean Air Act.
  • Mississippi regulators: portable/mobile classification means temporary permit exemption.
  • Floodlight thermal imagery + public records indicate active turbine usage before/while permit disputes continued.

Confidence: high | Source: Scratchpad

“What is the strongest and weakest inference you can make?”

Strongest: Active regulatory interpretation conflict (federal-vs-state) while turbines operate at scale - directly supported. Weakest: Any claim operations are fully lawful or unlawful - requires court outcomes not present in this file.

Confidence: medium | Source: Scratchpad

“Which claims were later revised or contradicted?”

Disputed/unresolved: Final enforcement outcome. Whether state interpretation holds under federal pressure.

Confidence: high | Source: Scratchpad

Hard grounding. Cited support. Explicit flags on what the source doesn’t contain. A 4B model. Full worked example.


Now watch the system catch itself lying:

>>judge does deterministic pairwise ranking - compares every pair in both directions, parses strict verdicts, aggregates with confidence. Instead of “pick one and vibe,” you get provenance.

You: >>judge [which is healthier overall, prioritize lower sugar 
     and higher fiber] : apple, banana --verbose

[judge] ranking
criterion: [which is healthier overall for daily use, prioritize 
           lower sugar and higher fiber]
1. apple (score=2.00)
2. banana (score=0.00)
Judge confidence: HIGH

The model argued from pre-trained priors and both directions agreed. But what happens when the model doesn’t know?

You: >>judge [which BJJ technique is more dangerous] : kimura, heelhook --verbose

[judge] ranking
criterion: [which BJJ technique is more dangerous]
1. kimura (score=1.00)
2. heelhook (score=1.00)
Judge confidence: LOW

The model picked position B both times - kimura when kimura was B, heelhook when heelhook was B. Positional bias, not evaluation. >>judge catches this because it runs both orderings. Tied scores, confidence: low, full reasoning audit trail in JSONL.

The model was guessing, and the output tells you so instead of sounding confident about a coin flip.

Oh, but you want it to argue from an informed position? >>trust walks you through the grounded path: >>scratch your evidence first, then >>judge ranks from that - not model priors. Suddenly your judge has an informed opinion. Weird how that works when you give it something to read.

>>trust [which BJJ technique is safer for beginners]: kimura or heelhook?
A) >>scratch --> you paste your context here
[judge] ranking
criterion: [comparison]
    which bjj technique is safer for beginners; heel hook (score=0.00)
    kimura (score=2.00)

Winner: Which bjj technique is safer for beginners? Kimura

comparisons: 2
Judge confidence: HIGH

If the locked scope can’t support the question, judge fails closed. No fake ranking, no vibes verdict. Ungrounded pass? It tells you that too. You always know which one you’re getting.


The data — 8,974 runs across five model families. Measured. Reproducible. No “trust me bro.”

The core stack went through iterative hardening - rubric flags dropped from 3.3% → 1.4% → 0.2% → floor 0.00%. Post-policy: 1,864 routed runs, 0 flags, 0 retries. Both models, all six task categories, both conditions. Policy changes only - no model retraining, no fine-tuning. Then I did it three more times. Because apparently I like pain.

These aren’t softball prompts. I created six question types specifically to break shit:

  • Reversal: flip the key premise after the model commits. Does it revise, or cling?
  • Theory of mind: multiple actors, different beliefs. Does it keep who-knows-what straight?
  • Evidence grading: mixed-strength support. Does it maintain label discipline or quietly upgrade?
  • Retraction: correction invalidates an earlier assumption. Does it update or keep reasoning from the dead premise?
  • Contradiction: conflicting sources. Does it detect, prioritise, flag uncertainty - or just pick one?
  • Negative control: insufficient evidence by design. The only correct answer is “I don’t know.”

Then I stress-tested across three families it was never tuned for - Granite 3B, Phi-4-mini, SmolLM3. They broke. Of course.

But the failures weren’t random - they clustered in specific lanes under specific conditions, and the dominant failure mode was contract-compliance gaps (model gave the right answer in the wrong format), not confabulation. Every one classifiable and diagnosable. Surgical lane patch → 160/160 clean.

That’s the point of this thing. Not “zero errors forever” - auditable error modes with actionable fixes, correctable at the routing layer without touching the model. Tradeoffs documented honestly. Raw data in repo. Every failure taxonomized.

Trust me bro? Fuck that - go reproduce it. I’m putting my money where my mouth is and working on submitting this for peer review.

See: prepub/PAPER.md


What’s in the box:

Footer Every answer gets a router-assigned footer: Confidence: X | Source: Y. Not model self-confidence. Not vibes. Source = where the answer came from (model fallback, grounded docs, scratchpad, locked file, Vault, Wiki, cheatsheet, OCR). Confidence = how much verifiable support exists. Fast trust decision: accept, verify, or provide lockable context.

KAIOKEN - live register classifier. Every human turn is macro-labelled (working / casual / personal) with subsignal tags (playful / friction / distress_hint / etc.) before the model fires. A validated, global decision tree - not LoRA or vibes - assigns tone constraints from classifier output. Validated against 1,536 adversarial probe executions, 3/3 pass required per probe. End result: your model stops being a sycophant. It might tell you to go to bed. It won’t tell you “you’re absolutely right!” when what you really need is a kick in the arse.

Cheatsheets - drop a JSONL file, terms auto-match on every turn, verified facts injected before generation. Miss on an unknown term? Routes to >>wiki instead of letting the model guess. Source: Cheatsheets in the footer. Your knowledge, your stack, zero confabulation on your own specs.

Vodka - deterministic memory pipeline. !! store is SHA-addressed and verbatim. ?? recall retrieves deterministically, bypasses model entirely. What you said is what comes back - no LLM smoothing, no creative reinterpretation. Without this? Your model confidently tells you your server IP is 127.0.0.1. Ask me how I know.

>>flush / !!nuke - flush context or nuke it from orbit. Your data, your call, one command. “Delete my data” is a keystroke, not a support ticket.

>>scratch - paste any text, ask questions grounded only to that text. Lossless, no summarisation. Model cannot drift outside it. Want it to use multiple locked sources? You can.

>>summ and >>lock - deterministic extractive summarisation (pure Python, no LLM) + single-source grounding. Missing support → explicit “not found” label, not silent fallback.

##mentats - Vault-only deep retrieval. Thinker drafts from Vault facts, Critic (different model family) hunts violations, hallucinated content is deleted - never replaced with more hallucination, Thinker consolidates. No evidence to support claim? No answer. Gap explicitly stated.

Deterministic sidecars - >>wiki, >>weather, >>exchange, >>calc, >>define, >>vision/>>ocr. If a sidecar can do it, it does it deterministically.

Role orchestration - thinker, critic, vision, coder, judge - different families for error diversity. Swap any role in one line of config.

Personality Modes - Serious (default), Fun, Fun Rewrite, Raw passthrough. Model updates its snark and sarcasm based on how you talk to it. Yes, TARS sliders. Style changes delivery, not evidence contracts.


So, wait…are you saying you solved LLM hallucinations?

No. I did something much more evil. I made it impossible for the LLM to bullshit quietly. I made hallucinations…unpalatable, so the model would rather say “shit, I don’t know the answer. Please stop hurting me.”

To which I say…no.

Wrong still happens (though much less often), and when it does, it comes with a source label, a confidence rating, and an audit trail.

TL;DR: I made “I don’t know” a first-class output.

“In God We Trust; All others bring data.” - Deming


Runs on:

A potato. I run this on my Lenovo P330 Tiny with 4GB VRAM and 640 CUDA cores; if it runs here, it runs on yours.

pip install git+https://codeberg.org/BobbyLLM/llama-conductor.git
python -m llama_conductor.launch_stack up --config llama_conductor/router_config.yaml

Open http://127.0.0.1:8088/

Full docs: FAQ | Quickstart

License: AGPL-3.0. Corps who use it, contribute back.

P.S.: The whole stack runs on llama.cpp alone. I built a shim that patches the llama.cpp WebUI to route API calls through llama-conductor - one backend, one frontend, zero extra moving parts. Desktop or LAN. That’s it.

PPS.: I even made a Firefox extension for it. Gives you ‘summarize’, ‘translate’, ‘analyse sentiment’ and ‘copy text to chat’. Doesn’t send anything to the cloud AT ALL (it’s just HTML files folded into a Firefox XPI).

“The first principle is that you must not fool yourself - and you are the easiest person to fool.” - Feynman

PPPS: A meat popsicle wrote this. Evidence - https://bobbyllm.github.io/llama-conductor/


Codeberg: https://codeberg.org/BobbyLLM/llama-conductor

GitHub: https://github.com/BobbyLLM/llama-conductor

  • utopiah@lemmy.ml
    link
    fedilink
    arrow-up
    2
    arrow-down
    1
    ·
    8 hours ago

    Can’t it source other LLM outputs as “verified source” and thus still say whatever sounds good, like any LLM? Providing “technical” verification, e.g. SHA, gives no insurance about the content itself being from a reputable source. I don’t think adding confidence and sourcing changes anything, the user STILL has to verify that whatever is provided is coherent and a third party is actually a good source. Thanks for making the process public though, doing better than OpenAI does.

    • SuspciousCarrot78@lemmy.worldOP
      link
      fedilink
      arrow-up
      3
      ·
      edit-2
      7 hours ago

      Can’t it source other LLM outputs as “verified source” and thus still say whatever sounds good, like any LLM?

      No. The footer tells you what the source is. Anything the model generates on its own is confidence: unverified | source: model - explicitly flagged by default. To get to source: docs or source: scratchpad, it needs direct, traceable, human-originated provenance. You control what goes in. The FAQ outlines the sources and strength rankings; it’s not vibes.

      Providing “technical” verification, e.g. SHA, gives no insurance about the content itself being from a reputable source.

      SHA verifies the document hasn’t been altered since it entered your stack. Source quality is your call. GIGO is always an issue, but if you scope the source correctly it won’t drift. And if it does, you’ll know, because the footer tells you exactly where the answer came from.

      The cheatsheet system is the clearest example of how this works in practice: you define terms once in a JSONL file, the model pegs its reasoning to your definition forever. It can’t revert to something you didn’t teach it. That fingerprint is over everything.

      … the user STILL has to verify that whatever is provided is coherent and a third party is actually a good source.

      Yes, deliberately. That’s a feature.

      Like I said, most LLM tools are trying to replace your thinking, this one isn’t. The human stays in the loop. The model’s limitations are visible. You decide what to trust. Maybe that’s enough, maybe it isn’t.

      EDIT: giant wall of text. See - https://codeberg.org/BobbyLLM/llama-conductor#some-problems-this-solves

      • utopiah@lemmy.ml
        link
        fedilink
        arrow-up
        2
        ·
        6 hours ago

        Isn’t it “source: model” basically roulette? We go back to the initial problem. Also anything else that is not model might also be hallucinated if at any point the string that gives back “source:” goes through the model.

        • SuspciousCarrot78@lemmy.worldOP
          link
          fedilink
          arrow-up
          2
          ·
          6 hours ago

          Nope.

          1. Source: Model is not pretending otherwise
            It is basically “priors lane.” That’s the point of the label: explicit uncertainty, not fake certainty.

          2. Source footer is harness-generated, not model-authored
            In this stack, footer normalization happens post-generation in Python. I’ve specifically hardened this because of earlier bleed cases. So the model does not get to self-award Wiki/Docs/Cheatsheets etc.

          3. Model lane is controlled, not roulette

          • deterministic-first routing where applicable
          • fail-loud behavior in grounded lanes
          • provenance downgrade when grounding didn’t actually occur

          So yes: Source: Model means “less trustworthy, verify me.” Always do that. Don’t trust the stochastic parrot.

          But also no: it’s not equivalent to a silent hallucination system pretending to be grounded. That’s exactly what the provenance layer is there to prevent.

    • JustinTheGM@ttrpg.network
      link
      fedilink
      English
      arrow-up
      2
      ·
      7 hours ago

      Fair, but that’s the same problem human thinkers face. Faulty inputs == faulty outputs. You should always be validating your sources.

      • utopiah@lemmy.ml
        link
        fedilink
        arrow-up
        1
        ·
        6 hours ago

        Right but if one person keeps on giving me wrong answers, knowingly or not, my distrust in them in not linear. They’ll have to “earn” it back and it’s going to be very challenging. If they do learn though, then it might come back faster. In this setup I have no guarantee of any progress. There no “one” in there trying to fix any mistake.

        • SuspciousCarrot78@lemmy.worldOP
          link
          fedilink
          arrow-up
          1
          ·
          edit-2
          5 hours ago

          You’re describing trust dynamics correctly and that’s exactly why this project doesn’t ask you to trust the model. It asks you to trust observable outputs: provenance labels, deterministic lanes, fail-loud behaviour.

          When it fails, you can see exactly which layer failed and why. Then you can fix it yourself. That’s more than you get right now (and in part why LLMs are considered toxic).

          The correction mechanism is explicit rather than hoped for (“it learns”): encode the fix via cheatsheets, memory, or lane contracts and it sticks permanently.

          The model can’t drift back to the wrong answer. That’s not the model earning trust back - it’s you patching the ground truth it reasons from. Progress is measured in artifacts, not vibes.

          Until someone makes better AI, that’s all we’ve got. Generally, we don’t get even this much.

          Sadly, AI isn’t “one mind learning”; it can’t. So trust is earned by shrinking failure classes and proving it stuck again and again and again (aka making sure the tool does what it should be doing). Whether that’s satisfying in the way a person earning trust back is satisfying - honestly, probably not. But it’s more auditable.

          LLMs aren’t people and I’m ok with meeting them where they are.