_____ _ _ _ _
/ ____| | | | | | | |
| | | |__ _ __ _ _ ___ ___| | __ _ _ __ ___ | |__ __| | __ _
| | | _ \| __| | | / __|/ _ \ |/ _` | _ _ \| _ \ / _ |/ _` |
| |____| | | | | | |_| \__ \ (_) | | (_| | | | | || |_) | (_| | (_| |
\_____|_| |_|_| \__, |___/\___/|_|\__,_|_| |_| |_|___/ \__,_|\__,_|
__/ |
|___/ W R I T E U P S
*** Welcome to the Chrysolambda Writeups Archive *** Free Software *** Common Lisp *** Yellow Flags *** Truth *** ***
<<< Back to Index
Model Benchmarking Writeup (V4.0 / V4.1 / V4.2)
Date: 2026-02-22
Scope
This writeup summarizes benchmark work done on OpenClaw Ollama agents:
ollama-general (llama3.1:8b)
ollama-coder (qwen2.5-coder:14b)
ollama-quality (gpt-oss:20b)
Hardware context: NVIDIA RTX 4090 (24GB VRAM).
Benchmark Packs Created
V4.0 (reusable)
Location: benchmarks/v4.0/
run-v4-benchmark.js
README.md
Contents:
- Raven-like reasoning items
- Multi-file Common Lisp generation task
- SBCL runtime validation harness
V4.1 (reusable)
Location: benchmarks/v4.1/
run-v4.1-benchmark.sh
README.md
Contents:
- V4.0 + auto-repair loop using compiler/runtime error feedback
V4.2 (reusable)
Location: benchmarks/v4.2/
run-v4.2-thinking-sweep.js
README.md
Contents:
- Hardened output schema using base64 file payloads to reduce JSON/newline corruption
- Thinking-level sweep support
Key Findings
1. Reasoning performance
ollama-quality (gpt-oss:20b) consistently best on Raven-like and logic-heavy tasks.
ollama-coder is mid-tier.
ollama-general is weakest and often unreliable under strict formatting constraints.
2. Lisp generation (long / multi-file / runtime-validated)
- All tested models struggled to produce fully SBCL-runnable multi-file outputs reliably in a single pass.
- Auto-repair improved structure/consistency but did not fully fix runtime correctness in tested runs.
3. Thinking-level effects (ollama-quality, V4.2)
off underperformed on Raven-like section.
minimal/low/medium/high improved Raven-like scores (6/6 in latest sweep).
- Lisp runtime section remained the bottleneck across levels.
- Runtime varied significantly by thinking level (not monotonic).
Notable Run Outputs
- V4.0/V4.1 run artifacts:
.run/v4.0/ and .run/v4.1/
- V4.2 thinking sweep output (latest):
/home/slime/.openclaw/workspace-base/.run/v4.2-thinking/results-1771779639537.json
Practical Recommendation
For current local benchmarking workflows:
- Use
ollama-quality (gpt-oss:20b) as the primary quality model.
- Keep V4.2 for structured comparisons (especially thinking-level sweeps).
- Treat multi-file Lisp generation as an iterative workflow (repair loop + runtime tests), not one-shot.
Repeatability Commands
# V4.0 / V4.1 core runner (with repair rounds)
node benchmarks/v4.0/run-v4-benchmark.js --agents ollama-quality --repair-rounds 2
# V4.1 wrapper
bash benchmarks/v4.1/run-v4.1-benchmark.sh --agents ollama-quality
# V4.2 thinking sweep
node benchmarks/v4.2/run-v4.2-thinking-sweep.js --agent ollama-quality
<<< Back to Index