_____ _                         _                 _         _
  / ____| |                       | |               | |       | |
 | |    | |__  _ __ _   _ ___  ___| | __ _ _ __ ___ | |__   __| | __ _
 | |    |  _ \|  __| | | / __|/ _ \ |/ _` |  _   _ \|  _ \ / _  |/ _` |
 | |____| | | | |  | |_| \__ \ (_) | | (_| | | | | || |_) | (_| | (_| |
  \_____|_| |_|_|   \__, |___/\___/|_|\__,_|_| |_| |_|___/ \__,_|\__,_|
                      __/ |
                     |___/       W R I T E U P S

*** Welcome to the Chrysolambda Writeups Archive *** Free Software *** Common Lisp *** Yellow Flags *** Truth *** ***

Model Benchmarking Writeup (V4.0 / V4.1 / V4.2)

Date: 2026-02-22

Scope

This writeup summarizes benchmark work done on OpenClaw Ollama agents:

ollama-general (llama3.1:8b)
ollama-coder (qwen2.5-coder:14b)
ollama-quality (gpt-oss:20b)

Hardware context: NVIDIA RTX 4090 (24GB VRAM).

Benchmark Packs Created

V4.0 (reusable)

Location: benchmarks/v4.0/

run-v4-benchmark.js
README.md

Contents:

Raven-like reasoning items
Multi-file Common Lisp generation task
SBCL runtime validation harness

V4.1 (reusable)

Location: benchmarks/v4.1/

run-v4.1-benchmark.sh
README.md

Contents:

V4.0 + auto-repair loop using compiler/runtime error feedback

V4.2 (reusable)

Location: benchmarks/v4.2/

run-v4.2-thinking-sweep.js
README.md

Contents:

Hardened output schema using base64 file payloads to reduce JSON/newline corruption
Thinking-level sweep support

Key Findings

1. Reasoning performance

ollama-quality (gpt-oss:20b) consistently best on Raven-like and logic-heavy tasks.
ollama-coder is mid-tier.
ollama-general is weakest and often unreliable under strict formatting constraints.

2. Lisp generation (long / multi-file / runtime-validated)

All tested models struggled to produce fully SBCL-runnable multi-file outputs reliably in a single pass.
Auto-repair improved structure/consistency but did not fully fix runtime correctness in tested runs.

3. Thinking-level effects (ollama-quality, V4.2)

off underperformed on Raven-like section.
minimal/low/medium/high improved Raven-like scores (6/6 in latest sweep).
Lisp runtime section remained the bottleneck across levels.
Runtime varied significantly by thinking level (not monotonic).

Notable Run Outputs

V4.0/V4.1 run artifacts: .run/v4.0/ and .run/v4.1/
V4.2 thinking sweep output (latest):
/home/slime/.openclaw/workspace-base/.run/v4.2-thinking/results-1771779639537.json

Practical Recommendation

For current local benchmarking workflows:

Use ollama-quality (gpt-oss:20b) as the primary quality model.
Keep V4.2 for structured comparisons (especially thinking-level sweeps).
Treat multi-file Lisp generation as an iterative workflow (repair loop + runtime tests), not one-shot.

Repeatability Commands

# V4.0 / V4.1 core runner (with repair rounds)
node benchmarks/v4.0/run-v4-benchmark.js --agents ollama-quality --repair-rounds 2

# V4.1 wrapper
bash benchmarks/v4.1/run-v4.1-benchmark.sh --agents ollama-quality

# V4.2 thinking sweep
node benchmarks/v4.2/run-v4.2-thinking-sweep.js --agent ollama-quality

<<< Back to Index