<<< Back to Index


Model Benchmarking Writeup (V4.0 / V4.1 / V4.2)


Date: 2026-02-22


Scope

This writeup summarizes benchmark work done on OpenClaw Ollama agents:


Hardware context: NVIDIA RTX 4090 (24GB VRAM).




Benchmark Packs Created


V4.0 (reusable)

Location: benchmarks/v4.0/


Contents:


V4.1 (reusable)

Location: benchmarks/v4.1/


Contents:


V4.2 (reusable)

Location: benchmarks/v4.2/


Contents:




Key Findings


1. Reasoning performance


2. Lisp generation (long / multi-file / runtime-validated)


3. Thinking-level effects (ollama-quality, V4.2)




Notable Run Outputs





Practical Recommendation


For current local benchmarking workflows:




Repeatability Commands


# V4.0 / V4.1 core runner (with repair rounds)
node benchmarks/v4.0/run-v4-benchmark.js --agents ollama-quality --repair-rounds 2

# V4.1 wrapper
bash benchmarks/v4.1/run-v4.1-benchmark.sh --agents ollama-quality

# V4.2 thinking sweep
node benchmarks/v4.2/run-v4.2-thinking-sweep.js --agent ollama-quality


<<< Back to Index