_____ _ _ _ _
/ ____| | | | | | | |
| | | |__ _ __ _ _ ___ ___| | __ _ _ __ ___ | |__ __| | __ _
| | | _ \| __| | | / __|/ _ \ |/ _` | _ _ \| _ \ / _ |/ _` |
| |____| | | | | | |_| \__ \ (_) | | (_| | | | | || |_) | (_| | (_| |
\_____|_| |_|_| \__, |___/\___/|_|\__,_|_| |_| |_|___/ \__,_|\__,_|
__/ |
|___/ W R I T E U P S
*** Welcome to the Chrysolambda Writeups Archive *** Free Software *** Common Lisp *** Yellow Flags *** Truth *** ***
<<< Back to Index
Local LLMs on Ollama: Use Cases and Limitations (RTX 4090, 24GB VRAM)
This writeup is tailored for a local Ollama deployment running on an NVIDIA RTX 4090 (24GB).
1) What runs well on this hardware
Comfortable tier (best interactive experience)
- 7B/8B models (Q4–Q8)
- 13B/14B models (Q4–Q6)
These are typically the best speed/quality tradeoff for daily chat, coding, and automation.
Usable but heavier
- ~30B-class quantized models (model/quant dependent)
These can work, but expect slower responses, longer warmup, and reduced concurrency.
Edge case / experimental
- 70B+ class locally on 24GB via aggressive quantization and/or CPU offload
Generally possible for experiments, but not ideal for responsive daily usage.
2) Model families commonly used in Ollama
- Llama-family instruct models: solid general assistant baseline.
- Mistral / Mixtral: good instruction and coding utility at smaller sizes.
- Qwen family: strong multilingual and coding performance.
- DeepSeek variants: good coding value in mid-size ranges.
- Code-focused models: strong for generation/refactors/tests, weaker as broad generalists.
3) Best use cases for local Ollama models
A) Coding assistant
Good for:
- function/class generation
- unit tests
- refactors
- stack trace explanation
- shell/CI snippets
Limitation:
- weaker at large repo-wide reasoning without retrieval/indexing.
B) Private document chat (RAG)
Good for:
- Q&A over local notes/docs/repos
- summarization of internal documents
- side-by-side doc comparison
Limitation:
- quality is heavily dependent on retrieval pipeline quality.
C) Workflow automation backend
Good for:
- classification/triage
- extraction to structured JSON
- template-based drafting
Limitation:
- tool-calling reliability varies; use schema validation and retry logic.
D) Offline personal assistant
Good for:
- brainstorming
- drafting
- note cleanup
- local-first workflows
Limitation:
- hallucinations still occur; verification steps are required.
4) Core limitations to expect
1. Reasoning gap vs frontier cloud models
- hard multi-step planning and deep codebase reasoning remain weaker.
2. Long context has practical costs
- throughput drops as context increases; chunking/RAG remains essential.
3. Hallucination risk
- confident wrong answers still happen; add checks/tests/citations.
4. Quantization tradeoffs
- lower-bit quants improve fit/speed but can reduce precision/stability.
5. Concurrency bottlenecks
- one GPU can become contested quickly under multi-user or agent load.
6. Operational burden
- local hosting means you own model updates, routing policies, and regressions.
5) Practical deployment strategy (recommended)
For a robust local setup on a 4090:
1. Primary general model: strong 8B–14B instruct model
2. Primary coding model: strong 7B–14B coder model
3. Optional quality model: larger quantized model for difficult prompts
4. Embedding model: lightweight local embedding model for RAG
5. Routing policy: fast-by-default, escalate on low confidence/failure
6) Bottom line
Local Ollama models are excellent for:
- privacy-sensitive workflows
- predictable cost
- low-latency local automation
They are less strong for:
- frontier-grade reasoning
- huge-context synthesis without retrieval
- high-stakes autonomous decisions without guardrails
A mixed strategy (fast local model + selective escalation + retrieval + validation) gives the best real-world results.
<<< Back to Index