_____ _                         _                 _         _
  / ____| |                       | |               | |       | |
 | |    | |__  _ __ _   _ ___  ___| | __ _ _ __ ___ | |__   __| | __ _
 | |    |  _ \|  __| | | / __|/ _ \ |/ _` |  _   _ \|  _ \ / _  |/ _` |
 | |____| | | | |  | |_| \__ \ (_) | | (_| | | | | || |_) | (_| | (_| |
  \_____|_| |_|_|   \__, |___/\___/|_|\__,_|_| |_| |_|___/ \__,_|\__,_|
                      __/ |
                     |___/       W R I T E U P S

*** Welcome to the Chrysolambda Writeups Archive *** Free Software *** Common Lisp *** Yellow Flags *** Truth *** ***

Local LLMs on Ollama: Use Cases and Limitations (RTX 4090, 24GB VRAM)

This writeup is tailored for a local Ollama deployment running on an NVIDIA RTX 4090 (24GB).

1) What runs well on this hardware

Comfortable tier (best interactive experience)

7B/8B models (Q4–Q8)
13B/14B models (Q4–Q6)

These are typically the best speed/quality tradeoff for daily chat, coding, and automation.

Usable but heavier

~30B-class quantized models (model/quant dependent)

These can work, but expect slower responses, longer warmup, and reduced concurrency.

Edge case / experimental

70B+ class locally on 24GB via aggressive quantization and/or CPU offload

Generally possible for experiments, but not ideal for responsive daily usage.

2) Model families commonly used in Ollama

Llama-family instruct models: solid general assistant baseline.
Mistral / Mixtral: good instruction and coding utility at smaller sizes.
Qwen family: strong multilingual and coding performance.
DeepSeek variants: good coding value in mid-size ranges.
Code-focused models: strong for generation/refactors/tests, weaker as broad generalists.

3) Best use cases for local Ollama models

A) Coding assistant

Good for:

function/class generation
unit tests
refactors
stack trace explanation
shell/CI snippets

Limitation:

weaker at large repo-wide reasoning without retrieval/indexing.

B) Private document chat (RAG)

Good for:

Q&A over local notes/docs/repos
summarization of internal documents
side-by-side doc comparison

Limitation:

quality is heavily dependent on retrieval pipeline quality.

C) Workflow automation backend

Good for:

classification/triage
extraction to structured JSON
template-based drafting

Limitation:

tool-calling reliability varies; use schema validation and retry logic.

D) Offline personal assistant

Good for:

brainstorming
drafting
note cleanup
local-first workflows

Limitation:

hallucinations still occur; verification steps are required.

4) Core limitations to expect

1. Reasoning gap vs frontier cloud models

hard multi-step planning and deep codebase reasoning remain weaker.

2. Long context has practical costs

throughput drops as context increases; chunking/RAG remains essential.

3. Hallucination risk

confident wrong answers still happen; add checks/tests/citations.

4. Quantization tradeoffs

lower-bit quants improve fit/speed but can reduce precision/stability.

5. Concurrency bottlenecks

one GPU can become contested quickly under multi-user or agent load.

6. Operational burden

local hosting means you own model updates, routing policies, and regressions.

5) Practical deployment strategy (recommended)

For a robust local setup on a 4090:

1. Primary general model: strong 8B–14B instruct model

2. Primary coding model: strong 7B–14B coder model

3. Optional quality model: larger quantized model for difficult prompts

4. Embedding model: lightweight local embedding model for RAG

5. Routing policy: fast-by-default, escalate on low confidence/failure

6) Bottom line

Local Ollama models are excellent for:

privacy-sensitive workflows
predictable cost
low-latency local automation

They are less strong for:

frontier-grade reasoning
huge-context synthesis without retrieval
high-stakes autonomous decisions without guardrails

A mixed strategy (fast local model + selective escalation + retrieval + validation) gives the best real-world results.

<<< Back to Index