All Posts
16 min read

Running Hermes Agent on 8 GB of VRAM When the Community Says 16 GB Minimum

Measured benchmarks, the KV-cache math that makes 8 GB fit, and a Windows 11 setup kit that takes an RTX 3060 Ti from blank slate to chained tool calls against a locally-hosted 9B model in one PowerShell command.

AI Side Projects Open Source Linux

The Hermes Discord ran a poll in April 2026: 16 GB+ of VRAM is the comfortable tier for agentic work, 24 GB+ is “best.” My desk has an RTX 3060 Ti with 8 GB. I wanted to know how far below the line you can actually live, so I spent a weekend writing a Windows 11 setup kit that takes the machine from “I just downloaded a zip” to “the agent is chaining tool calls in a terminal.” Along the way I learned a lot of small things about where the real floor is — and Carnice-9b, a 9B model running on a card that was allegedly too small, surprised me more than anything I’ve tested this year.

This post is about all of that: the inference math that makes 8 GB fit at all, the Windows-specific friction that almost killed the project, and what it actually feels like when a small, locally-hosted model handles multi-step agent tasks cleanly.

What “running Hermes locally” actually means

Before any of the math is interesting, it helps to know what’s running where. Hermes Agent is Nous Research’s open-source agent framework — the TUI you interact with, the skills system, the persistent memory, the tool-calling logic. None of that does inference itself. It speaks the OpenAI Chat Completions protocol and ships every prompt to whatever endpoint you configure.

On a typical cloud setup, that endpoint is api.openai.com or api.anthropic.com and you pay per token. Going local just means swapping the endpoint for http://127.0.0.1:8080/v1 and standing up a process that speaks the same protocol on the other end. That process is llama-server, part of llama.cpp — a single binary that loads a GGUF model file, exposes an OpenAI-compatible HTTP API, and runs the transformer math on your GPU.

So the actual local stack is three pieces:

  1. The driver layer: NVIDIA driver on Windows, CUDA toolkit inside WSL, and the WSL-CUDA shim that bridges them.
  2. The inference engine: a CUDA-built llama-server binary that memory-maps a .gguf model file and exposes /v1/chat/completions on localhost.
  3. The agent: Hermes, pointed at that localhost endpoint, doing all the skills/tools/memory work itself.

Everything in the kit exists to make those three pieces stand up on a fresh Windows 11 install without you having to learn what any of them are first. But knowing what they are makes the “why” of every decision below much clearer.

Why my card shouldn’t have worked

Hermes refuses to start under 64K of context. That floor exists for a real reason: the agent constantly re-injects tool definitions, recent conversation, memory recalls, and skill metadata into every model turn. Below 64K, the agent forgets which tools it has access to mid-task. Above 64K, things mostly work. 64K is the bargain.

But 64K of context on a 9B model is where 8 GB of VRAM stops being theoretical and starts being a puzzle. The math splits into three buckets:

WhatHow big on a 9B model
Model weights (Q4_K_M)~5.2 GB
KV cache @ 64K, FP16~5.5 GB
Inference scratch + CUDA runtime~0.5 GB

That adds up to ~11.2 GB on a card that has 8 GB total — and that’s before the desktop framebuffer, browser, and Discord have claimed their few hundred megabytes. The default configuration that the llama.cpp documentation walks you through does not fit on a 3060 Ti, by a long way.

The community’s “16 GB minimum” is, in that light, completely honest.

The KV cache trick that makes 8 GB fit

The first thing you learn when you actually sit down with the numbers is that the model weights are not the problem. A 5.2 GB model on an 8 GB card leaves ~2.8 GB of headroom — plenty, if the only other costs were fixed. The problem is the KV cache: the running scratchpad of attention keys and values for every token in the context window. It scales with layers × KV heads × head dimension × context length × dtype, and that last term — the data type used to store each value — is where the war is won:

Model classFP16 KV @ 64KQ8 KVQ4 KV
9B~5.5 GB~3 GB~1.5–2 GB

That’s not a typo. Quantizing the KV cache from 16-bit floats down to 4-bit integers takes the running cost from ~5.5 GB to under 2 GB, with quality loss that’s almost imperceptible in practice for agent workloads. The shape of the math:

KV cost per token ≈ 2 (K and V) × layers × num_kv_heads × head_dim × bytes

Two things keep that number small for Qwen3.5-9B specifically — and together they’re what make the 8 GB tier viable:

  • Grouped Query Attention. Qwen3.5-9B has many more query heads than KV heads — keys and values are shared across query groups, which cuts the cache by roughly 4× versus a dense-attention architecture of the same size. Without GQA, this same 9B model at 64K would need closer to 24 GB just for KV, and no amount of clever quantization would fit it on a 3060 Ti.
  • Q4 quantization of the cache itself. Another ~4× reduction by storing 4-bit integers instead of 16-bit floats, with quality loss that’s almost imperceptible in practice for agent workloads.

The combined effect lands the running cost around ~1.5 GB at Q4 versus ~5.5 GB at FP16 — the figures in the table above.

Every llama-server invocation in the kit forces --cache-type-k q4_0 --cache-type-v q4_0. This single flag pair is the difference between OOMing on startup and the model loading cleanly with ~500 MiB of headroom for the rest of the system — tight, but real.

Flash Attention: the other half of the unlock

The other flag every script forces is --flash-attn on. Flash Attention isn’t an algorithmic change so much as a memory-access change: instead of materializing the full attention matrix (which scales with the square of the sequence length) and then doing the softmax, it streams through the matrix in tiles that fit in shared memory and never holds the full thing in VRAM at once.

For our purposes, the practical effect is that the working memory of attention computation roughly halves — and at 64K context, “roughly half” is the difference between a smooth response and an OOM mid-token. With Flash Attention off, the Q4 KV cache savings get partially eaten back by transient attention buffers and 8 GB stops fitting again.

The kit auto-detects your card at install time and picks KV precision to match: Q4 on 8 GB cards, Q8 on 10–12 GB, F16 above that. Flash Attention is always on. Nobody should have to learn the contents of this section just to launch an agent.

Why WSL2, not native Windows

Hermes ships a native Windows installer. Nous calls it “early beta.” They call the Linux installer “battle-tested.” Same machine, same GPU, same agent — the difference is which path through the codebase has been exercised by the community.

The concrete wrinkles you avoid by using WSL2:

  • Tool-call parsers: Hermes has per-model parsers that translate the model’s text output into structured tool_calls. These have been developed and tested against Linux file paths and POSIX process semantics. The Windows-native build has its own copies, and they have rougher edges.
  • Skill system: many of Hermes’s built-in skills shell out to apt-installable Unix tools — ripgrep, ffmpeg, jq, things the agent uses to read repositories, parse media, manipulate data. On native Windows those have to be installed and PATH’d separately, and the bundled Git Bash that Hermes uses to find them has quirks.
  • File watching: skills monitor for changes via POSIX inotify semantics. WSL2 has them; Win32 has different ones, and the shim is imperfect.

The cost of going through WSL2 is approximately zero. NVIDIA’s WSL-CUDA passthrough means the Windows driver does the actual GPU work, and Ubuntu inside WSL2 talks to it through a special /usr/lib/wsl driver shim. Inference performance is within a few percent of native Linux. You don’t dual-boot, you don’t lose your GPU, you don’t pay anything for the indirection.

This is also why you should never install the Linux NVIDIA driver inside WSL. It conflicts with the passthrough. The kit installs only the CUDA toolkit (cuda-nvcc-12-4, cuda-cudart-dev-12-4, etc.) — the SDK pieces llama.cpp needs at build time. The driver stays on Windows where it belongs.

The Ubuntu 24.04 booby traps

Three things changed between Ubuntu 22.04 and 24.04 that broke every Hermes install guide written in 2024 or earlier. Knowing why each one exists makes them less infuriating:

libtinfo5 is gone. Ubuntu 24.04 ships only libtinfo6, the newer ncurses library. cuda-toolkit-12-4, the metapackage everyone reaches for, transitively depends on nsight-systems-2023.4.4, which still links against libtinfo5. Installing the metapackage on 24.04 throws an unresolvable dependency error. The fix is to skip the metapackage and install only the actual pieces llama.cpp needs: the NVCC compiler, the CUDA runtime headers, the NVRTC runtime compiler, cuBLAS (for matrix multiplication), and cuRAND. Nsight is a profiler. We don’t need it.

PEP 668 is enforced. Python 3.12 in 24.04 marks the system Python as “externally managed,” meaning pip install --user errors out with a warning about clobbering OS-managed packages. The right move is pipx, which installs each Python CLI tool into its own isolated virtualenv and symlinks the entry points onto your PATH. The kit uses pipx to install the Hugging Face CLI and a few other tools that used to be pip-installed.

huggingface-cli was renamed to hf. Recent versions of the huggingface_hub package made the old binary a deprecation shim that prints a notice and exits non-zero. Any script that invoked it without checking the exit code now silently breaks. The kit’s downloaders use hf download everywhere.

None of these are hard. All three are completely unsearchable in advance: you only find them by hitting them.

Carnice-9b: the part that actually surprised me

Here is the model that made the whole project feel worth it.

Carnice-9b is a Hermes-tuned variant of Qwen3.5-9B. “Hermes-tuned” is doing a lot of work in that sentence — it means the model has been fine-tuned specifically on the tool-calling and agentic-reasoning grammar that the Hermes framework expects. Tool calls come out in the right XML envelope. Multi-step plans come out structured. The model “knows” it’s an agent in a way base instruct models don’t.

At Q4_K_M the GGUF is 5.23 GiB on disk. GGUF metadata identifies it as Carnice_V1_9B_Hermes_Stage2_Merged, base model Qwen/Qwen3.5-9B, architecture qwen35. With the kit’s Q4 KV cache and Flash Attention flags, my 3060 Ti reports 7,519 / 8,192 MiB used with the model loaded and a 64K context warmed — exactly 506 MiB of headroom, which turns out to be enough for the desktop framebuffer if you don’t have Chrome devouring VRAM in the background.

Generation speed on my machine is 52–53 tokens per second, measured across half a dozen requests including the multi-tool task below. The numbers came out remarkably consistent — predicted_per_second in the server’s own timings field stayed within 51.7 → 53.5 tok/s every run. Prompt evaluation runs much faster (40–570 tok/s depending on how much of the prompt is cached). There is no perceptible lag between “I press enter” and “the model starts replying.”

What I genuinely didn’t expect:

Tool calling actually works. The conventional wisdom for the last two years has been that you need 30B+ parameters for reliable function calling — that small models hallucinate tool names, fumble argument schemas, and forget which tools they have access to halfway through a multi-step task. Carnice-9b does not do those things. To check, I wrote a four-tool harness exposing read_file, parse_json, calculate, and write_file, and asked Carnice to: read /tmp/carnice_test/order.json, work out the price × qty × (1 + tax) total, write the result to a new file, and report back. It did the whole task in four turns / 6.8 seconds at ~53 tok/sread_filecalculate(49.99 * 3 * (1 + 0.21))write_file(...) → a natural-language summary citing the file path and the number. The final file on disk read 181.4637. The tool-call envelopes were structured correctly, the arguments matched the schemas, and it even decided to skip my redundant parse_json tool because the read_file output was already valid JSON. That last bit was the moment I started taking the model seriously.

The first-turn-versus-follow-up format drift. There is a real wrinkle worth knowing about. On the very first turn, Carnice emits tool calls in the canonical format that llama-server’s --jinja parser knows how to read, and they land in the OpenAI-compatible tool_calls field cleanly. On subsequent turns — after a tool result has been fed back — it sometimes emits the call in a slightly different XML envelope (<tool_call><function=...><parameter=...>) inside reasoning_content instead of content, and llama-server’s parser doesn’t catch that variant. This is why Hermes ships with its own client-side tool-call parsers. The model is producing the right call; the harness has to be forgiving about the format. My test script reproduces this exactly — first call: parsed natively; calls two and three: recovered by a 30-line regex parser on the harness side. Without that lenient parsing, the chain stalls at step one.

The thinking-mode gotcha. Carnice inherits Qwen3’s two-mode behavior: it reasons in a reasoning_content field for a few hundred tokens before emitting the actual response. This is a feature, not a bug — but it means short max_tokens budgets will silently fail. A request with max_tokens=80 for “say hello” never reaches the greeting; all eighty tokens get spent thinking and content comes back empty. Set max_tokens to 1500–2000 for any tool-calling request and reality reasserts itself.

The --jinja flag is doing two jobs at once. Without it, llama-server uses a default chat template that doesn’t surface the model’s thinking-mode tokens properly and doesn’t route tool calls into the structured field. I tested both: with --jinja off, the model emits 80 reasoning tokens and zero content tokens on any prompt that would normally trigger reasoning. With --jinja on, the reasoning lands in reasoning_content, the tool call lands in tool_calls, and everything downstream just works.

This is the unlock that makes the 8 GB tier viable. Without Hermes-tuned small models like Carnice, you’d need a 30B-class model that doesn’t fit on a 3060 Ti and you’d be forced to the cloud fallback. With Carnice + --jinja + a harness that catches the follow-up format drift, the local path is real.

The configuration that makes tool calling work

A few small Hermes configuration details cost me an afternoon and deserve to be documented:

“No inference provider configured” — the most confusing single error of the project. The server is running, the model is loaded, /v1/models returns the right name, Hermes connects, and then refuses to talk. The cause: Hermes treats OPENAI_API_KEY="" or OPENAI_API_KEY=none as “no provider configured.” It just needs a non-empty string. The kit sets sk-local-no-auth. Any literal works.

The model name must not have a custom/ prefix. Some older guides suggest custom/carnice-9b as the model name in Hermes. Hermes now parses the slash as a provider route and looks for a provider called “custom.” Bare carnice-9b is what you want, matching exactly what llama-server’s /v1/models endpoint reports.

--jinja is non-negotiable, and the kit ships with it. Covered in depth above; the short version is that without --jinja, llama-server falls back to a default chat template that doesn’t recognise Carnice’s reasoning/content split. The model burns its entire token budget in reasoning mode and never reaches a usable response. The start scripts pass --jinja by default; this entry exists only so that if you fork the kit and the flag goes missing, you know where to look.

The graceful exit

Not everyone is going to get a clean local run. Sometimes the model fumbles tool calls anyway. Sometimes 8 GB is just too tight for what you want to do. The kit ships with a docs/FALLBACKS.md that exists precisely so the project doesn’t end in frustration: you can keep your entire Hermes install — your skills, your memory, your TUI muscle memory — and swap the inference backend to Ollama Cloud (free tier) or OpenRouter (~$2–5/month for an hour of agent work per day) in about two minutes.

The “fully local” pursuit is mostly about privacy and learning, not money. The cloud fallback exists so that a frustrating local experience doesn’t burn the whole project to the ground — and so you can isolate “is my Hermes config the problem?” from “is my local model the problem?” with a single TUI command.

What I actually learned

The 8 GB tier is real but tight. On my 3060 Ti, Carnice + 64K context sits at 7,519 / 8,192 MiB used — 92% of card capacity, with just over 500 MiB of slack. The community’s “16 GB minimum” is honest advice for someone who wants the experience to “just work” without this kind of fiddling. It is not a hard floor — at least not at 8 GB. I haven’t tested anything smaller; if you have a 6 GB card and want to know, I’d love to read your post.

Most of the unlock comes from four configuration flags, not parameter count:

  • --cache-type-k q4_0 and --cache-type-v q4_0 — the Q4 KV cache, the single biggest VRAM win.
  • --flash-attn on — tiled attention, halves the working memory at 64K context.
  • --jinja — activates the model’s own chat template, which is what routes thinking tokens and tool calls into the right response fields.

The more interesting thing I learned is about model size. The 2024 mental model — “you need 30B+ for agent work” — is already out of date for harness-tuned small models. Carnice, on a card the Hermes community said wasn’t supposed to work, completed a four-tool task in under seven seconds with clean structured calls and a coherent final summary. The Hermes-specific fine-tuning is doing more than the parameter count would suggest. There’s probably a generalizable lesson here: as fine-tunes get more specific to the harness, the harness’s hardware requirements drop faster than the model-size scaling law would predict.

The fine print: that smooth four-tool run only stayed smooth because the test harness was lenient about an XML format drift on follow-up turns. Hermes-on-top-of-Carnice handles this for you. Raw llama-server alone does not, on every turn, and a fork that strips out Hermes’s client-side parsers will hit a wall by the third tool call.

If you have a 3060 Ti, a 4060, or any 8 GB NVIDIA card sitting in a tower somewhere, the kit is on GitHub at blugart-dev/hermes-local-setup. The whole install is one PowerShell command, with a reboot in the middle. Try Carnice. See if it surprises you the way it surprised me.

Ignacio María Muñoz Márquez

Ignacio María Muñoz Márquez

Senior Game Programmer

Related Posts