What the Creators Say
Hermes Agent, built by Nous Research, has a built-in model catalog and configuration system that defines exactly which models work best for different agent tasks. This article compiles the official recommendations from the Hermes Agent documentation, model catalog, and configuration guides.
Hermes Agent uses a two-model architecture: a main model for reasoning and a set of auxiliary models (up to 11 slots) for side-jobs like vision, compression, title generation, and more. Each can be configured independently.
Model Tiers — Official Recommendations
Hermes Agent Creator categorizes models into 4 tiers based on use case
Best for Complex Reasoning & Multi-Step Tool-Calling
These are the recommended models for agent work when you need maximum intelligence. Use these as your main model.
# Best general-purpose agentic model:
anthropic/claude-sonnet-4.6
# Strong reasoning + tool calling:
openai/gpt-5.5-pro
# Huge context window:
google/gemini-3-pro-preview
# Cost-effective coder:
deepseek/deepseek-v4-pro
# Additional frontier options:
anthropic/claude-opus-4.8
openai/gpt-5.5
moonshotai/kimi-k2.6
x-ai/grok-4.3
Minimum context requirement: 64K tokens (128K recommended for optimal multi-step tool-calling workflows).
Faster, Cheaper Models for Simple Tasks
Recommended for simple tasks like formatting, renaming, boilerplate generation, and auxiliary side-jobs. Use /model to switch from frontier to fast models mid-session.
openai/gpt-5.4-mini
google/gemini-3.5-flash
anthropic/claude-haiku-4.5
deepseek/deepseek-v4-flash
google/gemini-3.1-pro-preview
qwen/qwen3.7-plus
Free Models for Cost-Effective Experimentation
Available through OpenRouter and Nous Portal free tier. Great for experimentation and lightweight tasks.
openrouter/elephant-alpha
openrouter/owl-alpha
poolside/laguna-m.1:free
tencent/hy3-preview:free
nvidia/nemotron-3-super-120b-a12b:free
nvidia/nemotron-3-ultra-550b-a55b:free
inclusionai/ring-2.6-1t:free
Note: Nemotron 3 Ultra was offered free on Nous Portal June 4-18, 2026. Owl Alpha is currently free on OpenRouter — use it while it lasts.
Self-Hosted for Privacy & Zero API Costs
Hermes Agent Creator's recommended local models for running on your own hardware:
Qwen3.5-9B (Q4_K_M GGUF)
# Size: 5.3 GB · RAM: ~10 GB · Context: 128K
# Backend: llama.cpp
# Best for Apple Silicon:
Qwen3.5-9B (mlx-lm MXP4)
# Size: ~5 GB · RAM: ~12 GB
# Backend: omlx (Apple MLX) — 37% faster than llama.cpp
Additional local options: Qwen3.5-4B-MTP (minimal RAM), Qwen3.5:397b (Ollama Cloud), Qwen3-Coder:480b (Ollama Cloud), Mistral-Large-3:675b (Ollama Cloud).
Important local model flag: Set --ctx-size 65536 for llama.cpp or -c 65536 for Ollama to meet the minimum context requirement.
⚠️ Important: Models NOT Recommended Inside Hermes Agent
Hermes-4-70B / Hermes-4-405B — NOT for Inside Agent
Nous Research's own models are NOT recommended for use INSIDE Hermes Agent. They are frontier hybrid-reasoning chat models tuned for chat and reasoning, not the rapid-fire tool-calling loop the agent relies on.
Use them for Nous Chat, research workflows, or via subscription proxy — but not as your agent's main model.
Auxiliary Models — The 11 Side-Job Slots
Hermes Agent uses auxiliary (smaller) models for side-jobs. Each has its own slot and can be overridden independently from the main model. This is where you save money.
📝 Title Gen
A cheap flash model writes session titles as well as Opus. google/gemini-3-flash-preview on OpenRouter.
👁️ Vision
When main model lacks vision. Point at google/gemini-2.5-flash or gpt-4o-mini for image analysis.
📦 Compression
When burning reasoning tokens on Opus just to summarize context. A fast chat model does the job at 1/50th the cost.
✅ Approval
For approval_mode: smart. A fast/cheap model (Haiku, Flash, GPT-5-mini) decides whether to auto-approve low-risk commands.
🌐 Web Extract
When using web_extract heavily. Summarization doesn't need reasoning — use a cheap flash model.
🔧 Skills Hub
Usually fine at auto (use main model). hermes skills search uses this slot.
🔌 MCP
Usually fine at auto (use main model). MCP tool routing.
🔀 Triage Specifier
A cheap, capable model works well. Routes Kanban triage specifier.
📋 Kanban Decomposer
Routes Kanban task decomposition — splits triage into child tasks.
👤 Profile Describer
Short, cheap call. Profile-description generation.
🧹 Curator
Can run for minutes on reasoning models, so a cheaper aux model is often worthwhile. Routes the curator skill-usage review pass.
Configuration — How to Set It Up
Here's how to configure your recommended models in ~/.hermes/config.yaml:
model:
provider: "nous"
default: "anthropic/claude-sonnet-4.6"
base_url: "https://inference-api.nousresearch.com/v1"
api_mode: "chat_completions"
# Auxiliary model overrides (cost optimization)
auxiliary:
title_gen:
provider: "openrouter"
model: "google/gemini-3-flash-preview"
vision:
provider: "openrouter"
model: "google/gemini-2.5-flash"
compression:
provider: "openrouter"
model: "deepseek/deepseek-v4-flash"
approval:
provider: "openrouter"
model: "anthropic/claude-haiku-4.5"
Recommended Providers
🏆 Nous Portal — RECOMMENDED
One OAuth login covers 300+ frontier agentic models plus the Tool Gateway (web search, image generation, TTS, browser automation). 10% off token-billed providers.
hermes setup --portal
🔄 OpenRouter — Most Models
400+ models with multi-provider routing. Supports provider routing for cost/speed optimization. Set OPENROUTER_API_KEY in ~/.hermes/.env.
All supported providers: Nous Portal, OpenRouter, OpenAI Codex, Anthropic, Google Gemini, GitHub Copilot, DeepSeek, Alibaba/DashScope, Z.AI/GLM, Kimi/Moonshot, MiniMax, xAI/Grok, AWS Bedrock, Azure AI Foundry, NVIDIA NIM, HuggingFace, Ollama Cloud, LM Studio, and Custom Endpoints.
Cost Optimization Strategies
From the Hermes Agent Creator documentation:
- Use auxiliary models — Override auxiliary tasks with cheaper flash models. Compression can run at 1/50th the cost of the main model.
- Use fast models for simple tasks — Switch to faster models for formatting, renaming, or boilerplate generation.
- Use free tier — Free models available through OpenRouter and Nous Portal free tier.
- Run local models — Zero API costs with local deployment. Best for privacy and high-volume usage.
- Compress long sessions — Run
/compressbefore hitting token limits to summarize conversation history. - Delegate for parallel work — Use
delegate_taskfor parallel subtasks to reduce main conversation token usage. - Use execute_code — Write Python scripts for batch operations instead of running terminal commands one at a time.
Conclusion
The Hermes Agent Creator's official recommendations are clear: Claude Sonnet 4.6 is the best general-purpose agentic model, GPT-5.5 Pro for reasoning, Gemini 3 Pro for context, and DeepSeek V4 Pro for coding. For budget users, the free tier models and local Qwen3.5-9B offer excellent value.
The biggest cost savings come from the auxiliary model system — overriding compression, vision, and title generation with cheap flash models can reduce costs by up to 50x. And Nous Portal's bundled approach (300+ models + Tool Gateway with 10% off) is the recommended way to run Hermes Agent.
Remember: don't use Hermes-4-70B/405B inside the agent — they're tuned for chat, not the rapid-fire tool-calling loop. Use them for Nous Chat instead.