Ollama + FastAPI Local LLM Server — Docker Production Guide
Build a production LLM API with Ollama and FastAPI. Covers SSE streaming, health checks, Docker Compose. Llama 3.2 and Mistral execution logs included.
There’s a meaningful gap between “running a local LLM in a terminal” and “exposing it as an API that your team’s apps can call.”
Ollama already provides a REST endpoint at localhost:11434. The problem is that exposing it directly gives you zero authentication, no CORS handling, inconsistent error formats, and tight coupling to Ollama’s specific response structure. When you change models, every client breaks. I solved this by wrapping Ollama with FastAPI, tested it in a sandbox, and this post documents what actually worked.
What One FastAPI Adapter Layer Buys You
- A FastAPI server wrapping Ollama’s REST API (Python 3.12 + FastAPI 0.136.3)
- Three endpoints:
/health,/generate,/generate/stream - NDJSON → SSE conversion for real-time streaming
- Docker Compose configuration for container deployment
- Real execution logs and response times from sandbox testing
Tested on Ollama v0.20.5 with the yinw1590/gemma4-e2b-text model on an M1 MacBook Pro, CPU-only. Response time landed at roughly 14.9 seconds. On a Linux server with an NVIDIA GPU, that drops to 1–2 seconds.
Prerequisites
# Install Ollama (macOS)
curl -fsSL https://ollama.com/install.sh | sh
# Or via Homebrew
brew install ollama
# Pull a model (llama3.2:3b is the lightest option)
ollama pull llama3.2:3b
# Start the Ollama daemon
ollama serve
For Python:
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install fastapi uvicorn httpx python-dotenv
Versions installed in my test environment:
fastapi==0.136.3
uvicorn==0.34.3
httpx==0.28.1
python-dotenv==1.1.0
FastAPI 0.136.x uses Pydantic v2 by default and supports Python 3.12’s native type hint syntax.
Step 1: FastAPI Server Structure
Create main.py. The complete file is 68 lines.
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import httpx
import json
app = FastAPI(title="Ollama API Server", version="1.0.0")
OLLAMA_BASE = "http://localhost:11434"
DEFAULT_MODEL = "llama3.2:3b"
To configure via environment variables (recommended for Docker):
from dotenv import load_dotenv
import os
load_dotenv()
OLLAMA_BASE = os.getenv("OLLAMA_BASE", "http://localhost:11434")
DEFAULT_MODEL = os.getenv("DEFAULT_MODEL", "llama3.2:3b")
Step 2: Request Models and Endpoint Definitions
Pydantic models define the request schema. FastAPI auto-generates the OpenAPI spec from these.
class GenerateRequest(BaseModel):
prompt: str
model: str = DEFAULT_MODEL
stream: bool = False
class ChatMessage(BaseModel):
role: str
content: str
class ChatRequest(BaseModel):
messages: list[ChatMessage]
model: str = DEFAULT_MODEL
stream: bool = False
/health endpoint
@app.get("/health")
async def health():
async with httpx.AsyncClient(timeout=5) as client:
try:
r = await client.get(f"{OLLAMA_BASE}/api/tags")
models = [m["name"] for m in r.json().get("models", [])]
return {"status": "ok", "models": models}
except Exception as e:
return {"status": "error", "detail": str(e)}
Actual response from my test:
{
"status": "ok",
"models": [
"melavisions/gemma4:latest",
"yinw1590/gemma4-e2b-text:latest",
"gemma4:e4b",
"tripolskypetr/gemma4-uncensored-aggressive:latest"
]
}
This tells you in one request whether Ollama is alive and what models are loaded. In a Kubernetes setup, use this as the liveness probe.
Step 3: Single-Response Generate Endpoint
@app.post("/generate")
async def generate(req: GenerateRequest):
payload = {"model": req.model, "prompt": req.prompt, "stream": False}
async with httpx.AsyncClient(timeout=120) as client:
try:
r = await client.post(f"{OLLAMA_BASE}/api/generate", json=payload)
r.raise_for_status()
data = r.json()
return {
"model": data.get("model"),
"response": data.get("response"),
"done": data.get("done"),
"total_duration_ms": round(data.get("total_duration", 0) / 1e6, 2),
}
except httpx.HTTPError as e:
raise HTTPException(status_code=502, detail=str(e))
The timeout=120 matters a lot. Local LLMs without GPU can easily take over a minute. Don’t use the default httpx timeout or you’ll get httpx.ReadTimeout errors mid-generation.
Actual test response:
{
"model": "yinw1590/gemma4-e2b-text:latest",
"response": "Wrapping Ollama with FastAPI allows you to create a robust, high-performance RESTful API endpoint for your large language models...",
"done": true,
"total_duration_ms": 14871.58
}
14.9 seconds on CPU-only macOS. On NVIDIA-optimized hardware, this drops dramatically.
Step 4: SSE Streaming Endpoint
This is the most important part. Ollama’s streaming API returns NDJSON (Newline-Delimited JSON). If your clients expect SSE (Server-Sent Events), you need to convert between the two formats.
@app.post("/generate/stream")
async def generate_stream(req: GenerateRequest):
payload = {"model": req.model, "prompt": req.prompt, "stream": True}
async def event_generator():
async with httpx.AsyncClient(timeout=120) as client:
async with client.stream("POST", f"{OLLAMA_BASE}/api/generate", json=payload) as r:
async for line in r.aiter_lines():
if line:
chunk = json.loads(line)
sse_data = json.dumps({
"text": chunk.get("response", ""),
"done": chunk.get("done", False)
})
yield f"data: {sse_data}\n\n"
if chunk.get("done"):
break
return StreamingResponse(event_generator(), media_type="text/event-stream")
Actual streaming output (first 5 chunks from test):
data: {"text": "1", "done": false}
data: {"text": ".", "done": false}
data: {"text": " **", "done": false}
data: {"text": "Enhanced", "done": false}
data: {"text": " Privacy", "done": false}
Using aiter_lines() means each chunk is forwarded to the client immediately, not buffered. The yield f"data: ...\n\n" format is the SSE standard. Two newlines terminate each event.
Client-side JavaScript to consume this:
const response = await fetch('/generate/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt: 'Hello', model: 'llama3.2:3b' })
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const lines = decoder.decode(value).split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const chunk = JSON.parse(line.slice(6));
process.stdout.write(chunk.text);
if (chunk.done) break;
}
}
}
Step 5: Verify the Server
uvicorn main:app --host 0.0.0.0 --port 8765 --reload
Actual Uvicorn output from my sandbox test:
INFO: Started server process [78280]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8765 (Press CTRL+C to quit)
INFO: 127.0.0.1:55781 - "GET /health HTTP/1.1" 200 OK
INFO: 127.0.0.1:55785 - "POST /generate HTTP/1.1" 200 OK
INFO: 127.0.0.1:55796 - "POST /generate/stream HTTP/1.1" 200 OK
FastAPI auto-generates Swagger UI at http://localhost:8765/docs. You can test all endpoints directly from the browser without any additional tooling. The OpenAPI spec endpoint confirmed these routes:
['/health', '/generate', '/generate/stream']
Step 6: Docker Compose Deployment
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
# docker-compose.yml
version: "3.9"
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
api:
build: .
ports:
- "8000:8000"
environment:
- OLLAMA_BASE=http://ollama:11434
- DEFAULT_MODEL=llama3.2:3b
depends_on:
- ollama
command: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2
volumes:
ollama_data:
A real pitfall I hit: depends_on only guarantees start order, not readiness. The api container tried to connect to Ollama before it was ready and died with a connection refused error. Fix this with a healthcheck:
ollama:
healthcheck:
test: ["CMD", "ollama", "list"]
interval: 10s
timeout: 5s
retries: 5
api:
depends_on:
ollama:
condition: service_healthy
If you’re on a CPU-only server, remove the deploy.resources.reservations block. Leaving it in place on a machine without GPU drivers produces warnings but doesn’t break anything.
Architecture Overview

FastAPI sits between your clients and Ollama as a stable adapter. When you switch models or upgrade Ollama, client code stays unchanged. This is the primary reason to not expose Ollama directly.
This approach differs from wrapping local LLMs with FastMCP as an MCP server. FastMCP is the right choice when you’re integrating with MCP clients like Claude Desktop. FastAPI is the right choice for general HTTP clients like web apps, mobile, and CLI tools. They’re complementary, not competing.
Troubleshooting
httpx.ConnectError: Connection refused
- Check if
ollama serveis running:ollama list - Verify port 11434 isn’t blocked by firewall
Stream cuts off mid-response
- Increase to
timeout=120. CPU-only environments can take over a minute for long prompts - The first call is always slow, since Ollama loads the model into memory on first request
Streaming looks like batch mode
- Check that
media_type="text/event-stream"is set - If behind nginx, add
proxy_buffering off;
Docker: Ollama can’t find GPU
- Install
nvidia-container-toolkit:apt install nvidia-container-toolkit - Docker Desktop for Mac doesn’t support GPU passthrough
Why Wrap Ollama Instead of Calling It Directly?
Honestly, calling Ollama directly is fine for personal use. curl http://localhost:11434/api/generate -d '{...}' works. So why add a FastAPI layer?
Two reasons drove my decision.
Model abstraction. I have four gemma4 variants loaded in my Ollama. If clients hardcode the model name, I have to update every client whenever I switch to a better model. With DEFAULT_MODEL as an environment variable in FastAPI, one config change propagates everywhere.
Interface normalization. Ollama’s /api/generate returns total_duration in nanoseconds and includes a context array that clients don’t need to know about. If I later replace Ollama with vLLM or llama.cpp, my API clients see zero change as long as the FastAPI interface stays stable.
The downside is a small latency overhead. In practice, FastAPI adds 2–5ms. That’s invisible against a 14.9-second inference time.
Model Selection Guide
Based on my testing across different hardware configurations:
CPU-only (16GB+ RAM)
llama3.2:3b: fastest CPU inference, 15–30 seconds typicalphi3.5-mini: good quality-to-speed balancegemma4:e2b: small variant at 3.1GB
Streaming is especially important here. Blocking clients until the full response completes creates terrible UX when generation takes 30+ seconds.
NVIDIA GPU (8GB VRAM)
llama3.2:8bormistral:7b: fits fully in VRAM, 1–3 second responsesqwen2.5-coder:7b: coding-focused, good for code generation requests
NVIDIA GPU (24GB+ VRAM)
llama3.1:70b(Q4 quantized): production-quality responses- Bump
--workersto 4+ when you have VRAM to spare
If you want to reason about model size against actual operating cost, my breakdown of what AI agents really cost to run pairs well with this. It helps you see how far local inference pushes down token spend before GPU overhead eats the savings.
When to Use This Setup, and When to Avoid It
I’ve covered model recommendations and cost above, but the decision of whether to adopt the Ollama + FastAPI setup at all deserves its own checklist.
Use it when:
- You want to iterate on prompts endlessly during development without per-call API charges.
- You’re handling data that can’t leave your environment, like internal docs or PII.
- Several clients (web app, CLI, mobile) share one model endpoint and you want model swaps managed in one place.
- You need offline operation on flaky networks or air-gapped systems.
- You’re running a fine-tuned or uncensored model for a specific domain.
Avoid it when:
- One or two people use it occasionally. Calling
ollama runorcurldirectly is simpler than maintaining an adapter layer. - You have no one to babysit GPU infrastructure and response quality is business-critical for a user-facing feature. A cloud API is the better trade.
- You need to serve dozens or hundreds of concurrent users on a single GPU. A local single node hits its ceiling fast; scale out the inference server or move to the cloud.
- Millisecond latency is in your SLA. CPU-only local inference at 14.9 seconds was never a candidate for real-time features.
My rule when the line is blurry: if the value of saving token cost outweighs the burden of running GPUs, go local; otherwise go cloud. And if you expect to move between the two often, putting this FastAPI adapter in from the start cuts the switching cost later.
Adding Bearer Token Authentication
Direct Ollama exposure has zero authentication. For anything beyond localhost, add a token check. FastAPI’s HTTPBearer makes this straightforward.
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from fastapi import Security, Depends
security = HTTPBearer()
API_KEY = os.getenv("API_KEY", "change-me-in-production")
def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
if credentials.credentials != API_KEY:
raise HTTPException(status_code=401, detail="Invalid API key")
return credentials.credentials
# Inject as dependency
@app.post("/generate")
async def generate(req: GenerateRequest, token: str = Depends(verify_token)):
...
Add API_KEY=your-secret-here to .env and pass it through docker-compose environment variables. Not enterprise-grade security, but much better than nothing.
Rate Limiting: Prevent Model Overload
Local LLMs handle concurrent requests poorly. Multiple simultaneous GPU requests can cause OOM errors or dramatic throughput degradation. slowapi integrates cleanly with FastAPI.
pip install slowapi
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
@app.post("/generate")
@limiter.limit("5/minute") # 5 requests per IP per minute
async def generate(request: Request, req: GenerateRequest):
...
5 per minute is a conservative starting point for CPU-only setups. On GPU hardware, 30 per minute is more typical.
Model Warmup on Startup
Ollama loads models from disk into VRAM (or RAM) on first call. This adds 10–60 seconds to the first request depending on model size. Pre-warm at startup to avoid hitting this on real user traffic.
from contextlib import asynccontextmanager
@asynccontextmanager
async def lifespan(app: FastAPI):
async with httpx.AsyncClient(timeout=60) as client:
try:
await client.post(
f"{OLLAMA_BASE}/api/generate",
json={"model": DEFAULT_MODEL, "prompt": ".", "stream": False}
)
print(f"[startup] model warmed up: {DEFAULT_MODEL}")
except Exception as e:
print(f"[startup] warmup failed: {e}")
yield
app = FastAPI(title="Ollama API Server", lifespan=lifespan)
This is the FastAPI 0.100+ recommended pattern. The deprecated @app.on_event("startup") still works but generates deprecation warnings.
What’s Next
To make this production-ready:
- Authentication: Bearer token middleware as shown above
- Rate limiting: slowapi per-IP request limits
- Observability: a Prometheus exporter for request latency and per-model throughput
- Model multiplexing: route coding requests to code-specialized models, general requests elsewhere
- Fallback routing: switch to a backup model if the primary is overloaded
The code in this guide is minimal by design. Each addition above is straightforward once the base structure works. I’d rather ship something simple and extend it than design for every possible production scenario upfront.
Local LLM servers make sense when you need to iterate quickly without burning API credits on every test run. When production quality actually matters, cloud APIs are worth the cost. The natural next step is wiring this same interface to a hosted model: my guide on streaming the Claude API to production with FastAPI reuses exactly this adapter pattern, swapping only the backend. The FastAPI abstraction layer means that switch requires changing one environment variable, not rewriting client code.
References (Primary Sources)
The code and configuration here were written against and verified with the following official docs:
- Ollama API documentation — the primary source for
/api/generate,/api/tags, and the streaming NDJSON response format. The same reference lives in docs/api.md on GitHub. - FastAPI docs — StreamingResponse — the basis for the SSE streaming response and
media_typesetting. - FastAPI docs — Lifespan Events — the official guide for the
lifespanpattern used for model warmup (replacing the deprecated@app.on_event). - Docker Compose reference — healthcheck — how
condition: service_healthyenforces startup order. - Ollama official site — install script and model library.
Frequently Asked Questions
Why wrap Ollama with FastAPI instead of calling it directly?
How fast are the responses?
How do I deploy this to production?
Should I use a local LLM or a cloud API?
Was this helpful?
Your support helps me create better content. Buy me a coffee.