A 2026 joint analysis by SentinelOne and Censys scanned the public internet for 293 days and found 175,000 unique Ollama instances exposed across 130 countries, most with no authentication and no firewall protection1. Many had tool-calling capabilities enabled, meaning attackers could not just consume the host's compute resources but potentially execute commands on the underlying system.
This is the part of the implementation that gets skipped most often. Ollama works perfectly on the developer's laptop with default settings. The production deployment that survives a security audit, runs as automated failover, and is tested regularly requires the steps in this article.
Why does Ollama need a firewall?
By default, Ollama binds only to 127.0.0.1:11434, which means localhost only2. This default is safe. The Ollama API is unreachable from other machines on the network, and no firewall configuration is strictly necessary.
The default changes the moment OLLAMA_HOST is set to 0.0.0.0:11434, which is required for any deployment where Ollama needs to serve requests from other machines (the most common business use case). At that point, the API is reachable from anywhere on the network. Without authentication and without a firewall, any user on the local network or, worse, any reachable internet host, can:
Submit arbitrary inference requests that pin the GPU for minutes at a time, effectively a denial-of-service attack against the host machine.
Exfiltrate model outputs by sending crafted prompts designed to leak training data or sensitive information that was used in fine-tuning.
Map the environment by querying the API for installed models, GPU specifications, and other host details that inform a larger attack.
Critical: Ollama has no built-in authentication. If OLLAMA_HOST is set to 0.0.0.0, anyone who can reach port 11434 can use the API. Firewall rules are the primary access control.
How is the firewall configured on Linux?
Linux has two common firewall tools. Ubuntu and Debian use ufw (Uncomplicated Firewall). Red Hat, CentOS, and Fedora use firewalld. Both achieve the same result with different syntax.
ufw on Ubuntu and Debian
The pattern is straightforward: deny port 11434 by default, then allow only the specific subnet or IP addresses that should have access.
firewalld on Red Hat, CentOS, Fedora
firewalld uses zones. The pattern is to add port 11434 to an "internal" zone that includes only trusted source addresses, and explicitly close that port in the "public" zone.
How is the firewall configured on Windows?
Windows uses Windows Defender Firewall. PowerShell as Administrator is the simplest way to configure rules consistently. The goal is the same: allow port 11434 only from trusted subnets.
How is the firewall configured on macOS?
macOS uses pf (Packet Filter) for firewall rules. The application firewall in System Settings does not provide enough granularity for port-level control. Editing the pf configuration directly is required.
What additional Ollama hardening matters?
Firewall rules are the first layer. Three more environment variables and configurations reduce the attack surface further.
Restrict CORS origins
Set OLLAMA_ORIGINS to the specific frontend URLs that should be allowed to call the API from a browser. This prevents arbitrary websites from making cross-origin requests to Ollama if a user visits them while on the corporate network3.
Disable the built-in web UI in production
Ollama includes a basic web UI that exposes model metadata and lacks role-based access control. Disable it in production deployments4.
Run Ollama as an unprivileged user
The official Linux installer already creates an ollama system user with no shell access. Verify this on existing installations and avoid running Ollama as root or as the primary user account. Resource limits via systemd cgroups prevent runaway processes from affecting the rest of the system.
Production hardening checklist
- Firewall rules in place restricting port 11434 to trusted sources only
- OLLAMA_ORIGINS set to specific allowed origins, not wildcard
- OLLAMA_NO_WEBSERVER=1 set to disable the unauthenticated UI
- Ollama running as an unprivileged system user, not root
- Reverse proxy with authentication in front of Ollama if accessed across networks
- Logs being collected and reviewed (see Part 3 monitoring script)
- Disk encryption at rest for the model storage directory
How does automatic failover from cloud AI to local AI work?
The architecture is simple: a thin client library sits between the application and the AI provider. Every request goes through the client. The client tries cloud AI first, and if that fails for any reason, retries the same request against local Ollama. The application code calling the client never knows which backend served the response.
The failover client handles three cases:
Connection failure
Cloud AI endpoint is unreachable, DNS fails, or TCP connection times out. Switch to Ollama immediately.
HTTP error
Cloud AI returns 5xx status code (server error) or specific 4xx codes (rate limits, service degraded). Retry with Ollama.
Timeout
Cloud AI accepts the request but takes longer than the timeout threshold. Cancel and retry with Ollama.
A working failover client in Python
The code below is the same pattern PCG uses for production deployments. It handles all three failure modes, logs which backend served each request, and exposes a single interface that drop-in replaces direct calls to the OpenAI or Anthropic SDK.
ai_request() never knows whether the response came from cloud AI or local Ollama. The failover is transparent, which is the whole point.
What is contingency mode and when does it activate?
Contingency mode is the operational state where all AI traffic routes to local Ollama by default, skipping the cloud AI attempt entirely. This is useful in two scenarios.
Known cloud outage. If the team knows the cloud provider is down (from a status page, social media, or repeated failover events in the logs), forcing contingency mode skips the wasted attempt at calling cloud AI and reduces latency for every request during the outage.
Compliance requirements. Some workflows handle data that should never touch cloud providers. Contingency mode can be enabled selectively for these workflows while other parts of the business continue using cloud AI.
Implementation is a single environment variable that the failover client checks before making any cloud request:
How often should the failover system be tested?
Quarterly at minimum. Monthly for business-critical deployments. The test is straightforward and takes about 15 minutes.
Quarterly failover drill
- Pick a low-traffic window (early morning, weekend, post-business hours)
- Block outbound traffic to the cloud AI endpoint at the firewall level for 15 minutes
- Have team members use AI-dependent workflows normally during the block
- Verify that the failover client logged "Cloud unavailable, falling back to Ollama" for every request
- Confirm response quality from local models was acceptable for the workflows tested
- Confirm monitoring alerts fired correctly (the team got notified)
- Remove the firewall block and verify automatic recovery to cloud
- Document any failures, surprises, or workflow gaps for the next iteration
Untested failover is failover that does not work when needed. The drill exists so the team finds problems in a controlled 15-minute window, not during an actual 78-minute Anthropic outage.
Does PCG build production AI continuity systems for clients?
Phoenix Consultants Group has been building production software systems for operational continuity since 1995, and three decades of experience in environments where business-critical software cannot stop translates directly to AI infrastructure. A custom AI continuity engagement covers everything in this series as a single deliverable: hardware assessment, Ollama deployment, monitoring integration, failover client development, security hardening, and team training on the contingency procedures.
The FireFlight Data System, PCG's modular platform for operational data, uses the same engineering discipline. Continuous monitoring, automatic recovery, security defaults that assume the worst, and tested procedures for every failure mode. The Ollama deployment follows that same playbook because the goal is the same: a system that works when the team needs it most.
Need a turnkey AI continuity system?
PCG handles hardware, deployment, monitoring, failover code, security hardening, and team training as one engagement. The diagnostic call is with an engineer, not a sales tier.
Frequently Asked Questions
Does Ollama need a firewall for business use?
How do I configure a firewall for Ollama on Linux?
How does automatic failover from cloud AI to local AI work?
Should failover be automatic or manual?
What is contingency mode for an AI continuity system?
How often should the failover system be tested?
About the Author
Allison Woolbert
CEO and Senior Systems Architect, Phoenix Consultants Group
Allison Woolbert is the principal of Phoenix Consultants Group, the custom software consultancy founded in 1995. PCG has run legacy migration projects across Microsoft Access, Visual FoxPro, Paradox, VB6, and other discontinued platforms for industrial, manufacturing, and environmental services clients since the late 1990s.
Allison leads PCG's discovery and architecture practice, where the first deliverable on every legacy engagement is an honest inventory of what the existing application actually does and what it should do next.
Sources
1 SentinelOne and Censys joint analysis on exposed Ollama instances, early 2026: serverman.co.uk/ai/ollama/ollama-security-guide
2 Ollama default network binding documentation: github.com/ollama/ollama/blob/main/docs/faq.md
3 Ollama environment variables reference, OLLAMA_ORIGINS for CORS control: docs.ollama.com
4 Ollama production security configuration, web UI and authentication: markaicode.com/configure-ollama-firewall-rules-security
Continue Reading
Get the full security and failover guide
This is Part 4 of a 4-part series on building an AI continuity plan with Ollama. Enter your email to unlock the rest of this article including firewall configuration for Linux, Windows, and macOS, a working Python failover client, and the testing drill that keeps the system trustworthy.
We verify your email first. One click confirms your subscription.
Part 2 of this series handled hardware and installation. With Ollama running, the next questions are operational. Which model should handle which task? How does a business team know if the local AI is healthy before the moment they need it? What gets logged, and where do alerts go?
The team that treats local AI like any other production service is the team whose failover actually works during a cloud AI outage. This part covers the practices that get a deployment from "installed" to "trusted."
Which local AI model should the business use?
The honest answer is that no single model is optimal for every task. Frontier cloud models like Claude and GPT-5 are general enough that one model handles everything. Local models are typically more specialized. A practical business deployment has two or three models installed, each assigned to the task it handles best.
For general business tasks (email, documents, summaries)
Llama 3.1 8B is the most versatile general-purpose model. It runs on 8 to 12 GB of VRAM, generates 50 to 70 tokens per second on consumer NVIDIA GPUs, and produces output quality comparable to GPT-3.5 across most business workflows1. For organizations with more memory available, Llama 3.3 70B matches GPT-4 on most benchmarks and runs on 40 GB of VRAM2.
For code generation and review (Claude Code replacement)
Qwen2.5-Coder is the most capable open coding model available through Ollama. The 7B variant runs comfortably on consumer hardware and handles autocomplete, refactoring, and bug fixing. The 32B variant scores 92.7 percent on HumanEval, putting it in the same range as GPT-4o for pure coding benchmarks3.
For structured reasoning and analysis
Microsoft's Phi-4 14B is purpose-built for mathematical reasoning, structured logic, and analytical tasks. It scores 80.4 percent on the MATH benchmark and outperforms general models several times its size on STEM problems4. Phi-4 is the right pick for data analysis pipelines, algorithm design, and any task where step-by-step logical reasoning matters more than creative output.
For fast lightweight tasks
Phi-3 Mini and Gemma 2 2B are lightweight models that run on minimal hardware. They are suitable for fast text classification, simple Q&A, and tasks where response latency matters more than depth. They are not replacements for larger models but they cover the case where a task does not need full reasoning capability.
The recommended emergency library
A typical business deployment for cloud AI failover has these four models installed:
| Model | Size on Disk | VRAM Needed | Replaces |
|---|---|---|---|
llama3.1:8b |
4.7 GB | 8 GB | ChatGPT general use |
qwen2.5-coder:7b |
4.7 GB | 8 GB | Claude Code, Copilot |
phi4:14b |
9 GB | 10 GB | Data analysis, reasoning |
mistral:7b |
4.1 GB | 6 GB | Fast email/document drafting |
Total disk footprint: roughly 22 GB. Total VRAM needed if loading all at once: around 32 GB. With OLLAMA_KEEP_ALIVE set to 30 minutes and OLLAMA_MAX_LOADED_MODELS set to 2, the system loads on demand and unloads idle models, which keeps memory pressure manageable on 16 to 24 GB systems.
How are system prompts different for local AI models?
Local models follow system prompts, but they need more explicit instruction than frontier cloud models. A prompt that works perfectly on Claude may produce inconsistent results on Llama 3.1 8B. Three patterns matter.
Be direct. Frontier models infer intent. Local models follow instructions literally. Replace "Help the user with their question" with "Read the question. Provide a 3-sentence answer. Do not add disclaimers or apologies."
State the output format. If JSON is needed, say so explicitly and provide an example. If a specific length is needed, give a word count or sentence count. Vague instructions get vague results.
Forbid unwanted behaviors explicitly. Phrases like "Do not add commentary" or "Do not explain your reasoning" prevent the model from padding responses with filler. Local models tend to over-explain unless told not to.
A working system prompt template for business document drafting:
Why does monitoring matter for a failover system?
A failover that fails is worse than no failover at all. The team that built it believes coverage exists. When the cloud AI outage finally happens and the local fallback turns out to be broken, the response is slower than if they had planned for full degradation from the start.
Monitoring catches three categories of failure:
Service down. The Ollama process crashed, the systemd service stopped, the host was rebooted but Ollama did not restart, or the API port is no longer accessible.
Models missing. Disk pressure caused models to be cleaned up, a model failed to download, or the OLLAMA_MODELS directory became unmounted. The service is technically running but cannot respond to inference requests.
Performance degraded. The GPU driver was updated and is no longer detected, the system fell back to CPU inference, or the model load time has grown unacceptable. The service responds, but slowly enough that the failover is unusable.
What does a working Ollama monitoring script look like?
The script below performs three checks: API health, model availability, and a test inference. It logs results to a file and exits with a status code that integrations like cron, systemd timer, or external monitoring tools can use.
Scheduling the script
On Linux, schedule the script with cron to run every 5 minutes:
On macOS, use launchd. On Windows, use Task Scheduler. The exact configuration varies by platform, but the pattern is the same: run every 5 minutes, log to a file, and exit non-zero on failure so external monitoring can detect the problem.
How does the team get alerted when monitoring fails?
A log file no one reads is not monitoring. The script above writes to a file by design, because that file becomes the input for whatever alerting system the organization already uses. Three common patterns work.
Existing infrastructure monitoring. If the organization uses Datadog, New Relic, Prometheus, or any other monitoring stack, point it at the log file or the script exit code. The integration is one line of configuration.
Slack or Teams webhooks. A 10-line addition to the script can post to Slack or Teams when checks fail. The webhook URL goes in a config file, the script reads it, and the team gets alerts in the channel they already watch.
Email alerts via cron. Configure cron with a MAILTO header so any non-zero exit triggers an email automatically. Simplest option, no extra code required, works on every Linux system.
Does PCG help operationalize local AI deployments?
Yes. Phoenix Consultants Group has been building operational software since 1995, and the discipline that applies to monitoring legacy databases or production web services applies directly to local AI infrastructure. A custom engagement includes model selection tailored to the client's actual workflows, monitoring integration with the existing alerting stack, runbook documentation, and team training on the operational procedures.
The FireFlight Data System uses the same monitoring philosophy: continuous health checks, automatic recovery, and external alerting that catches the failures recovery does not solve. The Ollama deployment follows the same playbook.
Need monitoring built into your AI failover?
PCG designs custom monitoring and alerting for local AI deployments, integrated with your existing infrastructure.
Frequently Asked Questions
Which local AI model is best for business use?
How do I monitor whether Ollama is working correctly?
Can local AI replace Claude or ChatGPT for daily business work?
How do I write a system prompt for local AI models?
What happens if Ollama crashes during a cloud AI outage?
Should I run the monitoring script on the same machine as Ollama?
About the Author
Allison Woolbert
CEO and Senior Systems Architect, Phoenix Consultants Group
Allison Woolbert is the principal of Phoenix Consultants Group, the custom software consultancy founded in 1995. PCG has run legacy migration projects across Microsoft Access, Visual FoxPro, Paradox, VB6, and other discontinued platforms for industrial, manufacturing, and environmental services clients since the late 1990s.
Allison leads PCG's discovery and architecture practice, where the first deliverable on every legacy engagement is an honest inventory of what the existing application actually does and what it should do next.
Sources
1 Llama 3.1 model documentation and benchmarks, Ollama library: ollama.com/library/llama3.1
2 Llama 3.3 70B model card and benchmark comparisons: ollama.com/library/llama3.3
3 Qwen2.5-Coder benchmarks, HumanEval 92.7 percent: ollama.com/library/qwen2.5-coder
4 Microsoft Phi-4 model card and MATH benchmark: ollama.com/library/phi4
5 Ollama API reference for /api/tags and /api/generate: github.com/ollama/ollama/blob/main/docs/api.md
Continue Reading
Get the full guide on models and monitoring
This is Part 3 of a 4-part series on building an AI continuity plan with Ollama. Enter your email to unlock the rest of this article, including the Python monitoring script and the recommended model library, plus access to all parts of the series.
We verify your email first. One click confirms your subscription.