Phoenix Consultants Group | Custom Computer Programming Phoenix Consultants Group | Custom Computer Programming
  • Custom Software Developers
    • Analyzing Business Needs
    • Custom Application Development
    • Custom Website Development
    • Data Collection and Management
    • Form Design & Development
    • Visual Basic Programming Experts
    • Custom Technology Products & Software Solutions for Business
  • .NET Development
    • Business Logic to .NET Architecture:
    • Smarter Decisions with Intelligent Data Systems
    • Custom .NET Software Development
  • Fireflight Data System
    • Fireflight – Project
  • Data Management
    • Managing Legacy Data and Systems
    • Conversion, Migration & Integration
    • Data Management
    • Data Movement & Middleware Integration Services
    • Enterprise Resource Planning
    • Inventory Management Systems
    • Microsoft Access Solutions
      • Access Database Consulting
      • Access Database Design
      • Access for Rapid Data Development
      • Access Database Programming
  • Case Studies
    • ISO 9000 Documentation & Regulatory Compliance Database
    • Superfund Soil Remediation
    • OSHA Training & Certification
    • Ground Water Monitoring
    • Pest Control Reporting Engine
    • Vineyard Pest Trap Management
    • Fueling System for a Top-5 U.S. Metro Fleet
    • Payroll System for a Multi-Facility Physician Staffing Company
    • Ground Support Equipment (GSE) Management System for Airport Operations
    • (MSDS/SDS) Management System
    • Pesticide Licensing Compliance System
    • EPA Title V Air Quality Management System
  • Tech Wisdom
  • Industries We Serve
    • Custom Software Portfolio
  • Blog
  • About Us
  • Contact Us
Phoenix Consultants Group | Custom Computer Programming
  • Custom Software Developers
    • Analyzing Business Needs
    • Custom Application Development
    • Custom Website Development
    • Data Collection and Management
    • Form Design & Development
    • Visual Basic Programming Experts
    • Custom Technology Products & Software Solutions for Business
  • .NET Development
    • Business Logic to .NET Architecture:
    • Smarter Decisions with Intelligent Data Systems
    • Custom .NET Software Development
  • Fireflight Data System
    • Fireflight – Project
  • Data Management
    • Managing Legacy Data and Systems
    • Conversion, Migration & Integration
    • Data Management
    • Data Movement & Middleware Integration Services
    • Enterprise Resource Planning
    • Inventory Management Systems
    • Microsoft Access Solutions
      • Access Database Consulting
      • Access Database Design
      • Access for Rapid Data Development
      • Access Database Programming
  • Case Studies
    • ISO 9000 Documentation & Regulatory Compliance Database
    • Superfund Soil Remediation
    • OSHA Training & Certification
    • Ground Water Monitoring
    • Pest Control Reporting Engine
    • Vineyard Pest Trap Management
    • Fueling System for a Top-5 U.S. Metro Fleet
    • Payroll System for a Multi-Facility Physician Staffing Company
    • Ground Support Equipment (GSE) Management System for Airport Operations
    • (MSDS/SDS) Management System
    • Pesticide Licensing Compliance System
    • EPA Title V Air Quality Management System
  • Tech Wisdom
  • Industries We Serve
    • Custom Software Portfolio
  • Blog
  • About Us
  • Contact Us

Tag: chatgpt down

Last updated: June 2026 Part 4 of 4
The previous three parts of this series got Ollama installed, configured, monitored, and ready. This final part closes the gap between "Ollama is available" and "Ollama is a reliable failover." Firewall configuration prevents accidental exposure. Auto-failover code makes the switch from cloud to local automatic. Drills and contingency procedures verify that the system actually works when needed.

A 2026 joint analysis by SentinelOne and Censys scanned the public internet for 293 days and found 175,000 unique Ollama instances exposed across 130 countries, most with no authentication and no firewall protection1. Many had tool-calling capabilities enabled, meaning attackers could not just consume the host's compute resources but potentially execute commands on the underlying system.

This is the part of the implementation that gets skipped most often. Ollama works perfectly on the developer's laptop with default settings. The production deployment that survives a security audit, runs as automated failover, and is tested regularly requires the steps in this article.

Why does Ollama need a firewall?

By default, Ollama binds only to 127.0.0.1:11434, which means localhost only2. This default is safe. The Ollama API is unreachable from other machines on the network, and no firewall configuration is strictly necessary.

The default changes the moment OLLAMA_HOST is set to 0.0.0.0:11434, which is required for any deployment where Ollama needs to serve requests from other machines (the most common business use case). At that point, the API is reachable from anywhere on the network. Without authentication and without a firewall, any user on the local network or, worse, any reachable internet host, can:

Submit arbitrary inference requests that pin the GPU for minutes at a time, effectively a denial-of-service attack against the host machine.

Exfiltrate model outputs by sending crafted prompts designed to leak training data or sensitive information that was used in fine-tuning.

Map the environment by querying the API for installed models, GPU specifications, and other host details that inform a larger attack.

Critical: Ollama has no built-in authentication. If OLLAMA_HOST is set to 0.0.0.0, anyone who can reach port 11434 can use the API. Firewall rules are the primary access control.

How is the firewall configured on Linux?

Linux has two common firewall tools. Ubuntu and Debian use ufw (Uncomplicated Firewall). Red Hat, CentOS, and Fedora use firewalld. Both achieve the same result with different syntax.

ufw on Ubuntu and Debian

The pattern is straightforward: deny port 11434 by default, then allow only the specific subnet or IP addresses that should have access.

# Enable ufw if not already enabled sudo ufw enable # Allow Ollama API access from the corporate subnet (example: 192.168.1.0/24) sudo ufw allow from 192.168.1.0/24 to any port 11434 proto tcp # Optional: allow access from a specific VPN subnet sudo ufw allow from 10.8.0.0/24 to any port 11434 proto tcp # Explicitly deny access from anywhere else to port 11434 sudo ufw deny to any port 11434 proto tcp # Check the resulting rules sudo ufw status numbered

firewalld on Red Hat, CentOS, Fedora

firewalld uses zones. The pattern is to add port 11434 to an "internal" zone that includes only trusted source addresses, and explicitly close that port in the "public" zone.

# Add trusted source to the internal zone sudo firewall-cmd --zone=internal --add-source=192.168.1.0/24 --permanent # Allow port 11434 only in the internal zone sudo firewall-cmd --zone=internal --add-port=11434/tcp --permanent # Reload to apply sudo firewall-cmd --reload # Verify sudo firewall-cmd --list-all --zone=internal

How is the firewall configured on Windows?

Windows uses Windows Defender Firewall. PowerShell as Administrator is the simplest way to configure rules consistently. The goal is the same: allow port 11434 only from trusted subnets.

# Open PowerShell as Administrator # Allow Ollama API from the corporate subnet New-NetFirewallRule -DisplayName "Ollama API - Internal" ` -Direction Inbound -Action Allow ` -Protocol TCP -LocalPort 11434 ` -RemoteAddress 192.168.1.0/24 # Block port 11434 from all other sources New-NetFirewallRule -DisplayName "Ollama API - Block External" ` -Direction Inbound -Action Block ` -Protocol TCP -LocalPort 11434 ` -RemoteAddress Any # Verify rules Get-NetFirewallRule -DisplayName "Ollama API*"

How is the firewall configured on macOS?

macOS uses pf (Packet Filter) for firewall rules. The application firewall in System Settings does not provide enough granularity for port-level control. Editing the pf configuration directly is required.

# Edit the pf configuration sudo nano /etc/pf.conf # Add these lines at the bottom # Block all incoming on port 11434 by default block in proto tcp from any to any port 11434 # Allow only the trusted subnet pass in proto tcp from 192.168.1.0/24 to any port 11434 # Load the updated configuration sudo pfctl -f /etc/pf.conf # Enable pf if not already enabled sudo pfctl -e # Check active rules sudo pfctl -sr

What additional Ollama hardening matters?

Firewall rules are the first layer. Three more environment variables and configurations reduce the attack surface further.

Restrict CORS origins

Set OLLAMA_ORIGINS to the specific frontend URLs that should be allowed to call the API from a browser. This prevents arbitrary websites from making cross-origin requests to Ollama if a user visits them while on the corporate network3.

Environment="OLLAMA_ORIGINS=https://docs.internal.corp,https://app.internal.corp"

Disable the built-in web UI in production

Ollama includes a basic web UI that exposes model metadata and lacks role-based access control. Disable it in production deployments4.

Environment="OLLAMA_NO_WEBSERVER=1"

Run Ollama as an unprivileged user

The official Linux installer already creates an ollama system user with no shell access. Verify this on existing installations and avoid running Ollama as root or as the primary user account. Resource limits via systemd cgroups prevent runaway processes from affecting the rest of the system.

Production hardening checklist

  • Firewall rules in place restricting port 11434 to trusted sources only
  • OLLAMA_ORIGINS set to specific allowed origins, not wildcard
  • OLLAMA_NO_WEBSERVER=1 set to disable the unauthenticated UI
  • Ollama running as an unprivileged system user, not root
  • Reverse proxy with authentication in front of Ollama if accessed across networks
  • Logs being collected and reviewed (see Part 3 monitoring script)
  • Disk encryption at rest for the model storage directory

How does automatic failover from cloud AI to local AI work?

The architecture is simple: a thin client library sits between the application and the AI provider. Every request goes through the client. The client tries cloud AI first, and if that fails for any reason, retries the same request against local Ollama. The application code calling the client never knows which backend served the response.

The failover client handles three cases:

Connection failure

Cloud AI endpoint is unreachable, DNS fails, or TCP connection times out. Switch to Ollama immediately.

HTTP error

Cloud AI returns 5xx status code (server error) or specific 4xx codes (rate limits, service degraded). Retry with Ollama.

Timeout

Cloud AI accepts the request but takes longer than the timeout threshold. Cancel and retry with Ollama.

A working failover client in Python

The code below is the same pattern PCG uses for production deployments. It handles all three failure modes, logs which backend served each request, and exposes a single interface that drop-in replaces direct calls to the OpenAI or Anthropic SDK.

#!/usr/bin/env python3 # ai_failover_client.py # A failover client that tries cloud AI first, then falls back to local Ollama. import requests import logging import os from typing import Optional # Configuration via environment variables CLOUD_API_URL = os.getenv("CLOUD_API_URL", "https://api.openai.com/v1/chat/completions") CLOUD_API_KEY = os.getenv("CLOUD_API_KEY") OLLAMA_URL = os.getenv("OLLAMA_URL", "http://localhost:11434/api/chat") OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "llama3.1:8b") CLOUD_TIMEOUT = int(os.getenv("CLOUD_TIMEOUT", "15")) OLLAMA_TIMEOUT = int(os.getenv("OLLAMA_TIMEOUT", "60")) logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s") def call_cloud(messages: list, model: str = "gpt-4o") -> Optional[str]: """Try the cloud AI provider. Returns response text or None on failure.""" try: r = requests.post( CLOUD_API_URL, headers={"Authorization": f"Bearer {CLOUD_API_KEY}"}, json={"model": model, "messages": messages}, timeout=CLOUD_TIMEOUT ) if r.status_code == 200: return r.json()["choices"][0]["message"]["content"] logging.warning(f"Cloud returned status {r.status_code}") return None except requests.exceptions.RequestException as e: logging.warning(f"Cloud request failed: {e}") return None def call_ollama(messages: list, model: str = OLLAMA_MODEL) -> Optional[str]: """Try the local Ollama instance. Returns response text or None on failure.""" try: r = requests.post( OLLAMA_URL, json={"model": model, "messages": messages, "stream": False}, timeout=OLLAMA_TIMEOUT ) if r.status_code == 200: return r.json()["message"]["content"] logging.error(f"Ollama returned status {r.status_code}") return None except requests.exceptions.RequestException as e: logging.error(f"Ollama request failed: {e}") return None def ai_request(messages: list, cloud_model: str = "gpt-4o") -> dict: """ Main entry point. Tries cloud first, falls back to Ollama on failure. Returns a dict with the response text and which backend served it. """ response = call_cloud(messages, model=cloud_model) if response is not None: logging.info("Served by cloud") return {"backend": "cloud", "text": response} logging.info("Cloud unavailable, falling back to Ollama") response = call_ollama(messages) if response is not None: return {"backend": "ollama", "text": response} logging.error("Both cloud and Ollama failed") return {"backend": "none", "text": None, "error": "All backends failed"} # Usage example if __name__ == "__main__": result = ai_request([ {"role": "user", "content": "Summarize the benefits of local AI in 50 words."} ]) print(f"[{result['backend']}] {result['text']}")
The key property is that the application code calling ai_request() never knows whether the response came from cloud AI or local Ollama. The failover is transparent, which is the whole point.

What is contingency mode and when does it activate?

Contingency mode is the operational state where all AI traffic routes to local Ollama by default, skipping the cloud AI attempt entirely. This is useful in two scenarios.

Known cloud outage. If the team knows the cloud provider is down (from a status page, social media, or repeated failover events in the logs), forcing contingency mode skips the wasted attempt at calling cloud AI and reduces latency for every request during the outage.

Compliance requirements. Some workflows handle data that should never touch cloud providers. Contingency mode can be enabled selectively for these workflows while other parts of the business continue using cloud AI.

Implementation is a single environment variable that the failover client checks before making any cloud request:

# In the client code, add at the top of ai_request() if os.getenv("AI_CONTINGENCY_MODE") == "true": logging.info("Contingency mode active, routing directly to Ollama") response = call_ollama(messages) if response: return {"backend": "ollama", "text": response}

How often should the failover system be tested?

Quarterly at minimum. Monthly for business-critical deployments. The test is straightforward and takes about 15 minutes.

Quarterly failover drill

  • Pick a low-traffic window (early morning, weekend, post-business hours)
  • Block outbound traffic to the cloud AI endpoint at the firewall level for 15 minutes
  • Have team members use AI-dependent workflows normally during the block
  • Verify that the failover client logged "Cloud unavailable, falling back to Ollama" for every request
  • Confirm response quality from local models was acceptable for the workflows tested
  • Confirm monitoring alerts fired correctly (the team got notified)
  • Remove the firewall block and verify automatic recovery to cloud
  • Document any failures, surprises, or workflow gaps for the next iteration

Untested failover is failover that does not work when needed. The drill exists so the team finds problems in a controlled 15-minute window, not during an actual 78-minute Anthropic outage.

Does PCG build production AI continuity systems for clients?

Phoenix Consultants Group has been building production software systems for operational continuity since 1995, and three decades of experience in environments where business-critical software cannot stop translates directly to AI infrastructure. A custom AI continuity engagement covers everything in this series as a single deliverable: hardware assessment, Ollama deployment, monitoring integration, failover client development, security hardening, and team training on the contingency procedures.

The FireFlight Data System, PCG's modular platform for operational data, uses the same engineering discipline. Continuous monitoring, automatic recovery, security defaults that assume the worst, and tested procedures for every failure mode. The Ollama deployment follows that same playbook because the goal is the same: a system that works when the team needs it most.

Need a turnkey AI continuity system?

PCG handles hardware, deployment, monitoring, failover code, security hardening, and team training as one engagement. The diagnostic call is with an engineer, not a sales tier.

Book Your Free Consultation

Frequently Asked Questions

Does Ollama need a firewall for business use?
Yes. A 2026 SentinelOne and Censys analysis found 175,000 Ollama instances exposed publicly with no authentication or firewall protection. Default Ollama binds to localhost only, which is safe. The moment OLLAMA_HOST is changed to 0.0.0.0 to allow network access, a firewall becomes mandatory. Without one, anyone reaching the host can submit inference requests, consume GPU resources, and potentially exfiltrate model outputs.
How do I configure a firewall for Ollama on Linux?
Use ufw on Ubuntu and Debian or firewalld on Red Hat and CentOS. The pattern is identical: block port 11434 from all sources by default, then explicitly allow the specific IP addresses or subnets that should reach Ollama. A single ufw command allows the corporate subnet and blocks everything else.
How does automatic failover from cloud AI to local AI work?
A small client library sits between the application and the AI provider. Each request goes to the cloud AI first. If the cloud AI returns an error, times out, or fails health checks, the same request is automatically retried against Ollama. The application sees one consistent API while the failover happens transparently. Typical implementation is 50 to 80 lines of code in any modern language.
Should failover be automatic or manual?
Automatic for most workflows. Manual failover requires someone to notice the outage and trigger the switch, which adds minutes or hours of delay. Automatic failover handles the switch in milliseconds. The exception is workflows with strict compliance or audit requirements where every model output must be logged with its source, in which case manual approval before falling back may be appropriate.
What is contingency mode for an AI continuity system?
Contingency mode is the operational state where all AI traffic routes to local Ollama instead of the cloud provider. It can be triggered automatically by repeated cloud failures or manually by an operator. While in contingency mode, the system logs all requests separately so the team has a record of what ran locally during the outage and can verify outputs after the cloud provider recovers.
How often should the failover system be tested?
Quarterly at minimum, monthly for business-critical deployments. A test drill blocks cloud AI access at the firewall level for 15 minutes during a low-traffic window. The team verifies that all applications continue working, monitors response quality from local models, and confirms that monitoring alerts fired correctly. Untested failover is failover that does not work when needed.

About the Author

Allison Woolbert

CEO and Senior Systems Architect, Phoenix Consultants Group

Allison Woolbert is the principal of Phoenix Consultants Group, the custom software consultancy founded in 1995. PCG has run legacy migration projects across Microsoft Access, Visual FoxPro, Paradox, VB6, and other discontinued platforms for industrial, manufacturing, and environmental services clients since the late 1990s.

Allison leads PCG's discovery and architecture practice, where the first deliverable on every legacy engagement is an honest inventory of what the existing application actually does and what it should do next.

LinkedIn.

Sources

1 SentinelOne and Censys joint analysis on exposed Ollama instances, early 2026: serverman.co.uk/ai/ollama/ollama-security-guide

2 Ollama default network binding documentation: github.com/ollama/ollama/blob/main/docs/faq.md

3 Ollama environment variables reference, OLLAMA_ORIGINS for CORS control: docs.ollama.com

4 Ollama production security configuration, web UI and authentication: markaicode.com/configure-ollama-firewall-rules-security

This article is informational and reflects industry observations as of June 2026. It is not legal, compliance, or financial advice for any specific situation. Phoenix Consultants Group, founded 1995, provides custom software development and AI infrastructure consulting. For guidance tailored to your organization's specific requirements, contact PCG directly.

Continue Reading

Get the full security and failover guide

This is Part 4 of a 4-part series on building an AI continuity plan with Ollama. Enter your email to unlock the rest of this article including firewall configuration for Linux, Windows, and macOS, a working Python failover client, and the testing drill that keeps the system trustworthy.

Tech Wisdom Series AI Signup

We verify your email first. One click confirms your subscription.

Last updated: June 2026 Part 3 of 4
A working Ollama installation is only half the story. The other half is choosing the right model for each business task and monitoring the system so that failures get detected before they matter. This guide covers model selection by use case, system prompt patterns that work for local models, and a complete Python monitoring script that runs as a scheduled task.

Part 2 of this series handled hardware and installation. With Ollama running, the next questions are operational. Which model should handle which task? How does a business team know if the local AI is healthy before the moment they need it? What gets logged, and where do alerts go?

The team that treats local AI like any other production service is the team whose failover actually works during a cloud AI outage. This part covers the practices that get a deployment from "installed" to "trusted."

Which local AI model should the business use?

The honest answer is that no single model is optimal for every task. Frontier cloud models like Claude and GPT-5 are general enough that one model handles everything. Local models are typically more specialized. A practical business deployment has two or three models installed, each assigned to the task it handles best.

For general business tasks (email, documents, summaries)

Llama 3.1 8B is the most versatile general-purpose model. It runs on 8 to 12 GB of VRAM, generates 50 to 70 tokens per second on consumer NVIDIA GPUs, and produces output quality comparable to GPT-3.5 across most business workflows1. For organizations with more memory available, Llama 3.3 70B matches GPT-4 on most benchmarks and runs on 40 GB of VRAM2.

ollama pull llama3.1:8b # 4.7 GB download, runs on 8GB+ VRAM ollama pull llama3.3:70b # 43 GB download, runs on 40GB+ VRAM

For code generation and review (Claude Code replacement)

Qwen2.5-Coder is the most capable open coding model available through Ollama. The 7B variant runs comfortably on consumer hardware and handles autocomplete, refactoring, and bug fixing. The 32B variant scores 92.7 percent on HumanEval, putting it in the same range as GPT-4o for pure coding benchmarks3.

ollama pull qwen2.5-coder:7b # 4.7 GB, daily coding work ollama pull qwen2.5-coder:32b # 20 GB, complex multi-file changes

For structured reasoning and analysis

Microsoft's Phi-4 14B is purpose-built for mathematical reasoning, structured logic, and analytical tasks. It scores 80.4 percent on the MATH benchmark and outperforms general models several times its size on STEM problems4. Phi-4 is the right pick for data analysis pipelines, algorithm design, and any task where step-by-step logical reasoning matters more than creative output.

ollama pull phi4:14b # 9 GB, structured reasoning tasks

For fast lightweight tasks

Phi-3 Mini and Gemma 2 2B are lightweight models that run on minimal hardware. They are suitable for fast text classification, simple Q&A, and tasks where response latency matters more than depth. They are not replacements for larger models but they cover the case where a task does not need full reasoning capability.

The recommended emergency library

A typical business deployment for cloud AI failover has these four models installed:

Model Size on Disk VRAM Needed Replaces
llama3.1:8b 4.7 GB 8 GB ChatGPT general use
qwen2.5-coder:7b 4.7 GB 8 GB Claude Code, Copilot
phi4:14b 9 GB 10 GB Data analysis, reasoning
mistral:7b 4.1 GB 6 GB Fast email/document drafting

Total disk footprint: roughly 22 GB. Total VRAM needed if loading all at once: around 32 GB. With OLLAMA_KEEP_ALIVE set to 30 minutes and OLLAMA_MAX_LOADED_MODELS set to 2, the system loads on demand and unloads idle models, which keeps memory pressure manageable on 16 to 24 GB systems.

How are system prompts different for local AI models?

Local models follow system prompts, but they need more explicit instruction than frontier cloud models. A prompt that works perfectly on Claude may produce inconsistent results on Llama 3.1 8B. Three patterns matter.

Be direct. Frontier models infer intent. Local models follow instructions literally. Replace "Help the user with their question" with "Read the question. Provide a 3-sentence answer. Do not add disclaimers or apologies."

State the output format. If JSON is needed, say so explicitly and provide an example. If a specific length is needed, give a word count or sentence count. Vague instructions get vague results.

Forbid unwanted behaviors explicitly. Phrases like "Do not add commentary" or "Do not explain your reasoning" prevent the model from padding responses with filler. Local models tend to over-explain unless told not to.

A working system prompt template for business document drafting:

# Send via the Ollama API with system field set You are a professional business writing assistant. Output rules: - Write in clear, direct sentences. - Use the requested document type and length. - Do not add introductions or conclusions unless asked. - Do not include phrases like "I hope this helps" or "Let me know if". - Reply with the document content only.

Why does monitoring matter for a failover system?

A failover that fails is worse than no failover at all. The team that built it believes coverage exists. When the cloud AI outage finally happens and the local fallback turns out to be broken, the response is slower than if they had planned for full degradation from the start.

Monitoring catches three categories of failure:

Service down. The Ollama process crashed, the systemd service stopped, the host was rebooted but Ollama did not restart, or the API port is no longer accessible.

Models missing. Disk pressure caused models to be cleaned up, a model failed to download, or the OLLAMA_MODELS directory became unmounted. The service is technically running but cannot respond to inference requests.

Performance degraded. The GPU driver was updated and is no longer detected, the system fell back to CPU inference, or the model load time has grown unacceptable. The service responds, but slowly enough that the failover is unusable.

What does a working Ollama monitoring script look like?

The script below performs three checks: API health, model availability, and a test inference. It logs results to a file and exits with a status code that integrations like cron, systemd timer, or external monitoring tools can use.

#!/usr/bin/env python3 # ollama_health_check.py # Runs three checks against a local Ollama installation. # Exit 0 if healthy, exit 1 if any check fails. import requests import json import sys import time import logging from pathlib import Path # Configuration OLLAMA_URL = "http://localhost:11434" REQUIRED_MODELS = ["llama3.1:8b", "qwen2.5-coder:7b"] TEST_MODEL = "llama3.1:8b" TEST_TIMEOUT_SECONDS = 30 LOG_FILE = "/var/log/ollama_health.log" logging.basicConfig( filename=LOG_FILE, level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s" ) def check_api(): """Verify the Ollama API is responding.""" try: r = requests.get(f"{OLLAMA_URL}/", timeout=5) if r.status_code == 200: logging.info("API check passed") return True logging.error(f"API returned status {r.status_code}") return False except requests.exceptions.RequestException as e: logging.error(f"API check failed: {e}") return False def check_models(): """Verify all required models are available.""" try: r = requests.get(f"{OLLAMA_URL}/api/tags", timeout=10) data = r.json() installed = [m["name"] for m in data.get("models", [])] missing = [m for m in REQUIRED_MODELS if m not in installed] if missing: logging.error(f"Missing models: {missing}") return False logging.info(f"All {len(REQUIRED_MODELS)} required models present") return True except Exception as e: logging.error(f"Model check failed: {e}") return False def check_inference(): """Run a test inference and measure response time.""" payload = { "model": TEST_MODEL, "prompt": "Reply with the single word: OK", "stream": False, "options": {"num_predict": 10} } try: start = time.time() r = requests.post( f"{OLLAMA_URL}/api/generate", json=payload, timeout=TEST_TIMEOUT_SECONDS ) elapsed = time.time() - start if r.status_code != 200: logging.error(f"Inference returned status {r.status_code}") return False if elapsed > TEST_TIMEOUT_SECONDS: logging.error(f"Inference too slow: {elapsed:.1f}s") return False logging.info(f"Inference check passed in {elapsed:.1f}s") return True except Exception as e: logging.error(f"Inference check failed: {e}") return False def main(): checks = [ ("API", check_api), ("Models", check_models), ("Inference", check_inference) ] failures = [] for name, check_fn in checks: if not check_fn(): failures.append(name) if failures: logging.error(f"Health check FAILED. Failed checks: {failures}") sys.exit(1) logging.info("Health check PASSED") sys.exit(0) if __name__ == "__main__": main()

Scheduling the script

On Linux, schedule the script with cron to run every 5 minutes:

# Edit the user crontab crontab -e # Add this line */5 * * * * /usr/bin/python3 /opt/scripts/ollama_health_check.py

On macOS, use launchd. On Windows, use Task Scheduler. The exact configuration varies by platform, but the pattern is the same: run every 5 minutes, log to a file, and exit non-zero on failure so external monitoring can detect the problem.

The monitoring script does not replace systemd's Restart=always. It catches the failures that automatic restart does not solve, such as missing models or GPU regression. Both layers matter.

How does the team get alerted when monitoring fails?

A log file no one reads is not monitoring. The script above writes to a file by design, because that file becomes the input for whatever alerting system the organization already uses. Three common patterns work.

Existing infrastructure monitoring. If the organization uses Datadog, New Relic, Prometheus, or any other monitoring stack, point it at the log file or the script exit code. The integration is one line of configuration.

Slack or Teams webhooks. A 10-line addition to the script can post to Slack or Teams when checks fail. The webhook URL goes in a config file, the script reads it, and the team gets alerts in the channel they already watch.

Email alerts via cron. Configure cron with a MAILTO header so any non-zero exit triggers an email automatically. Simplest option, no extra code required, works on every Linux system.

Does PCG help operationalize local AI deployments?

Yes. Phoenix Consultants Group has been building operational software since 1995, and the discipline that applies to monitoring legacy databases or production web services applies directly to local AI infrastructure. A custom engagement includes model selection tailored to the client's actual workflows, monitoring integration with the existing alerting stack, runbook documentation, and team training on the operational procedures.

The FireFlight Data System uses the same monitoring philosophy: continuous health checks, automatic recovery, and external alerting that catches the failures recovery does not solve. The Ollama deployment follows the same playbook.

Need monitoring built into your AI failover?

PCG designs custom monitoring and alerting for local AI deployments, integrated with your existing infrastructure.

Book Your Free Consultation

Frequently Asked Questions

Which local AI model is best for business use?
There is no single best model. Different tasks suit different models. Llama 3.1 8B is the most versatile general-purpose model for business workflows like email drafting and document analysis. Qwen2.5-Coder 14B replaces Claude Code for development teams. Phi-4 14B excels at structured reasoning and data analysis. The right approach is having two or three models installed and routing tasks to the appropriate one.
How do I monitor whether Ollama is working correctly?
A monitoring script that runs every 5 minutes checks three things: the Ollama API is responding on port 11434, the expected models are loaded and ready, and a test inference completes within an acceptable time. The script logs failures and sends alerts when checks fail repeatedly. Without monitoring, the team discovers Ollama is down only when they actually need it during a cloud AI outage.
Can local AI replace Claude or ChatGPT for daily business work?
For most daily tasks, yes. Llama 3.1 70B matches GPT-4 on most general benchmarks. Qwen2.5-Coder 32B scores 92.7 percent on HumanEval, comparable to frontier coding models. The 5 to 10 percent of tasks where cloud AI clearly wins involve complex multi-step reasoning, very long contexts, or niche domains. For everything else, local models are sufficient with appropriate hardware.
How do I write a system prompt for local AI models?
Local models follow system prompts but require more explicit instruction than frontier cloud models. Keep prompts direct and specific. State the role, the output format, the length constraint, and any forbidden behaviors. Avoid clever phrasing. A system prompt that works perfectly on Claude may need to be rewritten in simpler language for a local 8B model to follow consistently.
What happens if Ollama crashes during a cloud AI outage?
Without monitoring, the team only discovers the crash when they try to use Ollama as a backup and it fails. With monitoring in place, the automatic restart on systemd or the Windows service manager recovers Ollama within seconds, and the monitoring script logs the event for later review. The combination of automatic restart and external monitoring prevents the worst-case scenario of a failover that is itself failed.
Should I run the monitoring script on the same machine as Ollama?
Running on the same machine is acceptable for small deployments and catches most failure modes. For business-critical setups, running monitoring on a separate machine catches additional failure modes such as the entire Ollama host being unreachable. A separate machine also avoids the situation where the monitoring script crashes alongside Ollama.

About the Author

Allison Woolbert

CEO and Senior Systems Architect, Phoenix Consultants Group

Allison Woolbert is the principal of Phoenix Consultants Group, the custom software consultancy founded in 1995. PCG has run legacy migration projects across Microsoft Access, Visual FoxPro, Paradox, VB6, and other discontinued platforms for industrial, manufacturing, and environmental services clients since the late 1990s.

Allison leads PCG's discovery and architecture practice, where the first deliverable on every legacy engagement is an honest inventory of what the existing application actually does and what it should do next.

LinkedIn.

Sources

1 Llama 3.1 model documentation and benchmarks, Ollama library: ollama.com/library/llama3.1

2 Llama 3.3 70B model card and benchmark comparisons: ollama.com/library/llama3.3

3 Qwen2.5-Coder benchmarks, HumanEval 92.7 percent: ollama.com/library/qwen2.5-coder

4 Microsoft Phi-4 model card and MATH benchmark: ollama.com/library/phi4

5 Ollama API reference for /api/tags and /api/generate: github.com/ollama/ollama/blob/main/docs/api.md

This article is informational and reflects industry observations as of June 2026. It is not legal, compliance, or financial advice for any specific situation. Phoenix Consultants Group, founded 1995, provides custom software development and AI infrastructure consulting. For guidance tailored to your organization's specific requirements, contact PCG directly.

Continue Reading

Get the full guide on models and monitoring

This is Part 3 of a 4-part series on building an AI continuity plan with Ollama. Enter your email to unlock the rest of this article, including the Python monitoring script and the recommended model library, plus access to all parts of the series.

Tech Wisdom Series AI Signup

We verify your email first. One click confirms your subscription.

Last updated: June 2026 Part 2 of 4
Setting up local AI as a continuity backup starts with hardware. The wrong GPU choice makes Ollama unusable; the right one makes it nearly invisible. This guide covers the exact hardware specifications for production use, the installation process on macOS, Linux, and Windows, and the configuration variables that turn a development setup into a reliable failover service.

Part 1 of this series established why every AI-dependent business needs a continuity plan and introduced Ollama as the most practical local AI runtime for that role. This part addresses the implementation. Hardware first, because every decision after it depends on hardware reality. Installation second, with the configuration choices that matter for production failover use.

By the end of this guide, Ollama will be running on the target hardware, models will be downloaded, and the foundation for a working continuity plan will be in place. The remaining two parts of the series cover model selection with monitoring, then security and auto-failover integration.

What hardware does Ollama actually need?

Ollama is software. The constraint on whether it works for a business is the hardware it runs on. Three hardware paths qualify for production use, one path does not, and the difference between them is roughly an order of magnitude in response speed.

NVIDIA GPUs (Linux and Windows)

NVIDIA is the most common business path. Ollama requires Compute Capability 5.0 or higher, which includes the GTX 960 and every NVIDIA card released since 20151. The driver version must be 535 or higher on Linux, or 531 or higher on Windows. Modern data center cards (A100, H100, L40S) work and provide significant headroom for larger models.

Verification is a single command. Run nvidia-smi in a terminal. The output shows the driver version and lists available GPUs. If the command is not found or shows errors, the driver is missing or outdated and must be installed before Ollama will use GPU acceleration.

Apple Silicon (M1 through M4)

Apple Silicon is the simplest path. M1, M2, M3, and M4 chips all support Ollama through Metal GPU acceleration with zero configuration2. Install Ollama and it uses the GPU automatically. The unified memory architecture is particularly effective for large models because GPU and CPU share the same memory pool, which means a 32 GB Mac can load models that would require dedicated 32 GB GPU cards on PC hardware.

Intel Macs are not viable. Even on a high-end Intel i9 MacBook Pro, generation speed is in the 4 to 6 tokens-per-second range, similar to CPU-only operation on PC hardware.

AMD GPUs (Linux only, as of mid-2026)

AMD support is real but limited. ROCm 7 on Linux works for most modern AMD GPUs1. ROCm on Windows is still classified as experimental and is not officially supported by Ollama. Organizations standardized on AMD GPUs on Windows should plan around this reality, either by switching the AI workload to Linux, using WSL2 with the understanding that performance and stability vary, or running Ollama on CPU as a stopgap.

Hardware sizing by model

Different models require different amounts of memory. The table below shows the recommended hardware tier for each common model at standard quantization (Q4_K_M, which is the default in Ollama and balances quality with memory efficiency).

Available Memory Recommended Model Typical Speed Best For
8 GB RAM (CPU only) phi3:mini or gemma2:2b 3 to 8 tokens/sec Simple Q&A only, not viable for production
16 GB RAM (CPU only) llama3.1:8b 5 to 10 tokens/sec Still too slow for most workflows
8 to 12 GB VRAM (NVIDIA) llama3.1:8b, qwen2.5-coder:7b 50 to 70 tokens/sec Email, documents, code generation
16 GB unified memory (Apple) llama3.1:8b 40 to 60 tokens/sec General business workflows
24 GB VRAM (RTX 3090, 4090, A5000) qwen2.5-coder:14b, llama3.1:70b (tight) 30 to 100 tokens/sec Complex reasoning, near-frontier quality
48 GB+ VRAM or 64 GB+ unified llama3.1:70b with headroom 20 to 50 tokens/sec Highest-quality local inference

The pattern in the table is consistent: GPU acceleration delivers roughly 10x to 20x faster generation than CPU-only operation. For business failover, only the GPU rows are viable.

Should the organization self-host or stick with cloud AI?

Not every organization should build local AI infrastructure. The hardware investment and engineering time matter. A practical decision framework looks at three factors.

Existing hardware. If the team already runs machines with compatible GPUs (developer workstations with NVIDIA cards, Apple Silicon laptops, or Linux servers with discrete GPUs), the marginal cost of adding Ollama is engineering time only. If no suitable hardware exists, the conversation shifts to whether a continuity plan justifies a hardware purchase.

Operational criticality. If the business pauses meaningfully when cloud AI fails (development teams blocked, customer support degraded, content production stopped), local AI failover is justified. If AI use is exploratory or non-critical, the case for local infrastructure is weaker.

Data sensitivity. Organizations handling regulated data (healthcare, legal, financial) often need local AI for reasons beyond continuity. Local execution keeps prompts and responses inside the corporate network, which simplifies GDPR, HIPAA, and SOC 2 compliance.

How is Ollama installed on macOS?

macOS is the fastest path to a working installation. The graphical installer handles everything, including the system service setup that makes Ollama available after restart.

Step 1

Download the macOS installer

Visit ollama.com and download the macOS package. The download is approximately 200 MB.

Step 2

Run the installer and grant permissions

Open the downloaded file and drag Ollama to the Applications folder. Launch Ollama. macOS prompts for permission to install the command-line tools. Approve the prompt. Ollama now runs as a menu bar application and starts automatically at login.

Step 3

Verify the installation

Open Terminal and run the verification commands:

ollama --version # Should output: ollama version is 0.x.x curl http://localhost:11434 # Should output: Ollama is running

If both commands succeed, Ollama is installed and the API server is listening. The next step is downloading a model.

How is Ollama installed on Linux?

Linux installation requires a few more steps than macOS, but the result is a more robust production deployment. The official installer creates a systemd service with automatic restart on failure, which is the right baseline for business use.

Step 1

Run the installer

Execute the one-line install script. The script handles dependency detection, GPU driver verification, and systemd service creation:

curl -fsSL https://ollama.com/install.sh | sh

The installer creates an ollama system user and installs the binary to /usr/local/bin/ollama. Model storage defaults to /usr/share/ollama/.ollama/models.

Step 2

Verify the service is running

Check the systemd service status:

sudo systemctl status ollama # Should show: active (running) curl http://localhost:11434 # Should output: Ollama is running
Step 3

Configure environment variables for production use

The default installation binds Ollama to localhost only and stores models in the system partition. For production deployments, these defaults often need adjustment. Edit the systemd service:

sudo systemctl edit ollama.service

Add the configuration under the [Service] section. The most common production variables:

[Service] # Bind to all interfaces (only do this with proper firewall rules in place) Environment="OLLAMA_HOST=0.0.0.0:11434" # Store models on a larger drive Environment="OLLAMA_MODELS=/data/ollama/models" # Keep models loaded in memory longer to reduce cold-start latency Environment="OLLAMA_KEEP_ALIVE=30m" # Limit concurrent loaded models if memory is tight Environment="OLLAMA_MAX_LOADED_MODELS=2"

Save the file and reload the service:

sudo systemctl daemon-reload sudo systemctl restart ollama
Step 4

Confirm GPU detection (NVIDIA only)

If the machine has an NVIDIA GPU, verify Ollama is using it:

OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i "cuda\|gpu"

The output should mention CUDA initialization and list the detected GPU. If it shows CPU mode despite an installed GPU, the driver version is likely below the minimum (535 on Linux). Update the NVIDIA driver and restart.

The systemd service includes Restart=always by default, which means Ollama recovers automatically from crashes or OOM kills. This is the single most important property for a continuity service, since the whole point is that Ollama is available when needed.

How is Ollama installed on Windows?

Windows installation uses an MSI installer or the winget package manager. Both produce the same result: Ollama running as a system tray application with the API server listening on localhost:11434.

Step 1

Install Ollama

Two paths work. Either download the MSI from ollama.com and run it, or install via PowerShell with winget:

winget install Ollama.Ollama

The installer adds Ollama to the system PATH and starts the background service.

Step 2

Verify the installation

Open a new PowerShell or Command Prompt window (a new session is required for PATH updates to take effect):

ollama --version curl http://localhost:11434

Both should succeed. The Ollama system tray icon should also be visible.

Step 3

Configure environment variables

Ollama on Windows reads environment variables from the user and system environment. Quit Ollama from the system tray, then open System Properties through the Settings app or Control Panel. Add environment variables:

  • OLLAMA_HOST = 0.0.0.0:11434 (only with firewall in place)
  • OLLAMA_MODELS = D:\OllamaModels (redirect to larger drive)
  • OLLAMA_KEEP_ALIVE = 30m

Restart Ollama from the Start menu. The new environment variables take effect on the next launch.

How are models downloaded and tested?

With Ollama installed and running, the next step is pulling the model library that the failover system will use. Pull all models during normal operations, while the network is available. Once downloaded, models live locally and require no internet access to run.

# General-purpose business model (8B parameters, works on most hardware) ollama pull llama3.1 # Coding replacement for Claude Code workflows ollama pull qwen2.5-coder # Fast document and email drafting model ollama pull mistral # Lightweight model for low-spec hardware ollama pull phi3 # Verify all models are present and ready ollama list

Test each model with a real prompt to confirm output quality and response speed before relying on it for failover:

ollama run llama3.1 "Summarize the key risks of cloud AI dependency for a manufacturing business in 100 words."

A well-functioning installation responds within seconds and produces coherent output. If response time exceeds 30 seconds for a short prompt on a GPU-equipped machine, the model is probably running on CPU. Verify GPU acceleration is active.

Does PCG handle Ollama deployment for clients?

Phoenix Consultants Group has been deploying production software systems since 1995, and the operational discipline that applies to legacy migrations and compliance platforms applies equally to local AI infrastructure. A custom Ollama deployment engagement starts with a hardware audit (what compatible machines already exist on the network), continues through installation and configuration tailored to the client's operating systems, and ends with team training on the operational procedures that keep the failover ready.

The same engineering team that builds and maintains the FireFlight Data System manages Ollama deployments. Both involve infrastructure that has to run continuously without manual babysitting, which is what PCG has built for three decades.

Need help deploying Ollama in production?

PCG handles hardware assessment, multi-platform installation, monitoring integration, and team training as a single engagement.

Book Your Free Consultation

Frequently Asked Questions

What GPU do I need to run Ollama for business use?
For business-grade inference speed, NVIDIA GPUs with Compute Capability 5.0 or higher (GTX 960 and newer) with driver version 535 or higher on Linux, or 531 or higher on Windows. Apple Silicon M1 through M4 chips work automatically through Metal. AMD GPUs require ROCm 7 on Linux. The minimum VRAM for a 7-billion-parameter model at standard quantization is 6 GB.
Can Ollama run on an AMD GPU on Windows?
Not natively as of mid-2026. AMD GPU acceleration through ROCm is Linux-only. Windows users with AMD GPUs must either run Ollama on CPU, use WSL2 with experimental ROCm support, or switch to Linux for the AI workload.
How much disk space does an Ollama installation need?
The Ollama runtime itself uses approximately 1 GB. Models are the main storage cost. A baseline emergency library of llama3.1 (4.7 GB), qwen2.5-coder (8 GB), mistral (4.1 GB), and phi3 (2.2 GB) totals roughly 20 GB. Adding a 70B model adds another 40 to 45 GB. Plan for 60 to 80 GB of free disk space for a full business deployment.
Should I install Ollama as a system service or run it manually?
For production failover use, install as a system service. On macOS the desktop application handles this automatically. On Linux, configure as a systemd service with automatic restart on failure. On Windows, install as a system service. Manual ollama serve invocations are appropriate for development testing but do not survive reboots or process crashes.
Where does Ollama store the downloaded models?
Default locations are ~/.ollama/models on macOS and Linux, and C:\Users\<user>\.ollama\models on Windows. The location is configurable through the OLLAMA_MODELS environment variable. For business deployments, redirecting model storage to a separate drive is recommended.
Do I need to keep Ollama running all the time?
Yes for failover scenarios. Ollama runs as a background service that listens on port 11434. Idle service consumption is minimal because models load into memory only on request and unload after a configurable inactivity period through OLLAMA_KEEP_ALIVE.

About the Author

Allison Woolbert

CEO and Senior Systems Architect, Phoenix Consultants Group

Allison Woolbert is the principal of Phoenix Consultants Group, the custom software consultancy founded in 1995. PCG has run legacy migration projects across Microsoft Access, Visual FoxPro, Paradox, VB6, and other discontinued platforms for industrial, manufacturing, and environmental services clients since the late 1990s.

Allison leads PCG's discovery and architecture practice, where the first deliverable on every legacy engagement is an honest inventory of what the existing application actually does and what it should do next.

LinkedIn.

Sources

1 Ollama official GPU support documentation, NVIDIA and AMD requirements: github.com/ollama/ollama/blob/main/docs/gpu.md

2 Ollama documentation on Apple Silicon and Metal GPU acceleration: docs.ollama.com

3 Ollama Linux installation and systemd configuration: docs.ollama.com/linux

4 Ollama environment variable reference: github.com/ollama/ollama/blob/main/docs/faq.md

This article is informational and reflects industry observations as of June 2026. It is not legal, compliance, or financial advice for any specific situation. Phoenix Consultants Group, founded 1995, provides custom software development and AI infrastructure consulting. For guidance tailored to your organization's specific requirements, contact PCG directly.

Continue Reading

Get the full installation guide

This is Part 2 of a 4-part series on building an AI continuity plan with Ollama. Enter your email to unlock the rest of this article and receive Parts 3 and 4 covering monitoring, model selection, firewall configuration, and auto-failover integration.

Tech Wisdom Series AI Signup

We verify your email first. One click confirms your subscription.

Last updated: May 2026
Cloud AI services like Claude and ChatGPT have become critical business infrastructure, yet most organizations have no plan for when these services fail. An AI continuity plan documents the fallback path: a local AI runtime, monitoring, and tested procedures that keep work moving during cloud outages. The reality is that this protection requires specific hardware to be viable.

Developers use Claude Code to write and review code. Marketing teams draft content with ChatGPT. Operations staff process documents, summarize meetings, and answer internal questions through AI assistants embedded in daily workflows. For a growing share of businesses, AI has joined email and internet access on the list of services that, when they fail, the workday effectively pauses.

Yet most organizations have no continuity plan for AI outages. Disaster recovery exists for servers, databases, and network equipment. AI is rarely included.

How often do cloud AI services actually go down?

More often than most teams realize. OpenAI's published status data shows roughly 99 percent uptime across recent 90-day windows1. Anthropic publishes comparable numbers for Claude2. A 99 percent figure sounds reassuring until the math is applied: 99 percent uptime equals roughly 7 hours of downtime per month, or 87 hours per year.

Recent major incidents make the abstract concrete. On April 28, 2026, Claude AI suffered a major outage that took down Claude.ai, Claude Code, Claude Chat, and the Anthropic API simultaneously, with more than 12,000 users filing reports on Downdetector before service restored after roughly 78 minutes3. Just 8 days earlier, on April 20, 2026, Claude had experienced a separate partial outage affecting authentication across the same surfaces3. ChatGPT experienced a major global disruption on April 20, 2026, with thousands of simultaneous reports across the UK, US, and India, affecting both the chatbot and the Codex platform4. Earlier in the year, on February 25 and 26, 2026, OpenAI logged back-to-back incidents affecting artifact generation and ChatGPT Apps integrations5. Across the same 90-day window, monitoring services tracked 134 Claude incidents and 54 ChatGPT incidents, with median recovery times measured in hours, not minutes6.

The pattern is consistent: outages happen, they last hours, and they often hit multiple platforms simultaneously because shared infrastructure underlies them all.

What does an AI outage actually cost a business?

The visible cost is paused work. A development team that has restructured around Claude Code suddenly cannot get code review, suggestions, or refactoring assistance. A marketing team that drafts and edits with ChatGPT loses its content pipeline. Customer support teams that route initial responses through AI have to fall back to fully manual workflows.

The hidden cost is the recovery time after service restores. Teams often spend hours debugging what they assume is their own broken code or misconfigured integrations before realizing the AI provider is the actual problem.

Silent Failures

Cloud AI does not always fail loudly. Models return empty responses, time out unpredictably, or degrade quietly while teams assume their own code is broken.

Shared Infrastructure

Most cloud AI providers rely on overlapping infrastructure layers. When Cloudflare or a major datacenter fails, multiple AI services fail together.

No Tested Fallback

Disaster recovery plans cover servers and databases. AI services rarely appear on the continuity checklist, leaving teams with no documented procedure when outages happen.

What is local AI and how does it fit into a continuity plan?

A continuity plan answers a specific question: when the primary system fails, what runs instead. For cloud AI, the answer is a parallel local AI system operating on the organization's own hardware, ready to take over critical workflows during an outage.

Local AI runs entirely on the user's hardware. No data leaves the building. No internet connection is required after initial setup. The most practical local AI runtime for business use is Ollama, an open-source platform that downloads and serves large language models on the same machine where business applications are running7.

Once installed, Ollama exposes an HTTP API on the local network that is compatible with the OpenAI API format. Business applications that currently call Claude or ChatGPT can be redirected to call Ollama instead with minimal code changes. The fallback is technical, automatic, and verifiable.

A local AI system does not eliminate dependence on cloud AI. It eliminates total dependence on cloud AI. The combination of a primary cloud provider for production workflows and a local fallback for emergencies is what business continuity looks like in 2026.

Does a local AI backup require special hardware?

Yes, and this is the part of the conversation that gets skipped most often. Ollama is free to install, but it is not magic. The models that make it useful for business work require specific hardware to run at usable speeds.

What hardware works

Ollama performs well on NVIDIA GPUs with Compute Capability 5.0 or higher (essentially NVIDIA GTX 960 and newer), Apple Silicon chips (M1 through M4), and AMD GPUs with ROCm 7 drivers on Linux8. On these platforms, a 7-billion-parameter model generates between 40 and 120 tokens per second depending on the specific hardware, which is fast enough for production use.

What hardware does not work

CPU-only operation is technically possible. Ollama will install and run on a machine with no GPU. The result, however, is between 3 and 8 tokens per second for the same 7B model8. That is too slow for any workflow that involves waiting for a response, which describes nearly all business use cases.

Older Intel Macs (pre-Apple Silicon) and AMD GPUs on Windows currently fall into the unsupported or poorly supported category. Organizations relying on either should plan around that reality before committing to a local AI implementation.

This series is built around organizations with appropriate hardware. CPU-only deployments are addressed honestly: not viable for production failover.

What does this series cover?

This is Part 1 of a four-part series on building an AI continuity plan using Ollama. Subsequent parts go deep on the technical implementation:

Part 2: Hardware Requirements and Installing Ollama. A detailed hardware decision guide, the exact GPU and driver specifications, step-by-step installation on macOS, Linux, and Windows, and the configuration variables that matter for production use.

Part 3: Choosing Models and Monitoring Your Local AI. Which model to assign to which business task, optimized system prompts for local models, and a complete Python monitoring script that runs as a scheduled task and alerts when Ollama goes unhealthy.

Part 4: Securing and Automating Your Failover. Firewall configuration for Linux, Windows, and macOS deployments. Auto-failover client code that detects cloud AI failures and routes requests to Ollama automatically. Full contingency mode procedures and the testing drills that keep the system trustworthy.

Does PCG build custom software systems like this for clients?

Phoenix Consultants Group has been building production software systems for operational continuity since 1995, with three decades of experience in environments where business-critical software cannot stop. The FireFlight Data System, a modular platform PCG developed and maintains, was designed with that same operational reality in mind: hosted on PCG infrastructure, monitored continuously, and architected so that one component's failure does not cascade through the rest.

The same engineering discipline applies to AI infrastructure. A custom AI continuity implementation involves hardware assessment, Ollama deployment, monitoring integration, failover client development, and team training on the contingency procedures. PCG handles all of it as a single engagement.

Building AI continuity for your team?

PCG designs and deploys custom failover systems for businesses dependent on cloud AI. The diagnostic call is with an engineer, not a sales tier.

Book Your Free Consultation

Continue the Series

Want the technical implementation guide?

Parts 2, 3, and 4 cover hardware requirements, installation across macOS, Linux, and Windows, model selection, monitoring scripts, firewall configuration, and auto-failover integration. One installment per week, sent directly to your inbox.

Tech Wisdom Series AI Signup

We verify your email before sending anything. One click confirms your subscription.

Frequently Asked Questions

What is an AI continuity plan?
An AI continuity plan is a documented strategy for keeping AI-dependent workflows running when cloud AI services like Claude, ChatGPT, or Gemini become unavailable. The plan typically combines a local AI runtime such as Ollama, monitoring scripts, and tested fallback procedures for critical workflows.
How often do cloud AI services like ChatGPT or Claude go down?
OpenAI, Anthropic, and Google all publish uptime data showing roughly 99 percent availability. That sounds high until the math reveals roughly 7 hours of downtime per month. In early 2026 alone, Claude AI had a major 78-minute outage on April 28 affecting all surfaces simultaneously, and ChatGPT had a global disruption on April 20 affecting both the chatbot and the Codex platform.
Can my business run AI locally without internet access?
Yes, once the appropriate models have been downloaded. Ollama and similar local runtimes operate fully offline after initial setup. The constraint is hardware capability rather than network access. Models load into GPU or unified memory and respond to local API calls with no external dependency.
What hardware does a local AI backup system require?
Ollama requires a compatible GPU for business-grade performance. NVIDIA cards with Compute Capability 5.0 or higher, Apple Silicon M1 through M4 chips, or AMD GPUs with ROCm 7 on Linux all qualify. CPU-only operation works technically but generates roughly 3 to 8 tokens per second, which is too slow for most production workflows.
Is local AI a replacement for Claude or ChatGPT?
Not a full replacement. Frontier models from Anthropic and OpenAI still lead on complex reasoning and nuanced output. Local models on appropriate hardware are sufficient for the bulk of daily AI work including drafting, summarizing, code generation, and document analysis. The role of local AI is failover, not displacement.
What does an AI continuity plan cost to set up?
The software is free. Ollama is open source. The investment is engineering time to install, configure monitoring, and integrate failover logic into business applications, plus the hardware itself if a compatible machine is not already available. Most implementations take one afternoon of engineering work and one hour per month to maintain.

About the Author

Allison Woolbert

CEO and Senior Systems Architect, Phoenix Consultants Group

Allison Woolbert is the principal of Phoenix Consultants Group, the custom software consultancy founded in 1995. PCG has run legacy migration projects across Microsoft Access, Visual FoxPro, Paradox, VB6, and other discontinued platforms for industrial, manufacturing, and environmental services clients since the late 1990s.

Allison leads PCG's discovery and architecture practice, where the first deliverable on every legacy engagement is an honest inventory of what the existing application actually does and what it should do next.

LinkedIn.

Sources

1 OpenAI Status Page, 90-day uptime metrics: status.openai.com

2 Anthropic Status Page, 90-day uptime metrics: status.anthropic.com

3 Rolling Out, Claude AI outage hits 12,000 users in major disruption, April 28, 2026: rollingout.com/2026/04/28/anthropic-claude-outage-users-locked-out

4 Open Magazine, ChatGPT Hit by Major Global Outage, April 20, 2026: openthemagazine.com

5 StatusGator, OpenAI Outage History, February 2026 incidents: statusgator.com/services/openai/outage-history

6 IsDown monitoring data, 90-day incident counts for Claude and ChatGPT, May 2026: isdown.app/status/claude-ai

7 Ollama official documentation: ollama.com

8 Ollama GPU requirements and benchmarks, official documentation: github.com/ollama/ollama/blob/main/docs/gpu.md

This article is informational and reflects industry observations as of May 2026. It is not legal, compliance, or financial advice for any specific situation. Phoenix Consultants Group, founded 1995, provides custom software development and AI infrastructure consulting. For guidance tailored to your organization's specific requirements, contact PCG directly.

Continue the Series

Get the technical implementation guide

Parts 2, 3, and 4 cover hardware requirements, installation, model selection, monitoring, firewall configuration, and auto-failover integration. One installment per week, sent directly to your inbox.

Tech Wisdom Series AI Signup

We verify your email before sending anything. One click confirms your subscription.

Recent Posts
  • The ERP That Got You Here Is the One Holding You Back
  • Your System Says It’s There. Your Team Says It’s Not. Fixing Inventory Visibility Gaps
  • The Hidden Labor Drain: Why Warehouse Teams Walk More Than They Pick
  • Why Warehouse Teams Stop Using Your ERP (And What It Actually Costs You)
  • How Do You Measure the ROI of Custom Software in the First 12 Months?
Join Our Newsletter

Drop us a line! We are here to answer your questions 24/7

NEED A CONSULTATION?

Contact Us
Phoenix Consultants Group - Custom Computer Programming
Phoenix Consultants Group is a Minority Women and Veteran Owned business
LGBT-Owned

Copyright © 2021-2026. All Rights Reserved | Phoenix Consultants Group
Privacy Policy

Solutions
  • Turning Ideas into Solutions
  • Smarter Decisions with Intelligent Data Systems
  • Custom .NET Software Development
  • Custom Application Development
  • Data Collection & Management
Data Management
  • Conversion, Migration & Integration
  • Custom Database Programming
  • Data Movement Services
  • Full Custom Data Management
  • Inventory Management Systems
Small Data Systems
  • Access Database Consulting
  • Access Database Design
  • Access Database Programming
Additional Services
  • Custom Webhosing / Websites
  • Visual Basic Legacy Programming
  • Form Design & Development
Our Company
  • About Phoenix Consultants Group
  • Contact Us
  • Our Blog & News
  • Portfolio & Projects

Subscribe

Subscribe to our mailing list and you will always be updated with the latest news.

Phoenix Consultants FacebookPhoenix Consultants LinkedIn   Phoenix Consultants Instagram