Part 1 of this series established why every AI-dependent business needs a continuity plan and introduced Ollama as the most practical local AI runtime for that role. This part addresses the implementation. Hardware first, because every decision after it depends on hardware reality. Installation second, with the configuration choices that matter for production failover use.
By the end of this guide, Ollama will be running on the target hardware, models will be downloaded, and the foundation for a working continuity plan will be in place. The remaining two parts of the series cover model selection with monitoring, then security and auto-failover integration.
What hardware does Ollama actually need?
Ollama is software. The constraint on whether it works for a business is the hardware it runs on. Three hardware paths qualify for production use, one path does not, and the difference between them is roughly an order of magnitude in response speed.
NVIDIA GPUs (Linux and Windows)
NVIDIA is the most common business path. Ollama requires Compute Capability 5.0 or higher, which includes the GTX 960 and every NVIDIA card released since 20151. The driver version must be 535 or higher on Linux, or 531 or higher on Windows. Modern data center cards (A100, H100, L40S) work and provide significant headroom for larger models.
Verification is a single command. Run nvidia-smi in a terminal. The output shows the driver version and lists available GPUs. If the command is not found or shows errors, the driver is missing or outdated and must be installed before Ollama will use GPU acceleration.
Apple Silicon (M1 through M4)
Apple Silicon is the simplest path. M1, M2, M3, and M4 chips all support Ollama through Metal GPU acceleration with zero configuration2. Install Ollama and it uses the GPU automatically. The unified memory architecture is particularly effective for large models because GPU and CPU share the same memory pool, which means a 32 GB Mac can load models that would require dedicated 32 GB GPU cards on PC hardware.
Intel Macs are not viable. Even on a high-end Intel i9 MacBook Pro, generation speed is in the 4 to 6 tokens-per-second range, similar to CPU-only operation on PC hardware.
AMD GPUs (Linux only, as of mid-2026)
AMD support is real but limited. ROCm 7 on Linux works for most modern AMD GPUs1. ROCm on Windows is still classified as experimental and is not officially supported by Ollama. Organizations standardized on AMD GPUs on Windows should plan around this reality, either by switching the AI workload to Linux, using WSL2 with the understanding that performance and stability vary, or running Ollama on CPU as a stopgap.
Hardware sizing by model
Different models require different amounts of memory. The table below shows the recommended hardware tier for each common model at standard quantization (Q4_K_M, which is the default in Ollama and balances quality with memory efficiency).
| Available Memory | Recommended Model | Typical Speed | Best For |
|---|---|---|---|
| 8 GB RAM (CPU only) | phi3:mini or gemma2:2b |
3 to 8 tokens/sec | Simple Q&A only, not viable for production |
| 16 GB RAM (CPU only) | llama3.1:8b |
5 to 10 tokens/sec | Still too slow for most workflows |
| 8 to 12 GB VRAM (NVIDIA) | llama3.1:8b, qwen2.5-coder:7b |
50 to 70 tokens/sec | Email, documents, code generation |
| 16 GB unified memory (Apple) | llama3.1:8b |
40 to 60 tokens/sec | General business workflows |
| 24 GB VRAM (RTX 3090, 4090, A5000) | qwen2.5-coder:14b, llama3.1:70b (tight) |
30 to 100 tokens/sec | Complex reasoning, near-frontier quality |
| 48 GB+ VRAM or 64 GB+ unified | llama3.1:70b with headroom |
20 to 50 tokens/sec | Highest-quality local inference |
The pattern in the table is consistent: GPU acceleration delivers roughly 10x to 20x faster generation than CPU-only operation. For business failover, only the GPU rows are viable.
Should the organization self-host or stick with cloud AI?
Not every organization should build local AI infrastructure. The hardware investment and engineering time matter. A practical decision framework looks at three factors.
Existing hardware. If the team already runs machines with compatible GPUs (developer workstations with NVIDIA cards, Apple Silicon laptops, or Linux servers with discrete GPUs), the marginal cost of adding Ollama is engineering time only. If no suitable hardware exists, the conversation shifts to whether a continuity plan justifies a hardware purchase.
Operational criticality. If the business pauses meaningfully when cloud AI fails (development teams blocked, customer support degraded, content production stopped), local AI failover is justified. If AI use is exploratory or non-critical, the case for local infrastructure is weaker.
Data sensitivity. Organizations handling regulated data (healthcare, legal, financial) often need local AI for reasons beyond continuity. Local execution keeps prompts and responses inside the corporate network, which simplifies GDPR, HIPAA, and SOC 2 compliance.
How is Ollama installed on macOS?
macOS is the fastest path to a working installation. The graphical installer handles everything, including the system service setup that makes Ollama available after restart.
Download the macOS installer
Visit ollama.com and download the macOS package. The download is approximately 200 MB.
Run the installer and grant permissions
Open the downloaded file and drag Ollama to the Applications folder. Launch Ollama. macOS prompts for permission to install the command-line tools. Approve the prompt. Ollama now runs as a menu bar application and starts automatically at login.
Verify the installation
Open Terminal and run the verification commands:
If both commands succeed, Ollama is installed and the API server is listening. The next step is downloading a model.
How is Ollama installed on Linux?
Linux installation requires a few more steps than macOS, but the result is a more robust production deployment. The official installer creates a systemd service with automatic restart on failure, which is the right baseline for business use.
Run the installer
Execute the one-line install script. The script handles dependency detection, GPU driver verification, and systemd service creation:
The installer creates an ollama system user and installs the binary to /usr/local/bin/ollama. Model storage defaults to /usr/share/ollama/.ollama/models.
Verify the service is running
Check the systemd service status:
Configure environment variables for production use
The default installation binds Ollama to localhost only and stores models in the system partition. For production deployments, these defaults often need adjustment. Edit the systemd service:
Add the configuration under the [Service] section. The most common production variables:
Save the file and reload the service:
Confirm GPU detection (NVIDIA only)
If the machine has an NVIDIA GPU, verify Ollama is using it:
The output should mention CUDA initialization and list the detected GPU. If it shows CPU mode despite an installed GPU, the driver version is likely below the minimum (535 on Linux). Update the NVIDIA driver and restart.
How is Ollama installed on Windows?
Windows installation uses an MSI installer or the winget package manager. Both produce the same result: Ollama running as a system tray application with the API server listening on localhost:11434.
Install Ollama
Two paths work. Either download the MSI from ollama.com and run it, or install via PowerShell with winget:
The installer adds Ollama to the system PATH and starts the background service.
Verify the installation
Open a new PowerShell or Command Prompt window (a new session is required for PATH updates to take effect):
Both should succeed. The Ollama system tray icon should also be visible.
Configure environment variables
Ollama on Windows reads environment variables from the user and system environment. Quit Ollama from the system tray, then open System Properties through the Settings app or Control Panel. Add environment variables:
OLLAMA_HOST=0.0.0.0:11434(only with firewall in place)OLLAMA_MODELS=D:\OllamaModels(redirect to larger drive)OLLAMA_KEEP_ALIVE=30m
Restart Ollama from the Start menu. The new environment variables take effect on the next launch.
How are models downloaded and tested?
With Ollama installed and running, the next step is pulling the model library that the failover system will use. Pull all models during normal operations, while the network is available. Once downloaded, models live locally and require no internet access to run.
Test each model with a real prompt to confirm output quality and response speed before relying on it for failover:
A well-functioning installation responds within seconds and produces coherent output. If response time exceeds 30 seconds for a short prompt on a GPU-equipped machine, the model is probably running on CPU. Verify GPU acceleration is active.
Does PCG handle Ollama deployment for clients?
Phoenix Consultants Group has been deploying production software systems since 1995, and the operational discipline that applies to legacy migrations and compliance platforms applies equally to local AI infrastructure. A custom Ollama deployment engagement starts with a hardware audit (what compatible machines already exist on the network), continues through installation and configuration tailored to the client's operating systems, and ends with team training on the operational procedures that keep the failover ready.
The same engineering team that builds and maintains the FireFlight Data System manages Ollama deployments. Both involve infrastructure that has to run continuously without manual babysitting, which is what PCG has built for three decades.
Need help deploying Ollama in production?
PCG handles hardware assessment, multi-platform installation, monitoring integration, and team training as a single engagement.
Frequently Asked Questions
What GPU do I need to run Ollama for business use?
Can Ollama run on an AMD GPU on Windows?
How much disk space does an Ollama installation need?
Should I install Ollama as a system service or run it manually?
Where does Ollama store the downloaded models?
Do I need to keep Ollama running all the time?
About the Author
Allison Woolbert
CEO and Senior Systems Architect, Phoenix Consultants Group
Allison Woolbert is the principal of Phoenix Consultants Group, the custom software consultancy founded in 1995. PCG has run legacy migration projects across Microsoft Access, Visual FoxPro, Paradox, VB6, and other discontinued platforms for industrial, manufacturing, and environmental services clients since the late 1990s.
Allison leads PCG's discovery and architecture practice, where the first deliverable on every legacy engagement is an honest inventory of what the existing application actually does and what it should do next.
Sources
1 Ollama official GPU support documentation, NVIDIA and AMD requirements: github.com/ollama/ollama/blob/main/docs/gpu.md
2 Ollama documentation on Apple Silicon and Metal GPU acceleration: docs.ollama.com
3 Ollama Linux installation and systemd configuration: docs.ollama.com/linux
4 Ollama environment variable reference: github.com/ollama/ollama/blob/main/docs/faq.md
Continue Reading
Get the full installation guide
This is Part 2 of a 4-part series on building an AI continuity plan with Ollama. Enter your email to unlock the rest of this article and receive Parts 3 and 4 covering monitoring, model selection, firewall configuration, and auto-failover integration.
We verify your email first. One click confirms your subscription.