Llama 3.1 Archives

Last updated: June 2026 Part 3 of 4

A working Ollama installation is only half the story. The other half is choosing the right model for each business task and monitoring the system so that failures get detected before they matter. This guide covers model selection by use case, system prompt patterns that work for local models, and a complete Python monitoring script that runs as a scheduled task.

Part 2 of this series handled hardware and installation. With Ollama running, the next questions are operational. Which model should handle which task? How does a business team know if the local AI is healthy before the moment they need it? What gets logged, and where do alerts go?

The team that treats local AI like any other production service is the team whose failover actually works during a cloud AI outage. This part covers the practices that get a deployment from "installed" to "trusted."

Which local AI model should the business use?

The honest answer is that no single model is optimal for every task. Frontier cloud models like Claude and GPT-5 are general enough that one model handles everything. Local models are typically more specialized. A practical business deployment has two or three models installed, each assigned to the task it handles best.

For general business tasks (email, documents, summaries)

Llama 3.1 8B is the most versatile general-purpose model. It runs on 8 to 12 GB of VRAM, generates 50 to 70 tokens per second on consumer NVIDIA GPUs, and produces output quality comparable to GPT-3.5 across most business workflows¹. For organizations with more memory available, Llama 3.3 70B matches GPT-4 on most benchmarks and runs on 40 GB of VRAM².

ollama pull llama3.1:8b # 4.7 GB download, runs on 8GB+ VRAM ollama pull llama3.3:70b # 43 GB download, runs on 40GB+ VRAM

For code generation and review (Claude Code replacement)

Qwen2.5-Coder is the most capable open coding model available through Ollama. The 7B variant runs comfortably on consumer hardware and handles autocomplete, refactoring, and bug fixing. The 32B variant scores 92.7 percent on HumanEval, putting it in the same range as GPT-4o for pure coding benchmarks³.

ollama pull qwen2.5-coder:7b # 4.7 GB, daily coding work ollama pull qwen2.5-coder:32b # 20 GB, complex multi-file changes

For structured reasoning and analysis

Microsoft's Phi-4 14B is purpose-built for mathematical reasoning, structured logic, and analytical tasks. It scores 80.4 percent on the MATH benchmark and outperforms general models several times its size on STEM problems⁴. Phi-4 is the right pick for data analysis pipelines, algorithm design, and any task where step-by-step logical reasoning matters more than creative output.

ollama pull phi4:14b # 9 GB, structured reasoning tasks

For fast lightweight tasks

Phi-3 Mini and Gemma 2 2B are lightweight models that run on minimal hardware. They are suitable for fast text classification, simple Q&A, and tasks where response latency matters more than depth. They are not replacements for larger models but they cover the case where a task does not need full reasoning capability.

The recommended emergency library

A typical business deployment for cloud AI failover has these four models installed:

Model	Size on Disk	VRAM Needed	Replaces
`llama3.1:8b`	4.7 GB	8 GB	ChatGPT general use
`qwen2.5-coder:7b`	4.7 GB	8 GB	Claude Code, Copilot
`phi4:14b`	9 GB	10 GB	Data analysis, reasoning
`mistral:7b`	4.1 GB	6 GB	Fast email/document drafting

Total disk footprint: roughly 22 GB. Total VRAM needed if loading all at once: around 32 GB. With OLLAMA_KEEP_ALIVE set to 30 minutes and OLLAMA_MAX_LOADED_MODELS set to 2, the system loads on demand and unloads idle models, which keeps memory pressure manageable on 16 to 24 GB systems.

How are system prompts different for local AI models?

Local models follow system prompts, but they need more explicit instruction than frontier cloud models. A prompt that works perfectly on Claude may produce inconsistent results on Llama 3.1 8B. Three patterns matter.

Be direct. Frontier models infer intent. Local models follow instructions literally. Replace "Help the user with their question" with "Read the question. Provide a 3-sentence answer. Do not add disclaimers or apologies."

State the output format. If JSON is needed, say so explicitly and provide an example. If a specific length is needed, give a word count or sentence count. Vague instructions get vague results.

Forbid unwanted behaviors explicitly. Phrases like "Do not add commentary" or "Do not explain your reasoning" prevent the model from padding responses with filler. Local models tend to over-explain unless told not to.

A working system prompt template for business document drafting:

# Send via the Ollama API with system field set You are a professional business writing assistant. Output rules: - Write in clear, direct sentences. - Use the requested document type and length. - Do not add introductions or conclusions unless asked. - Do not include phrases like "I hope this helps" or "Let me know if". - Reply with the document content only.

Why does monitoring matter for a failover system?

A failover that fails is worse than no failover at all. The team that built it believes coverage exists. When the cloud AI outage finally happens and the local fallback turns out to be broken, the response is slower than if they had planned for full degradation from the start.

Monitoring catches three categories of failure:

Service down. The Ollama process crashed, the systemd service stopped, the host was rebooted but Ollama did not restart, or the API port is no longer accessible.

Models missing. Disk pressure caused models to be cleaned up, a model failed to download, or the OLLAMA_MODELS directory became unmounted. The service is technically running but cannot respond to inference requests.

Performance degraded. The GPU driver was updated and is no longer detected, the system fell back to CPU inference, or the model load time has grown unacceptable. The service responds, but slowly enough that the failover is unusable.

What does a working Ollama monitoring script look like?

The script below performs three checks: API health, model availability, and a test inference. It logs results to a file and exits with a status code that integrations like cron, systemd timer, or external monitoring tools can use.

#!/usr/bin/env python3 # ollama_health_check.py # Runs three checks against a local Ollama installation. # Exit 0 if healthy, exit 1 if any check fails. import requests import json import sys import time import logging from pathlib import Path # Configuration OLLAMA_URL = "http://localhost:11434" REQUIRED_MODELS = ["llama3.1:8b", "qwen2.5-coder:7b"] TEST_MODEL = "llama3.1:8b" TEST_TIMEOUT_SECONDS = 30 LOG_FILE = "/var/log/ollama_health.log" logging.basicConfig( filename=LOG_FILE, level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s" ) def check_api(): """Verify the Ollama API is responding.""" try: r = requests.get(f"{OLLAMA_URL}/", timeout=5) if r.status_code == 200: logging.info("API check passed") return True logging.error(f"API returned status {r.status_code}") return False except requests.exceptions.RequestException as e: logging.error(f"API check failed: {e}") return False def check_models(): """Verify all required models are available.""" try: r = requests.get(f"{OLLAMA_URL}/api/tags", timeout=10) data = r.json() installed = [m["name"] for m in data.get("models", [])] missing = [m for m in REQUIRED_MODELS if m not in installed] if missing: logging.error(f"Missing models: {missing}") return False logging.info(f"All {len(REQUIRED_MODELS)} required models present") return True except Exception as e: logging.error(f"Model check failed: {e}") return False def check_inference(): """Run a test inference and measure response time.""" payload = { "model": TEST_MODEL, "prompt": "Reply with the single word: OK", "stream": False, "options": {"num_predict": 10} } try: start = time.time() r = requests.post( f"{OLLAMA_URL}/api/generate", json=payload, timeout=TEST_TIMEOUT_SECONDS ) elapsed = time.time() - start if r.status_code != 200: logging.error(f"Inference returned status {r.status_code}") return False if elapsed > TEST_TIMEOUT_SECONDS: logging.error(f"Inference too slow: {elapsed:.1f}s") return False logging.info(f"Inference check passed in {elapsed:.1f}s") return True except Exception as e: logging.error(f"Inference check failed: {e}") return False def main(): checks = [ ("API", check_api), ("Models", check_models), ("Inference", check_inference) ] failures = [] for name, check_fn in checks: if not check_fn(): failures.append(name) if failures: logging.error(f"Health check FAILED. Failed checks: {failures}") sys.exit(1) logging.info("Health check PASSED") sys.exit(0) if __name__ == "__main__": main()

Scheduling the script

On Linux, schedule the script with cron to run every 5 minutes:

# Edit the user crontab crontab -e # Add this line */5 * * * * /usr/bin/python3 /opt/scripts/ollama_health_check.py

On macOS, use launchd. On Windows, use Task Scheduler. The exact configuration varies by platform, but the pattern is the same: run every 5 minutes, log to a file, and exit non-zero on failure so external monitoring can detect the problem.

The monitoring script does not replace systemd's Restart=always. It catches the failures that automatic restart does not solve, such as missing models or GPU regression. Both layers matter.

How does the team get alerted when monitoring fails?

A log file no one reads is not monitoring. The script above writes to a file by design, because that file becomes the input for whatever alerting system the organization already uses. Three common patterns work.

Existing infrastructure monitoring. If the organization uses Datadog, New Relic, Prometheus, or any other monitoring stack, point it at the log file or the script exit code. The integration is one line of configuration.

Slack or Teams webhooks. A 10-line addition to the script can post to Slack or Teams when checks fail. The webhook URL goes in a config file, the script reads it, and the team gets alerts in the channel they already watch.

Email alerts via cron. Configure cron with a MAILTO header so any non-zero exit triggers an email automatically. Simplest option, no extra code required, works on every Linux system.

Does PCG help operationalize local AI deployments?

Yes. Phoenix Consultants Group has been building operational software since 1995, and the discipline that applies to monitoring legacy databases or production web services applies directly to local AI infrastructure. A custom engagement includes model selection tailored to the client's actual workflows, monitoring integration with the existing alerting stack, runbook documentation, and team training on the operational procedures.

The FireFlight Data System uses the same monitoring philosophy: continuous health checks, automatic recovery, and external alerting that catches the failures recovery does not solve. The Ollama deployment follows the same playbook.

Need monitoring built into your AI failover?

PCG designs custom monitoring and alerting for local AI deployments, integrated with your existing infrastructure.

Book Your Free Consultation

Frequently Asked Questions

Which local AI model is best for business use?

There is no single best model. Different tasks suit different models. Llama 3.1 8B is the most versatile general-purpose model for business workflows like email drafting and document analysis. Qwen2.5-Coder 14B replaces Claude Code for development teams. Phi-4 14B excels at structured reasoning and data analysis. The right approach is having two or three models installed and routing tasks to the appropriate one.

How do I monitor whether Ollama is working correctly?

A monitoring script that runs every 5 minutes checks three things: the Ollama API is responding on port 11434, the expected models are loaded and ready, and a test inference completes within an acceptable time. The script logs failures and sends alerts when checks fail repeatedly. Without monitoring, the team discovers Ollama is down only when they actually need it during a cloud AI outage.

Can local AI replace Claude or ChatGPT for daily business work?

For most daily tasks, yes. Llama 3.1 70B matches GPT-4 on most general benchmarks. Qwen2.5-Coder 32B scores 92.7 percent on HumanEval, comparable to frontier coding models. The 5 to 10 percent of tasks where cloud AI clearly wins involve complex multi-step reasoning, very long contexts, or niche domains. For everything else, local models are sufficient with appropriate hardware.

How do I write a system prompt for local AI models?

Local models follow system prompts but require more explicit instruction than frontier cloud models. Keep prompts direct and specific. State the role, the output format, the length constraint, and any forbidden behaviors. Avoid clever phrasing. A system prompt that works perfectly on Claude may need to be rewritten in simpler language for a local 8B model to follow consistently.

What happens if Ollama crashes during a cloud AI outage?

Without monitoring, the team only discovers the crash when they try to use Ollama as a backup and it fails. With monitoring in place, the automatic restart on systemd or the Windows service manager recovers Ollama within seconds, and the monitoring script logs the event for later review. The combination of automatic restart and external monitoring prevents the worst-case scenario of a failover that is itself failed.

Should I run the monitoring script on the same machine as Ollama?

Running on the same machine is acceptable for small deployments and catches most failure modes. For business-critical setups, running monitoring on a separate machine catches additional failure modes such as the entire Ollama host being unreachable. A separate machine also avoids the situation where the monitoring script crashes alongside Ollama.

About the Author

Allison Woolbert

CEO and Senior Systems Architect, Phoenix Consultants Group

Allison Woolbert is the principal of Phoenix Consultants Group, the custom software consultancy founded in 1995. PCG has run legacy migration projects across Microsoft Access, Visual FoxPro, Paradox, VB6, and other discontinued platforms for industrial, manufacturing, and environmental services clients since the late 1990s.

Allison leads PCG's discovery and architecture practice, where the first deliverable on every legacy engagement is an honest inventory of what the existing application actually does and what it should do next.

LinkedIn.

Sources

¹ Llama 3.1 model documentation and benchmarks, Ollama library: ollama.com/library/llama3.1

² Llama 3.3 70B model card and benchmark comparisons: ollama.com/library/llama3.3

³ Qwen2.5-Coder benchmarks, HumanEval 92.7 percent: ollama.com/library/qwen2.5-coder

⁴ Microsoft Phi-4 model card and MATH benchmark: ollama.com/library/phi4

⁵ Ollama API reference for /api/tags and /api/generate: github.com/ollama/ollama/blob/main/docs/api.md

This article is informational and reflects industry observations as of June 2026. It is not legal, compliance, or financial advice for any specific situation. Phoenix Consultants Group, founded 1995, provides custom software development and AI infrastructure consulting. For guidance tailored to your organization's specific requirements, contact PCG directly.

Get the full guide on models and monitoring

This is Part 3 of a 4-part series on building an AI continuity plan with Ollama. Enter your email to unlock the rest of this article, including the Python monitoring script and the recommended model library, plus access to all parts of the series.

We verify your email first. One click confirms your subscription.

Tag: Llama 3.1