Getting started with Ollama: a practical guide

Run AI models on your own hardware in under ten minutes, with no cloud, no API keys, and no monthly fees.

Apr 30, 2026 · 10 min read

If you're tired of paying API fees, hitting message caps, sending your conversations to someone else's servers, or watching platforms change their content policies on you, running AI models locally is the answer most people don't realize is available. The barrier to entry used to be high: tens of gigabytes of dependencies, GPU configuration, and command-line gymnastics. In 2026, it's a five-minute install. Ollama made local AI genuinely accessible, and the experience is closer to "install an app and start typing" than to "set up a development environment."

This post walks through getting Ollama running on your machine, choosing your first model, understanding what your hardware can handle, and taking the next steps once you're past the basics.

What Ollama actually does

Ollama is the runtime that loads AI models on your computer and lets you chat with them. Think of it like Docker for language models: you tell Ollama which model you want, it downloads and configures it, and you start using it. The complexity that used to live in Python environments, CUDA drivers, and quantization scripts is now hidden inside a single tool that works the same way on Windows, macOS, and Linux.

The project has reached over 95,000 GitHub stars and 112 million model pulls for Llama 3.1 alone, making it the most widely-adopted local LLM runtime. The reason is simplicity. The same workflow that works for someone running a 3B model on a laptop also works for someone running a 70B model on a workstation. The interface doesn't change.

What you get from Ollama specifically: zero API costs, complete data privacy (your conversations never leave your machine), offline capability once models are downloaded, and the freedom to run any open-source model that's been ported to the format. For users who care about privacy in NSFW AI contexts, this is the strongest available privacy posture by a wide margin.

Installing Ollama

The installation depends on your platform but is one command or one installer in every case.

On macOS, install via Homebrew if you have it: brew install ollama. Or download the installer from ollama.com. Apple Silicon Macs (M1, M2, M3, M4) automatically use Metal GPU acceleration with no configuration needed.

On Linux, run the installation script: curl -fsSL https://ollama.com/install.sh | sh. The script handles everything including systemd service setup, and Ollama starts automatically.

On Windows, download the installer from ollama.com/download and run it. As of 2026, Windows ARM64 devices get a native build that's noticeably faster than the previous emulated version. NVIDIA GPUs are detected automatically; AMD GPU support via ROCm is improving but still has rough edges.

After installation, open a terminal and run ollama --version. If you see something like ollama version 0.13.x, you're set.

Pulling your first model

The model you start with matters less than people think. Pick something small enough to run smoothly on your hardware, and start chatting. You can always switch later.

For most users, the right starting point in 2026 is Llama 3.2 at the 7B parameter size. Pull it with:

ollama pull llama3.2:7b

The download is around 4.7GB. Subsequent runs start instantly because the model is cached locally.

To start chatting:

ollama run llama3.2:7b

You'll see a >>> prompt. Type a message, press enter, and the model responds. Type /bye to exit.

That's the basic loop. From here, every other use of Ollama is a variation of this pattern: pull a model, run it, chat with it, switch to another model when you want something different.

Understanding what your hardware can handle

The single biggest factor determining your local AI experience is RAM (or VRAM if you have a discrete GPU). Different model sizes have different memory requirements, and trying to run a model that doesn't fit produces either swapping (extremely slow) or outright failure.

Rough sizing rules for the most common quantization (Q4_K_M, which is what Ollama uses by default):

A 3B parameter model needs about 2-3GB of memory. Runs comfortably on any modern laptop. Models like Phi-3 or smaller Llama variants fit here.

A 7B parameter model needs about 4-6GB. Runs on most consumer laptops with 8GB of RAM or more. This is the sweet spot for general use: capable enough to be useful, light enough to run anywhere. Llama 3.2 7B, Mistral 7B, and Qwen 2.5 7B all live here.

A 13B parameter model needs about 8-10GB. Requires a laptop with 16GB of RAM, or any desktop with a mid-tier GPU. Larger context windows and better reasoning than 7B models, but the difference isn't always worth the slower inference speed.

A 34B parameter model needs about 20-24GB. Requires a high-end GPU (RTX 4090 with 24GB VRAM) or a Mac with 32GB+ unified memory.

A 70B parameter model needs about 40-48GB. The territory of dedicated workstations or top-tier Apple Silicon Macs (M2/M3/M4 with 64GB+ unified memory).

If you're not sure what your hardware can handle, start with a 7B model. If it runs smoothly, try 13B. If 13B runs smoothly, try 34B. Work upward until you find the size that gives you good response speed without exhausting memory.

Choosing a model for your use case

Different models excel at different things. The best general-purpose model in 2026 isn't necessarily the best model for your specific use.

For general chat and assistance, Llama 3.2 is the recommended starting point. Balanced capability across coding, writing, reasoning, and conversation.

For coding, DeepSeek Coder V2 outperforms many larger general-purpose models on coding benchmarks. CodeLlama is also widely used. If your primary use is software development, these specialized models are worth pulling alongside a general one.

For roleplay and creative writing, the community-developed roleplay models are noticeably better than general models. Nous Hermes 3, MythoMax L2 13B, and the Eva Qwen variants are widely used. These models are tuned for character consistency, longer outputs, and richer prose.

For uncensored or unfiltered content, you'll need specifically de-aligned models. Dolphin 3.0 (based on Llama 3) is a popular general-purpose uncensored option. Abliterated variants of major models remove the refusal training while preserving capability.

For multilingual work, Qwen 2.5 has strong performance on non-English languages, particularly Chinese, Japanese, and Korean. Mistral 7B is solid for European languages.

For very long contexts, Llama 4 Scout supports up to 10 million tokens in its abliterated form, far beyond anything in the consumer cloud space.

The Ollama model library at ollama.com/library lists everything available with hardware requirements and use case notes. Browse it, find models that match your needs, pull them, and try them.

The Ollama API

Beyond the command-line chat, Ollama exposes a REST API on port 11434 that lets you integrate local models into other applications. This is where Ollama becomes more than a chatbot.

The basic API call:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:7b",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Response is JSON with the generated text. The same API is OpenAI-compatible enough that many tools designed for OpenAI's API work with Ollama by changing the endpoint URL.

This compatibility unlocks a lot. SillyTavern, the most flexible AI roleplay frontend, connects to Ollama as one of its supported backends. Various IDE coding assistants (Continue, Cody) work with Ollama. Custom Python or JavaScript scripts can use the langchain-ollama integration. The pattern is: anything that talks to OpenAI's API can usually be reconfigured to talk to Ollama.

For integration use cases specifically, look at the Ollama documentation at docs.ollama.com, which covers Python, JavaScript, Docker deployment, and the REST API in detail.

Common things to do with a local Ollama setup

Once Ollama is running, there are several common patterns that get the most value out of the setup:

Run a chat frontend. Open WebUI is the most popular browser-based chat interface for Ollama, providing a ChatGPT-like UI for local models. Enchanted (macOS), Chatbox (cross-platform), and several other options exist. The CLI works fine, but a chat UI feels more natural for extended use.

Connect SillyTavern for roleplay. If you want serious AI companion functionality with character cards, lorebooks, and persistent memory, SillyTavern with an Ollama backend is the most flexible option available. The setup takes 30-60 minutes; the resulting capability is unmatched.

Use it as a coding assistant. Connect Continue or Cody to Ollama and use a coding-focused model (DeepSeek Coder, CodeLlama) as your IDE's AI. Free, private, no rate limits.

Build custom applications. For anyone doing development work, Ollama makes prototyping AI features dramatically faster. No API costs while you iterate. No rate limits during testing. Privacy guaranteed because nothing leaves your machine.

Run a private RAG system. Combine Ollama with a vector database (Chroma, Qdrant) to build retrieval-augmented generation systems against your own documents. Useful for searchable personal knowledge bases, document analysis, or any use case that benefits from grounding responses in specific sources.

Replace specific cloud features. If you primarily use AI for one thing (writing assistance, coding help, language translation), running a local model focused on that thing often produces results comparable to cloud services without the cost or privacy tradeoffs.

Performance optimization

Once you're past the basic setup, a few tweaks improve performance:

Use the right quantization. Ollama defaults to Q4_K_M which balances quality and size well. If you want better quality and have memory to spare, Q5 or Q6 quantizations preserve more of the original model's quality. If you need smaller, Q3 versions exist but quality drops noticeably.

Adjust context length. Larger context windows use more memory. If you don't need long contexts, reducing context length lets you run a larger model or run faster.

Configure parallel inference. The OLLAMA_NUM_PARALLEL environment variable controls how many requests can run simultaneously. Higher values use more memory but improve throughput when you have multiple users or applications hitting the same Ollama instance.

Enable GPU offload fully. Check the Ollama logs for n_gpu_layers to see how many layers are running on GPU vs CPU. If you're partial-offloaded when you have enough VRAM for full GPU, force the issue by adjusting the model's modelfile or the OLLAMA_GPU_OVERHEAD environment variable.

Use prompt caching. Repeated queries that share context can reuse processed state. The Ollama API supports this through specific flags; check the docs for the latest syntax.

When local isn't the right choice

For all the upsides of local AI, there are situations where cloud APIs are still the better choice:

When you need frontier-level reasoning. The largest cloud models (Claude Opus, GPT-5.4) are still meaningfully better at complex tasks than anything you can run locally on consumer hardware. If your work requires the absolute best reasoning available, local isn't the answer.

When you need scale you can't support. Running a model for 100 simultaneous users requires expensive hardware and infrastructure. Cloud APIs scale elastically; local doesn't.

When you don't have the hardware. A 4-year-old laptop with 8GB of RAM can run small models but produces a frustrating experience. If your hardware is limited, the cloud might still be more practical until you upgrade.

When the use case is occasional. If you use AI rarely, the time investment of setting up local doesn't pay off compared to a few cents of API calls.

For most regular users, though, especially anyone doing more than light occasional AI use, local Ollama eventually becomes the better experience.

Next steps

Once you have Ollama running:

Try several models. Pull 3-4 different ones from different families (Llama, Mistral, Qwen, Gemma) and see which feels best for your use cases. The "best model" is highly subjective.

Connect a chat UI. Open WebUI or Enchanted dramatically improves the experience over the command line.

If you want roleplay or AI companion functionality, set up SillyTavern with Ollama as the backend.

Explore the model library deeper. There are thousands of models, and finding ones tuned for your specific needs is part of the fun.

Read about hardware requirements if you're considering an upgrade. The right hardware unlocks dramatically better local AI experiences.

The local AI ecosystem is moving fast. The tools available in 2026 are dramatically better than what existed in 2023. The trajectory is toward more capable models running on smaller hardware, with better tooling around them. Getting started now means you're learning a workflow that will keep getting more powerful.

Frequently asked

Do I need a GPU for Ollama?

No, but it helps a lot. CPU inference works for small models but is noticeably slow. Any modern GPU (NVIDIA, Apple Silicon, recent AMD) makes inference 5-20x faster. For 7B models on a recent laptop, CPU might give you 5-10 tokens per second; the same model with GPU acceleration runs 30-60 tokens per second.

How much disk space do I need?

Each model is its own download. A 7B model is about 4-5GB. A 13B model is about 8GB. A 70B model is around 40GB. Plan for several models if you want variety.

Can I run multiple models at once?

Yes, but they share your memory. Loading two 13B models simultaneously needs roughly the memory required for both. Ollama handles model swapping automatically when memory is constrained.

What about Mac vs Windows vs Linux?

All three work well. Mac with Apple Silicon has the best out-of-box experience because Metal acceleration is automatic. Linux is most flexible. Windows works but historically had rough edges around GPU acceleration that have largely been fixed by 2026.

Is Ollama really free?

Yes. Open source, no subscription, no telemetry by default. The only costs are your own hardware and electricity.

Can Ollama run on a phone?

Officially no, but Private LLM and similar apps run smaller models on iOS and Android with surprisingly good results. For full Ollama, you need a laptop or desktop.

How do I uninstall Ollama?

The installer typically includes an uninstaller. Manual removal involves deleting the binary, the model cache (usually in ~/.ollama), and any system service entries. The Ollama documentation covers platform-specific uninstall instructions.

Keep reading

INSIGHT

We Tested Every Free AI Tier So You Don't Have To

INSIGHT

Free AI Changelog — June: What Changed, What Tightened, What's New

GUIDE

Which Free AI Is Right for You? A Simple Decision Guide

GUIDE

Free Tier Report Card: Grading Every AI Companion's Free Experience