Ollama LLM CLI

Ollama cheatsheet — run LLMs locally. ollama pull llama3, ollama run mistral, ollama list, ollama serve API. Run Llama, Mistral, CodeLlama on your own machine.

5 min read

What it is

Ollama is a command-line tool for running Large Language Models (LLMs) locally on your machine, making it easy to experiment with and integrate AI into your workflows.

Installation

Linux

curl -fsSL https://ollama.com/install.sh | sh

macOS

Download the .dmg file from the Ollama website and install it.

Windows

Download the .exe installer from the Ollama website and run it.

Core Concepts

Model: A specific version of a Large Language Model (e.g., llama2, mistral). Models are downloaded and managed by Ollama.
Pull: Downloading a model from the Ollama library to your local machine.
Run: Starting an interactive session with a downloaded model.
Serve: Running a model as an API endpoint for programmatic access.

Commands / Usage

Managing Models

Pull a model:
```
ollama pull llama3
```
Downloads the llama3 model from the Ollama library.
List downloaded models:
```
ollama list
```
Shows all models currently downloaded on your system.
Show model information:
```
ollama show llama2
```
Displays details about the llama2 model, including its parameters and layers.
Remove a model:
```
ollama rm llama2
```
Deletes the llama2 model from your local machine to free up disk space.

Running Models

Start an interactive chat session:
```
ollama run llama3
```
Opens an interactive command-line interface to chat with the llama3 model. Type your messages and press Enter. Type /bye to exit.
Run a model with a prompt (non-interactive):
```
ollama run llama3 "What is the capital of France?"
```
Sends the prompt "What is the capital of France?" to the llama3 model and prints the response. The session exits after the response.
Run a model with specific parameters:
```
ollama run llama3 --temperature 0.8 --top-k 40 -n 100
```
Runs llama3 with a temperature of 0.8, top-k sampling of 40, and a maximum of 100 tokens.
- --temperature: Controls randomness. Higher values (e.g., 0.8) make output more creative, lower values (e.g., 0.2) make it more focused and deterministic.
- --top-k: Limits the sampling pool to the k most likely next tokens.
- -n or --num-predict: The maximum number of tokens to predict in the response.
Continue a previous chat session:
```
ollama run llama3 --keep-alive 5m
```
Starts an interactive session with llama3 and keeps the model loaded for 5 minutes after the session ends, allowing for faster re-engagement.
Use a specific model version:
```
ollama run llama3:8b
```
Runs the llama3 model with the tag 8b (e.g., the 8-billion parameter version).

Serving Models (API)

Start a model as an API server:
```
ollama serve llama3
```
Starts an HTTP server on http://localhost:11434 that serves the llama3 model. This is useful for programmatic access.
Stop a running server: Press Ctrl+C in the terminal where ollama serve is running.

Generating Responses (API - with `curl`)

Send a prompt to a running server:
```
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'
```
Sends a request to the Ollama API to generate a response for the prompt "Why is the sky blue?" using the llama3 model. stream: false means you get the full response at once.
Stream responses from the API:
```
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain the concept of recursion in programming.",
  "stream": true
}'
```
Sends a request to stream the response token by token. This is useful for interactive applications where you want to display output as it’s generated.

Create a chat completion:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What are the main benefits of using Python?"}
  ],
  "stream": false
}'

Interacts with the model in a chat-like manner, allowing for system messages and a history of user/assistant turns.

Other Commands

Show Ollama version:
```
ollama --version
```
Displays the installed Ollama version.

Common Patterns

Get a quick answer without starting an interactive session:

ollama run llama3 "What is the chemical formula for water?"

Use Ollama output in a script (e.g., for summarization):

echo "This is a long piece of text that needs summarization. It covers various topics and goes into great detail about each one. The goal is to condense it into a few key points without losing the essential information." | ollama run llama3 --prompt "Summarize the following text: "

(Note: The prompt needs to be carefully crafted to include the piped text. A better approach for complex piping is often to save to a file first or use the API)

Integrate with other command-line tools (example: using output as input for another command):

# Generate code snippet and save to a file
ollama run codellama "Write a Python function to calculate factorial." > factorial.py

Run a model with custom system prompt via API:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "system", "content": "You are a pirate who speaks in a very specific dialect."},
    {"role": "user", "content": "Tell me about the weather today."}
  ],
  "stream": false
}'

Gotchas

Resource Usage: LLMs are resource-intensive. Running large models locally can consume significant RAM and CPU. If Ollama feels slow or unresponsive, check your system’s resource utilization.
Model Download Size: Models can be several gigabytes in size. Ensure you have sufficient disk space before pulling.
API Port: Ollama by default serves its API on http://localhost:11434. If this port is already in use, Ollama might fail to start its server or you’ll need to configure a different port (though this is less common for basic usage).
Model Naming and Tags: Models can have different versions or sizes specified by tags (e.g., llama3:8b, llama3:70b). Ensure you are pulling and running the intended version. If you just use llama3, it typically pulls the latest default version.
No Default Model: Ollama doesn’t come with a model pre-installed. You must explicitly ollama pull a model before you can ollama run it.
Interactive Session Exit: In interactive mode (ollama run <model>), you must type /bye to exit cleanly. Simply closing the terminal might leave the model process running in the background, consuming resources.