What it is
Ollama is a command-line tool for running Large Language Models (LLMs) locally on your machine, making it easy to experiment with and integrate AI into your workflows.
Installation
Linux
curl -fsSL https://ollama.com/install.sh | sh
macOS
Download the .dmg file from the Ollama website and install it.
Windows
Download the .exe installer from the Ollama website and run it.
Core Concepts
- Model: A specific version of a Large Language Model (e.g.,
llama2,mistral). Models are downloaded and managed by Ollama. - Pull: Downloading a model from the Ollama library to your local machine.
- Run: Starting an interactive session with a downloaded model.
- Serve: Running a model as an API endpoint for programmatic access.
Commands / Usage
Managing Models
-
Pull a model:
ollama pull llama3Downloads the
llama3model from the Ollama library. -
List downloaded models:
ollama listShows all models currently downloaded on your system.
-
Show model information:
ollama show llama2Displays details about the
llama2model, including its parameters and layers. -
Remove a model:
ollama rm llama2Deletes the
llama2model from your local machine to free up disk space.
Running Models
-
Start an interactive chat session:
ollama run llama3Opens an interactive command-line interface to chat with the
llama3model. Type your messages and press Enter. Type/byeto exit. -
Run a model with a prompt (non-interactive):
ollama run llama3 "What is the capital of France?"Sends the prompt "What is the capital of France?" to the
llama3model and prints the response. The session exits after the response. -
Run a model with specific parameters:
ollama run llama3 --temperature 0.8 --top-k 40 -n 100Runs
llama3with a temperature of 0.8, top-k sampling of 40, and a maximum of 100 tokens.--temperature: Controls randomness. Higher values (e.g., 0.8) make output more creative, lower values (e.g., 0.2) make it more focused and deterministic.--top-k: Limits the sampling pool to thekmost likely next tokens.-nor--num-predict: The maximum number of tokens to predict in the response.
-
Continue a previous chat session:
ollama run llama3 --keep-alive 5mStarts an interactive session with
llama3and keeps the model loaded for 5 minutes after the session ends, allowing for faster re-engagement. -
Use a specific model version:
ollama run llama3:8bRuns the
llama3model with the tag8b(e.g., the 8-billion parameter version).
Serving Models (API)
-
Start a model as an API server:
ollama serve llama3Starts an HTTP server on
http://localhost:11434that serves thellama3model. This is useful for programmatic access. -
Stop a running server: Press
Ctrl+Cin the terminal whereollama serveis running.
Generating Responses (API - with curl)
-
Send a prompt to a running server:
curl http://localhost:11434/api/generate -d '{ "model": "llama3", "prompt": "Why is the sky blue?", "stream": false }'Sends a request to the Ollama API to generate a response for the prompt "Why is the sky blue?" using the
llama3model.stream: falsemeans you get the full response at once. -
Stream responses from the API:
curl http://localhost:11434/api/generate -d '{ "model": "llama3", "prompt": "Explain the concept of recursion in programming.", "stream": true }'Sends a request to stream the response token by token. This is useful for interactive applications where you want to display output as it’s generated.
-
Create a chat completion:
curl http://localhost:11434/api/chat -d '{ "model": "llama3", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the main benefits of using Python?"} ], "stream": false }'Interacts with the model in a chat-like manner, allowing for system messages and a history of user/assistant turns.
Other Commands
- Show Ollama version:
Displays the installed Ollama version.ollama --version
Common Patterns
-
Get a quick answer without starting an interactive session:
ollama run llama3 "What is the chemical formula for water?" -
Use Ollama output in a script (e.g., for summarization):
echo "This is a long piece of text that needs summarization. It covers various topics and goes into great detail about each one. The goal is to condense it into a few key points without losing the essential information." | ollama run llama3 --prompt "Summarize the following text: "(Note: The prompt needs to be carefully crafted to include the piped text. A better approach for complex piping is often to save to a file first or use the API)
-
Integrate with other command-line tools (example: using output as input for another command):
# Generate code snippet and save to a file ollama run codellama "Write a Python function to calculate factorial." > factorial.py -
Run a model with custom system prompt via API:
curl http://localhost:11434/api/chat -d '{ "model": "llama3", "messages": [ {"role": "system", "content": "You are a pirate who speaks in a very specific dialect."}, {"role": "user", "content": "Tell me about the weather today."} ], "stream": false }'
Gotchas
- Resource Usage: LLMs are resource-intensive. Running large models locally can consume significant RAM and CPU. If Ollama feels slow or unresponsive, check your system’s resource utilization.
- Model Download Size: Models can be several gigabytes in size. Ensure you have sufficient disk space before pulling.
- API Port: Ollama by default serves its API on
http://localhost:11434. If this port is already in use, Ollama might fail to start its server or you’ll need to configure a different port (though this is less common for basic usage). - Model Naming and Tags: Models can have different versions or sizes specified by tags (e.g.,
llama3:8b,llama3:70b). Ensure you are pulling and running the intended version. If you just usellama3, it typically pulls the latest default version. - No Default Model: Ollama doesn’t come with a model pre-installed. You must explicitly
ollama pulla model before you canollama runit. - Interactive Session Exit: In interactive mode (
ollama run <model>), you must type/byeto exit cleanly. Simply closing the terminal might leave the model process running in the background, consuming resources.