Ollama LLM CLI

Ollama cheatsheet — run LLMs locally. ollama pull llama3, ollama run mistral, ollama list, ollama serve API. Run Llama, Mistral, CodeLlama on your own machine.

5 min read

What it is

Ollama is a command-line tool for running Large Language Models (LLMs) locally on your machine, making it easy to experiment with and integrate AI into your workflows.

Installation

Linux

curl -fsSL https://ollama.com/install.sh | sh

macOS

Download the .dmg file from the Ollama website and install it.

Windows

Download the .exe installer from the Ollama website and run it.

Core Concepts

  • Model: A specific version of a Large Language Model (e.g., llama2, mistral). Models are downloaded and managed by Ollama.
  • Pull: Downloading a model from the Ollama library to your local machine.
  • Run: Starting an interactive session with a downloaded model.
  • Serve: Running a model as an API endpoint for programmatic access.

Commands / Usage

Managing Models

  • Pull a model:

    ollama pull llama3
    

    Downloads the llama3 model from the Ollama library.

  • List downloaded models:

    ollama list
    

    Shows all models currently downloaded on your system.

  • Show model information:

    ollama show llama2
    

    Displays details about the llama2 model, including its parameters and layers.

  • Remove a model:

    ollama rm llama2
    

    Deletes the llama2 model from your local machine to free up disk space.

Running Models

  • Start an interactive chat session:

    ollama run llama3
    

    Opens an interactive command-line interface to chat with the llama3 model. Type your messages and press Enter. Type /bye to exit.

  • Run a model with a prompt (non-interactive):

    ollama run llama3 "What is the capital of France?"
    

    Sends the prompt "What is the capital of France?" to the llama3 model and prints the response. The session exits after the response.

  • Run a model with specific parameters:

    ollama run llama3 --temperature 0.8 --top-k 40 -n 100
    

    Runs llama3 with a temperature of 0.8, top-k sampling of 40, and a maximum of 100 tokens.

    • --temperature: Controls randomness. Higher values (e.g., 0.8) make output more creative, lower values (e.g., 0.2) make it more focused and deterministic.
    • --top-k: Limits the sampling pool to the k most likely next tokens.
    • -n or --num-predict: The maximum number of tokens to predict in the response.
  • Continue a previous chat session:

    ollama run llama3 --keep-alive 5m
    

    Starts an interactive session with llama3 and keeps the model loaded for 5 minutes after the session ends, allowing for faster re-engagement.

  • Use a specific model version:

    ollama run llama3:8b
    

    Runs the llama3 model with the tag 8b (e.g., the 8-billion parameter version).

Serving Models (API)

  • Start a model as an API server:

    ollama serve llama3
    

    Starts an HTTP server on http://localhost:11434 that serves the llama3 model. This is useful for programmatic access.

  • Stop a running server: Press Ctrl+C in the terminal where ollama serve is running.

Generating Responses (API - with curl)

  • Send a prompt to a running server:

    curl http://localhost:11434/api/generate -d '{
      "model": "llama3",
      "prompt": "Why is the sky blue?",
      "stream": false
    }'
    

    Sends a request to the Ollama API to generate a response for the prompt "Why is the sky blue?" using the llama3 model. stream: false means you get the full response at once.

  • Stream responses from the API:

    curl http://localhost:11434/api/generate -d '{
      "model": "llama3",
      "prompt": "Explain the concept of recursion in programming.",
      "stream": true
    }'
    

    Sends a request to stream the response token by token. This is useful for interactive applications where you want to display output as it’s generated.

  • Create a chat completion:

    curl http://localhost:11434/api/chat -d '{
      "model": "llama3",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the main benefits of using Python?"}
      ],
      "stream": false
    }'
    

    Interacts with the model in a chat-like manner, allowing for system messages and a history of user/assistant turns.

Other Commands

  • Show Ollama version:
    ollama --version
    
    Displays the installed Ollama version.

Common Patterns

  • Get a quick answer without starting an interactive session:

    ollama run llama3 "What is the chemical formula for water?"
    
  • Use Ollama output in a script (e.g., for summarization):

    echo "This is a long piece of text that needs summarization. It covers various topics and goes into great detail about each one. The goal is to condense it into a few key points without losing the essential information." | ollama run llama3 --prompt "Summarize the following text: "
    

    (Note: The prompt needs to be carefully crafted to include the piped text. A better approach for complex piping is often to save to a file first or use the API)

  • Integrate with other command-line tools (example: using output as input for another command):

    # Generate code snippet and save to a file
    ollama run codellama "Write a Python function to calculate factorial." > factorial.py
    
  • Run a model with custom system prompt via API:

    curl http://localhost:11434/api/chat -d '{
      "model": "llama3",
      "messages": [
        {"role": "system", "content": "You are a pirate who speaks in a very specific dialect."},
        {"role": "user", "content": "Tell me about the weather today."}
      ],
      "stream": false
    }'
    

Gotchas

  • Resource Usage: LLMs are resource-intensive. Running large models locally can consume significant RAM and CPU. If Ollama feels slow or unresponsive, check your system’s resource utilization.
  • Model Download Size: Models can be several gigabytes in size. Ensure you have sufficient disk space before pulling.
  • API Port: Ollama by default serves its API on http://localhost:11434. If this port is already in use, Ollama might fail to start its server or you’ll need to configure a different port (though this is less common for basic usage).
  • Model Naming and Tags: Models can have different versions or sizes specified by tags (e.g., llama3:8b, llama3:70b). Ensure you are pulling and running the intended version. If you just use llama3, it typically pulls the latest default version.
  • No Default Model: Ollama doesn’t come with a model pre-installed. You must explicitly ollama pull a model before you can ollama run it.
  • Interactive Session Exit: In interactive mode (ollama run <model>), you must type /bye to exit cleanly. Simply closing the terminal might leave the model process running in the background, consuming resources.