Prometheus Tooling

promtool cheatsheet — validate Prometheus configs and rules, query metrics, check alerts. promtool check rules, promtool query instant, promtool check config. Full reference.

6 min read

Prometheus Cheatsheet

What it is

Prometheus is an open-source systems monitoring and alerting toolkit that collects and stores metrics as time-series data. You reach for it when you need to monitor the health and performance of your applications and infrastructure.

Installation

Linux (using package manager - example for Debian/Ubuntu)

sudo apt update
sudo apt install prometheus prometheus-node-exporter

Mac (using Homebrew)

brew install prometheus prometheus-node-exporter

Windows (download binary from official releases)

Download the appropriate prometheus.exe and node_exporter.exe from the Prometheus releases page. Place them in a directory and run from the command line.

Core Concepts

  • Metrics: Numerical measurements of system or application performance over time. Prometheus supports several types:
    • Counters: Monotonically increasing values (e.g., number of requests served).
    • Gauges: Values that can go up or down (e.g., current memory usage).
    • Histograms: Observations of distributions (e.g., request latencies), allowing calculation of quantiles.
    • Summaries: Similar to Histograms but calculate quantiles on the client side.
  • Time Series: A stream of data points indexed by time. Each time series is uniquely identified by its metric name and a set of key-value pairs called labels.
  • Labels: Key-value pairs that attach metadata to time series (e.g., instance="localhost:9090", job="node"). Label sets define the uniqueness of a time series.
  • Exporters: Services that expose metrics in a format Prometheus can scrape. node_exporter is common for host-level metrics.
  • Scraping: Prometheus periodically fetches (scrapes) metrics from configured targets (exporters).
  • PromQL: Prometheus Query Language, a powerful functional query language to select and aggregate time series data in real-time.
  • Alertmanager: Handles alerts sent by Prometheus, deduplicating, grouping, and routing them to correct receiver integrations like email, PagerDuty, or Slack.

Commands / Usage

Prometheus Server (prometheus)

Starting Prometheus:

prometheus --config.file=prometheus.yml

Starts the Prometheus server using the specified configuration file.

Configuration Flags:

  • --config.file: Path to the Prometheus configuration file (default: prometheus.yml).
  • --web.listen-address: Address and port to listen on for HTTP requests (default: 0.0.0.0:9090).
  • --storage.tsdb.path: Path to the data directory (default: data/).
  • --storage.tsdb.retention.time: How long to retain data (e.g., 15d for 15 days, 4w for 4 weeks) (default: 15d).
  • --web.enable-lifecycle: Enable the HTTP endpoints for service discovery, reloading configuration, and shutdown.

Example prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100', '192.168.1.100:9100']

Node Exporter (node_exporter)

Starting Node Exporter:

node_exporter

Starts the node exporter, exposing system metrics on port 9100 by default.

Flags:

  • --web.listen-address: Address and port to listen on (default: :9100).
  • --path.procfs: Path to the proc filesystem (default: /proc).
  • --path.sysfs: Path to the sys filesystem (default: /sys).
  • --collector.<name>: Enable a specific collector (e.g., --collector.cpu, --collector.mem).
  • --no-collector.<name>: Disable a specific collector (e.g., --no-collector.diskstats).

Enabling specific collectors:

node_exporter --collector.cpu --collector.mem --collector.netdev

Promtool (promtool)

A utility for checking Prometheus configuration and rules.

Checking configuration:

promtool check config prometheus.yml

Validates the syntax of your Prometheus configuration file.

Checking rule files:

promtool check rules rules.yml

Validates the syntax of your alerting or recording rules file.

Benchmarking:

promtool benchmark rules rules.yml

Benchmarks the evaluation of rule files.

Querying the API (using curl)

Current status:

curl 'http://localhost:9090/api/v1/status/flags'

Retrieves Prometheus server flags.

Querying data (PromQL):

curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=up{job="node_exporter"}'

Executes a PromQL query and returns the result.

Querying range data:

curl -G 'http://localhost:9090/api/v1/query_range' --data-urlencode 'query=rate(node_cpu_seconds_total{mode="idle"}[5m])' --data-urlencode 'start=1678886400' --data-urlencode 'end=1678890000' --data-urlencode 'step=15s'

Executes a PromQL query over a time range with a specific step.

Getting available targets:

curl 'http://localhost:9090/api/v1/targets'

Lists all configured targets and their states.

Common Patterns

Monitoring CPU Usage (per core, per job):

# In Prometheus UI or via API
rate(node_cpu_seconds_total{mode!="idle"}[5m])

Calculates the per-second average rate of non-idle CPU time over the last 5 minutes.

Monitoring Memory Usage (percentage):

# In Prometheus UI or via API
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

Calculates the percentage of used memory.

Monitoring Network Traffic (bytes received per second):

# In Prometheus UI or via API
rate(node_network_receive_bytes_total{device!~"lo"}[5m])

Calculates the rate of received bytes per second on network interfaces, excluding loopback.

Alerting when a service is down:

  • Rule (rules.yml):
    groups:
    - name: example.rules
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
    

{% raw %} summary: "Instance {{ $labels.instance }} down" {% endraw %} {% raw %} description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes." {% endraw %} ```

  • Prometheus Configuration (prometheus.yml):
    rule_files:
      - "rules.yml"
    
  • Alertmanager Configuration: Ensure Alertmanager is configured to receive alerts from Prometheus and route them appropriately.

Recording a complex query for reuse:

  • Rule (rules.yml):
    groups:
    - name: recording.rules
      rules:
      - record: job:request_latency:histogram_quantile
        expr: |
          histogram_quantile(0.95, sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
    

This records the 95th percentile of request latency for each job, making it easier to query later.

Using label_replace for dynamic labels:

# In PromQL
label_replace(
  up{job="myjob", instance="192.168.1.1:8080"},
  "datacenter", "$1", "instance", "([^:]+):.*"
)

Adds a datacenter label extracted from the instance label.

Gotchas

  • rate() vs irate(): rate() calculates the average rate over the specified time window, smoothing out spikes. irate() calculates the instantaneous rate based on the last two data points in the window, which is more sensitive to short-lived spikes and better for high-resolution data.
  • Label Selectors: Forgetting to include necessary labels in selectors can lead to incorrect aggregations or missed data. For example, sum(rate(node_cpu_seconds_total[5m])) might sum across all cores and instances if not properly filtered by job or instance.
  • GROUP BY clauses: When aggregating, if you don’t specify sum by (label1, label2), Prometheus will sum across all unique label combinations, potentially resulting in a single value instead of grouped results.
  • Time Series Cardinality: Too many unique time series (e.g., from high-cardinality labels like user IDs or request IDs) can overwhelm Prometheus, leading to performance issues and high memory usage. Design your labels carefully.
  • _total suffix: Metrics ending in _total are typically counters and should be used with rate() or increase().
  • Default Retention: Prometheus default retention is 15 days. If you need longer retention, configure --storage.tsdb.retention.time or use remote storage solutions.
  • Alertmanager Clustering: For high availability, Alertmanager should be run in a cluster. Ensure its configuration (alertmanager.yml) supports this.
  • Scrape Interval vs Evaluation Interval: The scrape interval determines how often Prometheus fetches data. The evaluation interval determines how often Prometheus re-evaluates alerting and recording rules. They don’t have to be the same, but should be coordinated.