Prometheus Cheatsheet
What it is
Prometheus is an open-source systems monitoring and alerting toolkit that collects and stores metrics as time-series data. You reach for it when you need to monitor the health and performance of your applications and infrastructure.
Installation
Linux (using package manager - example for Debian/Ubuntu)
sudo apt update
sudo apt install prometheus prometheus-node-exporter
Mac (using Homebrew)
brew install prometheus prometheus-node-exporter
Windows (download binary from official releases)
Download the appropriate prometheus.exe and node_exporter.exe from the Prometheus releases page. Place them in a directory and run from the command line.
Core Concepts
- Metrics: Numerical measurements of system or application performance over time. Prometheus supports several types:
- Counters: Monotonically increasing values (e.g., number of requests served).
- Gauges: Values that can go up or down (e.g., current memory usage).
- Histograms: Observations of distributions (e.g., request latencies), allowing calculation of quantiles.
- Summaries: Similar to Histograms but calculate quantiles on the client side.
- Time Series: A stream of data points indexed by time. Each time series is uniquely identified by its metric name and a set of key-value pairs called labels.
- Labels: Key-value pairs that attach metadata to time series (e.g.,
instance="localhost:9090",job="node"). Label sets define the uniqueness of a time series. - Exporters: Services that expose metrics in a format Prometheus can scrape.
node_exporteris common for host-level metrics. - Scraping: Prometheus periodically fetches (scrapes) metrics from configured targets (exporters).
- PromQL: Prometheus Query Language, a powerful functional query language to select and aggregate time series data in real-time.
- Alertmanager: Handles alerts sent by Prometheus, deduplicating, grouping, and routing them to correct receiver integrations like email, PagerDuty, or Slack.
Commands / Usage
Prometheus Server (prometheus)
Starting Prometheus:
prometheus --config.file=prometheus.yml
Starts the Prometheus server using the specified configuration file.
Configuration Flags:
--config.file: Path to the Prometheus configuration file (default:prometheus.yml).--web.listen-address: Address and port to listen on for HTTP requests (default:0.0.0.0:9090).--storage.tsdb.path: Path to the data directory (default:data/).--storage.tsdb.retention.time: How long to retain data (e.g.,15dfor 15 days,4wfor 4 weeks) (default:15d).--web.enable-lifecycle: Enable the HTTP endpoints for service discovery, reloading configuration, and shutdown.
Example prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100', '192.168.1.100:9100']
Node Exporter (node_exporter)
Starting Node Exporter:
node_exporter
Starts the node exporter, exposing system metrics on port 9100 by default.
Flags:
--web.listen-address: Address and port to listen on (default::9100).--path.procfs: Path to the proc filesystem (default:/proc).--path.sysfs: Path to the sys filesystem (default:/sys).--collector.<name>: Enable a specific collector (e.g.,--collector.cpu,--collector.mem).--no-collector.<name>: Disable a specific collector (e.g.,--no-collector.diskstats).
Enabling specific collectors:
node_exporter --collector.cpu --collector.mem --collector.netdev
Promtool (promtool)
A utility for checking Prometheus configuration and rules.
Checking configuration:
promtool check config prometheus.yml
Validates the syntax of your Prometheus configuration file.
Checking rule files:
promtool check rules rules.yml
Validates the syntax of your alerting or recording rules file.
Benchmarking:
promtool benchmark rules rules.yml
Benchmarks the evaluation of rule files.
Querying the API (using curl)
Current status:
curl 'http://localhost:9090/api/v1/status/flags'
Retrieves Prometheus server flags.
Querying data (PromQL):
curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=up{job="node_exporter"}'
Executes a PromQL query and returns the result.
Querying range data:
curl -G 'http://localhost:9090/api/v1/query_range' --data-urlencode 'query=rate(node_cpu_seconds_total{mode="idle"}[5m])' --data-urlencode 'start=1678886400' --data-urlencode 'end=1678890000' --data-urlencode 'step=15s'
Executes a PromQL query over a time range with a specific step.
Getting available targets:
curl 'http://localhost:9090/api/v1/targets'
Lists all configured targets and their states.
Common Patterns
Monitoring CPU Usage (per core, per job):
# In Prometheus UI or via API
rate(node_cpu_seconds_total{mode!="idle"}[5m])
Calculates the per-second average rate of non-idle CPU time over the last 5 minutes.
Monitoring Memory Usage (percentage):
# In Prometheus UI or via API
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
Calculates the percentage of used memory.
Monitoring Network Traffic (bytes received per second):
# In Prometheus UI or via API
rate(node_network_receive_bytes_total{device!~"lo"}[5m])
Calculates the rate of received bytes per second on network interfaces, excluding loopback.
Alerting when a service is down:
- Rule (
rules.yml):groups: - name: example.rules rules: - alert: InstanceDown expr: up == 0 for: 5m labels: severity: critical annotations:
{% raw %} summary: "Instance {{ $labels.instance }} down" {% endraw %} {% raw %} description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes." {% endraw %} ```
- Prometheus Configuration (
prometheus.yml):rule_files: - "rules.yml" - Alertmanager Configuration: Ensure Alertmanager is configured to receive alerts from Prometheus and route them appropriately.
Recording a complex query for reuse:
- Rule (
rules.yml):groups: - name: recording.rules rules: - record: job:request_latency:histogram_quantile expr: | histogram_quantile(0.95, sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
This records the 95th percentile of request latency for each job, making it easier to query later.
Using label_replace for dynamic labels:
# In PromQL
label_replace(
up{job="myjob", instance="192.168.1.1:8080"},
"datacenter", "$1", "instance", "([^:]+):.*"
)
Adds a datacenter label extracted from the instance label.
Gotchas
rate()vsirate():rate()calculates the average rate over the specified time window, smoothing out spikes.irate()calculates the instantaneous rate based on the last two data points in the window, which is more sensitive to short-lived spikes and better for high-resolution data.- Label Selectors: Forgetting to include necessary labels in selectors can lead to incorrect aggregations or missed data. For example,
sum(rate(node_cpu_seconds_total[5m]))might sum across all cores and instances if not properly filtered byjoborinstance. GROUP BYclauses: When aggregating, if you don’t specifysum by (label1, label2), Prometheus will sum across all unique label combinations, potentially resulting in a single value instead of grouped results.- Time Series Cardinality: Too many unique time series (e.g., from high-cardinality labels like user IDs or request IDs) can overwhelm Prometheus, leading to performance issues and high memory usage. Design your labels carefully.
_totalsuffix: Metrics ending in_totalare typically counters and should be used withrate()orincrease().- Default Retention: Prometheus default retention is 15 days. If you need longer retention, configure
--storage.tsdb.retention.timeor use remote storage solutions. - Alertmanager Clustering: For high availability, Alertmanager should be run in a cluster. Ensure its configuration (
alertmanager.yml) supports this. - Scrape Interval vs Evaluation Interval: The scrape interval determines how often Prometheus fetches data. The evaluation interval determines how often Prometheus re-evaluates alerting and recording rules. They don’t have to be the same, but should be coordinated.