Prometheus

Prometheus
-open source
-metric based monitoring system
-it does one thing and does it well
-simple text format makes it easy to expose metrics to Prometheus.
-the data model identifies each time series an unordered set of key-value pairs called labels
-scraped data is stored in local time-series database.
-lack of autonotification
- Caution! -> If you need 100 accuracy, such as for perrequest billing, Prometheus is not a good choice as the collected data will likely not be detailed and complete enough.
- pull based system


1. PromQL expression language allows easy metrics selection and aggregation

 - create graphs
 - set alert rukes
 - expose data


2. Architecture



3. How to gather data?

- pull based system
-regular plain text  -> HTTP exposed
-metrics exposition format


record labels value

-embed into software
  *official client libraries:
    ** Go
    ** Java or Scala
    ** Python
    ** Ruby

  *unofficial third-party client libraries:
    ** Bash
    ** C++
    ** Common Lisp
    ** elixir
    ** Erlang
    ** Haskell
    ** Lua for Nginx
    ** Lua fo Tranatool
    ** .NET /C#
    ** Node.js
    ** PHP
    ** Rust
-or use metrics exporters

## Core components starting at 9090

* 9090 - Prometheus server
* 9091 - Pushgateway
* 9093 - Altermanager
* 9094 - Altermanager clustering

## Exporters starting at 9100

* 9100 - Node exporter
* 9101 - HAProxy exproter
* 9102 - StatsD exproter
* 9103 - Collectd exproter
* 9108 - Graphite exproter
* 9110 - Blackbox exproter

-sample exporter in python with 2 metrics (runtime of build and timestamp)



4. Visualization

a) Promdash

b) Grafana
-full suport from PromQL
-Prometheus Integration:
  *datasource support
  *Prometheus dashboard
  *PromQL autocomplete
  *Alerts

5. Prometheus Alerts

Alertmanager
-Alertmanager handles alerts sent by client application such as the Prometheus, Grafana, etc.

-Functions:
  *deduplication
  *grouping
  *routing
  *sending
  *silencing
  *inhibition - uzależnienie jednego alerta od drugiego

-Alertmanager supports a mesh configuration to create a cluster for High Availability. Warning: High Availability is under active development


6.) Installation

Method

Recomended
-source
-pre-compiled  binary
-docker container

Don't do this
-apt-get install prometheus
-yum install prometheus
-any installation from package


Binary

cd /tmp
wget https://github.com/prometheus/releases/download/v2.2.0/prometheus-2.2.0.linux-amd64.tar.gz
tar -xzf prometheus-2.2.0.linux-amd64.tar.gz

sudo chmod +x prometheus-2.2.0.linux-amd64/{prometheus, promtool}
sudo cp prometheus-2.2.0.linux-amd64/{prometheus, promtool}/usr/local/bin
sudo chown root:root /usr/local/bin/{prometheus, promtool}

sudo mkdir -p /etc/prometheus
sudo vim /etc/prometheus/prometheus.yml
promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 0 rule files found

prometheus --config.file "etc/prometheus/prometheus.yml" &

Repeat for every component (prometheus, alertmanager, node_exporter, blackbox_exporter, *_exporter) on multiple nodes every  month or so

Problems
-too many operations
-won't survive reboot
-no dedicated user
-try changing config
-troublesome upgrade
-SELinux anyone?

Manage (aka why Ansible?) - Cloud Alchemy
https://github.com/cloudalchemy/ansible-prometheus

Goals for ansible Roles
-zero-configuration depoyment
-easy managment of multiple nodes
-error checking
-multiple CPU architecture support


Where is my config?
-command line parameters
-main configuration file (in YAML)
-files included from main file (ex. alert rules or file_sd config) - File service discovery


Mian config

a) Prometheus

global:
  evaluation_interval: 15s
  scrape_interval: 15s
  scrape_timeout: 10s

  external_labels:
    environment: localhost.localdomain

scrape_configs:
  - job_name: "prometheus"
    metrics_path: "/metrics"
    static_configs:
    - targets:
      - "localhost:9090"
   - job_name: node
     file sd_configs:
     - files:
       - "/etc/prometheus/file_sd/node.yml"

b)Ansible



Mian config (extended)

a) Prometheus

global:
  evaluation_interval: 15s
  scrape_interval: 15s
  scrape_timeout: 10s

  external_labels:
    environment: localhost.localdomain

rule files:
  - /etc/prometheus/rules/*.rules

alerting:
  alertmanagers:
  - scheme: http
    static_configs:
    - targets:
      - localhost:9093

scrape_configs:
  - job_name: "prometheus"
    metrics_path: "/metrics"
    static_configs:
    - targets:
      - "localhost:9090"
   - job_name: node
     file sd_configs:
     - files:
       - "/etc/prometheus/file_sd/node.yml"

b)Ansible

prometheus_alertmanager_config:
   - scheme: http
      static_configs:
       - targets:
         - "localhost:9093"

prometheus_scrape_configs:
- job_name: "node"
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd/node.yml"

prometheus_targets:
  node:
      - targets:
        - "locaslhost:9100"


Command line parameters

#Ansible managed file. Be wary of possible overwrites.
[Unit]
Description=Prometheus
After=network.target

[Service]
Type=cimple
Environment="GOMAXPROCS-1"
User=prometheus
Group=prometheus
ExecReload=/bin/kill .HUP $MAINPID
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention=30d \
  --web.listen.address=0.0.0.0:9090 \
  --web.external-url=http://demo.cloudalchemy.org:9090

SyslogIdentifier=prometheus
Restart=always

[Install]
WantedBy=multi-user.target


Awarie przy zmianie konfiguracji:
-preflight checks included in role use 'promtool' in ansible 'validate' directive //validacja przed zastąpieniem


Gathering system metrics from many nodes with multiple CPU architectures?

node_exporter
-one binary
-simple configuration with cli flags
-ansible bonuses:
  *versioning
  *system user managment
  *CPU architecture auto-detection
  *systemd service files
  *linux capabilites support // creating role which have some of capabilites of root user but it is not root user
  *basic SELinux support


Example
-demo.cloudalchemy.org
-daily ansible deploy with travis CI

Resources:
-presentation.cloudalchemy.org
-github.com/cloudalchemy
-prometheus.io/docs
-www.safaribookonline.com/library/view/prometheus-up/9781492034131
-prometheus.io/webtools/alerting/routing-tree-editor
-prometheusbook.com

Lepsza wydajność w Prometheus 2.x

Influx dobrze współgra- > remote write and read

Push gateway - dla systemów krótko dostępnych

service discovery




https://www.robustperception.io/tag/prometheus


https://www.youtube.com/watch?v=cNjKWOk4YPU
https://prometheus.io/docs/prometheus/latest/querying/basics/

Komentarze

Popularne posty z tego bloga

Kubernetes

Helm

Ansible Tower / AWX