Go to file
2022-07-11 08:57:14 +02:00
check_patroni An attempt at correctly ordering imports 2022-02-07 15:11:05 +01:00
doc Fix README.md 2022-07-11 08:57:14 +02:00
test fix flake8 complaints 2022-02-07 15:13:13 +01:00
.flake8 First working version 2021-08-12 13:07:10 +02:00
.gitignore Vagrant icinga / grafana 2021-12-31 16:11:08 +01:00
mypy.ini Disable some checks from myppy 2021-09-09 17:17:25 +02:00
pytest.ini add pytest.ini 2021-08-26 15:32:11 +02:00
README.md Fix README.md 2022-07-11 08:57:14 +02:00
setup.py fix flake8 complaints 2022-02-07 15:13:13 +01:00

check_patroni

Usage: check_patroni [OPTIONS] COMMAND [ARGS]...

  Nagios plugin for patroni.

Options:
  --config FILE         Read option defaults from the specified INI file
                        [default: config.ini]
  -e, --endpoints TEXT  API endpoint. Can be specified multiple times.
                        [default: http://127.0.0.1:8008]
  --cert_file TEXT      File with the client certificate.
  --key_file TEXT       File with the client key.
  --ca_file TEXT        The CA certificate.
  -v, --verbose         Increase verbosity -v (info)/-vv (warning)/-vvv
                        (debug)
  --version
  --timeout INTEGER     Timeout in seconds for the API queries (0 to disable)
                        [default: 2]
  --help                Show this message and exit.

Commands:
  cluster_config_has_changed  Check if the hash of the configuration has...
  cluster_has_leader          Check if the cluster has a leader.
  cluster_has_replica         Check if the cluster has healthy replicas.
  cluster_is_in_maintenance   Check if the cluster is in maintenance mode...
  cluster_node_count          Count the number of nodes in the cluster.
  node_is_alive               Check if the node is alive ie patroni is...
  node_is_pending_restart     Check if the node is in pending restart state.
  node_is_primary             Check if the node is the primary with the...
  node_is_replica             Check if the node is a running replica with...
  node_patroni_version        Check if the version is equal to the input
  node_tl_has_changed         Check if the timeline has changed.

Install

Installation from the git repository:

$ git clone https://github.com/dalibo/check_patroni

Change the branch if necessary. Then create a dedicated environment, install dependencies and then check_patroni from the repo:

$ cd check_patroni
$ python3 -m venv .venv
$ . .venv/bin/activate
(.venv) $ pip3 install .
(.venv) $ pip3 install .[dev]   # for dev purposes
(.venv) $ pip3 install .[test]  # for testing purposes
(.venv) $ check_patroni

To quit this env and destroy it:

$ deactivate
$ rm -r .venv

Links:

Config file

All global and service specific parameters can be specified via a config file has follows:

[options]
endpoints = https://10.20.199.3:8008, https://10.20.199.4:8008,https://10.20.199.5:8008
cert_file = ./ssl/my-cert.pem
key_file = ./ssl/my-key.pem
ca_file = ./ssl/CA-cert.pem
timeout = 0

[options.node_is_replica]
lag=100

Thresholds

The format for the threshold parameters is [@][start:][end].

  • start: may be omitted if start == 0
  • ~: means that start is negative infinity
  • If end is omitted, infinity is assumed
  • To invert the match condition, prefix the range expression with @.

A match is found when: start <= VALUE <= end.

For example, the following command will raise:

  • a warning if there is less than 1 nodes, wich can be translated to outside of range [2;+INF[
  • a critical if there are no nodes, wich can be translated to outside of range [1;+INF[
check_patroni -e https://10.20.199.3:8008 cluster_has_replica --warning 2: --critical 1:

Cluster services

cluster_config_has_changed

Usage: check_patroni cluster_config_has_changed [OPTIONS]

  Check if the hash of the configuration has changed.

  Note: either a hash or a state file must be provided for this service to
  work.

  Check:
  * `OK`: The hash didn't change
  * `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`)

  Perfdata:
  * `is_configuration_changed` is 1 if the configuration has changed

Options:
  --hash TEXT            A hash to compare with.
  -s, --state-file TEXT  A state file to store the hash of the configuration.
  --help                 Show this message and exit.

cluster_has_leader

Usage: check_patroni cluster_has_leader [OPTIONS]

  Check if the cluster has a leader.

  Note: there is no difference between a normal and standby leader.

  Check:
  * `OK`: if there is a leader node.
  * `CRITICAL`: otherwise

  Perfdata: `has_leader` is 1 if there is a leader node, 0 otherwise

Options:
  --help  Show this message and exit.

cluster_has_replica

Usage: check_patroni cluster_has_replica [OPTIONS]

  Check if the cluster has healthy replicas.

  A healthy replica:
  * is in running state
  * has a replica role
  * has a lag lower or equal to max_lag

  Check:
  * `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold.
  * `WARNING` / `CRITICAL`: otherwise

  Perfdata:
  * healthy_replica & unhealthy_replica count
  * the lag of each replica labelled with  "member name"_lag

Options:
  -w, --warning TEXT   Warning threshold for the number of healthy replica
                       nodes.
  -c, --critical TEXT  Critical threshold for the number of healthy replica
                       nodes.
  --max-lag TEXT       maximum allowed lag
  --help               Show this message and exit.

cluster_is_in_maintenance

Usage: check_patroni cluster_is_in_maintenance [OPTIONS]

  Check if the cluster is in maintenance mode or paused.

  Check:
  * `OK`: If the cluster is in maintenance mode.
  * `CRITICAL`: otherwise.

  Perfdata:
  * `is_in_maintenance` is 1 the cluster is in maintenance mode,  0 otherwise

Options:
  --help  Show this message and exit.

cluster_node_count

Usage: check_patroni cluster_node_count [OPTIONS]

  Count the number of nodes in the cluster.

  Check:
  * Compares the number of nodes against the normal and running node warning and critical thresholds.
  * `OK`:  If they are not provided.

  Perfdata:
  * `members`: the member count.
  * all the roles of the nodes in the cluster with their number.
  * all the statuses of the nodes in the cluster with their number.

Options:
  -w, --warning TEXT       Warning threshold for the number of nodes.
  -c, --critical TEXT      Critical threshold for the number of nodes.
  --running-warning TEXT   Warning threshold for the number of running nodes.
  --running-critical TEXT  Critical threshold for the number of running nodes.
  --help                   Show this message and exit.

Node services

node_is_alive

Usage: check_patroni node_is_alive [OPTIONS]

  Check if the node is alive ie patroni is running.

  Check:
  * `OK`: If patroni is running.
  * `CRITICAL`: otherwise.

  Perfdata:
  * `is_running` is 1 if patroni is running, 0 otherwise

Options:
  --help  Show this message and exit.

node_is_pending_restart

Usage: check_patroni node_is_pending_restart [OPTIONS]

  Check if the node is in pending restart state.

  This situation can arise if the configuration has been modified but requiers
  a restart of PostgreSQL to take effect.

  Check:
  * `OK`: if the node has pending restart tag.
  * `CRITICAL`: otherwise

  Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0
  otherwise.

Options:
  --help  Show this message and exit.

node_is_primary

Usage: check_patroni node_is_primary [OPTIONS]

  Check if the node is the primary with the leader lock.

  Check:
  * `OK`: if the node is a primary with the leader lock.
  * `CRITICAL:` otherwise

  Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0
  otherwise.

Options:
  --help  Show this message and exit.

node_is_replica

Usage: check_patroni node_is_replica [OPTIONS]

  Check if the node is a running replica with no noloadbalance tag.

  Check:
  * `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold.
  * `CRITICAL`:  otherwise

  Perfdata: `is_replica` is 1 if the node is a running replica with
  noloadbalance tag and the lag is under the maximum threshold, 0 otherwise.

Options:
  --max-lag TEXT  maximum allowed lag
  --help          Show this message and exit.

node_patroni_version

Usage: check_patroni node_patroni_version [OPTIONS]

  Check if the version is equal to the input

  Check:
  * `OK`: The version is the same as the input `--patroni-version`
  * `CRITICAL`: otherwise.

  Perfdata:
  * `is_version_ok` is 1 if version is ok, 0 otherwise

Options:
  --patroni-version TEXT  Patroni version to compare to  [required]
  --help                  Show this message and exit.

node_tl_has_changed

Usage: check_patroni node_tl_has_changed [OPTIONS]

  Check if the timeline has changed.

  Note: either a timeline or a state file must be provided for this service to
  work.

  Check:
  * `OK`: The timeline is the same as last time (`--state_file`) or the inputed timeline (`--timeline`)
  * `CRITICAL`: The tl is not the same.

  Perfdata:
  * `is_timeline_changed` is 1 if the tl has changed, 0 otherwise
  * the timeline

Options:
  --timeline TEXT        A timeline number to compare with.
  -s, --state-file TEXT  A state file to store the last tl number into.
  --help                 Show this message and exit.

test

The pytests are in ./test and use a moker to provide a json response instead of having to call the patroni API.

A vagrant file is available to create a icinga / opm / grafana stack and install check_patroni. You can then add a server to the supervision and watch the graphs in grafana. It's in ./test/vagrant.

A vagrant file can be found in [this repository](https://github.com/ioguix/vagrant-patroni to generate a patroni/etcd setup.