
437 lines
13 KiB
Raw Permalink Normal View History

# check_patroni
2022-07-11 11:53:00 +02:00
A nagios plugin for patroni.
## Features
- Check presence of leader, replicas, node counts.
- Check each node for replication status.
Usage: check_patroni [OPTIONS] COMMAND [ARGS]...
Nagios plugin that uses Patroni's REST API to monitor a Patroni cluster.
--config FILE Read option defaults from the specified INI file
[default: config.ini]
-e, --endpoints TEXT Patroni API endpoint. Can be specified multiple times
or as a list of comma separated addresses. The node
services checks the status of one node, therefore if
several addresses are specified they should point to
different interfaces on the same node. The cluster
services check the status of the cluster, therefore
it's better to give a list of all Patroni node
addresses. [default:]
--cert_file PATH File with the client certificate.
--key_file PATH File with the client key.
--ca_file PATH The CA certificate.
-v, --verbose Increase verbosity -v (info)/-vv (warning)/-vvv
2022-07-11 08:57:14 +02:00
--timeout INTEGER Timeout in seconds for the API queries (0 to disable)
2022-02-07 11:03:12 +01:00
[default: 2]
--help Show this message and exit.
cluster_config_has_changed Check if the hash of the configuration...
cluster_has_leader Check if the cluster has a leader.
cluster_has_replica Check if the cluster has healthy replicas...
cluster_has_scheduled_action Check if the cluster has a scheduled...
cluster_is_in_maintenance Check if the cluster is in maintenance...
cluster_node_count Count the number of nodes in the cluster.
node_is_alive Check if the node is alive ie patroni is...
node_is_leader Check if the node is a leader node.
node_is_pending_restart Check if the node is in pending restart...
node_is_primary Check if the node is the primary with the...
node_is_replica Check if the node is a running replica...
node_patroni_version Check if the version is equal to the input
node_tl_has_changed Check if the timeline has changed.
2021-08-13 12:12:08 +02:00
2022-02-07 11:03:12 +01:00
## Install
2021-08-13 12:12:08 +02:00
2022-07-11 11:53:00 +02:00
check_patroni is licensed under PostgreSQL license.
2022-07-11 11:53:00 +02:00
$ pip install git+
2022-02-07 11:03:12 +01:00
check_patroni works on python 3.6, we keep it that way because patroni also
supports it and there are still lots of RH 7 variants around. That being said
python 3.6 has been EOL for age and there is no support for it in the github
2022-07-11 11:53:00 +02:00
## Support
If you hit a bug or need help, open a [GitHub
issue]( Dalibo has no
commitment on response time for public free support. Thanks for you
contribution !
2022-02-07 11:03:12 +01:00
## Config file
2021-08-13 11:13:33 +02:00
All global and service specific parameters can be specified via a config file has follows:
endpoints =,,
2022-07-11 08:57:14 +02:00
cert_file = ./ssl/my-cert.pem
key_file = ./ssl/my-key.pem
2021-08-13 11:13:33 +02:00
ca_file = ./ssl/CA-cert.pem
timeout = 0
2022-02-07 11:03:12 +01:00
## Thresholds
2022-07-11 08:57:14 +02:00
The format for the threshold parameters is `[@][start:][end]`.
2022-07-11 08:57:14 +02:00
* `start:` may be omitted if `start == 0`
* `~:` means that start is negative infinity
* If `end` is omitted, infinity is assumed
2022-07-11 08:57:14 +02:00
* To invert the match condition, prefix the range expression with `@`.
2022-07-11 08:57:14 +02:00
A match is found when: `start <= VALUE <= end`.
2022-07-11 08:57:14 +02:00
For example, the following command will raise:
2021-12-31 11:30:17 +01:00
* a warning if there is less than 1 nodes, wich can be translated to outside of range [2;+INF[
* a critical if there are no nodes, wich can be translated to outside of range [1;+INF[
2021-12-31 11:30:17 +01:00
check_patroni -e cluster_has_replica --warning 2: --critical 1:
2023-03-12 19:43:06 +01:00
## SSL
Several options are available:
2023-03-12 19:43:06 +01:00
* the server's CA certificate is not available or trusted by the client system:
2023-03-12 19:43:06 +01:00
* `--ca_cert`: your certification chain `cat CA-certificate server-certificate > cabundle`
* you have a client certificate for authenticating with Patroni's REST API:
2023-03-12 19:43:06 +01:00
* `--cert_file`: your certificate or the concatenation of your certificate and private key
* `--key_file`: your private key (optional)
2021-08-13 11:13:33 +02:00
2022-02-07 11:03:12 +01:00
## Cluster services
### cluster_config_has_changed
Usage: check_patroni cluster_config_has_changed [OPTIONS]
Check if the hash of the configuration has changed.
Note: either a hash or a state file must be provided for this service to
* `OK`: The hash didn't change
* `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`)
2022-02-07 11:03:12 +01:00
* `is_configuration_changed` is 1 if the configuration has changed
--hash TEXT A hash to compare with.
2021-12-08 17:14:38 +01:00
-s, --state-file TEXT A state file to store the hash of the configuration.
--save Set the current configuration hash as the reference
for future calls.
--help Show this message and exit.
### cluster_has_leader
Usage: check_patroni cluster_has_leader [OPTIONS]
Check if the cluster has a leader.
This check applies to any kind of leaders including standby leaders.
2022-02-07 14:18:14 +01:00
* `OK`: if there is a leader node.
* `CRITICAL`: otherwise
2022-02-07 11:03:12 +01:00
Perfdata: `has_leader` is 1 if there is a leader node, 0 otherwise
--help Show this message and exit.
### cluster_has_replica
Usage: check_patroni cluster_has_replica [OPTIONS]
Check if the cluster has healthy replicas and/or if some are sync standbies
2022-02-07 14:18:14 +01:00
A healthy replica:
* is in running or streaming state (V3.0.4)
* has a replica or sync_standby role
2021-12-31 11:30:17 +01:00
* has a lag lower or equal to max_lag
* `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold.
and if the sync_replica count is compatible with the sync replica count threshold.
* `WARNING` / `CRITICAL`: otherwise
2022-02-07 11:03:12 +01:00
* healthy_replica & unhealthy_replica count
* the number of sync_replica, they are included in the previous count
* the lag of each replica labelled with "member name"_lag
* a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync
-w, --warning TEXT Warning threshold for the number of healthy replica
-c, --critical TEXT Critical threshold for the number of healthy replica
--sync-warning TEXT Warning threshold for the number of sync replica.
--sync-critical TEXT Critical threshold for the number of sync replica.
--max-lag TEXT maximum allowed lag
--help Show this message and exit.
### cluster_has_scheduled_action
Usage: check_patroni cluster_has_scheduled_action [OPTIONS]
Check if the cluster has a scheduled action (switchover or restart)
* `OK`: If the cluster has no scheduled action
* `CRITICAL`: otherwise.
* `scheduled_actions` is 1 if the cluster has scheduled actions.
* `scheduled_switchover` is 1 if the cluster has a scheduled switchover.
* `scheduled_restart` counts the number of scheduled restart in the cluster.
--help Show this message and exit.
### cluster_is_in_maintenance
Usage: check_patroni cluster_is_in_maintenance [OPTIONS]
2022-02-07 11:03:12 +01:00
Check if the cluster is in maintenance mode or paused.
* `OK`: If the cluster is in maintenance mode.
* `CRITICAL`: otherwise.
2022-02-07 11:03:12 +01:00
* `is_in_maintenance` is 1 the cluster is in maintenance mode, 0 otherwise
--help Show this message and exit.
### cluster_node_count
Usage: check_patroni cluster_node_count [OPTIONS]
Count the number of nodes in the cluster.
The state refers to the state of PostgreSQL. Possible values are:
* initializing new cluster, initdb failed
* running custom bootstrap script, custom bootstrap failed
* starting, start failed
* restarting, restart failed
* running, streaming (for a replica V3.0.4)
* stopping, stopped, stop failed
* creating replica
* crashed
The role refers to the role of the server in the cluster. Possible values
* master or leader (V3.0.0+)
* replica
* demoted
* promoted
* uninitialized
* Compares the number of nodes against the normal and healthy (running + streaming) nodes warning and critical thresholds.
2022-02-07 14:18:14 +01:00
* `OK`: If they are not provided.
* `members`: the member count.
* `healthy_members`: the running and streaming member count.
* all the roles of the nodes in the cluster with their count (start with "role_").
* all the statuses of the nodes in the cluster with their count (start with "state_").
-w, --warning TEXT Warning threshold for the number of nodes.
2022-02-07 14:18:14 +01:00
-c, --critical TEXT Critical threshold for the number of nodes.
--healthy-warning TEXT Warning threshold for the number of healthy nodes
(running + streaming).
--healthy-critical TEXT Critical threshold for the number of healthy nodes
(running + streaming).
--help Show this message and exit.
2022-02-07 11:03:12 +01:00
## Node services
### node_is_alive
Usage: check_patroni node_is_alive [OPTIONS]
Check if the node is alive ie patroni is running. This is a liveness check
as defined in Patroni's documentation.
* `OK`: If patroni is running.
* `CRITICAL`: otherwise.
2022-02-07 11:03:12 +01:00
* `is_running` is 1 if patroni is running, 0 otherwise
--help Show this message and exit.
### node_is_pending_restart
Usage: check_patroni node_is_pending_restart [OPTIONS]
Check if the node is in pending restart state.
This situation can arise if the configuration has been modified but requiers
a restart of PostgreSQL to take effect.
2023-03-12 19:43:06 +01:00
* `OK`: if the node has no pending restart tag.
* `CRITICAL`: otherwise
Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0
--help Show this message and exit.
### node_is_leader
Usage: check_patroni node_is_leader [OPTIONS]
Check if the node is a leader node.
This check applies to any kind of leaders including standby leaders. To
check explicitly for a standby leader use the `--is-standby-leader` option.
* `OK`: if the node is a leader.
* `CRITICAL:` otherwise
Perfdata: `is_leader` is 1 if the node is a leader node, 0 otherwise.
--is-standby-leader Check for a standby leader
--help Show this message and exit.
### node_is_primary
Usage: check_patroni node_is_primary [OPTIONS]
Check if the node is the primary with the leader lock.
This service is not valid for a standby leader, because this kind of node is
not a primary.
* `OK`: if the node is a primary with the leader lock.
* `CRITICAL:` otherwise
Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0
--help Show this message and exit.
### node_is_replica
Usage: check_patroni node_is_replica [OPTIONS]
Check if the node is a running replica with no noloadbalance tag.
It is possible to check if the node is synchronous or asynchronous. If
nothing is specified any kind of replica is accepted. When checking for a
synchronous replica, it's not possible to specify a lag.
* `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold.
* `CRITICAL`: otherwise
2022-02-07 11:03:12 +01:00
Perfdata: `is_replica` is 1 if the node is a running replica with
noloadbalance tag and the lag is under the maximum threshold, 0 otherwise.
--max-lag TEXT maximum allowed lag
--is-sync check if the replica is synchronous
--is-async check if the replica is asynchronous
--help Show this message and exit.
### node_patroni_version
Usage: check_patroni node_patroni_version [OPTIONS]
Check if the version is equal to the input
* `OK`: The version is the same as the input `--patroni-version`
* `CRITICAL`: otherwise.
2022-02-07 11:03:12 +01:00
* `is_version_ok` is 1 if version is ok, 0 otherwise
--patroni-version TEXT Patroni version to compare to [required]
--help Show this message and exit.
### node_tl_has_changed
Usage: check_patroni node_tl_has_changed [OPTIONS]
Check if the timeline has changed.
Note: either a timeline or a state file must be provided for this service to
* `OK`: The timeline is the same as last time (`--state_file`) or the inputted timeline (`--timeline`)
* `CRITICAL`: The tl is not the same.
2022-02-07 11:03:12 +01:00
* `is_timeline_changed` is 1 if the tl has changed, 0 otherwise
* the timeline
--timeline TEXT A timeline number to compare with.
-s, --state-file TEXT A state file to store the last tl number into.
--save Set the current timeline number as the reference for
future calls.
--help Show this message and exit.