check-patroni/README.md

337 lines
9 KiB
Markdown
Raw Normal View History

# check_patroni
2022-07-11 11:53:00 +02:00
A nagios plugin for patroni.
## Features
- Check presence of leader, replicas, node counts.
- Check each node for replication status.
```
Usage: check_patroni [OPTIONS] COMMAND [ARGS]...
Nagios plugin for patroni.
Options:
--config FILE Read option defaults from the specified INI file
[default: config.ini]
-e, --endpoints TEXT API endpoint. Can be specified multiple times.
2022-02-07 11:03:12 +01:00
[default: http://127.0.0.1:8008]
--cert_file TEXT File with the client certificate.
--key_file TEXT File with the client key.
--ca_file TEXT The CA certificate.
-v, --verbose Increase verbosity -v (info)/-vv (warning)/-vvv
2022-07-11 08:57:14 +02:00
(debug)
--version
--timeout INTEGER Timeout in seconds for the API queries (0 to disable)
2022-02-07 11:03:12 +01:00
[default: 2]
--help Show this message and exit.
Commands:
cluster_config_has_changed Check if the hash of the configuration has...
cluster_has_leader Check if the cluster has a leader.
2022-02-07 14:18:14 +01:00
cluster_has_replica Check if the cluster has healthy replicas.
cluster_is_in_maintenance Check if the cluster is in maintenance mode...
cluster_node_count Count the number of nodes in the cluster.
node_is_alive Check if the node is alive ie patroni is...
node_is_pending_restart Check if the node is in pending restart state.
node_is_primary Check if the node is the primary with the...
node_is_replica Check if the node is a running replica with...
node_patroni_version Check if the version is equal to the input
node_tl_has_changed Check if the timeline has changed.
```
2021-08-13 12:12:08 +02:00
2022-02-07 11:03:12 +01:00
## Install
2021-08-13 12:12:08 +02:00
2022-07-11 11:53:00 +02:00
check_patroni is licensed under PostgreSQL license.
```
2022-07-11 11:53:00 +02:00
$ pip install git+https://github.com/dalibo/check_patroni.git
2022-02-07 11:03:12 +01:00
```
Links:
* [pip & centos 7](https://linuxize.com/post/how-to-install-pip-on-centos-7/)
* [pip & debian10](https://linuxize.com/post/how-to-install-pip-on-debian-10/)
2022-07-11 11:53:00 +02:00
## Support
If you hit a bug or need help, open a [GitHub
issue](https://github.com/dalibo/check_patroni/issues/new). Dalibo has no
commitment on response time for public free support. Thanks for you
contribution !
2022-02-07 11:03:12 +01:00
## Config file
2021-08-13 11:13:33 +02:00
All global and service specific parameters can be specified via a config file has follows:
```
[options]
endpoints = https://10.20.199.3:8008, https://10.20.199.4:8008,https://10.20.199.5:8008
2022-07-11 08:57:14 +02:00
cert_file = ./ssl/my-cert.pem
key_file = ./ssl/my-key.pem
2021-08-13 11:13:33 +02:00
ca_file = ./ssl/CA-cert.pem
timeout = 0
[options.node_is_replica]
lag=100
```
2022-02-07 11:03:12 +01:00
## Thresholds
2022-07-11 08:57:14 +02:00
The format for the threshold parameters is `[@][start:][end]`.
2022-07-11 08:57:14 +02:00
* `start:` may be omitted if `start == 0`
* `~:` means that start is negative infinity
* If `end` is omitted, infinity is assumed
2022-07-11 08:57:14 +02:00
* To invert the match condition, prefix the range expression with `@`.
2022-07-11 08:57:14 +02:00
A match is found when: `start <= VALUE <= end`.
2022-07-11 08:57:14 +02:00
For example, the following command will raise:
2021-12-31 11:30:17 +01:00
* a warning if there is less than 1 nodes, wich can be translated to outside of range [2;+INF[
* a critical if there are no nodes, wich can be translated to outside of range [1;+INF[
```
2021-12-31 11:30:17 +01:00
check_patroni -e https://10.20.199.3:8008 cluster_has_replica --warning 2: --critical 1:
```
2021-08-13 11:13:33 +02:00
2022-02-07 11:03:12 +01:00
## Cluster services
### cluster_config_has_changed
```
Usage: check_patroni cluster_config_has_changed [OPTIONS]
Check if the hash of the configuration has changed.
Note: either a hash or a state file must be provided for this service to
work.
Check:
* `OK`: The hash didn't change
* `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`)
2022-02-07 11:03:12 +01:00
Perfdata:
* `is_configuration_changed` is 1 if the configuration has changed
Options:
--hash TEXT A hash to compare with.
2021-12-08 17:14:38 +01:00
-s, --state-file TEXT A state file to store the hash of the configuration.
--help Show this message and exit.
```
### cluster_has_leader
```
Usage: check_patroni cluster_has_leader [OPTIONS]
Check if the cluster has a leader.
2022-02-07 14:18:14 +01:00
Note: there is no difference between a normal and standby leader.
Check:
* `OK`: if there is a leader node.
* `CRITICAL`: otherwise
2022-02-07 11:03:12 +01:00
Perfdata: `has_leader` is 1 if there is a leader node, 0 otherwise
Options:
--help Show this message and exit.
```
### cluster_has_replica
```
Usage: check_patroni cluster_has_replica [OPTIONS]
2022-02-07 14:18:14 +01:00
Check if the cluster has healthy replicas.
2022-02-07 14:18:14 +01:00
A healthy replica:
2021-12-31 11:30:17 +01:00
* is in running state
* has a replica role
* has a lag lower or equal to max_lag
Check:
* `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold.
* `WARNING` / `CRITICAL`: otherwise
2022-02-07 11:03:12 +01:00
Perfdata:
* healthy_replica & unhealthy_replica count
* the lag of each replica labelled with "member name"_lag
Options:
2022-02-07 14:18:14 +01:00
-w, --warning TEXT Warning threshold for the number of healthy replica
nodes.
-c, --critical TEXT Critical threshold for the number of healthy replica
nodes.
--max-lag TEXT maximum allowed lag
--help Show this message and exit.
```
### cluster_is_in_maintenance
```
Usage: check_patroni cluster_is_in_maintenance [OPTIONS]
2022-02-07 11:03:12 +01:00
Check if the cluster is in maintenance mode or paused.
Check:
* `OK`: If the cluster is in maintenance mode.
* `CRITICAL`: otherwise.
2022-02-07 11:03:12 +01:00
Perfdata:
* `is_in_maintenance` is 1 the cluster is in maintenance mode, 0 otherwise
Options:
--help Show this message and exit.
```
### cluster_node_count
```
Usage: check_patroni cluster_node_count [OPTIONS]
Count the number of nodes in the cluster.
Check:
* Compares the number of nodes against the normal and running node warning and critical thresholds.
2022-02-07 14:18:14 +01:00
* `OK`: If they are not provided.
Perfdata:
* `members`: the member count.
* all the roles of the nodes in the cluster with their number.
* all the statuses of the nodes in the cluster with their number.
Options:
-w, --warning TEXT Warning threshold for the number of nodes.
2022-02-07 14:18:14 +01:00
-c, --critical TEXT Critical threshold for the number of nodes.
--running-warning TEXT Warning threshold for the number of running nodes.
2022-02-07 14:18:14 +01:00
--running-critical TEXT Critical threshold for the number of running nodes.
--help Show this message and exit.
```
2022-02-07 11:03:12 +01:00
## Node services
### node_is_alive
```
Usage: check_patroni node_is_alive [OPTIONS]
Check if the node is alive ie patroni is running.
Check:
* `OK`: If patroni is running.
* `CRITICAL`: otherwise.
2022-02-07 11:03:12 +01:00
Perfdata:
* `is_running` is 1 if patroni is running, 0 otherwise
Options:
--help Show this message and exit.
```
### node_is_pending_restart
```
Usage: check_patroni node_is_pending_restart [OPTIONS]
Check if the node is in pending restart state.
This situation can arise if the configuration has been modified but requiers
a restart of PostgreSQL to take effect.
Check:
* `OK`: if the node has pending restart tag.
* `CRITICAL`: otherwise
Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0
otherwise.
Options:
--help Show this message and exit.
```
### node_is_primary
```
Usage: check_patroni node_is_primary [OPTIONS]
Check if the node is the primary with the leader lock.
Check:
* `OK`: if the node is a primary with the leader lock.
* `CRITICAL:` otherwise
Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0
otherwise.
Options:
--help Show this message and exit.
```
### node_is_replica
```
Usage: check_patroni node_is_replica [OPTIONS]
Check if the node is a running replica with no noloadbalance tag.
Check:
* `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold.
* `CRITICAL`: otherwise
2022-02-07 11:03:12 +01:00
Perfdata: `is_replica` is 1 if the node is a running replica with
noloadbalance tag and the lag is under the maximum threshold, 0 otherwise.
Options:
--max-lag TEXT maximum allowed lag
--help Show this message and exit.
```
### node_patroni_version
```
Usage: check_patroni node_patroni_version [OPTIONS]
Check if the version is equal to the input
Check:
* `OK`: The version is the same as the input `--patroni-version`
* `CRITICAL`: otherwise.
2022-02-07 11:03:12 +01:00
Perfdata:
* `is_version_ok` is 1 if version is ok, 0 otherwise
Options:
--patroni-version TEXT Patroni version to compare to [required]
--help Show this message and exit.
```
### node_tl_has_changed
```
Usage: check_patroni node_tl_has_changed [OPTIONS]
Check if the timeline has changed.
Note: either a timeline or a state file must be provided for this service to
work.
Check:
* `OK`: The timeline is the same as last time (`--state_file`) or the inputed timeline (`--timeline`)
* `CRITICAL`: The tl is not the same.
2022-02-07 11:03:12 +01:00
Perfdata:
* `is_timeline_changed` is 1 if the tl has changed, 0 otherwise
* the timeline
Options:
--timeline TEXT A timeline number to compare with.
-s, --state-file TEXT A state file to store the last tl number into.
--help Show this message and exit.
```