Go to file
2023-08-23 18:35:06 +02:00
.github/workflows Add python 3.11 to the tests and reduce version coverage 2023-03-20 15:04:37 +01:00
check_patroni Add a node_is_leader service to check for the leader states 2023-08-23 18:22:49 +02:00
docs Update changelog for cluster_has_scheduled_action 2023-08-23 18:35:06 +02:00
tests Add a node_is_leader service to check for the leader states 2023-08-23 18:22:49 +02:00
vagrant Add a node_is_leader service to check for the leader states 2023-08-23 18:22:49 +02:00
.flake8 Add tests for the output of the script and support pre/post 3.0.4 2023-08-23 10:53:09 +02:00
.gitignore Add tests for the output of the script and support pre/post 3.0.4 2023-08-23 10:53:09 +02:00
CHANGELOG.md Add a node_is_leader service to check for the leader states 2023-08-23 18:22:49 +02:00
CONTRIBUTING.md Tox updates, setup.py fixes and CONTRIBUTING.md 2022-07-13 16:04:49 +02:00
LICENSE Add License file 2022-07-11 11:35:30 +02:00
MANIFEST.in Review fix 2022-07-13 09:03:23 +02:00
mypy.ini Move from urllib3 to requests 2023-03-20 12:25:32 +01:00
pyproject.toml Remove setuptools-scm as suggeted by @dlax 2023-08-22 11:18:27 +02:00
pytest.ini add pytest.ini 2021-08-26 15:32:11 +02:00
README.md Update changelog for cluster_has_scheduled_action 2023-08-23 18:35:06 +02:00
requirements-dev.txt Use isort to automatically sort imports 2023-03-20 14:56:11 +01:00
setup.py Move packaging metadata into pyproject.toml 2023-08-22 11:18:27 +02:00
tox.ini Move packaging metadata into pyproject.toml 2023-08-22 11:18:27 +02:00

check_patroni

A nagios plugin for patroni.

Features

  • Check presence of leader, replicas, node counts.
  • Check each node for replication status.
Usage: check_patroni [OPTIONS] COMMAND [ARGS]...

  Nagios plugin that uses Patroni's REST API to monitor a Patroni cluster.

Options:
  --config FILE         Read option defaults from the specified INI file
                        [default: config.ini]
  -e, --endpoints TEXT  Patroni API endpoint. Can be specified multiple times
                        or as a list of comma separated addresses. The node
                        services checks the status of one node, therefore if
                        several addresses are specified they should point to
                        different interfaces on the same node. The cluster
                        services check the status of the cluster, therefore
                        it's better to give a list of all Patroni node
                        addresses.  [default: http://127.0.0.1:8008]
  --cert_file PATH      File with the client certificate.
  --key_file PATH       File with the client key.
  --ca_file PATH        The CA certificate.
  -v, --verbose         Increase verbosity -v (info)/-vv (warning)/-vvv
                        (debug)
  --version
  --timeout INTEGER     Timeout in seconds for the API queries (0 to disable)
                        [default: 2]
  --help                Show this message and exit.

Commands:
  cluster_config_has_changed    Check if the hash of the configuration...
  cluster_has_leader            Check if the cluster has a leader.
  cluster_has_replica           Check if the cluster has healthy replicas.
  cluster_has_scheduled_action  Check if the cluster has a scheduled...
  cluster_is_in_maintenance     Check if the cluster is in maintenance...
  cluster_node_count            Count the number of nodes in the cluster.
  node_is_alive                 Check if the node is alive ie patroni is...
  node_is_leader                Check if the node is a leader node.
  node_is_pending_restart       Check if the node is in pending restart...
  node_is_primary               Check if the node is the primary with the...
  node_is_replica               Check if the node is a running replica...
  node_patroni_version          Check if the version is equal to the input
  node_tl_has_changed           Check if the timeline has changed.

Install

check_patroni is licensed under PostgreSQL license.

$ pip install git+https://github.com/dalibo/check_patroni.git

Links:

Support

If you hit a bug or need help, open a GitHub issue. Dalibo has no commitment on response time for public free support. Thanks for you contribution !

Config file

All global and service specific parameters can be specified via a config file has follows:

[options]
endpoints = https://10.20.199.3:8008, https://10.20.199.4:8008,https://10.20.199.5:8008
cert_file = ./ssl/my-cert.pem
key_file = ./ssl/my-key.pem
ca_file = ./ssl/CA-cert.pem
timeout = 0

[options.node_is_replica]
lag=100

Thresholds

The format for the threshold parameters is [@][start:][end].

  • start: may be omitted if start == 0
  • ~: means that start is negative infinity
  • If end is omitted, infinity is assumed
  • To invert the match condition, prefix the range expression with @.

A match is found when: start <= VALUE <= end.

For example, the following command will raise:

  • a warning if there is less than 1 nodes, wich can be translated to outside of range [2;+INF[
  • a critical if there are no nodes, wich can be translated to outside of range [1;+INF[
check_patroni -e https://10.20.199.3:8008 cluster_has_replica --warning 2: --critical 1:

SSL

Several options are available:

  • the server's CA certificate is not available or trusted by the client system:
    • --ca_cert: your certification chain cat CA-certificate server-certificate > cabundle
  • you have a client certificate for authenticating with Patroni's REST API:
    • --cert_file: your certificate or the concatenation of your certificate and private key
    • --key_file: your private key (optional)

Tests

Crafting repeatable tests using a live Patroni cluster can be intricate. To simplify the development process, interactions with Patroni's API are substituted with a mock function that yields an HTTP return code and a JSON object outlining the cluster's status. The JSON files containing this information are housed in the ./tests/json directory.

An important consideration is that there is a potential drawback: if the JSON data is incorrect or if modifications have been made to Patroni without corresponding updates to the tests documented here, the tests might still pass erroneously.

To run the tests:

  1. Clone the check_patroni repository, create a virtual environment, and install the script:

    git clone https://github.com/dalibo/check_patroni.git
    cd check_patroni
    python -m venv .venv
    . .venv/bin/activate
    pip install -e check_patroni[test]
    
  1. Run the tests

    • Using patroni's nominal replica state of streaming (since v3.0.4):
    pytest ./tests
    
    • Using patroni's nominal replica state of running (before v3.0.4):
    pytest --use-old-replica-state ./tests
    

Please note that when dealing with any service that checks the state of a node in patroni's cluster endpoint, the corresponding JSON test file must be added in ./test/tools.py.

A bash script, check_patroni.sh, is provided to facilitate testing all services on a Patroni endpoint (./vagrant/check_patroni.sh). It requires one parameter: the endpoint URL that will be used as the argument for the -e/--endpoints option of check_patroni. This script essentially compiles a list of service calls and executes them sequentially in a bash script.

Here's an example usage:

./vagrant/check_patroni.sh http://10.20.30.51:8008

Cluster services

cluster_config_has_changed

Usage: check_patroni cluster_config_has_changed [OPTIONS]

  Check if the hash of the configuration has changed.

  Note: either a hash or a state file must be provided for this service to
  work.

  Check:
  * `OK`: The hash didn't change
  * `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`)

  Perfdata:
  * `is_configuration_changed` is 1 if the configuration has changed

Options:
  --hash TEXT            A hash to compare with.
  -s, --state-file TEXT  A state file to store the hash of the configuration.
  --save                 Set the current configuration hash as the reference
                         for future calls.
  --help                 Show this message and exit.

cluster_has_leader

Usage: check_patroni cluster_has_leader [OPTIONS]

  Check if the cluster has a leader.

  This check applies to any kind of leaders including standby leaders.

  Check:
  * `OK`: if there is a leader node.
  * `CRITICAL`: otherwise

  Perfdata: `has_leader` is 1 if there is a leader node, 0 otherwise

Options:
  --help  Show this message and exit.

cluster_has_replica

Usage: check_patroni cluster_has_replica [OPTIONS]

  Check if the cluster has healthy replicas.

  A healthy replica:
  * is in running or streaming state (V3.0.4)
  * has a replica or sync_standby role
  * has a lag lower or equal to max_lag

  Check:
  * `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold.
  * `WARNING` / `CRITICAL`: otherwise

  Perfdata:
  * healthy_replica & unhealthy_replica count
  * the lag of each replica labelled with  "member name"_lag

Options:
  -w, --warning TEXT   Warning threshold for the number of healthy replica
                       nodes.
  -c, --critical TEXT  Critical threshold for the number of healthy replica
                       nodes.
  --max-lag TEXT       maximum allowed lag
  --help               Show this message and exit.

cluster_has_scheduled_action

Usage: check_patroni cluster_has_scheduled_action [OPTIONS]

  Check if the cluster has a scheduled action (switchover or restart)

  Check:
  * `OK`: If the cluster has no scheduled action
  * `CRITICAL`: otherwise.

  Perfdata:
  * `scheduled_actions` is 1 if the cluster has scheduled actions.
  * `scheduled_switchover` is 1 if the cluster has a scheduled switchover.
  * `scheduled_restart` counts the number of scheduled restart in the cluster.

Options:
  --help  Show this message and exit.

cluster_is_in_maintenance

Usage: check_patroni cluster_is_in_maintenance [OPTIONS]

  Check if the cluster is in maintenance mode or paused.

  Check:
  * `OK`: If the cluster is in maintenance mode.
  * `CRITICAL`: otherwise.

  Perfdata:
  * `is_in_maintenance` is 1 the cluster is in maintenance mode,  0 otherwise

Options:
  --help  Show this message and exit.

cluster_node_count

Usage: check_patroni cluster_node_count [OPTIONS]

  Count the number of nodes in the cluster.

  The state refers to the state of PostgreSQL. Possible values are:
  * initializing new cluster, initdb failed
  * running custom bootstrap script, custom bootstrap failed
  * starting, start failed
  * restarting, restart failed
  * running, streaming (for a replica V3.0.4)
  * stopping, stopped, stop failed
  * creating replica
  * crashed

  The role refers to the role of the server in the cluster. Possible values
  are:
  * master or leader (V3.0.0+)
  * replica
  * demoted
  * promoted
  * uninitialized

  Check:
  * Compares the number of nodes against the normal and healthy (running + streaming) nodes warning and critical thresholds.
  * `OK`:  If they are not provided.

  Perfdata:
  * `members`: the member count.
  * `healthy_members`: the running and streaming member count.
  * all the roles of the nodes in the cluster with their count (start with "role_").
  * all the statuses of the nodes in the cluster with their count (start with "state_").

Options:
  -w, --warning TEXT       Warning threshold for the number of nodes.
  -c, --critical TEXT      Critical threshold for the number of nodes.
  --healthy-warning TEXT   Warning threshold for the number of healthy nodes
                           (running + streaming).
  --healthy-critical TEXT  Critical threshold for the number of healthy nodes
                           (running + streaming).
  --help                   Show this message and exit.

Node services

node_is_alive

Usage: check_patroni node_is_alive [OPTIONS]

  Check if the node is alive ie patroni is running. This is a liveness check
  as defined in Patroni's documentation.

  Check:
  * `OK`: If patroni is running.
  * `CRITICAL`: otherwise.

  Perfdata:
  * `is_running` is 1 if patroni is running, 0 otherwise

Options:
  --help  Show this message and exit.

node_is_pending_restart

Usage: check_patroni node_is_pending_restart [OPTIONS]

  Check if the node is in pending restart state.

  This situation can arise if the configuration has been modified but requiers
  a restart of PostgreSQL to take effect.

  Check:
  * `OK`: if the node has no pending restart tag.
  * `CRITICAL`: otherwise

  Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0
  otherwise.

Options:
  --help  Show this message and exit.

node_is_leader

Usage: check_patroni node_is_leader [OPTIONS]

  Check if the node is a leader node.

  This check applies to any kind of leaders including standby leaders. To
  check explicitly for a standby leader use the `--is-standby-leader` option.

  Check:
  * `OK`: if the node is a leader.
  * `CRITICAL:` otherwise

  Perfdata: `is_leader` is 1 if the node is a leader node, 0 otherwise.

Options:
  --is-standby-leader  Check for a standby leader
  --help               Show this message and exit.

node_is_primary

Usage: check_patroni node_is_primary [OPTIONS]

  Check if the node is the primary with the leader lock.

  This service is not valid for a standby leader, because this kind of node is
  not a primary.

  Check:
  * `OK`: if the node is a primary with the leader lock.
  * `CRITICAL:` otherwise

  Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0
  otherwise.

Options:
  --help  Show this message and exit.

node_is_replica

Usage: check_patroni node_is_replica [OPTIONS]

  Check if the node is a running replica with no noloadbalance tag.

  It is possible to check if the node is synchronous or asynchronous. If
  nothing is specified any kind of replica is accepted. When checking for a
  synchronous replica, it's not possible to specify a lag.

  Check:
  * `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold.
  * `CRITICAL`:  otherwise

  Perfdata: `is_replica` is 1 if the node is a running replica with
  noloadbalance tag and the lag is under the maximum threshold, 0 otherwise.

Options:
  --max-lag TEXT  maximum allowed lag
  --is-sync       check if the replica is synchronous
  --is-async      check if the replica is asynchronous
  --help          Show this message and exit.

node_patroni_version

Usage: check_patroni node_patroni_version [OPTIONS]

  Check if the version is equal to the input

  Check:
  * `OK`: The version is the same as the input `--patroni-version`
  * `CRITICAL`: otherwise.

  Perfdata:
  * `is_version_ok` is 1 if version is ok, 0 otherwise

Options:
  --patroni-version TEXT  Patroni version to compare to  [required]
  --help                  Show this message and exit.

node_tl_has_changed

Usage: check_patroni node_tl_has_changed [OPTIONS]

  Check if the timeline has changed.

  Note: either a timeline or a state file must be provided for this service to
  work.

  Check:
  * `OK`: The timeline is the same as last time (`--state_file`) or the inputted timeline (`--timeline`)
  * `CRITICAL`: The tl is not the same.

  Perfdata:
  * `is_timeline_changed` is 1 if the tl has changed, 0 otherwise
  * the timeline

Options:
  --timeline TEXT        A timeline number to compare with.
  -s, --state-file TEXT  A state file to store the last tl number into.
  --save                 Set the current timeline number as the reference for
                         future calls.
  --help                 Show this message and exit.