Compare commits

...

No commits in common. "debian/latest" and "pristine-tar" have entirely different histories.

108 changed files with 2 additions and 8469 deletions

View file

@ -1,3 +0,0 @@
[run]
include =
check_patroni/*

13
.flake8
View file

@ -1,13 +0,0 @@
[flake8]
doctests = True
ignore =
# line too long
E501,
# line break before binary operator (added by black)
W503,
exclude =
.git,
.mypy_cache,
.tox,
.venv,
mypy_config = mypy.ini

View file

@ -1,16 +0,0 @@
name: Lint
on: [push, pull_request]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- name: Install tox
run: pip install tox
- name: Lint (black & flake8)
run: tox -e lint
- name: Mypy
run: tox -e mypy

View file

@ -1,28 +0,0 @@
name: Publish
on:
push:
tags:
- 'v*'
jobs:
publish:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.10'
- name: Install
run: python -m pip install setuptools wheel twine
- name: Build
run: |
python setup.py check
python setup.py sdist bdist_wheel
python -m twine check dist/*
- name: Publish
run: python -m twine upload dist/*
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}

View file

@ -1,22 +0,0 @@
name: Tests
on: [push, pull_request]
jobs:
tests:
runs-on: ubuntu-latest
strategy:
matrix:
include:
- python: "3.7"
- python: "3.11"
steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python }}
- name: Install tox
run: pip install tox
- name: Test
run: tox -e py

11
.gitignore vendored
View file

@ -1,11 +0,0 @@
__pycache__/
check_patroni.egg-info
tests/config.ini
vagrant/.vagrant
vagrant/*.state_file
.*.swp
.coverage
.venv/
.tox/
dist/
build/

View file

@ -1,97 +0,0 @@
# Change log
## check_patroni 2.0.0 - 2024-04-09
### Changed
* In `cluster_node_count`, a healthy standby, sync replica or standby leaders cannot be "in
archive recovery" because this service doesn't check for lag and timelines.
### Added
* Add the timeline in the `cluster_has_replica` perfstats. (#50)
* Add a mention about shell completion support and shell versions in the doc. (#53)
* Add the leader type and whether it's archiving to the `cluster_has_leader` perfstats. (#58)
### Fixed
* Add compatibility with [requests](https://requests.readthedocs.io)
version 2.25 and higher.
* Fix what `cluster_has_replica` deems a healthy replica. (#50, reported by @mbanck)
* Fix `cluster_has_replica` to display perfstats for replicas whenever it's possible (healthy or not). (#50)
* Fix `cluster_has_leader` to correctly check for standby leaders. (#58, reported by @mbanck)
* Fix `cluster_node_count` to correctly manage replication states. (#50, reported by @mbanck)
### Misc
* Improve the documentation for `node_is_replica`.
* Improve test coverage by running an HTTP server to fake the Patroni API (#55
by @dlax).
* Work around old pytest versions in type annotations in the test suite.
* Declare compatibility with click version 7.1 (or higher).
* In tests, work around nagiosplugin 1.3.2 not properly handling stdout
redirection.
## check_patroni 1.0.0 - 2023-08-28
Check patroni is now tagged as Production/Stable.
### Added
* Add `sync_standby` as a valid replica type for `cluster_has_replica`. (contributed by @mattpoel)
* Add info and options (`--sync-warning` and `--sync-critical`) about sync replica to `cluster_has_replica`.
* Add a new service `cluster_has_scheduled_action` to warn of any scheduled switchover or restart.
* Add options to `node_is_replica` to check specifically for a synchronous (`--is-sync`) or asynchronous node (`--is-async`).
* Add `standby-leader` as a valid leader type for `cluster_has_leader`.
* Add a new service `node_is_leader` to check if a node is a leader (which includes standby leader nodes)
### Fixed
* Fix the `node_is_alive` check. (#31)
* Fix the `cluster_has_replica` and `cluster_node_count` checks to account for
the new replica state `streaming` introduced in v3.0.4 (#28, reported by @log1-c)
### Misc
* Create CHANGELOG.md
* Add tests for the output of the scripts in addition to the return code
* Documentation in CONTRIBUTING.md
## check_patroni 0.2.0 - 2023-03-20
### Added
* Add a `--save` option when state files are used
* Modify `-e/--endpoints` to allow a comma separated list of endpoints (#21, reported by @lihnjo)
* Use requests instead of urllib3 (with extensive help from @dlax)
* Change the way logging is handled (with extensive help from @dlax)
### Fix
* Reverse the test for `node_is_pending`
* SSL handling
### Misc
* Several doc Fix and Updates
* Use spellcheck and isort
* Remove tests for python 3.6
* Add python tests for python 3.11
## check_patroni 0.1.1 - 2022-07-15
The initial release covers the following checks :
* check a cluster for
+ configuration change
+ presence of a leader
+ presence of a replica
+ maintenance status
* check a node for
+ liveness
+ pending restart status
+ primary status
+ replica status
+ tl change
+ patroni version

View file

@ -1,94 +0,0 @@
# Contributing to check_patroni
Thanks for your interest in contributing to check_patroni.
## Clone Git Repository
Installation from the git repository:
```
$ git clone https://github.com/dalibo/check_patroni.git
$ cd check_patroni
```
Change the branch if necessary.
## Create Python Virtual Environment
You need a dedicated environment, install dependencies and then check_patroni
from the repo:
```
$ python3 -m venv .venv
$ . .venv/bin/activate
(.venv) $ pip3 install .[test]
(.venv) $ pip3 install -r requirements-dev.txt
(.venv) $ check_patroni
```
To quit this env and destroy it:
```
$ deactivate
$ rm -r .venv
```
## Development Environment
A vagrant file is available to create a icinga / opm / grafana stack and
install check_patroni. You can then add a server to the supervision and
watch the graphs in grafana. It's in the `vagrant` directory.
A vagrant file can be found in [this
repository](https://github.com/ioguix/vagrant-patroni) to generate a patroni/etcd
setup.
The `README.md` can be generated with `./docs/make_readme.sh`.
## Executing Tests
Crafting repeatable tests using a live Patroni cluster can be intricate. To
simplify the development process, a fake HTTP server is set up as a test
fixture and serves static files (either from `tests/json` directory or from
in-memory data).
An important consideration is that there is a potential drawback: if the JSON
data is incorrect or if modifications have been made to Patroni without
corresponding updates to the tests documented here, the tests might still pass
erroneously.
The tests are executed automatically for each PR using the ci (see
`.github/workflow/lint.yml` and `.github/workflow/tests.yml`).
Running the tests,
* manually:
```bash
pytest --cov tests
```
* or using tox:
```bash
tox -e lint # mypy + flake8 + black + isort ° codespell
tox # pytests and "lint" tests for all supported version of python
tox -e py # pytests and "lint" tests for the default version of python
```
Please note that when dealing with any service that checks the state of a node,
the related tests must use the `old_replica_state` fixture to test with both
old (pre 3.0.4) and new replica states.
A bash script, `check_patroni.sh`, is provided to facilitate testing all
services on a Patroni endpoint (`./vagrant/check_patroni.sh`). It requires one
parameter: the endpoint URL that will be used as the argument for the
`-e/--endpoints` option of `check_patroni`. This script essentially compiles a
list of service calls and executes them sequentially in a bash script. It
creates a state file in the directory from which you run the script.
Here's an example usage:
```bash
./vagrant/check_patroni.sh http://10.20.30.51:8008
```

19
LICENSE
View file

@ -1,19 +0,0 @@
PostgreSQL Licence
Copyright (c) 2022, DALIBO
Permission to use, copy, modify, and distribute this software and its
documentation for any purpose, without fee, and without a written agreement is
hereby granted, provided that the above copyright notice and this paragraph and
the following two paragraphs appear in all copies.
IN NO EVENT SHALL DALIBO BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL,
INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE
USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF DALIBO HAS BEEN ADVISED OF
THE POSSIBILITY OF SUCH DAMAGE.
DALIBO SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND DALIBO HAS NO
OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
MODIFICATIONS.

View file

@ -1,10 +0,0 @@
include *.md
include mypy.ini
include pytest.ini
include tox.ini
include .coveragerc
include .flake8
include pyproject.toml
recursive-include docs *.sh
recursive-include tests *.json
recursive-include tests *.py

514
README.md
View file

@ -1,514 +0,0 @@
# check_patroni
A nagios plugin for patroni.
## Features
- Check presence of leader, replicas, node counts.
- Check each node for replication status.
```
Usage: check_patroni [OPTIONS] COMMAND [ARGS]...
Nagios plugin that uses Patroni's REST API to monitor a Patroni cluster.
Options:
--config FILE Read option defaults from the specified INI file
[default: config.ini]
-e, --endpoints TEXT Patroni API endpoint. Can be specified multiple times
or as a list of comma separated addresses. The node
services checks the status of one node, therefore if
several addresses are specified they should point to
different interfaces on the same node. The cluster
services check the status of the cluster, therefore
it's better to give a list of all Patroni node
addresses. [default: http://127.0.0.1:8008]
--cert_file PATH File with the client certificate.
--key_file PATH File with the client key.
--ca_file PATH The CA certificate.
-v, --verbose Increase verbosity -v (info)/-vv (warning)/-vvv
(debug)
--version
--timeout INTEGER Timeout in seconds for the API queries (0 to disable)
[default: 2]
--help Show this message and exit.
Commands:
cluster_config_has_changed Check if the hash of the configuration...
cluster_has_leader Check if the cluster has a leader.
cluster_has_replica Check if the cluster has healthy replicas...
cluster_has_scheduled_action Check if the cluster has a scheduled...
cluster_is_in_maintenance Check if the cluster is in maintenance...
cluster_node_count Count the number of nodes in the cluster.
node_is_alive Check if the node is alive ie patroni is...
node_is_leader Check if the node is a leader node.
node_is_pending_restart Check if the node is in pending restart...
node_is_primary Check if the node is the primary with the...
node_is_replica Check if the node is a replica with no...
node_patroni_version Check if the version is equal to the input
node_tl_has_changed Check if the timeline has changed.
```
## Install
check_patroni is licensed under PostgreSQL license.
```
$ pip install git+https://github.com/dalibo/check_patroni.git
```
check_patroni works on python 3.6, we keep it that way because patroni also
supports it and there are still lots of RH 7 variants around. That being said
python 3.6 has been EOL for ages and there is no support for it in the github
CI.
## Support
If you hit a bug or need help, open a [GitHub
issue](https://github.com/dalibo/check_patroni/issues/new). Dalibo has no
commitment on response time for public free support. Thanks for you
contribution !
## Config file
All global and service specific parameters can be specified via a config file has follows:
```
[options]
endpoints = https://10.20.199.3:8008, https://10.20.199.4:8008,https://10.20.199.5:8008
cert_file = ./ssl/my-cert.pem
key_file = ./ssl/my-key.pem
ca_file = ./ssl/CA-cert.pem
timeout = 0
[options.node_is_replica]
lag=100
```
## Thresholds
The format for the threshold parameters is `[@][start:][end]`.
* `start:` may be omitted if `start == 0`
* `~:` means that start is negative infinity
* If `end` is omitted, infinity is assumed
* To invert the match condition, prefix the range expression with `@`.
A match is found when: `start <= VALUE <= end`.
For example, the following command will raise:
* a warning if there is less than 1 nodes, which can be translated to outside of range [2;+INF[
* a critical if there are no nodes, which can be translated to outside of range [1;+INF[
```
check_patroni -e https://10.20.199.3:8008 cluster_has_replica --warning 2: --critical 1:
```
## SSL
Several options are available:
* the server's CA certificate is not available or trusted by the client system:
* `--ca_cert`: your certification chain `cat CA-certificate server-certificate > cabundle`
* you have a client certificate for authenticating with Patroni's REST API:
* `--cert_file`: your certificate or the concatenation of your certificate and private key
* `--key_file`: your private key (optional)
## Shell completion
We use the [click] library which supports shell completion natively.
Shell completion can be added by typing the following command or adding it to
a file spécific to your shell of choice.
* for Bash (add to `~/.bashrc`):
```
eval "$(_CHECK_PATRONI_COMPLETE=bash_source check_patroni)"
```
* for Zsh (add to `~/.zshrc`):
```
eval "$(_CHECK_PATRONI_COMPLETE=zsh_source check_patroni)"
```
* for Fish (add to `~/.config/fish/completions/check_patroni.fish`):
```
eval "$(_CHECK_PATRONI_COMPLETE=fish_source check_patroni)"
```
Please note that shell completion is not supported far all shell versions, for
example only Bash versions older than 4.4 are supported.
[click]: https://click.palletsprojects.com/en/8.1.x/shell-completion/
## Cluster services
### cluster_config_has_changed
```
Usage: check_patroni cluster_config_has_changed [OPTIONS]
Check if the hash of the configuration has changed.
Note: either a hash or a state file must be provided for this service to
work.
Check:
* `OK`: The hash didn't change
* `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`)
Perfdata:
* `is_configuration_changed` is 1 if the configuration has changed
Options:
--hash TEXT A hash to compare with.
-s, --state-file TEXT A state file to store the hash of the configuration.
--save Set the current configuration hash as the reference
for future calls.
--help Show this message and exit.
```
### cluster_has_leader
```
Usage: check_patroni cluster_has_leader [OPTIONS]
Check if the cluster has a leader.
This check applies to any kind of leaders including standby leaders.
A leader is a node with the "leader" role and a "running" state.
A standby leader is a node with a "standby_leader" role and a "streaming" or
"in archive recovery" state. Please note that log shipping could be stuck
because the WAL are not available or applicable. Patroni doesn't provide
information about the origin cluster (timeline or lag), so we cannot check
if there is a problem in that particular case. That's why we issue a warning
when the node is "in archive recovery". We suggest using other supervision
tools to do this (eg. check_pgactivity).
Check:
* `OK`: if there is a leader node.
* 'WARNING': if there is a stanby leader in archive mode.
* `CRITICAL`: otherwise.
Perfdata:
* `has_leader` is 1 if there is any kind of leader node, 0 otherwise
* `is_standby_leader_in_arc_rec` is 1 if the standby leader node is "in
archive recovery", 0 otherwise
* `is_standby_leader` is 1 if there is a standby leader node, 0 otherwise
* `is_leader` is 1 if there is a "classical" leader node, 0 otherwise
Options:
--help Show this message and exit.
```
### cluster_has_replica
```
Usage: check_patroni cluster_has_replica [OPTIONS]
Check if the cluster has healthy replicas and/or if some are sync standbies
For patroni (and this check):
* a replica is `streaming` if the `pg_stat_wal_receiver` say's so.
* a replica is `in archive recovery`, if it's not `streaming` and has a `restore_command`.
A healthy replica:
* has a `replica` or `sync_standby` role
* has the same timeline as the leader and
* is in `running` state (patroni < V3.0.4)
* is in `streaming` or `in archive recovery` state (patroni >= V3.0.4)
* has a lag lower or equal to `max_lag`
Please note that replica `in archive recovery` could be stuck because the
WAL are not available or applicable (the server's timeline has diverged for
the leader's). We already detect the latter but we will miss the former.
Therefore, it's preferable to check for the lag in addition to the healthy
state if you rely on log shipping to help lagging standbies to catch up.
Since we require a healthy replica to have the same timeline as the leader,
it's possible that we raise alerts when the cluster is performing a
switchover or failover and the standbies are in the process of catching up
with the new leader. The alert shouldn't last long.
Check:
* `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold.
and if the sync_replica count is compatible with the sync replica count threshold.
* `WARNING` / `CRITICAL`: otherwise
Perfdata:
* healthy_replica & unhealthy_replica count
* the number of sync_replica, they are included in the previous count
* the lag of each replica labelled with "member name"_lag
* the timeline of each replica labelled with "member name"_timeline
* a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync
Options:
-w, --warning TEXT Warning threshold for the number of healthy replica
nodes.
-c, --critical TEXT Critical threshold for the number of healthy replica
nodes.
--sync-warning TEXT Warning threshold for the number of sync replica.
--sync-critical TEXT Critical threshold for the number of sync replica.
--max-lag TEXT maximum allowed lag
--help Show this message and exit.
```
### cluster_has_scheduled_action
```
Usage: check_patroni cluster_has_scheduled_action [OPTIONS]
Check if the cluster has a scheduled action (switchover or restart)
Check:
* `OK`: If the cluster has no scheduled action
* `CRITICAL`: otherwise.
Perfdata:
* `scheduled_actions` is 1 if the cluster has scheduled actions.
* `scheduled_switchover` is 1 if the cluster has a scheduled switchover.
* `scheduled_restart` counts the number of scheduled restart in the cluster.
Options:
--help Show this message and exit.
```
### cluster_is_in_maintenance
```
Usage: check_patroni cluster_is_in_maintenance [OPTIONS]
Check if the cluster is in maintenance mode or paused.
Check:
* `OK`: If the cluster is in maintenance mode.
* `CRITICAL`: otherwise.
Perfdata:
* `is_in_maintenance` is 1 the cluster is in maintenance mode, 0 otherwise
Options:
--help Show this message and exit.
```
### cluster_node_count
```
Usage: check_patroni cluster_node_count [OPTIONS]
Count the number of nodes in the cluster.
The role refers to the role of the server in the cluster. Possible values
are:
* master or leader
* replica
* standby_leader
* sync_standby
* demoted
* promoted
* uninitialized
The state refers to the state of PostgreSQL. Possible values are:
* initializing new cluster, initdb failed
* running custom bootstrap script, custom bootstrap failed
* starting, start failed
* restarting, restart failed
* running, streaming, in archive recovery
* stopping, stopped, stop failed
* creating replica
* crashed
The "healthy" checks only ensures that:
* a leader has the running state
* a standby_leader has the running or streaming (V3.0.4) state
* a replica or sync-standby has the running or streaming (V3.0.4) state
Since we dont check the lag or timeline, "in archive recovery" is not
considered a valid state for this service. See cluster_has_leader and
cluster_has_replica for specialized checks.
Check:
* Compares the number of nodes against the normal and healthy nodes warning and critical thresholds.
* `OK`: If they are not provided.
Perfdata:
* `members`: the member count.
* `healthy_members`: the running and streaming member count.
* all the roles of the nodes in the cluster with their count (start with "role_").
* all the statuses of the nodes in the cluster with their count (start with "state_").
Options:
-w, --warning TEXT Warning threshold for the number of nodes.
-c, --critical TEXT Critical threshold for the number of nodes.
--healthy-warning TEXT Warning threshold for the number of healthy nodes
(running + streaming).
--healthy-critical TEXT Critical threshold for the number of healthy nodes
(running + streaming).
--help Show this message and exit.
```
## Node services
### node_is_alive
```
Usage: check_patroni node_is_alive [OPTIONS]
Check if the node is alive ie patroni is running. This is a liveness check
as defined in Patroni's documentation.
Check:
* `OK`: If patroni is running.
* `CRITICAL`: otherwise.
Perfdata:
* `is_running` is 1 if patroni is running, 0 otherwise
Options:
--help Show this message and exit.
```
### node_is_pending_restart
```
Usage: check_patroni node_is_pending_restart [OPTIONS]
Check if the node is in pending restart state.
This situation can arise if the configuration has been modified but requires
a restart of PostgreSQL to take effect.
Check:
* `OK`: if the node has no pending restart tag.
* `CRITICAL`: otherwise
Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0
otherwise.
Options:
--help Show this message and exit.
```
### node_is_leader
```
Usage: check_patroni node_is_leader [OPTIONS]
Check if the node is a leader node.
This check applies to any kind of leaders including standby leaders. To
check explicitly for a standby leader use the `--is-standby-leader` option.
Check:
* `OK`: if the node is a leader.
* `CRITICAL:` otherwise
Perfdata: `is_leader` is 1 if the node is a leader node, 0 otherwise.
Options:
--is-standby-leader Check for a standby leader
--help Show this message and exit.
```
### node_is_primary
```
Usage: check_patroni node_is_primary [OPTIONS]
Check if the node is the primary with the leader lock.
This service is not valid for a standby leader, because this kind of node is
not a primary.
Check:
* `OK`: if the node is a primary with the leader lock.
* `CRITICAL:` otherwise
Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0
otherwise.
Options:
--help Show this message and exit.
```
### node_is_replica
```
Usage: check_patroni node_is_replica [OPTIONS]
Check if the node is a replica with no noloadbalance tag.
It is possible to check if the node is synchronous or asynchronous. If
nothing is specified any kind of replica is accepted. When checking for a
synchronous replica, it's not possible to specify a lag.
This service is using the following Patroni endpoints: replica, asynchronous
and synchronous. The first two implement the `lag` tag. For these endpoints
the state of a replica node doesn't reflect the replication state
(`streaming` or `in archive recovery`), we only know if it's `running`. The
timeline is also not checked.
Therefore, if a cluster is using asynchronous replication, it is recommended
to check for the lag to detect a divegence as soon as possible.
Check:
* `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold.
* `CRITICAL`: otherwise
Perfdata: `is_replica` is 1 if the node is a running replica with
noloadbalance tag and the lag is under the maximum threshold, 0 otherwise.
Options:
--max-lag TEXT maximum allowed lag
--is-sync check if the replica is synchronous
--is-async check if the replica is asynchronous
--help Show this message and exit.
```
### node_patroni_version
```
Usage: check_patroni node_patroni_version [OPTIONS]
Check if the version is equal to the input
Check:
* `OK`: The version is the same as the input `--patroni-version`
* `CRITICAL`: otherwise.
Perfdata:
* `is_version_ok` is 1 if version is ok, 0 otherwise
Options:
--patroni-version TEXT Patroni version to compare to [required]
--help Show this message and exit.
```
### node_tl_has_changed
```
Usage: check_patroni node_tl_has_changed [OPTIONS]
Check if the timeline has changed.
Note: either a timeline or a state file must be provided for this service to
work.
Check:
* `OK`: The timeline is the same as last time (`--state_file`) or the inputted timeline (`--timeline`)
* `CRITICAL`: The tl is not the same.
Perfdata:
* `is_timeline_changed` is 1 if the tl has changed, 0 otherwise
* the timeline
Options:
--timeline TEXT A timeline number to compare with.
-s, --state-file TEXT A state file to store the last tl number into.
--save Set the current timeline number as the reference for
future calls.
--help Show this message and exit.
```

View file

@ -1,38 +0,0 @@
# Release HOW TO
## Preparatory changes
* Review the **Unreleased** section, if any, in `CHANGELOG.md` possibly adding
any missing item from closed issues, merged pull requests, or directly the
git history[^git-changes],
* Rename the **Unreleased** section according to the version to be released,
with a date,
* Bump the version in `check_patroni/__init__.py`,
* Rebuild the `README.md` (`cd docs; ./make_readme.sh`),
* Commit these changes (either on a dedicated branch, before submitting a pull
request or directly on the `master` branch) with the commit message `release
X.Y.Z`.
* Then, when changes landed in the `master` branch, create an annotated (and
possibly signed) tag, as `git tag -a [-s] -m 'release X.Y.Z' vX.Y.Z`,
and,
* Push with `--follow-tags`.
[^git-changes]: Use `git log $(git describe --tags --abbrev=0).. --format=%s
--reverse` to get commits from the previous tag.
## PyPI package
The package is generated and uploaded to pypi when a `v*` tag is created (see
`.github/workflow/publish.yml`).
Alternatively, the release can be done manually with:
```
tox -e build
tox -e upload
```
## GitHub release
Draft a new release from the release page, choosing the tag just pushed and
copy the relevant change log section as a description.

Binary file not shown.

View file

@ -0,0 +1 @@
b2b3623e4494aa159395ea754eaeb599bd3e73cf

Binary file not shown.

View file

@ -0,0 +1 @@
a53d3e6184ae38be80088911068e7b30d2ac2b22

View file

@ -1,5 +0,0 @@
import logging
__version__ = "2.0.0"
_log: logging.Logger = logging.getLogger(__name__)

View file

@ -1,4 +0,0 @@
from .cli import main
if __name__ == "__main__":
main()

View file

@ -1,809 +0,0 @@
import logging
import re
from configparser import ConfigParser
from typing import List
import click
import nagiosplugin
from . import __version__, _log
from .cluster import (
ClusterConfigHasChanged,
ClusterConfigHasChangedSummary,
ClusterHasLeader,
ClusterHasLeaderSummary,
ClusterHasReplica,
ClusterHasScheduledAction,
ClusterIsInMaintenance,
ClusterNodeCount,
)
from .convert import size_to_byte
from .node import (
NodeIsAlive,
NodeIsAliveSummary,
NodeIsLeader,
NodeIsLeaderSummary,
NodeIsPendingRestart,
NodeIsPendingRestartSummary,
NodeIsPrimary,
NodeIsPrimarySummary,
NodeIsReplica,
NodeIsReplicaSummary,
NodePatroniVersion,
NodePatroniVersionSummary,
NodeTLHasChanged,
NodeTLHasChangedSummary,
)
from .types import ConnectionInfo, Parameters
DEFAULT_CFG = "config.ini"
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(levelname)s - %(message)s"))
_log.addHandler(handler)
def print_version(ctx: click.Context, param: str, value: str) -> None:
if not value or ctx.resilient_parsing:
return
click.echo(f"Version {__version__}")
ctx.exit()
def configure(ctx: click.Context, param: str, filename: str) -> None:
"""Use a config file for the parameters
stolen from https://jwodder.github.io/kbits/posts/click-config/
"""
# FIXME should use click-configfile / click-config-file ?
cfg = ConfigParser()
cfg.read(filename)
ctx.default_map = {}
for sect in cfg.sections():
command_path = sect.split(".")
if command_path[0] != "options":
continue
defaults = ctx.default_map
for cmdname in command_path[1:]:
defaults = defaults.setdefault(cmdname, {})
defaults.update(cfg[sect])
try:
# endpoints is an array of addresses separated by ,
if isinstance(defaults["endpoints"], str):
defaults["endpoints"] = re.split(r"\s*,\s*", defaults["endpoints"])
except KeyError:
pass
@click.group()
@click.option(
"--config",
type=click.Path(dir_okay=False),
default=DEFAULT_CFG,
callback=configure,
is_eager=True,
expose_value=False,
help="Read option defaults from the specified INI file",
show_default=True,
)
@click.option(
"-e",
"--endpoints",
"endpoints",
type=str,
multiple=True,
default=["http://127.0.0.1:8008"],
help=(
"Patroni API endpoint. Can be specified multiple times or as a list "
"of comma separated addresses. "
"The node services checks the status of one node, therefore if "
"several addresses are specified they should point to different "
"interfaces on the same node. The cluster services check the "
"status of the cluster, therefore it's better to give a list of "
"all Patroni node addresses."
),
show_default=True,
)
@click.option(
"--cert_file",
"cert_file",
type=click.Path(exists=True),
default=None,
help="File with the client certificate.",
)
@click.option(
"--key_file",
"key_file",
type=click.Path(exists=True),
default=None,
help="File with the client key.",
)
@click.option(
"--ca_file",
"ca_file",
type=click.Path(exists=True),
default=None,
help="The CA certificate.",
)
@click.option(
"-v",
"--verbose",
"verbose",
count=True,
default=0,
help="Increase verbosity -v (info)/-vv (warning)/-vvv (debug)",
show_default=False,
)
@click.option(
"--version", is_flag=True, callback=print_version, expose_value=False, is_eager=True
)
@click.option(
"--timeout",
"timeout",
default=2,
type=int,
help="Timeout in seconds for the API queries (0 to disable)",
show_default=True,
)
@click.pass_context
@nagiosplugin.guarded
def main(
ctx: click.Context,
endpoints: List[str],
cert_file: str,
key_file: str,
ca_file: str,
verbose: int,
timeout: int,
) -> None:
"""Nagios plugin that uses Patroni's REST API to monitor a Patroni cluster."""
# FIXME Not all "is/has" services have the same return code for ok. Check if it's ok
# We use this to pass parameters instead of ctx.parent.params because the
# latter is typed as Optional[Context] and mypy complains with the following
# error unless we test if ctx.parent is none which looked ugly.
#
# error: Item "None" of "Optional[Context]" has an attribute "params" [union-attr]
# The config file allows endpoints to be specified as a comma separated list of endpoints
# To avoid confusion, We allow the same in command line parameters
tendpoints: List[str] = []
for e in endpoints:
tendpoints += re.split(r"\s*,\s*", e)
endpoints = tendpoints
if verbose == 3:
logging.getLogger("urllib3").addHandler(handler)
logging.getLogger("urllib3").setLevel(logging.DEBUG)
_log.setLevel(logging.DEBUG)
connection_info: ConnectionInfo
if cert_file is None and key_file is None:
connection_info = ConnectionInfo(endpoints, None, ca_file)
else:
connection_info = ConnectionInfo(endpoints, (cert_file, key_file), ca_file)
ctx.obj = Parameters(
connection_info,
timeout,
verbose,
)
@main.command(name="cluster_node_count") # required otherwise _ are converted to -
@click.option(
"-w",
"--warning",
"warning",
type=str,
help="Warning threshold for the number of nodes.",
)
@click.option(
"-c",
"--critical",
"critical",
type=str,
help="Critical threshold for the number of nodes.",
)
@click.option(
"--healthy-warning",
"healthy_warning",
type=str,
help="Warning threshold for the number of healthy nodes (running + streaming).",
)
@click.option(
"--healthy-critical",
"healthy_critical",
type=str,
help="Critical threshold for the number of healthy nodes (running + streaming).",
)
@click.pass_context
@nagiosplugin.guarded
def cluster_node_count(
ctx: click.Context,
warning: str,
critical: str,
healthy_warning: str,
healthy_critical: str,
) -> None:
"""Count the number of nodes in the cluster.
\b
The role refers to the role of the server in the cluster. Possible values
are:
* master or leader
* replica
* standby_leader
* sync_standby
* demoted
* promoted
* uninitialized
\b
The state refers to the state of PostgreSQL. Possible values are:
* initializing new cluster, initdb failed
* running custom bootstrap script, custom bootstrap failed
* starting, start failed
* restarting, restart failed
* running, streaming, in archive recovery
* stopping, stopped, stop failed
* creating replica
* crashed
\b
The "healthy" checks only ensures that:
* a leader has the running state
* a standby_leader has the running or streaming (V3.0.4) state
* a replica or sync-standby has the running or streaming (V3.0.4) state
Since we dont check the lag or timeline, "in archive recovery" is not considered a valid state
for this service. See cluster_has_leader and cluster_has_replica for specialized checks.
\b
Check:
* Compares the number of nodes against the normal and healthy nodes warning and critical thresholds.
* `OK`: If they are not provided.
\b
Perfdata:
* `members`: the member count.
* `healthy_members`: the running and streaming member count.
* all the roles of the nodes in the cluster with their count (start with "role_").
* all the statuses of the nodes in the cluster with their count (start with "state_").
"""
check = nagiosplugin.Check()
check.add(
ClusterNodeCount(ctx.obj.connection_info),
nagiosplugin.ScalarContext(
"members",
warning,
critical,
),
nagiosplugin.ScalarContext(
"healthy_members",
healthy_warning,
healthy_critical,
),
nagiosplugin.ScalarContext("member_roles"),
nagiosplugin.ScalarContext("member_statuses"),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
@main.command(name="cluster_has_leader")
@click.pass_context
@nagiosplugin.guarded
def cluster_has_leader(ctx: click.Context) -> None:
"""Check if the cluster has a leader.
This check applies to any kind of leaders including standby leaders.
A leader is a node with the "leader" role and a "running" state.
A standby leader is a node with a "standby_leader" role and a "streaming"
or "in archive recovery" state. Please note that log shipping could be
stuck because the WAL are not available or applicable. Patroni doesn't
provide information about the origin cluster (timeline or lag), so we
cannot check if there is a problem in that particular case. That's why we
issue a warning when the node is "in archive recovery". We suggest using
other supervision tools to do this (eg. check_pgactivity).
\b
Check:
* `OK`: if there is a leader node.
* 'WARNING': if there is a stanby leader in archive mode.
* `CRITICAL`: otherwise.
\b
Perfdata:
* `has_leader` is 1 if there is any kind of leader node, 0 otherwise
* `is_standby_leader_in_arc_rec` is 1 if the standby leader node is "in
archive recovery", 0 otherwise
* `is_standby_leader` is 1 if there is a standby leader node, 0 otherwise
* `is_leader` is 1 if there is a "classical" leader node, 0 otherwise
"""
check = nagiosplugin.Check()
check.add(
ClusterHasLeader(ctx.obj.connection_info),
nagiosplugin.ScalarContext("has_leader", None, "@0:0"),
nagiosplugin.ScalarContext("is_standby_leader_in_arc_rec", "@1:1", None),
nagiosplugin.ScalarContext("is_leader", None, None),
nagiosplugin.ScalarContext("is_standby_leader", None, None),
ClusterHasLeaderSummary(),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
@main.command(name="cluster_has_replica")
@click.option(
"-w",
"--warning",
"warning",
type=str,
help="Warning threshold for the number of healthy replica nodes.",
)
@click.option(
"-c",
"--critical",
"critical",
type=str,
help="Critical threshold for the number of healthy replica nodes.",
)
@click.option(
"--sync-warning",
"sync_warning",
type=str,
help="Warning threshold for the number of sync replica.",
)
@click.option(
"--sync-critical",
"sync_critical",
type=str,
help="Critical threshold for the number of sync replica.",
)
@click.option("--max-lag", "max_lag", type=str, help="maximum allowed lag")
@click.pass_context
@nagiosplugin.guarded
def cluster_has_replica(
ctx: click.Context,
warning: str,
critical: str,
sync_warning: str,
sync_critical: str,
max_lag: str,
) -> None:
"""Check if the cluster has healthy replicas and/or if some are sync standbies
\b
For patroni (and this check):
* a replica is `streaming` if the `pg_stat_wal_receiver` say's so.
* a replica is `in archive recovery`, if it's not `streaming` and has a `restore_command`.
\b
A healthy replica:
* has a `replica` or `sync_standby` role
* has the same timeline as the leader and
* is in `running` state (patroni < V3.0.4)
* is in `streaming` or `in archive recovery` state (patroni >= V3.0.4)
* has a lag lower or equal to `max_lag`
Please note that replica `in archive recovery` could be stuck because the WAL
are not available or applicable (the server's timeline has diverged for the
leader's). We already detect the latter but we will miss the former.
Therefore, it's preferable to check for the lag in addition to the healthy
state if you rely on log shipping to help lagging standbies to catch up.
Since we require a healthy replica to have the same timeline as the
leader, it's possible that we raise alerts when the cluster is performing a
switchover or failover and the standbies are in the process of catching up with
the new leader. The alert shouldn't last long.
\b
Check:
* `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold.
and if the sync_replica count is compatible with the sync replica count threshold.
* `WARNING` / `CRITICAL`: otherwise
\b
Perfdata:
* healthy_replica & unhealthy_replica count
* the number of sync_replica, they are included in the previous count
* the lag of each replica labelled with "member name"_lag
* the timeline of each replica labelled with "member name"_timeline
* a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync
"""
tmax_lag = size_to_byte(max_lag) if max_lag is not None else None
check = nagiosplugin.Check()
check.add(
ClusterHasReplica(ctx.obj.connection_info, tmax_lag),
nagiosplugin.ScalarContext(
"healthy_replica",
warning,
critical,
),
nagiosplugin.ScalarContext(
"sync_replica",
sync_warning,
sync_critical,
),
nagiosplugin.ScalarContext("unhealthy_replica"),
nagiosplugin.ScalarContext("replica_lag"),
nagiosplugin.ScalarContext("replica_timeline"),
nagiosplugin.ScalarContext("replica_sync"),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
@main.command(name="cluster_config_has_changed")
@click.option("--hash", "config_hash", type=str, help="A hash to compare with.")
@click.option(
"-s",
"--state-file",
"state_file",
type=str,
help="A state file to store the hash of the configuration.",
)
@click.option(
"--save",
"save_config",
is_flag=True,
default=False,
help="Set the current configuration hash as the reference for future calls.",
)
@click.pass_context
@nagiosplugin.guarded
def cluster_config_has_changed(
ctx: click.Context, config_hash: str, state_file: str, save_config: bool
) -> None:
"""Check if the hash of the configuration has changed.
Note: either a hash or a state file must be provided for this service to work.
\b
Check:
* `OK`: The hash didn't change
* `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`)
\b
Perfdata:
* `is_configuration_changed` is 1 if the configuration has changed
"""
# Note: hash cannot be in the perf data = not a number
if (config_hash is None and state_file is None) or (
config_hash is not None and state_file is not None
):
raise click.UsageError(
"Either --hash or --state-file should be provided for this service", ctx
)
old_config_hash = config_hash
if state_file is not None:
cookie = nagiosplugin.Cookie(state_file)
cookie.open()
old_config_hash = cookie.get("hash")
cookie.close()
check = nagiosplugin.Check()
check.add(
ClusterConfigHasChanged(
ctx.obj.connection_info, old_config_hash, state_file, save_config
),
nagiosplugin.ScalarContext("is_configuration_changed", None, "@1:1"),
ClusterConfigHasChangedSummary(old_config_hash),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
@main.command(name="cluster_is_in_maintenance")
@click.pass_context
@nagiosplugin.guarded
def cluster_is_in_maintenance(ctx: click.Context) -> None:
"""Check if the cluster is in maintenance mode or paused.
\b
Check:
* `OK`: If the cluster is in maintenance mode.
* `CRITICAL`: otherwise.
\b
Perfdata:
* `is_in_maintenance` is 1 the cluster is in maintenance mode, 0 otherwise
"""
check = nagiosplugin.Check()
check.add(
ClusterIsInMaintenance(ctx.obj.connection_info),
nagiosplugin.ScalarContext("is_in_maintenance", None, "0:0"),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
@main.command(name="cluster_has_scheduled_action")
@click.pass_context
@nagiosplugin.guarded
def cluster_has_scheduled_action(ctx: click.Context) -> None:
"""Check if the cluster has a scheduled action (switchover or restart)
\b
Check:
* `OK`: If the cluster has no scheduled action
* `CRITICAL`: otherwise.
\b
Perfdata:
* `scheduled_actions` is 1 if the cluster has scheduled actions.
* `scheduled_switchover` is 1 if the cluster has a scheduled switchover.
* `scheduled_restart` counts the number of scheduled restart in the cluster.
"""
check = nagiosplugin.Check()
check.add(
ClusterHasScheduledAction(ctx.obj.connection_info),
nagiosplugin.ScalarContext("has_scheduled_actions", None, "0:0"),
nagiosplugin.ScalarContext("scheduled_switchover"),
nagiosplugin.ScalarContext("scheduled_restart"),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
@main.command(name="node_is_primary")
@click.pass_context
@nagiosplugin.guarded
def node_is_primary(ctx: click.Context) -> None:
"""Check if the node is the primary with the leader lock.
This service is not valid for a standby leader, because this kind of node is not a primary.
\b
Check:
* `OK`: if the node is a primary with the leader lock.
* `CRITICAL:` otherwise
Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0 otherwise.
"""
check = nagiosplugin.Check()
check.add(
NodeIsPrimary(ctx.obj.connection_info),
nagiosplugin.ScalarContext("is_primary", None, "@0:0"),
NodeIsPrimarySummary(),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
@main.command(name="node_is_leader")
@click.option(
"--is-standby-leader",
"check_standby_leader",
is_flag=True,
default=False,
help="Check for a standby leader",
)
@click.pass_context
@nagiosplugin.guarded
def node_is_leader(ctx: click.Context, check_standby_leader: bool) -> None:
"""Check if the node is a leader node.
This check applies to any kind of leaders including standby leaders.
To check explicitly for a standby leader use the `--is-standby-leader` option.
\b
Check:
* `OK`: if the node is a leader.
* `CRITICAL:` otherwise
Perfdata: `is_leader` is 1 if the node is a leader node, 0 otherwise.
"""
check = nagiosplugin.Check()
check.add(
NodeIsLeader(ctx.obj.connection_info, check_standby_leader),
nagiosplugin.ScalarContext("is_leader", None, "@0:0"),
NodeIsLeaderSummary(check_standby_leader),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
@main.command(name="node_is_replica")
@click.option("--max-lag", "max_lag", type=str, help="maximum allowed lag")
@click.option(
"--is-sync",
"check_is_sync",
is_flag=True,
default=False,
help="check if the replica is synchronous",
)
@click.option(
"--is-async",
"check_is_async",
is_flag=True,
default=False,
help="check if the replica is asynchronous",
)
@click.pass_context
@nagiosplugin.guarded
def node_is_replica(
ctx: click.Context, max_lag: str, check_is_sync: bool, check_is_async: bool
) -> None:
"""Check if the node is a replica with no noloadbalance tag.
It is possible to check if the node is synchronous or asynchronous. If
nothing is specified any kind of replica is accepted. When checking for a
synchronous replica, it's not possible to specify a lag.
This service is using the following Patroni endpoints: replica, asynchronous
and synchronous. The first two implement the `lag` tag. For these endpoints
the state of a replica node doesn't reflect the replication state
(`streaming` or `in archive recovery`), we only know if it's `running`. The
timeline is also not checked.
Therefore, if a cluster is using asynchronous replication, it is
recommended to check for the lag to detect a divegence as soon as possible.
\b
Check:
* `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold.
* `CRITICAL`: otherwise
Perfdata: `is_replica` is 1 if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold, 0 otherwise.
"""
if check_is_sync and max_lag is not None:
raise click.UsageError(
"--is-sync and --max-lag cannot be provided at the same time for this service",
ctx,
)
if check_is_sync and check_is_async:
raise click.UsageError(
"--is-sync and --is-async cannot be provided at the same time for this service",
ctx,
)
check = nagiosplugin.Check()
check.add(
NodeIsReplica(ctx.obj.connection_info, max_lag, check_is_sync, check_is_async),
nagiosplugin.ScalarContext("is_replica", None, "@0:0"),
NodeIsReplicaSummary(max_lag, check_is_sync, check_is_async),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
@main.command(name="node_is_pending_restart")
@click.pass_context
@nagiosplugin.guarded
def node_is_pending_restart(ctx: click.Context) -> None:
"""Check if the node is in pending restart state.
This situation can arise if the configuration has been modified but
requires a restart of PostgreSQL to take effect.
\b
Check:
* `OK`: if the node has no pending restart tag.
* `CRITICAL`: otherwise
Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0 otherwise.
"""
check = nagiosplugin.Check()
check.add(
NodeIsPendingRestart(ctx.obj.connection_info),
nagiosplugin.ScalarContext("is_pending_restart", None, "0:0"),
NodeIsPendingRestartSummary(),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
@main.command(name="node_tl_has_changed")
@click.option(
"--timeline", "timeline", type=str, help="A timeline number to compare with."
)
@click.option(
"-s",
"--state-file",
"state_file",
type=str,
help="A state file to store the last tl number into.",
)
@click.option(
"--save",
"save_tl",
is_flag=True,
default=False,
help="Set the current timeline number as the reference for future calls.",
)
@click.pass_context
@nagiosplugin.guarded
def node_tl_has_changed(
ctx: click.Context, timeline: str, state_file: str, save_tl: bool
) -> None:
"""Check if the timeline has changed.
Note: either a timeline or a state file must be provided for this service to work.
\b
Check:
* `OK`: The timeline is the same as last time (`--state_file`) or the inputted timeline (`--timeline`)
* `CRITICAL`: The tl is not the same.
\b
Perfdata:
* `is_timeline_changed` is 1 if the tl has changed, 0 otherwise
* the timeline
"""
if (timeline is None and state_file is None) or (
timeline is not None and state_file is not None
):
raise click.UsageError(
"Either --timeline or --state-file should be provided for this service", ctx
)
old_timeline = timeline
if state_file is not None:
cookie = nagiosplugin.Cookie(state_file)
cookie.open()
old_timeline = cookie.get("timeline")
cookie.close()
check = nagiosplugin.Check()
check.add(
NodeTLHasChanged(ctx.obj.connection_info, old_timeline, state_file, save_tl),
nagiosplugin.ScalarContext("is_timeline_changed", None, "@1:1"),
nagiosplugin.ScalarContext("timeline"),
NodeTLHasChangedSummary(old_timeline),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
@main.command(name="node_patroni_version")
@click.option(
"--patroni-version",
"patroni_version",
type=str,
help="Patroni version to compare to",
required=True,
)
@click.pass_context
@nagiosplugin.guarded
def node_patroni_version(ctx: click.Context, patroni_version: str) -> None:
"""Check if the version is equal to the input
\b
Check:
* `OK`: The version is the same as the input `--patroni-version`
* `CRITICAL`: otherwise.
\b
Perfdata:
* `is_version_ok` is 1 if version is ok, 0 otherwise
"""
# TODO the version cannot be written in perfdata find something else ?
check = nagiosplugin.Check()
check.add(
NodePatroniVersion(ctx.obj.connection_info, patroni_version),
nagiosplugin.ScalarContext("is_version_ok", None, "@0:0"),
nagiosplugin.ScalarContext("patroni_version"),
NodePatroniVersionSummary(patroni_version),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
@main.command(name="node_is_alive")
@click.pass_context
@nagiosplugin.guarded
def node_is_alive(ctx: click.Context) -> None:
"""Check if the node is alive ie patroni is running. This is
a liveness check as defined in Patroni's documentation.
\b
Check:
* `OK`: If patroni is running.
* `CRITICAL`: otherwise.
\b
Perfdata:
* `is_running` is 1 if patroni is running, 0 otherwise
"""
check = nagiosplugin.Check()
check.add(
NodeIsAlive(ctx.obj.connection_info),
nagiosplugin.ScalarContext("is_alive", None, "@0:0"),
NodeIsAliveSummary(),
)
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)

View file

@ -1,340 +0,0 @@
import hashlib
import json
from collections import Counter
from typing import Any, Iterable, Union
import nagiosplugin
from . import _log
from .types import ConnectionInfo, PatroniResource, handle_unknown
def replace_chars(text: str) -> str:
return text.replace("'", "").replace(" ", "_")
class ClusterNodeCount(PatroniResource):
def probe(self) -> Iterable[nagiosplugin.Metric]:
def debug_member(member: Any, health: str) -> None:
_log.debug(
"Node %(node_name)s is %(health)s: role %(role)s state %(state)s.",
{
"node_name": member["name"],
"health": health,
"role": member["role"],
"state": member["state"],
},
)
# get the cluster info
item_dict = self.rest_api("cluster")
role_counters: Counter[str] = Counter()
roles = []
status_counters: Counter[str] = Counter()
statuses = []
healthy_member = 0
for member in item_dict["members"]:
state, role = member["state"], member["role"]
roles.append(replace_chars(role))
statuses.append(replace_chars(state))
if role == "leader" and state == "running":
healthy_member += 1
debug_member(member, "healthy")
continue
if role in ["standby_leader", "replica", "sync_standby"] and (
(self.has_detailed_states() and state == "streaming")
or (not self.has_detailed_states() and state == "running")
):
healthy_member += 1
debug_member(member, "healthy")
continue
debug_member(member, "unhealthy")
role_counters.update(roles)
status_counters.update(statuses)
# The actual check: members, healthy_members
yield nagiosplugin.Metric("members", len(item_dict["members"]))
yield nagiosplugin.Metric("healthy_members", healthy_member)
# The performance data : role
for role in role_counters:
yield nagiosplugin.Metric(
f"role_{role}", role_counters[role], context="member_roles"
)
# The performance data : statuses (except running)
for state in status_counters:
yield nagiosplugin.Metric(
f"state_{state}", status_counters[state], context="member_statuses"
)
class ClusterHasLeader(PatroniResource):
def probe(self) -> Iterable[nagiosplugin.Metric]:
item_dict = self.rest_api("cluster")
is_leader_found = False
is_standby_leader_found = False
is_standby_leader_in_arc_rec = False
for member in item_dict["members"]:
if member["role"] == "leader" and member["state"] == "running":
is_leader_found = True
break
if member["role"] == "standby_leader":
if member["state"] not in ["streaming", "in archive recovery"]:
# for patroni >= 3.0.4 any state would be wrong
# for patroni < 3.0.4 a state different from running would be wrong
if self.has_detailed_states() or member["state"] != "running":
continue
if member["state"] in ["in archive recovery"]:
is_standby_leader_in_arc_rec = True
is_standby_leader_found = True
break
return [
nagiosplugin.Metric(
"has_leader",
1 if is_leader_found or is_standby_leader_found else 0,
),
nagiosplugin.Metric(
"is_standby_leader_in_arc_rec",
1 if is_standby_leader_in_arc_rec else 0,
),
nagiosplugin.Metric(
"is_standby_leader",
1 if is_standby_leader_found else 0,
),
nagiosplugin.Metric(
"is_leader",
1 if is_leader_found else 0,
),
]
class ClusterHasLeaderSummary(nagiosplugin.Summary):
def ok(self, results: nagiosplugin.Result) -> str:
return "The cluster has a running leader."
@handle_unknown
def problem(self, results: nagiosplugin.Result) -> str:
return "The cluster has no running leader or the standby leader is in archive recovery."
class ClusterHasReplica(PatroniResource):
def __init__(self, connection_info: ConnectionInfo, max_lag: Union[int, None]):
super().__init__(connection_info)
self.max_lag = max_lag
def probe(self) -> Iterable[nagiosplugin.Metric]:
def debug_member(member: Any, health: str) -> None:
_log.debug(
"Node %(node_name)s is %(health)s: lag %(lag)s, state %(state)s, tl %(tl)s.",
{
"node_name": member["name"],
"health": health,
"lag": member["lag"],
"state": member["state"],
"tl": member["timeline"],
},
)
# get the cluster info
cluster_item_dict = self.rest_api("cluster")
replicas = []
healthy_replica = 0
unhealthy_replica = 0
sync_replica = 0
leader_tl = None
# Look for replicas
for member in cluster_item_dict["members"]:
if member["role"] in ["replica", "sync_standby"]:
if member["lag"] == "unknown":
# This could happen if the node is stopped
# nagiosplugin doesn't handle strings in perfstats
# so we have to ditch all the stats in that case
debug_member(member, "unhealthy")
unhealthy_replica += 1
continue
else:
replicas.append(
{
"name": member["name"],
"lag": member["lag"],
"timeline": member["timeline"],
"sync": 1 if member["role"] == "sync_standby" else 0,
}
)
# Get the leader tl if we haven't already
if leader_tl is None:
# If there are no leaders, we will loop here for all
# members because leader_tl will remain None. it's not
# a big deal since having no leader is rare.
for tmember in cluster_item_dict["members"]:
if tmember["role"] == "leader":
leader_tl = int(tmember["timeline"])
break
_log.debug(
"Patroni's leader_timeline is %(leader_tl)s",
{
"leader_tl": leader_tl,
},
)
# Test for an unhealthy replica
if (
self.has_detailed_states()
and not (
member["state"] in ["streaming", "in archive recovery"]
and int(member["timeline"]) == leader_tl
)
) or (
not self.has_detailed_states()
and not (
member["state"] == "running"
and int(member["timeline"]) == leader_tl
)
):
debug_member(member, "unhealthy")
unhealthy_replica += 1
continue
if member["role"] == "sync_standby":
sync_replica += 1
if self.max_lag is None or self.max_lag >= int(member["lag"]):
debug_member(member, "healthy")
healthy_replica += 1
else:
debug_member(member, "unhealthy")
unhealthy_replica += 1
# The actual check
yield nagiosplugin.Metric("healthy_replica", healthy_replica)
yield nagiosplugin.Metric("sync_replica", sync_replica)
# The performance data : unhealthy replica count, replicas lag
yield nagiosplugin.Metric("unhealthy_replica", unhealthy_replica)
for replica in replicas:
yield nagiosplugin.Metric(
f"{replica['name']}_lag", replica["lag"], context="replica_lag"
)
yield nagiosplugin.Metric(
f"{replica['name']}_timeline",
replica["timeline"],
context="replica_timeline",
)
yield nagiosplugin.Metric(
f"{replica['name']}_sync", replica["sync"], context="replica_sync"
)
# FIXME is this needed ??
# class ClusterHasReplicaSummary(nagiosplugin.Summary):
# def ok(self, results):
# def problem(self, results):
class ClusterConfigHasChanged(PatroniResource):
def __init__(
self,
connection_info: ConnectionInfo,
config_hash: str, # Always contains the old hash
state_file: str, # Only used to update the hash in the state_file (when needed)
save: bool = False, # Save the configuration
):
super().__init__(connection_info)
self.state_file = state_file
self.config_hash = config_hash
self.save = save
def probe(self) -> Iterable[nagiosplugin.Metric]:
item_dict = self.rest_api("config")
new_hash = hashlib.md5(json.dumps(item_dict).encode()).hexdigest()
_log.debug("save result: %(issave)s", {"issave": self.save})
old_hash = self.config_hash
if self.state_file is not None and self.save:
_log.debug(
"saving new hash to state file / cookie %(state_file)s",
{"state_file": self.state_file},
)
cookie = nagiosplugin.Cookie(self.state_file)
cookie.open()
cookie["hash"] = new_hash
cookie.commit()
cookie.close()
_log.debug(
"hash info: old hash %(old_hash)s, new hash %(new_hash)s",
{"old_hash": old_hash, "new_hash": new_hash},
)
return [
nagiosplugin.Metric(
"is_configuration_changed",
1 if new_hash != old_hash else 0,
)
]
class ClusterConfigHasChangedSummary(nagiosplugin.Summary):
def __init__(self, config_hash: str) -> None:
self.old_config_hash = config_hash
# Note: It would be helpful to display the old / new hash here. Unfortunately, it's not a metric.
# So we only have the old / expected one.
def ok(self, results: nagiosplugin.Result) -> str:
return f"The hash of patroni's dynamic configuration has not changed ({self.old_config_hash})."
@handle_unknown
def problem(self, results: nagiosplugin.Result) -> str:
return f"The hash of patroni's dynamic configuration has changed. The old hash was {self.old_config_hash}."
class ClusterIsInMaintenance(PatroniResource):
def probe(self) -> Iterable[nagiosplugin.Metric]:
item_dict = self.rest_api("cluster")
# The actual check
return [
nagiosplugin.Metric(
"is_in_maintenance",
1 if "pause" in item_dict and item_dict["pause"] else 0,
)
]
class ClusterHasScheduledAction(PatroniResource):
def probe(self) -> Iterable[nagiosplugin.Metric]:
item_dict = self.rest_api("cluster")
scheduled_switchover = 0
scheduled_restart = 0
if "scheduled_switchover" in item_dict:
scheduled_switchover = 1
for member in item_dict["members"]:
if "scheduled_restart" in member:
scheduled_restart += 1
# The actual check
yield nagiosplugin.Metric(
"has_scheduled_actions",
1 if (scheduled_switchover + scheduled_restart) > 0 else 0,
)
# The performance data : scheduled_switchover, scheduled action count
yield nagiosplugin.Metric("scheduled_switchover", scheduled_switchover)
yield nagiosplugin.Metric("scheduled_restart", scheduled_restart)

View file

@ -1,59 +0,0 @@
import re
from typing import Tuple, Union
import click
def size_to_byte(value: str) -> int:
"""Convert any size to Byte
>>> size_to_byte('1TB')
1099511627776
>>> size_to_byte('5kB')
5120
>>> size_to_byte('.5kB')
512
>>> size_to_byte('.5 yoyo')
Traceback (most recent call last):
...
click.exceptions.BadParameter: Invalid unit for size f{value}
"""
convert = {
"B": 1,
"kB": 1024,
"MB": 1024 * 1024,
"GB": 1024 * 1024 * 1024,
"TB": 1024 * 1024 * 1024 * 1024,
}
val, unit = strtod(value)
if val is None:
val = 1
if unit is None:
# No unit, all good
# we can round half bytes dont really make sense
return round(val)
else:
try:
multiplicateur = convert[unit]
except KeyError:
raise click.BadParameter("Invalid unit for size f{value}")
# we can round half bytes dont really make sense
return round(val * multiplicateur)
DBL_RE = re.compile(r"^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?")
def strtod(value: str) -> Tuple[Union[float, None], Union[str, None]]:
"""As most as possible close equivalent of strtod(3) function used by postgres to parse parameter values.
>>> strtod(' A ') == (None, 'A')
True
"""
value = str(value).strip()
match = DBL_RE.match(value)
if match:
end = match.end()
return float(value[:end]), value[end:]
return None, value

View file

@ -1,247 +0,0 @@
from typing import Iterable
import nagiosplugin
from . import _log
from .types import APIError, ConnectionInfo, PatroniResource, handle_unknown
class NodeIsPrimary(PatroniResource):
def probe(self) -> Iterable[nagiosplugin.Metric]:
try:
self.rest_api("primary")
except APIError:
return [nagiosplugin.Metric("is_primary", 0)]
return [nagiosplugin.Metric("is_primary", 1)]
class NodeIsPrimarySummary(nagiosplugin.Summary):
def ok(self, results: nagiosplugin.Result) -> str:
return "This node is the primary with the leader lock."
@handle_unknown
def problem(self, results: nagiosplugin.Result) -> str:
return "This node is not the primary with the leader lock."
class NodeIsLeader(PatroniResource):
def __init__(
self, connection_info: ConnectionInfo, check_is_standby_leader: bool
) -> None:
super().__init__(connection_info)
self.check_is_standby_leader = check_is_standby_leader
def probe(self) -> Iterable[nagiosplugin.Metric]:
apiname = "leader"
if self.check_is_standby_leader:
apiname = "standby-leader"
try:
self.rest_api(apiname)
except APIError:
return [nagiosplugin.Metric("is_leader", 0)]
return [nagiosplugin.Metric("is_leader", 1)]
class NodeIsLeaderSummary(nagiosplugin.Summary):
def __init__(self, check_is_standby_leader: bool) -> None:
if check_is_standby_leader:
self.leader_kind = "standby leader"
else:
self.leader_kind = "leader"
def ok(self, results: nagiosplugin.Result) -> str:
return f"This node is a {self.leader_kind} node."
@handle_unknown
def problem(self, results: nagiosplugin.Result) -> str:
return f"This node is not a {self.leader_kind} node."
class NodeIsReplica(PatroniResource):
def __init__(
self,
connection_info: ConnectionInfo,
max_lag: str,
check_is_sync: bool,
check_is_async: bool,
) -> None:
super().__init__(connection_info)
self.max_lag = max_lag
self.check_is_sync = check_is_sync
self.check_is_async = check_is_async
def probe(self) -> Iterable[nagiosplugin.Metric]:
try:
if self.check_is_sync:
api_name = "synchronous"
elif self.check_is_async:
api_name = "asynchronous"
else:
api_name = "replica"
if self.max_lag is None:
self.rest_api(api_name)
else:
self.rest_api(f"{api_name}?lag={self.max_lag}")
except APIError:
return [nagiosplugin.Metric("is_replica", 0)]
return [nagiosplugin.Metric("is_replica", 1)]
class NodeIsReplicaSummary(nagiosplugin.Summary):
def __init__(self, lag: str, check_is_sync: bool, check_is_async: bool) -> None:
self.lag = lag
if check_is_sync:
self.replica_kind = "synchronous replica"
elif check_is_async:
self.replica_kind = "asynchronous replica"
else:
self.replica_kind = "replica"
def ok(self, results: nagiosplugin.Result) -> str:
if self.lag is None:
return (
f"This node is a running {self.replica_kind} with no noloadbalance tag."
)
return f"This node is a running {self.replica_kind} with no noloadbalance tag and the lag is under {self.lag}."
@handle_unknown
def problem(self, results: nagiosplugin.Result) -> str:
if self.lag is None:
return f"This node is not a running {self.replica_kind} with no noloadbalance tag."
return f"This node is not a running {self.replica_kind} with no noloadbalance tag and a lag under {self.lag}."
class NodeIsPendingRestart(PatroniResource):
def probe(self) -> Iterable[nagiosplugin.Metric]:
item_dict = self.rest_api("patroni")
is_pending_restart = item_dict.get("pending_restart", False)
return [
nagiosplugin.Metric(
"is_pending_restart",
1 if is_pending_restart else 0,
)
]
class NodeIsPendingRestartSummary(nagiosplugin.Summary):
def ok(self, results: nagiosplugin.Result) -> str:
return "This node doesn't have the pending restart flag."
@handle_unknown
def problem(self, results: nagiosplugin.Result) -> str:
return "This node has the pending restart flag."
class NodeTLHasChanged(PatroniResource):
def __init__(
self,
connection_info: ConnectionInfo,
timeline: str, # Always contains the old timeline
state_file: str, # Only used to update the timeline in the state_file (when needed)
save: bool, # save timeline in state file
) -> None:
super().__init__(connection_info)
self.state_file = state_file
self.timeline = timeline
self.save = save
def probe(self) -> Iterable[nagiosplugin.Metric]:
item_dict = self.rest_api("patroni")
new_tl = item_dict["timeline"]
_log.debug("save result: %(issave)s", {"issave": self.save})
old_tl = self.timeline
if self.state_file is not None and self.save:
_log.debug(
"saving new timeline to state file / cookie %(state_file)s",
{"state_file": self.state_file},
)
cookie = nagiosplugin.Cookie(self.state_file)
cookie.open()
cookie["timeline"] = new_tl
cookie.commit()
cookie.close()
_log.debug(
"Tl data: old tl %(old_tl)s, new tl %(new_tl)s",
{"old_tl": old_tl, "new_tl": new_tl},
)
# The actual check
yield nagiosplugin.Metric(
"is_timeline_changed",
1 if str(new_tl) != str(old_tl) else 0,
)
# The performance data : the timeline number
yield nagiosplugin.Metric("timeline", new_tl)
class NodeTLHasChangedSummary(nagiosplugin.Summary):
def __init__(self, timeline: str) -> None:
self.timeline = timeline
def ok(self, results: nagiosplugin.Result) -> str:
return f"The timeline is still {self.timeline}."
@handle_unknown
def problem(self, results: nagiosplugin.Result) -> str:
return f"The expected timeline was {self.timeline} got {results['timeline'].metric}."
class NodePatroniVersion(PatroniResource):
def __init__(self, connection_info: ConnectionInfo, patroni_version: str) -> None:
super().__init__(connection_info)
self.patroni_version = patroni_version
def probe(self) -> Iterable[nagiosplugin.Metric]:
item_dict = self.rest_api("patroni")
version = item_dict["patroni"]["version"]
_log.debug(
"Version data: patroni version %(version)s input version %(patroni_version)s",
{"version": version, "patroni_version": self.patroni_version},
)
# The actual check
return [
nagiosplugin.Metric(
"is_version_ok",
1 if version == self.patroni_version else 0,
)
]
class NodePatroniVersionSummary(nagiosplugin.Summary):
def __init__(self, patroni_version: str) -> None:
self.patroni_version = patroni_version
def ok(self, results: nagiosplugin.Result) -> str:
return f"Patroni's version is {self.patroni_version}."
@handle_unknown
def problem(self, results: nagiosplugin.Result) -> str:
# FIXME find a way to make the following work, check is perf data can be strings
# return f"The expected patroni version was {self.patroni_version} got {results['patroni_version'].metric}."
return f"Patroni's version is not {self.patroni_version}."
class NodeIsAlive(PatroniResource):
def probe(self) -> Iterable[nagiosplugin.Metric]:
try:
self.rest_api("liveness")
except APIError:
return [nagiosplugin.Metric("is_alive", 0)]
return [nagiosplugin.Metric("is_alive", 1)]
class NodeIsAliveSummary(nagiosplugin.Summary):
def ok(self, results: nagiosplugin.Result) -> str:
return "This node is alive (patroni is running)."
@handle_unknown
def problem(self, results: nagiosplugin.Result) -> str:
return "This node is not alive (patroni is not running)."

View file

@ -1,114 +0,0 @@
import json
from functools import lru_cache
from typing import Any, Callable, List, Optional, Tuple, Union
from urllib.parse import urlparse
import attr
import nagiosplugin
import requests
from . import _log
class APIError(requests.exceptions.RequestException):
"""This exception is raised when the rest api couldn't
be reached and we got a http status code different from 200.
"""
@attr.s(auto_attribs=True, frozen=True, slots=True)
class ConnectionInfo:
endpoints: List[str] = ["http://127.0.0.1:8008"]
cert: Optional[Union[str, Tuple[str, str]]] = None
ca_cert: Optional[str] = None
@attr.s(auto_attribs=True, frozen=True, slots=True)
class Parameters:
connection_info: ConnectionInfo
timeout: int
verbose: int
@attr.s(auto_attribs=True, eq=False, slots=True)
class PatroniResource(nagiosplugin.Resource):
conn_info: ConnectionInfo
def rest_api(self, service: str) -> Any:
"""Try to connect to all the provided endpoints for the requested service"""
for endpoint in self.conn_info.endpoints:
cert: Optional[Union[Tuple[str, str], str]] = None
verify: Optional[Union[str, bool]] = None
if urlparse(endpoint).scheme == "https":
if self.conn_info.cert is not None:
# we can have: a key + a cert or a single file with key and cert.
cert = self.conn_info.cert
if self.conn_info.ca_cert is not None:
verify = self.conn_info.ca_cert
_log.debug(
"Trying to connect to %(endpoint)s/%(service)s with cert: %(cert)s verify: %(verify)s",
{
"endpoint": endpoint,
"service": service,
"cert": cert,
"verify": verify,
},
)
try:
r = requests.get(f"{endpoint}/{service}", verify=verify, cert=cert)
except Exception as e:
_log.debug(e)
continue
# The status code is already displayed by urllib3
_log.debug(
"api call data: %(data)s", {"data": r.text if r.text else "<Empty>"}
)
if r.status_code != 200:
raise APIError(
f"Failed to connect to {endpoint}/{service} status code {r.status_code}"
)
try:
return r.json()
except (json.JSONDecodeError, ValueError):
return None
raise nagiosplugin.CheckError("Connection failed for all provided endpoints")
@lru_cache(maxsize=None)
def has_detailed_states(self) -> bool:
# get patroni's version to find out if the "streaming" and "in archive recovery" states are available
patroni_item_dict = self.rest_api("patroni")
if tuple(
int(v) for v in patroni_item_dict["patroni"]["version"].split(".", 2)
) >= (3, 0, 4):
_log.debug(
"Patroni's version is %(version)s, more detailed states can be used to check for the health of replicas.",
{"version": patroni_item_dict["patroni"]["version"]},
)
return True
_log.debug(
"Patroni's version is %(version)s, the running state and the timelines must be used to check for the health of replicas.",
{"version": patroni_item_dict["patroni"]["version"]},
)
return False
HandleUnknown = Callable[[nagiosplugin.Summary, nagiosplugin.Results], Any]
def handle_unknown(func: HandleUnknown) -> HandleUnknown:
"""decorator to handle the unknown state in Summary.problem"""
def wrapper(summary: nagiosplugin.Summary, results: nagiosplugin.Results) -> Any:
if results.most_significant[0].state.code == 3:
"""get the appropriate message for all unknown error"""
return results.most_significant[0].hint
return func(summary, results)
return wrapper

25
debian/changelog vendored
View file

@ -1,25 +0,0 @@
check-patroni (2.0.0-1~bpo12+1) bookworm-backports; urgency=medium
* Rebuild for bookworm-backports
-- David Prévot <dprevot@evolix.fr> Thu, 18 Apr 2024 16:10:08 +0200
check-patroni (2.0.0-1) unstable; urgency=medium
[ benoit ]
* cluster_has_replica: fix the way a healthy replica is detected
* Fix the cluster_has_leader service for standby clusters
* Fix cluster_node_count's management of replication states
* Fix cluster_has_leader in archive recovery tests
* Release V2.0.0 (Closes: #1053548)
[ David Prévot ]
* Update Standards-Version to 4.7.0
-- David Prévot <taffit@debian.org> Sun, 14 Apr 2024 09:34:48 +0200
check-patroni (1.0.0-1) unstable; urgency=medium
* Initial release, initiated by py2dsp/3.20230219
-- David Prévot <taffit@debian.org> Wed, 06 Sep 2023 14:26:10 +0530

View file

@ -1,2 +0,0 @@
[name]
check-patroni \- Nagios plugin to check on patroni

27
debian/control vendored
View file

@ -1,27 +0,0 @@
Source: check-patroni
Section: utils
Priority: optional
Maintainer: David Prévot <taffit@debian.org>
Build-Depends: debhelper-compat (= 13),
help2man,
pybuild-plugin-pyproject,
python3-all,
python3-attr,
python3-click,
python3-nagiosplugin,
python3-pytest-mock,
python3-requests,
python3-setuptools
Standards-Version: 4.7.0
Testsuite: autopkgtest-pkg-pybuild
Homepage: https://github.com/dalibo/check_patroni
Vcs-Git: https://salsa.debian.org/debian/check-patroni.git
Vcs-Browser: https://salsa.debian.org/debian/check-patroni
Rules-Requires-Root: no
Package: check-patroni
Architecture: all
Depends: ${misc:Depends}, ${python3:Depends}
Description: Nagios plugin to check on patroni
A nagios plugin for patroni that checks presence of leader, replicas,
and node counts, and also checks each node for replication status.

55
debian/copyright vendored
View file

@ -1,55 +0,0 @@
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Name: check-patroni
Upstream-Contact: Dalibo <contact@dalibo.com>
Source: https://github.com/dalibo/check_patroni
Files: *
Copyright: 2022, DALIBO <contact@dalibo.com>
License: PostgreSQL
Files: vagrant/*
Copyright: 2019, Jehan-Guillaume (ioguix) de Rorthais
License: BSD-3-clause
License: BSD-3-clause
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
.
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
License: PostgreSQL
Permission to use, copy, modify, and distribute this software and its
documentation for any purpose, without fee, and without a written agreement is
hereby granted, provided that the above copyright notice and this paragraph and
the following two paragraphs appear in all copies.
.
IN NO EVENT SHALL DALIBO BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL,
INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE
USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF DALIBO HAS BEEN ADVISED OF
THE POSSIBILITY OF SUCH DAMAGE.
.
DALIBO SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND DALIBO HAS NO
OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
MODIFICATIONS.

5
debian/gbp.conf vendored
View file

@ -1,5 +0,0 @@
[DEFAULT]
debian-branch = debian/latest
pristine-tar = True
upstream-branch = upstream/latest
upstream-vcs-tag = v%(version%~%-)s

1
debian/manpages vendored
View file

@ -1 +0,0 @@
debian/tmp/check_patroni.1

View file

@ -1 +0,0 @@
README.md

14
debian/rules vendored
View file

@ -1,14 +0,0 @@
#! /usr/bin/make -f
export PYBUILD_NAME=check-patroni
%:
dh $@ --with python3 --buildsystem=pybuild
execute_before_dh_installman:
mkdir --parent $(CURDIR)/debian/tmp
PYTHONPATH=debian/check-patroni/usr/lib/python3.11/dist-packages \
help2man \
--no-info \
--include=$(CURDIR)/debian/check_patroni.1.in \
debian/check-patroni/usr/bin/check_patroni \
> $(CURDIR)/debian/tmp/check_patroni.1

View file

@ -1 +0,0 @@
3.0 (quilt)

View file

@ -1 +0,0 @@
extend-diff-ignore="^[^/]+.(egg-info|dist-info)/"

View file

@ -1,5 +0,0 @@
---
Bug-Database: https://github.com/dalibo/check_patroni/issues
Bug-Submit: https://github.com/dalibo/check_patroni/issues/new
Repository: https://github.com/dalibo/check_patroni.git
Repository-Browse: https://github.com/dalibo/check_patroni

2
debian/watch vendored
View file

@ -1,2 +0,0 @@
version=4
https://github.com/dalibo/check_patroni/tags (?:.*?/)?v?(\d[\d.]*)\.tar\.gz

View file

@ -1,158 +0,0 @@
#!/bin/bash
if ! command -v check_patroni &>/dev/null; then
echo "check_partroni must be installed to generate the documentation"
exit 1
fi
top_srcdir="$(readlink -m "$0/../..")"
README="${top_srcdir}/README.md"
function readme(){
echo "$1" >> $README
}
function helpme(){
readme
readme '```'
check_patroni $1 --help >> $README
readme '```'
readme
}
cat << '_EOF_' > $README
# check_patroni
A nagios plugin for patroni.
## Features
- Check presence of leader, replicas, node counts.
- Check each node for replication status.
_EOF_
helpme
cat << '_EOF_' >> $README
## Install
check_patroni is licensed under PostgreSQL license.
```
$ pip install git+https://github.com/dalibo/check_patroni.git
```
check_patroni works on python 3.6, we keep it that way because patroni also
supports it and there are still lots of RH 7 variants around. That being said
python 3.6 has been EOL for ages and there is no support for it in the github
CI.
## Support
If you hit a bug or need help, open a [GitHub
issue](https://github.com/dalibo/check_patroni/issues/new). Dalibo has no
commitment on response time for public free support. Thanks for you
contribution !
## Config file
All global and service specific parameters can be specified via a config file has follows:
```
[options]
endpoints = https://10.20.199.3:8008, https://10.20.199.4:8008,https://10.20.199.5:8008
cert_file = ./ssl/my-cert.pem
key_file = ./ssl/my-key.pem
ca_file = ./ssl/CA-cert.pem
timeout = 0
[options.node_is_replica]
lag=100
```
## Thresholds
The format for the threshold parameters is `[@][start:][end]`.
* `start:` may be omitted if `start == 0`
* `~:` means that start is negative infinity
* If `end` is omitted, infinity is assumed
* To invert the match condition, prefix the range expression with `@`.
A match is found when: `start <= VALUE <= end`.
For example, the following command will raise:
* a warning if there is less than 1 nodes, which can be translated to outside of range [2;+INF[
* a critical if there are no nodes, which can be translated to outside of range [1;+INF[
```
check_patroni -e https://10.20.199.3:8008 cluster_has_replica --warning 2: --critical 1:
```
## SSL
Several options are available:
* the server's CA certificate is not available or trusted by the client system:
* `--ca_cert`: your certification chain `cat CA-certificate server-certificate > cabundle`
* you have a client certificate for authenticating with Patroni's REST API:
* `--cert_file`: your certificate or the concatenation of your certificate and private key
* `--key_file`: your private key (optional)
## Shell completion
We use the [click] library which supports shell completion natively.
Shell completion can be added by typing the following command or adding it to
a file spécific to your shell of choice.
* for Bash (add to `~/.bashrc`):
```
eval "$(_CHECK_PATRONI_COMPLETE=bash_source check_patroni)"
```
* for Zsh (add to `~/.zshrc`):
```
eval "$(_CHECK_PATRONI_COMPLETE=zsh_source check_patroni)"
```
* for Fish (add to `~/.config/fish/completions/check_patroni.fish`):
```
eval "$(_CHECK_PATRONI_COMPLETE=fish_source check_patroni)"
```
Please note that shell completion is not supported far all shell versions, for
example only Bash versions older than 4.4 are supported.
[click]: https://click.palletsprojects.com/en/8.1.x/shell-completion/
_EOF_
readme
readme "## Cluster services"
readme
readme "### cluster_config_has_changed"
helpme cluster_config_has_changed
readme "### cluster_has_leader"
helpme cluster_has_leader
readme "### cluster_has_replica"
helpme cluster_has_replica
readme "### cluster_has_scheduled_action"
helpme cluster_has_scheduled_action
readme "### cluster_is_in_maintenance"
helpme cluster_is_in_maintenance
readme "### cluster_node_count"
helpme cluster_node_count
readme "## Node services"
readme
readme "### node_is_alive"
helpme node_is_alive
readme "### node_is_pending_restart"
helpme node_is_pending_restart
readme "### node_is_leader"
helpme node_is_leader
readme "### node_is_primary"
helpme node_is_primary
readme "### node_is_replica"
helpme node_is_replica
readme "### node_patroni_version"
helpme node_patroni_version
readme "### node_tl_has_changed"
helpme node_tl_has_changed
cat << _EOF_ >> $README
_EOF_

View file

@ -1,27 +0,0 @@
[mypy]
files = .
show_error_codes = true
strict = true
exclude = build/
[mypy-setup]
ignore_errors = True
[mypy-nagiosplugin.*]
ignore_missing_imports = true
[mypy-check_patroni.types]
# no stubs for nagioplugin => ignore: Class cannot subclass "Resource" (has type "Any") [misc]
disallow_subclassing_any = false
[mypy-check_patroni.node]
# no subs for nagiosplugin => ignore: Class cannot subclass "Summary" (has type "Any") [misc]
disallow_subclassing_any = false
[mypy-check_patroni.cluster]
# no subs for nagiosplugin => ignore: Class cannot subclass "Summary" (has type "Any") [misc]
disallow_subclassing_any = false
[mypy-check_patroni.cli]
# no stubs for nagiosplugin => ignore: Untyped decorator makes function "main" untyped [misc]
disallow_untyped_decorators = false

View file

@ -1,7 +0,0 @@
[build-system]
requires = ["setuptools", "setuptools-scm"]
build-backend = "setuptools.build_meta"
[tool.isort]
profile = "black"

View file

@ -1,2 +0,0 @@
[pytest]
addopts = --doctest-modules

View file

@ -1,12 +0,0 @@
black
codespell
isort
flake8
mypy==0.961
pytest
pytest-cov
types-requests
setuptools
tox
twine
wheel

View file

@ -1,58 +0,0 @@
import pathlib
from setuptools import find_packages, setup
HERE = pathlib.Path(__file__).parent
long_description = (HERE / "README.md").read_text()
def get_version() -> str:
fpath = HERE / "check_patroni" / "__init__.py"
with fpath.open() as f:
for line in f:
if line.startswith("__version__"):
return line.split('"')[1]
raise Exception(f"version information not found in {fpath}")
setup(
name="check_patroni",
version=get_version(),
author="Dalibo",
author_email="contact@dalibo.com",
packages=find_packages(include=["check_patroni*"]),
include_package_data=True,
url="https://github.com/dalibo/check_patroni",
license="PostgreSQL",
description="Nagios plugin to check on patroni",
long_description=long_description,
long_description_content_type="text/markdown",
classifiers=[
"Development Status :: 5 - Production/Stable",
"Environment :: Console",
"License :: OSI Approved :: PostgreSQL License",
"Programming Language :: Python :: 3",
"Topic :: System :: Monitoring",
],
keywords="patroni nagios check",
python_requires=">=3.6",
install_requires=[
"attrs >= 17, !=21.1",
"requests",
"nagiosplugin >= 1.3.2",
"click >= 7.1",
],
extras_require={
"test": [
"importlib_metadata; python_version < '3.8'",
"pytest >= 6.0.2",
],
},
entry_points={
"console_scripts": [
"check_patroni=check_patroni.cli:main",
],
},
zip_safe=False,
)

View file

@ -1,65 +0,0 @@
import json
import logging
import shutil
from contextlib import contextmanager
from functools import partial
from http.server import HTTPServer, SimpleHTTPRequestHandler
from pathlib import Path
from typing import Any, Iterator, Mapping, Union
logger = logging.getLogger(__name__)
class PatroniAPI(HTTPServer):
def __init__(self, directory: Path, *, datadir: Path) -> None:
self.directory = directory
self.datadir = datadir
handler_cls = partial(SimpleHTTPRequestHandler, directory=str(directory))
super().__init__(("", 0), handler_cls)
def serve_forever(self, *args: Any) -> None:
logger.info(
"starting fake Patroni API at %s (directory=%s)",
self.endpoint,
self.directory,
)
return super().serve_forever(*args)
@property
def endpoint(self) -> str:
return f"http://{self.server_name}:{self.server_port}"
@contextmanager
def routes(self, mapping: Mapping[str, Union[Path, str]]) -> Iterator[None]:
"""Temporarily install specified files in served directory, thus
building "routes" from given mapping.
The 'mapping' defines target route paths as keys and files to be
installed in served directory as values. Mapping values of type 'str'
are assumed be relative file path to the 'datadir'.
"""
for route_path, fpath in mapping.items():
if isinstance(fpath, str):
fpath = self.datadir / fpath
shutil.copy(fpath, self.directory / route_path)
try:
yield None
finally:
for fname in mapping:
(self.directory / fname).unlink()
def cluster_api_set_replica_running(in_json: Path, target_dir: Path) -> Path:
# starting from 3.0.4 the state of replicas is streaming or in archive recovery
# instead of running
with in_json.open() as f:
js = json.load(f)
for node in js["members"]:
if node["role"] in ["replica", "sync_standby", "standby_leader"]:
if node["state"] in ["streaming", "in archive recovery"]:
node["state"] = "running"
assert target_dir.is_dir()
out_json = target_dir / in_json.name
with out_json.open("w") as f:
json.dump(js, f)
return out_json

View file

@ -1,76 +0,0 @@
import logging
import sys
from pathlib import Path
from threading import Thread
from typing import Any, Iterator, Tuple
from unittest.mock import patch
if sys.version_info >= (3, 8):
from importlib.metadata import version as metadata_version
else:
from importlib_metadata import version as metadata_version
import pytest
from click.testing import CliRunner
from . import PatroniAPI
logger = logging.getLogger(__name__)
def numversion(pkgname: str) -> Tuple[int, ...]:
version = metadata_version(pkgname)
return tuple(int(v) for v in version.split(".", 3))
if numversion("pytest") >= (6, 2):
TempPathFactory = pytest.TempPathFactory
else:
from _pytest.tmpdir import TempPathFactory
@pytest.fixture(scope="session", autouse=True)
def nagioplugin_runtime_stdout() -> Iterator[None]:
# work around https://github.com/mpounsett/nagiosplugin/issues/24 when
# nagiosplugin is older than 1.3.3
if numversion("nagiosplugin") < (1, 3, 3):
target = "nagiosplugin.runtime.Runtime.stdout"
with patch(target, None):
logger.warning("patching %r", target)
yield None
else:
yield None
@pytest.fixture(
params=[False, True],
ids=lambda v: "new-replica-state" if v else "old-replica-state",
)
def old_replica_state(request: Any) -> Any:
return request.param
@pytest.fixture(scope="session")
def datadir() -> Path:
return Path(__file__).parent / "json"
@pytest.fixture(scope="session")
def patroni_api(
tmp_path_factory: TempPathFactory, datadir: Path
) -> Iterator[PatroniAPI]:
"""A fake HTTP server for the Patroni API serving files from a temporary
directory.
"""
httpd = PatroniAPI(tmp_path_factory.mktemp("api"), datadir=datadir)
t = Thread(target=httpd.serve_forever)
t.start()
yield httpd
httpd.shutdown()
t.join()
@pytest.fixture
def runner() -> CliRunner:
"""A CliRunner with stdout and stderr not mixed."""
return CliRunner(mix_stderr=False)

View file

@ -1,16 +0,0 @@
{
"loop_wait": 10,
"master_start_timeout": 300,
"postgresql": {
"parameters": {
"archive_command": "pgbackrest --stanza=main archive-push %p",
"archive_mode": "on",
"max_connections": 300,
"restore_command": "pgbackrest --stanza=main archive-get %f \"%p\""
},
"use_pg_rewind": false,
"use_slot": true
},
"retry_timeout": 10,
"ttl": 30
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "replica",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "running",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "replica",
"state": "running",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "standby_leader",
"state": "stopped",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "standby_leader",
"state": "in archive recovery",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "standby_leader",
"state": "streaming",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "stopped",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": "unknown"
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,35 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "replica",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv2",
"role": "replica",
"state": "running",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "replica",
"state": "running",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 10241024
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 20000000
}
]
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "running",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 50,
"lag": 1000000
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "in archive recovery",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "sync_standby",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 1024
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,26 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2021-08-11 07:02:20.732 UTC",
"role": "master",
"server_version": 110012,
"cluster_unlocked": false,
"xlog": {
"location": 1174407088
},
"timeline": 51,
"replication": [
{
"usename": "replicator",
"application_name": "srv1",
"client_addr": "10.20.199.3",
"state": "streaming",
"sync_state": "async",
"sync_priority": 0
}
],
"database_system_identifier": "6965971025273547206",
"patroni": {
"version": "3.0.0",
"scope": "patroni-demo"
}
}

View file

@ -1,26 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2021-08-11 07:02:20.732 UTC",
"role": "master",
"server_version": 110012,
"cluster_unlocked": false,
"xlog": {
"location": 1174407088
},
"timeline": 51,
"replication": [
{
"usename": "replicator",
"application_name": "srv1",
"client_addr": "10.20.199.3",
"state": "streaming",
"sync_state": "async",
"sync_priority": 0
}
],
"database_system_identifier": "6965971025273547206",
"patroni": {
"version": "3.1.0",
"scope": "patroni-demo"
}
}

View file

@ -1,27 +0,0 @@
{
"members": [
{
"name": "p1",
"role": "sync_standby",
"state": "streaming",
"api_url": "http://10.20.30.51:8008/patroni",
"host": "10.20.30.51",
"port": 5432,
"timeline": 3,
"scheduled_restart": {
"schedule": "2023-10-08T11:30:00+00:00",
"postmaster_start_time": "2023-08-21 08:08:33.415237+00:00"
},
"lag": 0
},
{
"name": "p2",
"role": "leader",
"state": "running",
"api_url": "http://10.20.30.52:8008/patroni",
"host": "10.20.30.52",
"port": 5432,
"timeline": 3
}
]
}

View file

@ -1,28 +0,0 @@
{
"members": [
{
"name": "p1",
"role": "sync_standby",
"state": "streaming",
"api_url": "http://10.20.30.51:8008/patroni",
"host": "10.20.30.51",
"port": 5432,
"timeline": 3,
"lag": 0
},
{
"name": "p2",
"role": "leader",
"state": "running",
"api_url": "http://10.20.30.52:8008/patroni",
"host": "10.20.30.52",
"port": 5432,
"timeline": 3
}
],
"scheduled_switchover": {
"at": "2023-10-08T11:30:00+00:00",
"from": "p1",
"to": "p2"
}
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "sync_standby",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,34 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
],
"pause": true
}

View file

@ -1,34 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
],
"pause": false
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,34 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
],
"pause": false
}

View file

@ -1,13 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
}
]
}

View file

@ -1,31 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "start failed",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"lag": "unknown"
},
{
"name": "srv3",
"role": "replica",
"state": "start failed",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"lag": "unknown"
}
]
}

View file

@ -1,23 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "standby_leader",
"state": "in archive recovery",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "in archive recovery",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,33 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
},
{
"name": "srv3",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.5:8008/patroni",
"host": "10.20.199.5",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,23 +0,0 @@
{
"members": [
{
"name": "srv1",
"role": "leader",
"state": "running",
"api_url": "https://10.20.199.3:8008/patroni",
"host": "10.20.199.3",
"port": 5432,
"timeline": 51
},
{
"name": "srv2",
"role": "replica",
"state": "streaming",
"api_url": "https://10.20.199.4:8008/patroni",
"host": "10.20.199.4",
"port": 5432,
"timeline": 51,
"lag": 0
}
]
}

View file

@ -1,26 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2021-08-11 07:02:20.732 UTC",
"role": "master",
"server_version": 110012,
"cluster_unlocked": false,
"xlog": {
"location": 1174407088
},
"timeline": 58,
"replication": [
{
"usename": "replicator",
"application_name": "srv1",
"client_addr": "10.20.199.3",
"state": "streaming",
"sync_state": "async",
"sync_priority": 0
}
],
"database_system_identifier": "6965971025273547206",
"patroni": {
"version": "2.0.2",
"scope": "patroni-demo"
}
}

View file

@ -1,19 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2023-08-23 14:30:50.201691+00:00",
"role": "standby_leader",
"server_version": 140009,
"xlog": {
"received_location": 889192448,
"replayed_location": 889192448,
"replayed_timestamp": null,
"paused": false
},
"timeline": 1,
"dcs_last_seen": 1692805971,
"database_system_identifier": "7270495803765492571",
"patroni": {
"version": "3.1.0",
"scope": "patroni-demo-sb"
}
}

View file

@ -1,26 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2021-08-11 07:02:20.732 UTC",
"role": "master",
"server_version": 110012,
"cluster_unlocked": false,
"xlog": {
"location": 1174407088
},
"timeline": 58,
"replication": [
{
"usename": "replicator",
"application_name": "srv1",
"client_addr": "10.20.199.3",
"state": "streaming",
"sync_state": "async",
"sync_priority": 0
}
],
"database_system_identifier": "6965971025273547206",
"patroni": {
"version": "2.0.2",
"scope": "patroni-demo"
}
}

View file

@ -1,19 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2023-08-23 14:30:50.201691+00:00",
"role": "standby_leader",
"server_version": 140009,
"xlog": {
"received_location": 889192448,
"replayed_location": 889192448,
"replayed_timestamp": null,
"paused": false
},
"timeline": 1,
"dcs_last_seen": 1692805971,
"database_system_identifier": "7270495803765492571",
"patroni": {
"version": "3.1.0",
"scope": "patroni-demo-sb"
}
}

View file

@ -1,27 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2021-08-11 07:02:20.732 UTC",
"role": "master",
"server_version": 110012,
"cluster_unlocked": false,
"xlog": {
"location": 1174407088
},
"timeline": 58,
"replication": [
{
"usename": "replicator",
"application_name": "srv1",
"client_addr": "10.20.199.3",
"state": "streaming",
"sync_state": "async",
"sync_priority": 0
}
],
"pending_restart": true,
"database_system_identifier": "6965971025273547206",
"patroni": {
"version": "2.0.2",
"scope": "patroni-demo"
}
}

View file

@ -1,26 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2021-08-11 07:02:20.732 UTC",
"role": "master",
"server_version": 110012,
"cluster_unlocked": false,
"xlog": {
"location": 1174407088
},
"timeline": 58,
"replication": [
{
"usename": "replicator",
"application_name": "srv1",
"client_addr": "10.20.199.3",
"state": "streaming",
"sync_state": "async",
"sync_priority": 0
}
],
"database_system_identifier": "6965971025273547206",
"patroni": {
"version": "2.0.2",
"scope": "patroni-demo"
}
}

View file

@ -1,19 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2021-08-11 07:57:51.693 UTC",
"role": "replica",
"server_version": 110012,
"cluster_unlocked": false,
"xlog": {
"received_location": 1174407088,
"replayed_location": 1174407088,
"replayed_timestamp": null,
"paused": false
},
"timeline": 58,
"database_system_identifier": "6965971025273547206",
"patroni": {
"version": "2.0.2",
"scope": "patroni-demo"
}
}

View file

@ -1,26 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2021-08-11 07:02:20.732 UTC",
"role": "master",
"server_version": 110012,
"cluster_unlocked": false,
"xlog": {
"location": 1174407088
},
"timeline": 58,
"replication": [
{
"usename": "replicator",
"application_name": "srv1",
"client_addr": "10.20.199.3",
"state": "streaming",
"sync_state": "async",
"sync_priority": 0
}
],
"database_system_identifier": "6965971025273547206",
"patroni": {
"version": "2.0.2",
"scope": "patroni-demo"
}
}

View file

@ -1,26 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2021-08-11 07:02:20.732 UTC",
"role": "master",
"server_version": 110012,
"cluster_unlocked": false,
"xlog": {
"location": 1174407088
},
"timeline": 58,
"replication": [
{
"usename": "replicator",
"application_name": "srv1",
"client_addr": "10.20.199.3",
"state": "streaming",
"sync_state": "async",
"sync_priority": 0
}
],
"database_system_identifier": "6965971025273547206",
"patroni": {
"version": "2.0.2",
"scope": "patroni-demo"
}
}

View file

@ -1,19 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2021-08-11 07:57:51.693 UTC",
"role": "replica",
"server_version": 110012,
"cluster_unlocked": false,
"xlog": {
"received_location": 1174407088,
"replayed_location": 1174407088,
"replayed_timestamp": null,
"paused": false
},
"timeline": 58,
"database_system_identifier": "6965971025273547206",
"patroni": {
"version": "2.0.2",
"scope": "patroni-demo"
}
}

View file

@ -1,26 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2021-08-11 07:02:20.732 UTC",
"role": "master",
"server_version": 110012,
"cluster_unlocked": false,
"xlog": {
"location": 1174407088
},
"timeline": 58,
"replication": [
{
"usename": "replicator",
"application_name": "srv1",
"client_addr": "10.20.199.3",
"state": "streaming",
"sync_state": "async",
"sync_priority": 0
}
],
"database_system_identifier": "6965971025273547206",
"patroni": {
"version": "2.0.2",
"scope": "patroni-demo"
}
}

View file

@ -1,26 +0,0 @@
{
"state": "running",
"postmaster_start_time": "2021-08-11 07:02:20.732 UTC",
"role": "master",
"server_version": 110012,
"cluster_unlocked": false,
"xlog": {
"location": 1174407088
},
"timeline": 58,
"replication": [
{
"usename": "replicator",
"application_name": "srv1",
"client_addr": "10.20.199.3",
"state": "streaming",
"sync_state": "async",
"sync_priority": 0
}
],
"database_system_identifier": "6965971025273547206",
"patroni": {
"version": "2.0.2",
"scope": "patroni-demo"
}
}

View file

@ -1,20 +0,0 @@
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI
def test_api_status_code_200(runner: CliRunner, patroni_api: PatroniAPI) -> None:
with patroni_api.routes({"patroni": "node_is_pending_restart_ok.json"}):
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "node_is_pending_restart"]
)
assert result.exit_code == 0
def test_api_status_code_404(runner: CliRunner, patroni_api: PatroniAPI) -> None:
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "node_is_pending_restart"]
)
assert result.exit_code == 3

View file

@ -1,171 +0,0 @@
from pathlib import Path
from typing import Iterator
import nagiosplugin
import pytest
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI
@pytest.fixture(scope="module", autouse=True)
def cluster_config_has_changed(patroni_api: PatroniAPI) -> Iterator[None]:
with patroni_api.routes({"config": "cluster_config_has_changed.json"}):
yield None
def test_cluster_config_has_changed_ok_with_hash(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_config_has_changed",
"--hash",
"96b12d82571473d13e890b893734e731",
],
)
assert result.exit_code == 0
assert (
result.stdout
== "CLUSTERCONFIGHASCHANGED OK - The hash of patroni's dynamic configuration has not changed (96b12d82571473d13e890b893734e731). | is_configuration_changed=0;;@1:1\n"
)
def test_cluster_config_has_changed_ok_with_state_file(
runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path
) -> None:
state_file = tmp_path / "cluster_config_has_changed.state_file"
with state_file.open("w") as f:
f.write('{"hash": "96b12d82571473d13e890b893734e731"}')
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_config_has_changed",
"--state-file",
str(state_file),
],
)
assert result.exit_code == 0
assert (
result.stdout
== "CLUSTERCONFIGHASCHANGED OK - The hash of patroni's dynamic configuration has not changed (96b12d82571473d13e890b893734e731). | is_configuration_changed=0;;@1:1\n"
)
def test_cluster_config_has_changed_ko_with_hash(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_config_has_changed",
"--hash",
"96b12d82571473d13e890b8937ffffff",
],
)
assert result.exit_code == 2
assert (
result.stdout
== "CLUSTERCONFIGHASCHANGED CRITICAL - The hash of patroni's dynamic configuration has changed. The old hash was 96b12d82571473d13e890b8937ffffff. | is_configuration_changed=1;;@1:1\n"
)
def test_cluster_config_has_changed_ko_with_state_file_and_save(
runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path
) -> None:
state_file = tmp_path / "cluster_config_has_changed.state_file"
with state_file.open("w") as f:
f.write('{"hash": "96b12d82571473d13e890b8937ffffff"}')
# test without saving the new hash
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_config_has_changed",
"--state-file",
str(state_file),
],
)
assert result.exit_code == 2
assert (
result.stdout
== "CLUSTERCONFIGHASCHANGED CRITICAL - The hash of patroni's dynamic configuration has changed. The old hash was 96b12d82571473d13e890b8937ffffff. | is_configuration_changed=1;;@1:1\n"
)
state_file = tmp_path / "cluster_config_has_changed.state_file"
cookie = nagiosplugin.Cookie(state_file)
cookie.open()
new_config_hash = cookie.get("hash")
cookie.close()
assert new_config_hash == "96b12d82571473d13e890b8937ffffff"
# test when we save the hash
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_config_has_changed",
"--state-file",
str(state_file),
"--save",
],
)
assert result.exit_code == 2
assert (
result.stdout
== "CLUSTERCONFIGHASCHANGED CRITICAL - The hash of patroni's dynamic configuration has changed. The old hash was 96b12d82571473d13e890b8937ffffff. | is_configuration_changed=1;;@1:1\n"
)
cookie = nagiosplugin.Cookie(state_file)
cookie.open()
new_config_hash = cookie.get("hash")
cookie.close()
assert new_config_hash == "96b12d82571473d13e890b893734e731"
def test_cluster_config_has_changed_params(
runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path
) -> None:
# This one is placed last because it seems like the exceptions are not flushed from stderr for the next tests.
fake_state_file = tmp_path / "fake_file_name.state_file"
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_config_has_changed",
"--hash",
"640df9f0211c791723f18fc3ed9dbb95",
"--state-file",
str(fake_state_file),
],
)
assert result.exit_code == 3
assert (
result.stdout
== "CLUSTERCONFIGHASCHANGED UNKNOWN: click.exceptions.UsageError: Either --hash or --state-file should be provided for this service\n"
)
result = runner.invoke(
main, ["-e", "https://10.20.199.3:8008", "cluster_config_has_changed"]
)
assert result.exit_code == 3
assert (
result.stdout
== "CLUSTERCONFIGHASCHANGED UNKNOWN: click.exceptions.UsageError: Either --hash or --state-file should be provided for this service\n"
)

View file

@ -1,139 +0,0 @@
from pathlib import Path
from typing import Iterator, Union
import pytest
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI, cluster_api_set_replica_running
@pytest.fixture
def cluster_has_leader_ok(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_has_leader_ok.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_has_leader_ok")
def test_cluster_has_leader_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_leader"])
assert (
result.stdout
== "CLUSTERHASLEADER OK - The cluster has a running leader. | has_leader=1;;@0 is_leader=1 is_standby_leader=0 is_standby_leader_in_arc_rec=0;@1:1\n"
)
assert result.exit_code == 0
@pytest.fixture
def cluster_has_leader_ok_standby_leader(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_has_leader_ok_standby_leader.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_has_leader_ok_standby_leader")
def test_cluster_has_leader_ok_standby_leader(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_leader"])
assert (
result.stdout
== "CLUSTERHASLEADER OK - The cluster has a running leader. | has_leader=1;;@0 is_leader=0 is_standby_leader=1 is_standby_leader_in_arc_rec=0;@1:1\n"
)
assert result.exit_code == 0
@pytest.fixture
def cluster_has_leader_ko(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_has_leader_ko.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_has_leader_ko")
def test_cluster_has_leader_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_leader"])
assert (
result.stdout
== "CLUSTERHASLEADER CRITICAL - The cluster has no running leader or the standby leader is in archive recovery. | has_leader=0;;@0 is_leader=0 is_standby_leader=0 is_standby_leader_in_arc_rec=0;@1:1\n"
)
assert result.exit_code == 2
@pytest.fixture
def cluster_has_leader_ko_standby_leader(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_has_leader_ko_standby_leader.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_has_leader_ko_standby_leader")
def test_cluster_has_leader_ko_standby_leader(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_leader"])
assert (
result.stdout
== "CLUSTERHASLEADER CRITICAL - The cluster has no running leader or the standby leader is in archive recovery. | has_leader=0;;@0 is_leader=0 is_standby_leader=0 is_standby_leader_in_arc_rec=0;@1:1\n"
)
assert result.exit_code == 2
@pytest.fixture
def cluster_has_leader_ko_standby_leader_archiving(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = (
"cluster_has_leader_ko_standby_leader_archiving.json"
)
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_has_leader_ko_standby_leader_archiving")
def test_cluster_has_leader_ko_standby_leader_archiving(
runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool
) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_leader"])
if old_replica_state:
assert (
result.stdout
== "CLUSTERHASLEADER OK - The cluster has a running leader. | has_leader=1;;@0 is_leader=0 is_standby_leader=1 is_standby_leader_in_arc_rec=0;@1:1\n"
)
assert result.exit_code == 0
else:
assert (
result.stdout
== "CLUSTERHASLEADER WARNING - The cluster has no running leader or the standby leader is in archive recovery. | has_leader=1;;@0 is_leader=0 is_standby_leader=1 is_standby_leader_in_arc_rec=1;@1:1\n"
)
assert result.exit_code == 1

View file

@ -1,288 +0,0 @@
from pathlib import Path
from typing import Iterator, Union
import pytest
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI, cluster_api_set_replica_running
@pytest.fixture
def cluster_has_replica_ok(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_has_replica_ok.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_has_replica_ok")
def test_cluster_has_relica_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_replica"])
assert (
result.stdout
== "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=1 srv3_timeline=51 sync_replica=1 unhealthy_replica=0\n"
)
assert result.exit_code == 0
@pytest.mark.usefixtures("cluster_has_replica_ok")
def test_cluster_has_replica_ok_with_count_thresholds(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_has_replica",
"--warning",
"@1",
"--critical",
"@0",
],
)
assert (
result.stdout
== "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2;@1;@0 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=1 srv3_timeline=51 sync_replica=1 unhealthy_replica=0\n"
)
assert result.exit_code == 0
@pytest.mark.usefixtures("cluster_has_replica_ok")
def test_cluster_has_replica_ok_with_sync_count_thresholds(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_has_replica",
"--sync-warning",
"1:",
],
)
assert (
result.stdout
== "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=1 srv3_timeline=51 sync_replica=1;1: unhealthy_replica=0\n"
)
assert result.exit_code == 0
@pytest.fixture
def cluster_has_replica_ok_lag(
patroni_api: PatroniAPI, datadir: Path, tmp_path: Path, old_replica_state: bool
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_has_replica_ok_lag.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_has_replica_ok_lag")
def test_cluster_has_replica_ok_with_count_thresholds_lag(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_has_replica",
"--warning",
"@1",
"--critical",
"@0",
"--max-lag",
"1MB",
],
)
assert (
result.stdout
== "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2;@1;@0 srv2_lag=1024 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=0\n"
)
assert result.exit_code == 0
@pytest.fixture
def cluster_has_replica_ko(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_has_replica_ko.json"
patroni_path: Union[str, Path] = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_has_replica_ko")
def test_cluster_has_replica_ko_with_count_thresholds(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_has_replica",
"--warning",
"@1",
"--critical",
"@0",
],
)
assert (
result.stdout
== "CLUSTERHASREPLICA WARNING - healthy_replica is 1 (outside range @0:1) | healthy_replica=1;@1;@0 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=1\n"
)
assert result.exit_code == 1
@pytest.mark.usefixtures("cluster_has_replica_ko")
def test_cluster_has_replica_ko_with_sync_count_thresholds(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_has_replica",
"--sync-warning",
"2:",
"--sync-critical",
"1:",
],
)
# The lag on srv2 is "unknown". We don't handle string in perfstats so we have to scratch all the second node stats
assert (
result.stdout
== "CLUSTERHASREPLICA CRITICAL - sync_replica is 0 (outside range 1:) | healthy_replica=1 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0;2:;1: unhealthy_replica=1\n"
)
assert result.exit_code == 2
@pytest.fixture
def cluster_has_replica_ko_lag(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_has_replica_ko_lag.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_has_replica_ko_lag")
def test_cluster_has_replica_ko_with_count_thresholds_and_lag(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_has_replica",
"--warning",
"@1",
"--critical",
"@0",
"--max-lag",
"1MB",
],
)
assert (
result.stdout
== "CLUSTERHASREPLICA CRITICAL - healthy_replica is 0 (outside range @0:0) | healthy_replica=0;@1;@0 srv2_lag=10241024 srv2_sync=0 srv2_timeline=51 srv3_lag=20000000 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=2\n"
)
assert result.exit_code == 2
@pytest.fixture
def cluster_has_replica_ko_wrong_tl(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_has_replica_ko_wrong_tl.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_has_replica_ko_wrong_tl")
def test_cluster_has_replica_ko_wrong_tl(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_has_replica",
"--warning",
"@1",
"--critical",
"@0",
"--max-lag",
"1MB",
],
)
assert (
result.stdout
== "CLUSTERHASREPLICA WARNING - healthy_replica is 1 (outside range @0:1) | healthy_replica=1;@1;@0 srv2_lag=1000000 srv2_sync=0 srv2_timeline=50 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=1\n"
)
assert result.exit_code == 1
@pytest.fixture
def cluster_has_replica_ko_all_replica(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_has_replica_ko_all_replica.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_has_replica_ko_all_replica")
def test_cluster_has_replica_ko_all_replica(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_has_replica",
"--warning",
"@1",
"--critical",
"@0",
"--max-lag",
"1MB",
],
)
assert (
result.stdout
== "CLUSTERHASREPLICA CRITICAL - healthy_replica is 0 (outside range @0:0) | healthy_replica=0;@1;@0 srv1_lag=0 srv1_sync=0 srv1_timeline=51 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=3\n"
)
assert result.exit_code == 2

View file

@ -1,51 +0,0 @@
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI
def test_cluster_has_scheduled_action_ok(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
with patroni_api.routes({"cluster": "cluster_has_scheduled_action_ok.json"}):
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "cluster_has_scheduled_action"]
)
assert result.exit_code == 0
assert (
result.stdout
== "CLUSTERHASSCHEDULEDACTION OK - has_scheduled_actions is 0 | has_scheduled_actions=0;;0 scheduled_restart=0 scheduled_switchover=0\n"
)
def test_cluster_has_scheduled_action_ko_switchover(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
with patroni_api.routes(
{"cluster": "cluster_has_scheduled_action_ko_switchover.json"}
):
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "cluster_has_scheduled_action"]
)
assert result.exit_code == 2
assert (
result.stdout
== "CLUSTERHASSCHEDULEDACTION CRITICAL - has_scheduled_actions is 1 (outside range 0:0) | has_scheduled_actions=1;;0 scheduled_restart=0 scheduled_switchover=1\n"
)
def test_cluster_has_scheduled_action_ko_restart(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
with patroni_api.routes(
{"cluster": "cluster_has_scheduled_action_ko_restart.json"}
):
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "cluster_has_scheduled_action"]
)
assert result.exit_code == 2
assert (
result.stdout
== "CLUSTERHASSCHEDULEDACTION CRITICAL - has_scheduled_actions is 1 (outside range 0:0) | has_scheduled_actions=1;;0 scheduled_restart=1 scheduled_switchover=0\n"
)

View file

@ -1,49 +0,0 @@
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI
def test_cluster_is_in_maintenance_ok(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
with patroni_api.routes({"cluster": "cluster_is_in_maintenance_ok.json"}):
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "cluster_is_in_maintenance"]
)
assert result.exit_code == 0
assert (
result.stdout
== "CLUSTERISINMAINTENANCE OK - is_in_maintenance is 0 | is_in_maintenance=0;;0\n"
)
def test_cluster_is_in_maintenance_ko(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
with patroni_api.routes({"cluster": "cluster_is_in_maintenance_ko.json"}):
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "cluster_is_in_maintenance"]
)
assert result.exit_code == 2
assert (
result.stdout
== "CLUSTERISINMAINTENANCE CRITICAL - is_in_maintenance is 1 (outside range 0:0) | is_in_maintenance=1;;0\n"
)
def test_cluster_is_in_maintenance_ok_pause_false(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
with patroni_api.routes(
{"cluster": "cluster_is_in_maintenance_ok_pause_false.json"}
):
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "cluster_is_in_maintenance"]
)
assert result.exit_code == 0
assert (
result.stdout
== "CLUSTERISINMAINTENANCE OK - is_in_maintenance is 0 | is_in_maintenance=0;;0\n"
)

View file

@ -1,272 +0,0 @@
from pathlib import Path
from typing import Iterator, Union
import pytest
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI, cluster_api_set_replica_running
@pytest.fixture
def cluster_node_count_ok(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_node_count_ok.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_node_count_ok")
def test_cluster_node_count_ok(
runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool
) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_node_count"])
if old_replica_state:
assert (
result.output
== "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3 members=3 role_leader=1 role_replica=2 state_running=3\n"
)
else:
assert (
result.output
== "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3 members=3 role_leader=1 role_replica=2 state_running=1 state_streaming=2\n"
)
assert result.exit_code == 0
@pytest.mark.usefixtures("cluster_node_count_ok")
def test_cluster_node_count_ok_with_thresholds(
runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_node_count",
"--warning",
"@0:1",
"--critical",
"@2",
"--healthy-warning",
"@2",
"--healthy-critical",
"@0:1",
],
)
if old_replica_state:
assert (
result.output
== "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3;@2;@1 members=3;@1;@2 role_leader=1 role_replica=2 state_running=3\n"
)
else:
assert (
result.output
== "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3;@2;@1 members=3;@1;@2 role_leader=1 role_replica=2 state_running=1 state_streaming=2\n"
)
assert result.exit_code == 0
@pytest.fixture
def cluster_node_count_healthy_warning(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_node_count_healthy_warning.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_node_count_healthy_warning")
def test_cluster_node_count_healthy_warning(
runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_node_count",
"--healthy-warning",
"@2",
"--healthy-critical",
"@0:1",
],
)
if old_replica_state:
assert (
result.output
== "CLUSTERNODECOUNT WARNING - healthy_members is 2 (outside range @0:2) | healthy_members=2;@2;@1 members=2 role_leader=1 role_replica=1 state_running=2\n"
)
else:
assert (
result.output
== "CLUSTERNODECOUNT WARNING - healthy_members is 2 (outside range @0:2) | healthy_members=2;@2;@1 members=2 role_leader=1 role_replica=1 state_running=1 state_streaming=1\n"
)
assert result.exit_code == 1
@pytest.fixture
def cluster_node_count_healthy_critical(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_node_count_healthy_critical.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_node_count_healthy_critical")
def test_cluster_node_count_healthy_critical(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_node_count",
"--healthy-warning",
"@2",
"--healthy-critical",
"@0:1",
],
)
assert (
result.output
== "CLUSTERNODECOUNT CRITICAL - healthy_members is 1 (outside range @0:1) | healthy_members=1;@2;@1 members=3 role_leader=1 role_replica=2 state_running=1 state_start_failed=2\n"
)
assert result.exit_code == 2
@pytest.fixture
def cluster_node_count_warning(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_node_count_warning.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_node_count_warning")
def test_cluster_node_count_warning(
runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_node_count",
"--warning",
"@2",
"--critical",
"@0:1",
],
)
if old_replica_state:
assert (
result.stdout
== "CLUSTERNODECOUNT WARNING - members is 2 (outside range @0:2) | healthy_members=2 members=2;@2;@1 role_leader=1 role_replica=1 state_running=2\n"
)
else:
assert (
result.stdout
== "CLUSTERNODECOUNT WARNING - members is 2 (outside range @0:2) | healthy_members=2 members=2;@2;@1 role_leader=1 role_replica=1 state_running=1 state_streaming=1\n"
)
assert result.exit_code == 1
@pytest.fixture
def cluster_node_count_critical(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_node_count_critical.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_node_count_critical")
def test_cluster_node_count_critical(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_node_count",
"--warning",
"@2",
"--critical",
"@0:1",
],
)
assert (
result.stdout
== "CLUSTERNODECOUNT CRITICAL - members is 1 (outside range @0:1) | healthy_members=1 members=1;@2;@1 role_leader=1 state_running=1\n"
)
assert result.exit_code == 2
@pytest.fixture
def cluster_node_count_ko_in_archive_recovery(
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
) -> Iterator[None]:
cluster_path: Union[str, Path] = "cluster_node_count_ko_in_archive_recovery.json"
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
if old_replica_state:
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
yield None
@pytest.mark.usefixtures("cluster_node_count_ko_in_archive_recovery")
def test_cluster_node_count_ko_in_archive_recovery(
runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"cluster_node_count",
"--healthy-warning",
"@2",
"--healthy-critical",
"@0:1",
],
)
if old_replica_state:
assert (
result.stdout
== "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3;@2;@1 members=3 role_replica=2 role_standby_leader=1 state_running=3\n"
)
assert result.exit_code == 0
else:
assert (
result.stdout
== "CLUSTERNODECOUNT CRITICAL - healthy_members is 1 (outside range @0:1) | healthy_members=1;@2;@1 members=3 role_replica=2 role_standby_leader=1 state_in_archive_recovery=2 state_streaming=1\n"
)
assert result.exit_code == 2

View file

@ -1,30 +0,0 @@
from pathlib import Path
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI
def test_node_is_alive_ok(
runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path
) -> None:
liveness = tmp_path / "liveness"
liveness.touch()
with patroni_api.routes({"liveness": liveness}):
result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_alive"])
assert result.exit_code == 0
assert (
result.stdout
== "NODEISALIVE OK - This node is alive (patroni is running). | is_alive=1;;@0\n"
)
def test_node_is_alive_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_alive"])
assert result.exit_code == 2
assert (
result.stdout
== "NODEISALIVE CRITICAL - This node is not alive (patroni is not running). | is_alive=0;;@0\n"
)

View file

@ -1,58 +0,0 @@
from typing import Iterator
import pytest
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI
@pytest.fixture
def node_is_leader_ok(patroni_api: PatroniAPI) -> Iterator[None]:
with patroni_api.routes(
{
"leader": "node_is_leader_ok.json",
"standby-leader": "node_is_leader_ok_standby_leader.json",
}
):
yield None
@pytest.mark.usefixtures("node_is_leader_ok")
def test_node_is_leader_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_leader"])
assert result.exit_code == 0
assert (
result.stdout
== "NODEISLEADER OK - This node is a leader node. | is_leader=1;;@0\n"
)
result = runner.invoke(
main,
["-e", patroni_api.endpoint, "node_is_leader", "--is-standby-leader"],
)
assert result.exit_code == 0
assert (
result.stdout
== "NODEISLEADER OK - This node is a standby leader node. | is_leader=1;;@0\n"
)
def test_node_is_leader_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_leader"])
assert result.exit_code == 2
assert (
result.stdout
== "NODEISLEADER CRITICAL - This node is not a leader node. | is_leader=0;;@0\n"
)
result = runner.invoke(
main,
["-e", patroni_api.endpoint, "node_is_leader", "--is-standby-leader"],
)
assert result.exit_code == 2
assert (
result.stdout
== "NODEISLEADER CRITICAL - This node is not a standby leader node. | is_leader=0;;@0\n"
)

View file

@ -1,29 +0,0 @@
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI
def test_node_is_pending_restart_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None:
with patroni_api.routes({"patroni": "node_is_pending_restart_ok.json"}):
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "node_is_pending_restart"]
)
assert result.exit_code == 0
assert (
result.stdout
== "NODEISPENDINGRESTART OK - This node doesn't have the pending restart flag. | is_pending_restart=0;;0\n"
)
def test_node_is_pending_restart_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None:
with patroni_api.routes({"patroni": "node_is_pending_restart_ko.json"}):
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "node_is_pending_restart"]
)
assert result.exit_code == 2
assert (
result.stdout
== "NODEISPENDINGRESTART CRITICAL - This node has the pending restart flag. | is_pending_restart=1;;0\n"
)

View file

@ -1,24 +0,0 @@
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI
def test_node_is_primary_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None:
with patroni_api.routes({"primary": "node_is_primary_ok.json"}):
result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_primary"])
assert result.exit_code == 0
assert (
result.stdout
== "NODEISPRIMARY OK - This node is the primary with the leader lock. | is_primary=1;;@0\n"
)
def test_node_is_primary_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_primary"])
assert result.exit_code == 2
assert (
result.stdout
== "NODEISPRIMARY CRITICAL - This node is not the primary with the leader lock. | is_primary=0;;@0\n"
)

View file

@ -1,155 +0,0 @@
from typing import Iterator
import pytest
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI
@pytest.fixture
def node_is_replica_ok(patroni_api: PatroniAPI) -> Iterator[None]:
with patroni_api.routes(
{
k: "node_is_replica_ok.json"
for k in ("replica", "synchronous", "asynchronous")
}
):
yield None
@pytest.mark.usefixtures("node_is_replica_ok")
def test_node_is_replica_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_replica"])
assert result.exit_code == 0
assert (
result.stdout
== "NODEISREPLICA OK - This node is a running replica with no noloadbalance tag. | is_replica=1;;@0\n"
)
def test_node_is_replica_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None:
result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_replica"])
assert result.exit_code == 2
assert (
result.stdout
== "NODEISREPLICA CRITICAL - This node is not a running replica with no noloadbalance tag. | is_replica=0;;@0\n"
)
def test_node_is_replica_ko_lag(runner: CliRunner, patroni_api: PatroniAPI) -> None:
# We don't do the check ourselves, patroni does it and changes the return code
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "node_is_replica", "--max-lag", "100"]
)
assert result.exit_code == 2
assert (
result.stdout
== "NODEISREPLICA CRITICAL - This node is not a running replica with no noloadbalance tag and a lag under 100. | is_replica=0;;@0\n"
)
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"node_is_replica",
"--is-async",
"--max-lag",
"100",
],
)
assert result.exit_code == 2
assert (
result.stdout
== "NODEISREPLICA CRITICAL - This node is not a running asynchronous replica with no noloadbalance tag and a lag under 100. | is_replica=0;;@0\n"
)
@pytest.mark.usefixtures("node_is_replica_ok")
def test_node_is_replica_sync_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None:
# We don't do the check ourselves, patroni does it and changes the return code
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "node_is_replica", "--is-sync"]
)
assert result.exit_code == 0
assert (
result.stdout
== "NODEISREPLICA OK - This node is a running synchronous replica with no noloadbalance tag. | is_replica=1;;@0\n"
)
def test_node_is_replica_sync_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None:
# We don't do the check ourselves, patroni does it and changes the return code
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "node_is_replica", "--is-sync"]
)
assert result.exit_code == 2
assert (
result.stdout
== "NODEISREPLICA CRITICAL - This node is not a running synchronous replica with no noloadbalance tag. | is_replica=0;;@0\n"
)
@pytest.mark.usefixtures("node_is_replica_ok")
def test_node_is_replica_async_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None:
# We don't do the check ourselves, patroni does it and changes the return code
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "node_is_replica", "--is-async"]
)
assert result.exit_code == 0
assert (
result.stdout
== "NODEISREPLICA OK - This node is a running asynchronous replica with no noloadbalance tag. | is_replica=1;;@0\n"
)
def test_node_is_replica_async_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None:
# We don't do the check ourselves, patroni does it and changes the return code
result = runner.invoke(
main, ["-e", patroni_api.endpoint, "node_is_replica", "--is-async"]
)
assert result.exit_code == 2
assert (
result.stdout
== "NODEISREPLICA CRITICAL - This node is not a running asynchronous replica with no noloadbalance tag. | is_replica=0;;@0\n"
)
@pytest.mark.usefixtures("node_is_replica_ok")
def test_node_is_replica_params(runner: CliRunner, patroni_api: PatroniAPI) -> None:
# We don't do the check ourselves, patroni does it and changes the return code
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"node_is_replica",
"--is-async",
"--is-sync",
],
)
assert result.exit_code == 3
assert (
result.stdout
== "NODEISREPLICA UNKNOWN: click.exceptions.UsageError: --is-sync and --is-async cannot be provided at the same time for this service\n"
)
# We don't do the check ourselves, patroni does it and changes the return code
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"node_is_replica",
"--is-sync",
"--max-lag",
"1MB",
],
)
assert result.exit_code == 3
assert (
result.stdout
== "NODEISREPLICA UNKNOWN: click.exceptions.UsageError: --is-sync and --max-lag cannot be provided at the same time for this service\n"
)

View file

@ -1,50 +0,0 @@
from typing import Iterator
import pytest
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI
@pytest.fixture(scope="module", autouse=True)
def node_patroni_version(patroni_api: PatroniAPI) -> Iterator[None]:
with patroni_api.routes({"patroni": "node_patroni_version.json"}):
yield None
def test_node_patroni_version_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"node_patroni_version",
"--patroni-version",
"2.0.2",
],
)
assert result.exit_code == 0
assert (
result.stdout
== "NODEPATRONIVERSION OK - Patroni's version is 2.0.2. | is_version_ok=1;;@0\n"
)
def test_node_patroni_version_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"node_patroni_version",
"--patroni-version",
"1.0.0",
],
)
assert result.exit_code == 2
assert (
result.stdout
== "NODEPATRONIVERSION CRITICAL - Patroni's version is not 1.0.0. | is_version_ok=0;;@0\n"
)

View file

@ -1,173 +0,0 @@
from pathlib import Path
from typing import Iterator
import nagiosplugin
import pytest
from click.testing import CliRunner
from check_patroni.cli import main
from . import PatroniAPI
@pytest.fixture
def node_tl_has_changed(patroni_api: PatroniAPI) -> Iterator[None]:
with patroni_api.routes({"patroni": "node_tl_has_changed.json"}):
yield None
@pytest.mark.usefixtures("node_tl_has_changed")
def test_node_tl_has_changed_ok_with_timeline(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"node_tl_has_changed",
"--timeline",
"58",
],
)
assert result.exit_code == 0
assert (
result.stdout
== "NODETLHASCHANGED OK - The timeline is still 58. | is_timeline_changed=0;;@1:1 timeline=58\n"
)
@pytest.mark.usefixtures("node_tl_has_changed")
def test_node_tl_has_changed_ok_with_state_file(
runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path
) -> None:
state_file = tmp_path / "node_tl_has_changed.state_file"
with state_file.open("w") as f:
f.write('{"timeline": 58}')
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"node_tl_has_changed",
"--state-file",
str(state_file),
],
)
assert result.exit_code == 0
assert (
result.stdout
== "NODETLHASCHANGED OK - The timeline is still 58. | is_timeline_changed=0;;@1:1 timeline=58\n"
)
@pytest.mark.usefixtures("node_tl_has_changed")
def test_node_tl_has_changed_ko_with_timeline(
runner: CliRunner, patroni_api: PatroniAPI
) -> None:
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"node_tl_has_changed",
"--timeline",
"700",
],
)
assert result.exit_code == 2
assert (
result.stdout
== "NODETLHASCHANGED CRITICAL - The expected timeline was 700 got 58. | is_timeline_changed=1;;@1:1 timeline=58\n"
)
@pytest.mark.usefixtures("node_tl_has_changed")
def test_node_tl_has_changed_ko_with_state_file_and_save(
runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path
) -> None:
state_file = tmp_path / "node_tl_has_changed.state_file"
with state_file.open("w") as f:
f.write('{"timeline": 700}')
# test without saving the new tl
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"node_tl_has_changed",
"--state-file",
str(state_file),
],
)
assert result.exit_code == 2
assert (
result.stdout
== "NODETLHASCHANGED CRITICAL - The expected timeline was 700 got 58. | is_timeline_changed=1;;@1:1 timeline=58\n"
)
cookie = nagiosplugin.Cookie(state_file)
cookie.open()
new_tl = cookie.get("timeline")
cookie.close()
assert new_tl == 700
# test when we save the hash
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"node_tl_has_changed",
"--state-file",
str(state_file),
"--save",
],
)
assert result.exit_code == 2
assert (
result.stdout
== "NODETLHASCHANGED CRITICAL - The expected timeline was 700 got 58. | is_timeline_changed=1;;@1:1 timeline=58\n"
)
cookie = nagiosplugin.Cookie(state_file)
cookie.open()
new_tl = cookie.get("timeline")
cookie.close()
assert new_tl == 58
@pytest.mark.usefixtures("node_tl_has_changed")
def test_node_tl_has_changed_params(
runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path
) -> None:
# This one is placed last because it seems like the exceptions are not flushed from stderr for the next tests.
fake_state_file = tmp_path / "fake_file_name.state_file"
result = runner.invoke(
main,
[
"-e",
patroni_api.endpoint,
"node_tl_has_changed",
"--timeline",
"58",
"--state-file",
str(fake_state_file),
],
)
assert result.exit_code == 3
assert (
result.stdout
== "NODETLHASCHANGED UNKNOWN: click.exceptions.UsageError: Either --timeline or --state-file should be provided for this service\n"
)
result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_tl_has_changed"])
assert result.exit_code == 3
assert (
result.stdout
== "NODETLHASCHANGED UNKNOWN: click.exceptions.UsageError: Either --timeline or --state-file should be provided for this service\n"
)

49
tox.ini
View file

@ -1,49 +0,0 @@
[tox]
# the versions specified here are overridden by github workflow
envlist = lint, mypy, py{37,38,39,310,311}
skip_missing_interpreters = True
[testenv]
extras = test
commands =
pytest {toxinidir}/check_patroni {toxinidir}/tests {posargs:-vv --log-level=debug}
[testenv:lint]
skip_install = True
deps =
codespell
black
flake8
isort
commands =
codespell {toxinidir}/check_patroni {toxinidir}/tests {toxinidir}/docs/ {toxinidir}/RELEASE.md {toxinidir}/CONTRIBUTING.md
black --check --diff {toxinidir}/check_patroni {toxinidir}/tests
flake8 {toxinidir}/check_patroni {toxinidir}/tests
isort --check --diff {toxinidir}/check_patroni {toxinidir}/tests
[testenv:mypy]
deps =
mypy == 0.961
commands =
# we need to install types-requests
mypy --install-types --non-interactive
[testenv:build]
deps =
wheel
setuptools
twine
allowlist_externals =
rm
commands =
rm --verbose --recursive --force {toxinidir}/dist/
python -m build
python -m twine check dist/*
[testenv:upload]
# requires a check_patroni section in ~/.pypirc
skip_install = True
deps =
twine
commands =
python -m twine upload --repository check_patroni dist/*

View file

@ -1,29 +0,0 @@
BSD 3-Clause License
Copyright (c) 2019, Jehan-Guillaume (ioguix) de Rorthais
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View file

@ -1,22 +0,0 @@
export VAGRANT_BOX_UPDATE_CHECK_DISABLE=1
export VAGRANT_CHECKPOINT_DISABLE=1
.PHONY: all prov validate
all: prov
prov:
vagrant up --provision
clean:
vagrant destroy -f
validate:
@vagrant validate
@if which shellcheck >/dev/null ;\
then shellcheck provision/* ;\
else echo "WARNING: shellcheck is not in PATH, not checking bash syntax" ;\
fi

View file

@ -1,127 +0,0 @@
# Icinga
## Install
Create the VM:
```
make
```
## IcingaWeb
Configure Icingaweb :
```
http://$IP/icingaweb2/setup
```
* Screen 1: Welcome
Use the icinga token given a the end of the `icinga2-setup` provision, or:
```
sudo icingacli setup token show
```
Next
* Screen 2: Modules
Activate Monitor (already set)
Next
* Screen 3: Icinga Web 2
Next
* Screen 4: Authentication
Next
* Screen 5: Database Resource
Database Name: icingaweb_db
Username: supervisor
Password: th3Pass
Charset: UTF8
Validate
Next
* Screen 6: Authentication Backend
Next
* Screen 7: Administration
Fill the blanks
Next
* Screen 8: Application Configuration
Next
* Screen 9: Summary
Next
* Screen 10: Welcome ... again
Next
* Screen 11: Monitoring IDO Resource
Database Name: icinga2
Username: supervisor
Password: th3Pass
Charset: UTF8
Validate
Next
* Screen 12: Command Transport
Transaport name: icinga2
Transport Type: API
Host: 127.0.0.1
Port: 5665
User: icinga_api
Password: th3Pass
Next
* Screen 13: Monitoring Security
Next
* Screen 14: Summary
Finish
* Screen 15: Hopefully success
Login
## Add servers to icinga
```
# Connect to the vm
vagrant ssh s1
# Create /etc/icinga2/conf.d/check_patroni.conf
sudo /vagrant/provision/director.bash init cluster1 p1=10.20.89.54 p2=10.20.89.55
# Check and load conf
sudo icinga2 daemon -C
sudo systemctl restart icinga2.service
```
# Grafana
Connect to: http://10.20.89.52:3000/login
User / pass: admin/admin
Import the dashboards for the grafana directory. They are created for cluster1,
and servers p1, p2.

Some files were not shown because too many files have changed in this diff Show more