Commit graph

57 commits

Author SHA1 Message Date
benoit 807f9b2071 Release V2.0.0 2024-04-09 16:45:11 +02:00
benoit a4ed20210c Improve doc for node_is_replica
node_is_replica is using the following Patroni endpoints: replica, asynchronous
and synchronous. The first two implement the lag tag. For these endpoints
the state of a replica node doesn't reflect the replication state
(streaming or in archive recovery), we only know if it's running. The
timeline is also not checked.

Therefore, if a cluster is using asynchronous replication, it is recommended
to check for the lag to detect a divegence as soon as possible.
2024-02-26 16:02:53 +01:00
benoit 78ef0f6ada Fix cluster_node_count's management of replication states
The service now supports the `streaming` state.

Since we dont check for lag or timeline in this service, a healthy node
is :

* leader : in a running state
* standby_leader : running (pre Patroni 3.0.4), streaming otherwise
* standby & sync_standby : running (pre Patroni 3.0.4), streaming otherwise

Updated the tests for this service.
2024-01-09 06:50:00 +01:00
benoit 46db3e2d15 Fix the cluster_has_leader service for standby clusters
Before this patch we checked the expected standby leader state
was `running` for all versions of Patroni.

With this patch, for:
* Patroni < 3.0.4, standby leaders are in `running` state.
* Patroni >= 3.0.4, standby leaders can be in `streaming` or `in
archive recovey` state. We will raise a warning for the latter.

The tests where modified to account for this.

Co-authored-by: Denis Laxalde <denis@laxalde.org>
2023-12-18 13:17:37 +01:00
benoit 8d6b8502b6 cluster_has_replica: fix the way a healthy replica is detected
For patroni >= version 3.0.4:
* the role is `replica` or `sync_standby`
* the state is `streaming` or `in archive recovery`
* the timeline is the same as the leader
* the lag is lower or equal to `max_lag`

For prio versions of patroni:
* the role is `replica` or `sync_standby`
* the state is `running`
* the timeline is the same as the leader
* the lag is lower or equal to `max_lag`

Additionnally, we now display the timeline in the perfstats. We also try
to display the perf stats of unhealthy replica as much as possible.

Update tests for cluster_has_replica:
* Fix the tests to make them work with the new algotithm
* Add a specific test for tl divergences
2023-11-11 10:50:35 +01:00
Denis Laxalde 6ee8db1df2 Avoid using requests's JSONDecodeError
This exception is only present in "recent" version of requests,
typically not in the version distributed by Debian bullseye. Since
requests' JSONDecodeError is in general a subclass of
json.JSONDecodeError, we use the latter, but also handle the plain
ValueError (which json.JSONDecodeError is a subclass of) because
requests might use simplejson (which uses its own JSONDecodeError, also
a subclass of ValueError).
2023-10-13 11:45:39 +02:00
Denis Laxalde a0189ebba7 Fix some typos spotted by codespell 2.2.6 2023-10-03 09:53:53 +02:00
Denis Laxalde 95f21a133d Drop superfluous type annotation of 'self'
See https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html#classes
> For instance methods, omit type for "self"
2023-10-03 09:39:40 +02:00
benoit 68b230ccb2 Release 1.0.0 2023-08-28 12:09:16 +02:00
benoit ee3837fab1 Add info and options about sync standby to cluster_has_replica
* Add `--sync-warning` and `--sync-critical`
* Add `sync_replica` to track the number of sync replica in the perf data
* Add `MEMBER-sync` to track if a member is a sync replica in the perf data
2023-08-24 17:34:08 +02:00
benoit b69c2501fd Revert commit 20d9d48
For `node_is_alive`, it seemed to be a good idea to exit with a
`CRITICAL` when the target doesn't exist. But for all the rest, UNKNOWN
(which corresponds to a configuration error) seems better.
2023-08-23 18:35:06 +02:00
benoit 259f04587b Add a node_is_leader service to check for the leader states
It's possible to check for any kind of leader of specifically for a
standby leader.
2023-08-23 18:22:49 +02:00
benoit 8883d6bdc4 Add standby-leader as a valid leader type for cluster_has_leader 2023-08-23 18:22:49 +02:00
benoit 20d9d48d78 Fix the liveness check in node_is_alive
Previously if a node wasn't reachable whe would get an UNKNOWN error.
instead of a CRITICAL error.

```
NODEISALIVE UNKNOWN - Connection failed for all provided endpoints
```

We now get the correct error.

```
NODEISALIVE CRITICAL - This node is not alive (patroni is not running). | is_alive=0;;@0
```
2023-08-23 16:05:02 +02:00
benoit 2bcddf9f87 Add a --is-sync and --is-async to node_is_replica 2023-08-23 15:41:15 +02:00
benoit 99bf1c5bb5 Add new service cluster_has_scheduled_action 2023-08-23 12:08:19 +02:00
benoit d99faeba15 Add tests for the output of the script and support pre/post 3.0.4
* Change all replica status from `running` to `streaming`
* Add an option to pytest to change the state back to `running`
* Also tests the output of the script
* Add a quick test script for live clusters
2023-08-23 10:53:09 +02:00
benoit bb68d937e6 Fix a mistake in the vvv output of ClusterConfigHasChanged 2023-08-22 11:15:44 +02:00
benoit d07a26d324 Fix tests in a more pytonic way as proposed by @dlax 2023-08-21 15:13:22 +02:00
benoit 77722f40c1 Fix liveness check
The liveness probe used to return something. It looks like it doesn't do
it anymore and it breaks the `node_is_alive` check.

Issue: #31
2023-08-21 15:13:22 +02:00
benoit a01a535680 Add sync_standby as an acceptable state for a replica 2023-08-21 13:11:08 +02:00
benoit 021b572e53 Redefining cluster_node_count using Patroni 3.0.4's new status indicators
Previously, replica nodes were labeled with a `running` state. As a
result, our checks were based on nodes marked as `running` through
the `--running-[warning|critical]` options.

However, with the recent changes in Patroni 3.0.4, replica nodes now
carry a `streaming` state. This shift in terminology calls for an
adjustment in our approach. A new state, `healthy_member`, has been
introduced to encompass both `running` and `streaming` nodes.

Key Modifications:

* The existing `--running-[warning|critical]` option is now designated
  as `--healthy-[warning|critical]`.
* Introduction of the `healthy_member` perfdata, which serves as the
  reference point for the aforementioned options.
* Updates to documentation, help messages, and tests.
2023-08-21 11:59:55 +02:00
benoit 2f250e846e State change in patroni 3.0.4
Since patroni 3.0.4, standby node nominal state is "streaming" instead
of "running". Some services need to be changed to account for that.

Reported in issue #28
2023-08-21 11:59:55 +02:00
benoit e9f197b9d9 Release 0.2.0 2023-03-20 15:11:03 +01:00
benoit df744bf7dc Use isort to automatically sort imports 2023-03-20 14:56:11 +01:00
benoit dff95eae2f Fixing logging issues in the previous modifications 2023-03-20 12:25:32 +01:00
benoit 7eea4c94be Reorganise for the urllib3 > requests change 2023-03-20 12:25:32 +01:00
benoit 2443505ad6 Change logging
We now use our own logger.
When debug is set (-vvv), we also display urllib3's debug info.
2023-03-20 12:25:32 +01:00
benoit 7815f3379c Reverse the test for node_is_pending
Since the desired state is for there to be no restart pending state, it
makes more sense to modify the service logic so that the return code
reflects this. As a result, the test for the service `node_is_pending`
has been reversed.
2023-03-20 12:25:32 +01:00
benoit 9cd80f5af8 Move from urllib3 to requests 2023-03-20 12:25:32 +01:00
benoit 0800fc72e9 Add more info on states and roles for cluster_node_count 2023-03-10 10:16:01 +01:00
benoit 6f45618e08 Update node_is_alive to metion that it's an liveness check. 2023-03-10 10:16:01 +01:00
benoit 7286638121 -e/endpoints spec update
* it is now possible to specify a comma separated list of endpoints
* the documentation as been updated to explain that:
  + for node services if several addresses are specified they should
    point to different interfaces on the same server.
  + for cluster services several addresses should be used because we
    want the cluster status so the more API we try the better our chance
    of having a reply.
2023-03-10 10:16:01 +01:00
benoit 48d2656ec7 Update check_patroni's description to mention Patroni's API 2023-03-10 10:16:01 +01:00
benoit 93b196fb77 Fix a format string for cluster_config_has_changed 2023-03-02 17:32:18 +01:00
benoit 275901006b Add spellcheck + tox in requirements-dev.txt 2023-03-02 17:32:18 +01:00
benoit 908669f073 Add a --save option when state files are used
The checks `cluster_config_has_changed` and `node_tl_has_changed` use a
state file to store the previous value of the config hash and the
timeline.

Previously the check would fail if something changed, but the new value
would be saved directly. This behavious has changed. The new value
is saved only if `--save` is passed to the check.

The mimics the way [check_pgactivity] manages this kind of checks.

[check_pgactivity]: https://github.com/OPMDG/check_pgactivity
2023-03-02 17:32:18 +01:00
benoit e7e7ac2e3a Release 0.1.1 2022-07-15 11:16:19 +02:00
benoit 1119abcbdd Release 0.1.0 2022-07-13 16:53:16 +02:00
benoit 64d87d505e Black run 2022-07-11 15:16:19 +02:00
benoit aa1de928d3 An attempt at correctly ordering imports 2022-02-07 15:11:05 +01:00
benoit 9ed9b6466d Read state file in cli to pass the info to the checks and summaries 2022-02-07 15:01:50 +01:00
benoit 4de20fefdc Node and Cluster services reviews 2022-02-07 14:18:14 +01:00
benoit 7898011c40 Update the README and help 2022-02-07 11:03:12 +01:00
benoit 561c3ed9da Fix doc layout and threshold doc 2021-12-31 11:30:17 +01:00
benoit 86f8bdb395 Readme: perfdata for not_tl_has_changed 2021-12-08 17:23:56 +01:00
benoit df58901fd9 Typo Readme 2021-12-08 17:14:38 +01:00
benoit 6c696a03ee Mypy fix
Stop using ctx.parent.params to get the verbose and timeout parameters
parsed in main and use ctx.obj instead.

ctx.parent.params is typed as Optional[Context] which forces us to test
if it's NULL before using it. It's useless in our case because we know
it's not empty and the resulting code is ugly.

The mypy ierror.

Item "None" of "Optional[Context]" has an attribute "params"
[union-attr]
2021-09-11 00:36:57 +02:00
benoit ca95250f2e disaply default values in help 2021-09-10 09:14:19 +02:00
benoit e18ce97d66 Add default value for verbose and fix the type in main parameters 2021-09-09 17:39:12 +02:00