node_is_replica is using the following Patroni endpoints: replica, asynchronous
and synchronous. The first two implement the lag tag. For these endpoints
the state of a replica node doesn't reflect the replication state
(streaming or in archive recovery), we only know if it's running. The
timeline is also not checked.
Therefore, if a cluster is using asynchronous replication, it is recommended
to check for the lag to detect a divegence as soon as possible.
The service now supports the `streaming` state.
Since we dont check for lag or timeline in this service, a healthy node
is :
* leader : in a running state
* standby_leader : running (pre Patroni 3.0.4), streaming otherwise
* standby & sync_standby : running (pre Patroni 3.0.4), streaming otherwise
Updated the tests for this service.
Before this patch we checked the expected standby leader state
was `running` for all versions of Patroni.
With this patch, for:
* Patroni < 3.0.4, standby leaders are in `running` state.
* Patroni >= 3.0.4, standby leaders can be in `streaming` or `in
archive recovey` state. We will raise a warning for the latter.
The tests where modified to account for this.
Co-authored-by: Denis Laxalde <denis@laxalde.org>
For patroni >= version 3.0.4:
* the role is `replica` or `sync_standby`
* the state is `streaming` or `in archive recovery`
* the timeline is the same as the leader
* the lag is lower or equal to `max_lag`
For prio versions of patroni:
* the role is `replica` or `sync_standby`
* the state is `running`
* the timeline is the same as the leader
* the lag is lower or equal to `max_lag`
Additionnally, we now display the timeline in the perfstats. We also try
to display the perf stats of unhealthy replica as much as possible.
Update tests for cluster_has_replica:
* Fix the tests to make them work with the new algotithm
* Add a specific test for tl divergences
This exception is only present in "recent" version of requests,
typically not in the version distributed by Debian bullseye. Since
requests' JSONDecodeError is in general a subclass of
json.JSONDecodeError, we use the latter, but also handle the plain
ValueError (which json.JSONDecodeError is a subclass of) because
requests might use simplejson (which uses its own JSONDecodeError, also
a subclass of ValueError).
* Add `--sync-warning` and `--sync-critical`
* Add `sync_replica` to track the number of sync replica in the perf data
* Add `MEMBER-sync` to track if a member is a sync replica in the perf data
For `node_is_alive`, it seemed to be a good idea to exit with a
`CRITICAL` when the target doesn't exist. But for all the rest, UNKNOWN
(which corresponds to a configuration error) seems better.
Previously if a node wasn't reachable whe would get an UNKNOWN error.
instead of a CRITICAL error.
```
NODEISALIVE UNKNOWN - Connection failed for all provided endpoints
```
We now get the correct error.
```
NODEISALIVE CRITICAL - This node is not alive (patroni is not running). | is_alive=0;;@0
```
* Change all replica status from `running` to `streaming`
* Add an option to pytest to change the state back to `running`
* Also tests the output of the script
* Add a quick test script for live clusters
Previously, replica nodes were labeled with a `running` state. As a
result, our checks were based on nodes marked as `running` through
the `--running-[warning|critical]` options.
However, with the recent changes in Patroni 3.0.4, replica nodes now
carry a `streaming` state. This shift in terminology calls for an
adjustment in our approach. A new state, `healthy_member`, has been
introduced to encompass both `running` and `streaming` nodes.
Key Modifications:
* The existing `--running-[warning|critical]` option is now designated
as `--healthy-[warning|critical]`.
* Introduction of the `healthy_member` perfdata, which serves as the
reference point for the aforementioned options.
* Updates to documentation, help messages, and tests.
Since patroni 3.0.4, standby node nominal state is "streaming" instead
of "running". Some services need to be changed to account for that.
Reported in issue #28
Since the desired state is for there to be no restart pending state, it
makes more sense to modify the service logic so that the return code
reflects this. As a result, the test for the service `node_is_pending`
has been reversed.
* it is now possible to specify a comma separated list of endpoints
* the documentation as been updated to explain that:
+ for node services if several addresses are specified they should
point to different interfaces on the same server.
+ for cluster services several addresses should be used because we
want the cluster status so the more API we try the better our chance
of having a reply.
The checks `cluster_config_has_changed` and `node_tl_has_changed` use a
state file to store the previous value of the config hash and the
timeline.
Previously the check would fail if something changed, but the new value
would be saved directly. This behavious has changed. The new value
is saved only if `--save` is passed to the check.
The mimics the way [check_pgactivity] manages this kind of checks.
[check_pgactivity]: https://github.com/OPMDG/check_pgactivity
Stop using ctx.parent.params to get the verbose and timeout parameters
parsed in main and use ctx.obj instead.
ctx.parent.params is typed as Optional[Context] which forces us to test
if it's NULL before using it. It's useless in our case because we know
it's not empty and the resulting code is ugly.
The mypy ierror.
Item "None" of "Optional[Context]" has an attribute "params"
[union-attr]