Previously if a node wasn't reachable whe would get an UNKNOWN error.
instead of a CRITICAL error.
```
NODEISALIVE UNKNOWN - Connection failed for all provided endpoints
```
We now get the correct error.
```
NODEISALIVE CRITICAL - This node is not alive (patroni is not running). | is_alive=0;;@0
```
* Change all replica status from `running` to `streaming`
* Add an option to pytest to change the state back to `running`
* Also tests the output of the script
* Add a quick test script for live clusters
Previously, replica nodes were labeled with a `running` state. As a
result, our checks were based on nodes marked as `running` through
the `--running-[warning|critical]` options.
However, with the recent changes in Patroni 3.0.4, replica nodes now
carry a `streaming` state. This shift in terminology calls for an
adjustment in our approach. A new state, `healthy_member`, has been
introduced to encompass both `running` and `streaming` nodes.
Key Modifications:
* The existing `--running-[warning|critical]` option is now designated
as `--healthy-[warning|critical]`.
* Introduction of the `healthy_member` perfdata, which serves as the
reference point for the aforementioned options.
* Updates to documentation, help messages, and tests.
Since patroni 3.0.4, standby node nominal state is "streaming" instead
of "running". Some services need to be changed to account for that.
Reported in issue #28
Since the desired state is for there to be no restart pending state, it
makes more sense to modify the service logic so that the return code
reflects this. As a result, the test for the service `node_is_pending`
has been reversed.
* it is now possible to specify a comma separated list of endpoints
* the documentation as been updated to explain that:
+ for node services if several addresses are specified they should
point to different interfaces on the same server.
+ for cluster services several addresses should be used because we
want the cluster status so the more API we try the better our chance
of having a reply.
The checks `cluster_config_has_changed` and `node_tl_has_changed` use a
state file to store the previous value of the config hash and the
timeline.
Previously the check would fail if something changed, but the new value
would be saved directly. This behavious has changed. The new value
is saved only if `--save` is passed to the check.
The mimics the way [check_pgactivity] manages this kind of checks.
[check_pgactivity]: https://github.com/OPMDG/check_pgactivity