check-patroni

Author	SHA1	Message	Date
benoit	807f9b2071	Release V2.0.0	2024-04-09 16:45:11 +02:00
benoit	a4ed20210c	Improve doc for node_is_replica node_is_replica is using the following Patroni endpoints: replica, asynchronous and synchronous. The first two implement the lag tag. For these endpoints the state of a replica node doesn't reflect the replication state (streaming or in archive recovery), we only know if it's running. The timeline is also not checked. Therefore, if a cluster is using asynchronous replication, it is recommended to check for the lag to detect a divegence as soon as possible.	2024-02-26 16:02:53 +01:00
benoit	78ef0f6ada	Fix cluster_node_count's management of replication states The service now supports the `streaming` state. Since we dont check for lag or timeline in this service, a healthy node is : * leader : in a running state * standby_leader : running (pre Patroni 3.0.4), streaming otherwise * standby & sync_standby : running (pre Patroni 3.0.4), streaming otherwise Updated the tests for this service.	2024-01-09 06:50:00 +01:00
benoit	46db3e2d15	Fix the cluster_has_leader service for standby clusters Before this patch we checked the expected standby leader state was `running` for all versions of Patroni. With this patch, for: * Patroni < 3.0.4, standby leaders are in `running` state. * Patroni >= 3.0.4, standby leaders can be in `streaming` or `in archive recovey` state. We will raise a warning for the latter. The tests where modified to account for this. Co-authored-by: Denis Laxalde <denis@laxalde.org>	2023-12-18 13:17:37 +01:00
benoit	8d6b8502b6	cluster_has_replica: fix the way a healthy replica is detected For patroni >= version 3.0.4: * the role is `replica` or `sync_standby` * the state is `streaming` or `in archive recovery` * the timeline is the same as the leader * the lag is lower or equal to `max_lag` For prio versions of patroni: * the role is `replica` or `sync_standby` * the state is `running` * the timeline is the same as the leader * the lag is lower or equal to `max_lag` Additionnally, we now display the timeline in the perfstats. We also try to display the perf stats of unhealthy replica as much as possible. Update tests for cluster_has_replica: * Fix the tests to make them work with the new algotithm * Add a specific test for tl divergences	2023-11-11 10:50:35 +01:00
Denis Laxalde	6ee8db1df2	Avoid using requests's JSONDecodeError This exception is only present in "recent" version of requests, typically not in the version distributed by Debian bullseye. Since requests' JSONDecodeError is in general a subclass of json.JSONDecodeError, we use the latter, but also handle the plain ValueError (which json.JSONDecodeError is a subclass of) because requests might use simplejson (which uses its own JSONDecodeError, also a subclass of ValueError).	2023-10-13 11:45:39 +02:00
Denis Laxalde	a0189ebba7	Fix some typos spotted by codespell 2.2.6	2023-10-03 09:53:53 +02:00
Denis Laxalde	95f21a133d	Drop superfluous type annotation of 'self' See https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html#classes > For instance methods, omit type for "self"	2023-10-03 09:39:40 +02:00
benoit	68b230ccb2	Release 1.0.0	2023-08-28 12:09:16 +02:00
benoit	ee3837fab1	Add info and options about sync standby to cluster_has_replica * Add `--sync-warning` and `--sync-critical` * Add `sync_replica` to track the number of sync replica in the perf data * Add `MEMBER-sync` to track if a member is a sync replica in the perf data	2023-08-24 17:34:08 +02:00
benoit	b69c2501fd	Revert commit `20d9d48` For `node_is_alive`, it seemed to be a good idea to exit with a `CRITICAL` when the target doesn't exist. But for all the rest, UNKNOWN (which corresponds to a configuration error) seems better.	2023-08-23 18:35:06 +02:00
benoit	259f04587b	Add a node_is_leader service to check for the leader states It's possible to check for any kind of leader of specifically for a standby leader.	2023-08-23 18:22:49 +02:00
benoit	8883d6bdc4	Add standby-leader as a valid leader type for cluster_has_leader	2023-08-23 18:22:49 +02:00
benoit	20d9d48d78	Fix the liveness check in node_is_alive Previously if a node wasn't reachable whe would get an UNKNOWN error. instead of a CRITICAL error. ``` NODEISALIVE UNKNOWN - Connection failed for all provided endpoints ``` We now get the correct error. ``` NODEISALIVE CRITICAL - This node is not alive (patroni is not running). \| is_alive=0;;@0 ```	2023-08-23 16:05:02 +02:00
benoit	2bcddf9f87	Add a --is-sync and --is-async to node_is_replica	2023-08-23 15:41:15 +02:00
benoit	99bf1c5bb5	Add new service cluster_has_scheduled_action	2023-08-23 12:08:19 +02:00
benoit	d99faeba15	Add tests for the output of the script and support pre/post 3.0.4 * Change all replica status from `running` to `streaming` * Add an option to pytest to change the state back to `running` * Also tests the output of the script * Add a quick test script for live clusters	2023-08-23 10:53:09 +02:00
benoit	bb68d937e6	Fix a mistake in the vvv output of ClusterConfigHasChanged	2023-08-22 11:15:44 +02:00
benoit	d07a26d324	Fix tests in a more pytonic way as proposed by @dlax	2023-08-21 15:13:22 +02:00
benoit	77722f40c1	Fix liveness check The liveness probe used to return something. It looks like it doesn't do it anymore and it breaks the `node_is_alive` check. Issue: #31	2023-08-21 15:13:22 +02:00
benoit	a01a535680	Add sync_standby as an acceptable state for a replica	2023-08-21 13:11:08 +02:00
benoit	021b572e53	Redefining `cluster_node_count` using Patroni 3.0.4's new status indicators Previously, replica nodes were labeled with a `running` state. As a result, our checks were based on nodes marked as `running` through the `--running-[warning\|critical]` options. However, with the recent changes in Patroni 3.0.4, replica nodes now carry a `streaming` state. This shift in terminology calls for an adjustment in our approach. A new state, `healthy_member`, has been introduced to encompass both `running` and `streaming` nodes. Key Modifications: * The existing `--running-[warning\|critical]` option is now designated as `--healthy-[warning\|critical]`. * Introduction of the `healthy_member` perfdata, which serves as the reference point for the aforementioned options. * Updates to documentation, help messages, and tests.	2023-08-21 11:59:55 +02:00
benoit	2f250e846e	State change in patroni 3.0.4 Since patroni 3.0.4, standby node nominal state is "streaming" instead of "running". Some services need to be changed to account for that. Reported in issue #28	2023-08-21 11:59:55 +02:00
benoit	e9f197b9d9	Release 0.2.0	2023-03-20 15:11:03 +01:00
benoit	df744bf7dc	Use isort to automatically sort imports	2023-03-20 14:56:11 +01:00
benoit	dff95eae2f	Fixing logging issues in the previous modifications	2023-03-20 12:25:32 +01:00
benoit	7eea4c94be	Reorganise for the urllib3 > requests change	2023-03-20 12:25:32 +01:00
benoit	2443505ad6	Change logging We now use our own logger. When debug is set (-vvv), we also display urllib3's debug info.	2023-03-20 12:25:32 +01:00
benoit	7815f3379c	Reverse the test for node_is_pending Since the desired state is for there to be no restart pending state, it makes more sense to modify the service logic so that the return code reflects this. As a result, the test for the service `node_is_pending` has been reversed.	2023-03-20 12:25:32 +01:00
benoit	9cd80f5af8	Move from urllib3 to requests	2023-03-20 12:25:32 +01:00
benoit	0800fc72e9	Add more info on states and roles for cluster_node_count	2023-03-10 10:16:01 +01:00
benoit	6f45618e08	Update node_is_alive to metion that it's an liveness check.	2023-03-10 10:16:01 +01:00
benoit	7286638121	-e/endpoints spec update * it is now possible to specify a comma separated list of endpoints * the documentation as been updated to explain that: + for node services if several addresses are specified they should point to different interfaces on the same server. + for cluster services several addresses should be used because we want the cluster status so the more API we try the better our chance of having a reply.	2023-03-10 10:16:01 +01:00
benoit	48d2656ec7	Update check_patroni's description to mention Patroni's API	2023-03-10 10:16:01 +01:00
benoit	93b196fb77	Fix a format string for cluster_config_has_changed	2023-03-02 17:32:18 +01:00
benoit	275901006b	Add spellcheck + tox in requirements-dev.txt	2023-03-02 17:32:18 +01:00
benoit	908669f073	Add a `--save` option when state files are used The checks `cluster_config_has_changed` and `node_tl_has_changed` use a state file to store the previous value of the config hash and the timeline. Previously the check would fail if something changed, but the new value would be saved directly. This behavious has changed. The new value is saved only if `--save` is passed to the check. The mimics the way [check_pgactivity] manages this kind of checks. [check_pgactivity]: https://github.com/OPMDG/check_pgactivity	2023-03-02 17:32:18 +01:00
benoit	e7e7ac2e3a	Release 0.1.1	2022-07-15 11:16:19 +02:00
benoit	1119abcbdd	Release 0.1.0	2022-07-13 16:53:16 +02:00
benoit	64d87d505e	Black run	2022-07-11 15:16:19 +02:00
benoit	aa1de928d3	An attempt at correctly ordering imports	2022-02-07 15:11:05 +01:00
benoit	9ed9b6466d	Read state file in cli to pass the info to the checks and summaries	2022-02-07 15:01:50 +01:00
benoit	4de20fefdc	Node and Cluster services reviews	2022-02-07 14:18:14 +01:00
benoit	7898011c40	Update the README and help	2022-02-07 11:03:12 +01:00
benoit	561c3ed9da	Fix doc layout and threshold doc	2021-12-31 11:30:17 +01:00
benoit	86f8bdb395	Readme: perfdata for not_tl_has_changed	2021-12-08 17:23:56 +01:00
benoit	df58901fd9	Typo Readme	2021-12-08 17:14:38 +01:00
benoit	6c696a03ee	Mypy fix Stop using ctx.parent.params to get the verbose and timeout parameters parsed in main and use ctx.obj instead. ctx.parent.params is typed as Optional[Context] which forces us to test if it's NULL before using it. It's useless in our case because we know it's not empty and the resulting code is ugly. The mypy ierror. Item "None" of "Optional[Context]" has an attribute "params" [union-attr]	2021-09-11 00:36:57 +02:00
benoit	ca95250f2e	disaply default values in help	2021-09-10 09:14:19 +02:00
benoit	e18ce97d66	Add default value for verbose and fix the type in main parameters	2021-09-09 17:39:12 +02:00

1 2

57 commits