cluster_has_replica: fix the way a healthy replica is detected
For patroni >= version 3.0.4: * the role is `replica` or `sync_standby` * the state is `streaming` or `in archive recovery` * the timeline is the same as the leader * the lag is lower or equal to `max_lag` For prio versions of patroni: * the role is `replica` or `sync_standby` * the state is `running` * the timeline is the same as the leader * the lag is lower or equal to `max_lag` Additionnally, we now display the timeline in the perfstats. We also try to display the perf stats of unhealthy replica as much as possible. Update tests for cluster_has_replica: * Fix the tests to make them work with the new algotithm * Add a specific test for tl divergences
This commit is contained in:
parent
6ee8db1df2
commit
8d6b8502b6
|
@ -4,10 +4,14 @@
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|
||||||
|
* Add the timeline in the `cluster_has_replica` perfstats. (#50)
|
||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
|
|
||||||
* Add compatibility with [requests](https://requests.readthedocs.io)
|
* Add compatibility with [requests](https://requests.readthedocs.io)
|
||||||
version 2.25 and higher.
|
version 2.25 and higher.
|
||||||
|
* Fix what `cluster_has_replica` deems a healthy replica. (#50, reported by @mbanck)
|
||||||
|
* Fix `cluster_has_replica` to display perfstats for replicas whenever it's possible (healthy or not). (#50)
|
||||||
|
|
||||||
### Misc
|
### Misc
|
||||||
|
|
||||||
|
|
28
README.md
28
README.md
|
@ -190,10 +190,27 @@ Usage: check_patroni cluster_has_replica [OPTIONS]
|
||||||
|
|
||||||
Check if the cluster has healthy replicas and/or if some are sync standbies
|
Check if the cluster has healthy replicas and/or if some are sync standbies
|
||||||
|
|
||||||
|
For patroni (and this check):
|
||||||
|
* a replica is `streaming` if the `pg_stat_wal_receiver` say's so.
|
||||||
|
* a replica is `in archive recovery`, if it's not `streaming` and has a `restore_command`.
|
||||||
|
|
||||||
A healthy replica:
|
A healthy replica:
|
||||||
* is in running or streaming state (V3.0.4)
|
* has a `replica` or `sync_standby` role
|
||||||
* has a replica or sync_standby role
|
* has the same timeline as the leader and
|
||||||
* has a lag lower or equal to max_lag
|
* is in `running` state (patroni < V3.0.4)
|
||||||
|
* is in `streaming` or `in archive recovery` state (patroni >= V3.0.4)
|
||||||
|
* has a lag lower or equal to `max_lag`
|
||||||
|
|
||||||
|
Please note that replica `in archive recovery` could be stuck because the
|
||||||
|
WAL are not available or applicable (the server's timeline has diverged for
|
||||||
|
the leader's). We already detect the latter but we will miss the former.
|
||||||
|
Therefore, it's preferable to check for the lag in addition to the healthy
|
||||||
|
state if you rely on log shipping to help lagging standbies to catch up.
|
||||||
|
|
||||||
|
Since we require a healthy replica to have the same timeline as the leader,
|
||||||
|
it's possible that we raise alerts when the cluster is performing a
|
||||||
|
switchover or failover and the standbies are in the process of catching up
|
||||||
|
with the new leader. The alert shouldn't last long.
|
||||||
|
|
||||||
Check:
|
Check:
|
||||||
* `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold.
|
* `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold.
|
||||||
|
@ -203,8 +220,9 @@ Usage: check_patroni cluster_has_replica [OPTIONS]
|
||||||
Perfdata:
|
Perfdata:
|
||||||
* healthy_replica & unhealthy_replica count
|
* healthy_replica & unhealthy_replica count
|
||||||
* the number of sync_replica, they are included in the previous count
|
* the number of sync_replica, they are included in the previous count
|
||||||
* the lag of each replica labelled with "member name"_lag
|
* the lag of each replica labelled with "member name"_lag
|
||||||
* a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync
|
* the timeline of each replica labelled with "member name"_timeline
|
||||||
|
* a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync
|
||||||
|
|
||||||
Options:
|
Options:
|
||||||
-w, --warning TEXT Warning threshold for the number of healthy replica
|
-w, --warning TEXT Warning threshold for the number of healthy replica
|
||||||
|
|
|
@ -341,11 +341,29 @@ def cluster_has_replica(
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Check if the cluster has healthy replicas and/or if some are sync standbies
|
"""Check if the cluster has healthy replicas and/or if some are sync standbies
|
||||||
|
|
||||||
|
\b
|
||||||
|
For patroni (and this check):
|
||||||
|
* a replica is `streaming` if the `pg_stat_wal_receiver` say's so.
|
||||||
|
* a replica is `in archive recovery`, if it's not `streaming` and has a `restore_command`.
|
||||||
|
|
||||||
\b
|
\b
|
||||||
A healthy replica:
|
A healthy replica:
|
||||||
* is in running or streaming state (V3.0.4)
|
* has a `replica` or `sync_standby` role
|
||||||
* has a replica or sync_standby role
|
* has the same timeline as the leader and
|
||||||
* has a lag lower or equal to max_lag
|
* is in `running` state (patroni < V3.0.4)
|
||||||
|
* is in `streaming` or `in archive recovery` state (patroni >= V3.0.4)
|
||||||
|
* has a lag lower or equal to `max_lag`
|
||||||
|
|
||||||
|
Please note that replica `in archive recovery` could be stuck because the WAL
|
||||||
|
are not available or applicable (the server's timeline has diverged for the
|
||||||
|
leader's). We already detect the latter but we will miss the former.
|
||||||
|
Therefore, it's preferable to check for the lag in addition to the healthy
|
||||||
|
state if you rely on log shipping to help lagging standbies to catch up.
|
||||||
|
|
||||||
|
Since we require a healthy replica to have the same timeline as the
|
||||||
|
leader, it's possible that we raise alerts when the cluster is performing a
|
||||||
|
switchover or failover and the standbies are in the process of catching up with
|
||||||
|
the new leader. The alert shouldn't last long.
|
||||||
|
|
||||||
\b
|
\b
|
||||||
Check:
|
Check:
|
||||||
|
@ -357,8 +375,9 @@ def cluster_has_replica(
|
||||||
Perfdata:
|
Perfdata:
|
||||||
* healthy_replica & unhealthy_replica count
|
* healthy_replica & unhealthy_replica count
|
||||||
* the number of sync_replica, they are included in the previous count
|
* the number of sync_replica, they are included in the previous count
|
||||||
* the lag of each replica labelled with "member name"_lag
|
* the lag of each replica labelled with "member name"_lag
|
||||||
* a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync
|
* the timeline of each replica labelled with "member name"_timeline
|
||||||
|
* a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync
|
||||||
"""
|
"""
|
||||||
|
|
||||||
tmax_lag = size_to_byte(max_lag) if max_lag is not None else None
|
tmax_lag = size_to_byte(max_lag) if max_lag is not None else None
|
||||||
|
@ -377,6 +396,7 @@ def cluster_has_replica(
|
||||||
),
|
),
|
||||||
nagiosplugin.ScalarContext("unhealthy_replica"),
|
nagiosplugin.ScalarContext("unhealthy_replica"),
|
||||||
nagiosplugin.ScalarContext("replica_lag"),
|
nagiosplugin.ScalarContext("replica_lag"),
|
||||||
|
nagiosplugin.ScalarContext("replica_timeline"),
|
||||||
nagiosplugin.ScalarContext("replica_sync"),
|
nagiosplugin.ScalarContext("replica_sync"),
|
||||||
)
|
)
|
||||||
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
|
check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout)
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
import hashlib
|
import hashlib
|
||||||
import json
|
import json
|
||||||
from collections import Counter
|
from collections import Counter
|
||||||
from typing import Iterable, Union
|
from typing import Any, Iterable, Union
|
||||||
|
|
||||||
import nagiosplugin
|
import nagiosplugin
|
||||||
|
|
||||||
|
@ -83,35 +83,91 @@ class ClusterHasReplica(PatroniResource):
|
||||||
self.max_lag = max_lag
|
self.max_lag = max_lag
|
||||||
|
|
||||||
def probe(self) -> Iterable[nagiosplugin.Metric]:
|
def probe(self) -> Iterable[nagiosplugin.Metric]:
|
||||||
item_dict = self.rest_api("cluster")
|
def debug_member(member: Any, health: str) -> None:
|
||||||
|
_log.debug(
|
||||||
|
"Node %(node_name)s is %(health)s: lag %(lag)s, state %(state)s, tl %(tl)s.",
|
||||||
|
{
|
||||||
|
"node_name": member["name"],
|
||||||
|
"health": health,
|
||||||
|
"lag": member["lag"],
|
||||||
|
"state": member["state"],
|
||||||
|
"tl": member["timeline"],
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
# get the cluster info
|
||||||
|
cluster_item_dict = self.rest_api("cluster")
|
||||||
|
|
||||||
replicas = []
|
replicas = []
|
||||||
healthy_replica = 0
|
healthy_replica = 0
|
||||||
unhealthy_replica = 0
|
unhealthy_replica = 0
|
||||||
sync_replica = 0
|
sync_replica = 0
|
||||||
for member in item_dict["members"]:
|
leader_tl = None
|
||||||
# FIXME are there other acceptable states
|
|
||||||
|
# Look for replicas
|
||||||
|
for member in cluster_item_dict["members"]:
|
||||||
if member["role"] in ["replica", "sync_standby"]:
|
if member["role"] in ["replica", "sync_standby"]:
|
||||||
# patroni 3.0.4 changed the standby state from running to streaming
|
if member["lag"] == "unknown":
|
||||||
if (
|
# This could happen if the node is stopped
|
||||||
member["state"] in ["running", "streaming"]
|
# nagiosplugin doesn't handle strings in perfstats
|
||||||
and member["lag"] != "unknown"
|
# so we have to ditch all the stats in that case
|
||||||
):
|
debug_member(member, "unhealthy")
|
||||||
|
unhealthy_replica += 1
|
||||||
|
continue
|
||||||
|
else:
|
||||||
replicas.append(
|
replicas.append(
|
||||||
{
|
{
|
||||||
"name": member["name"],
|
"name": member["name"],
|
||||||
"lag": member["lag"],
|
"lag": member["lag"],
|
||||||
|
"timeline": member["timeline"],
|
||||||
"sync": 1 if member["role"] == "sync_standby" else 0,
|
"sync": 1 if member["role"] == "sync_standby" else 0,
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
|
|
||||||
if member["role"] == "sync_standby":
|
# Get the leader tl if we haven't already
|
||||||
sync_replica += 1
|
if leader_tl is None:
|
||||||
|
# If there are no leaders, we will loop here for all
|
||||||
|
# members because leader_tl will remain None. it's not
|
||||||
|
# a big deal since having no leader is rare.
|
||||||
|
for tmember in cluster_item_dict["members"]:
|
||||||
|
if tmember["role"] == "leader":
|
||||||
|
leader_tl = int(tmember["timeline"])
|
||||||
|
break
|
||||||
|
|
||||||
if self.max_lag is None or self.max_lag >= int(member["lag"]):
|
_log.debug(
|
||||||
healthy_replica += 1
|
"Patroni's leader_timeline is %(leader_tl)s",
|
||||||
continue
|
{
|
||||||
unhealthy_replica += 1
|
"leader_tl": leader_tl,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
# Test for an unhealthy replica
|
||||||
|
if (
|
||||||
|
self.has_detailed_states()
|
||||||
|
and not (
|
||||||
|
member["state"] in ["streaming", "in archive recovery"]
|
||||||
|
and int(member["timeline"]) == leader_tl
|
||||||
|
)
|
||||||
|
) or (
|
||||||
|
not self.has_detailed_states()
|
||||||
|
and not (
|
||||||
|
member["state"] == "running"
|
||||||
|
and int(member["timeline"]) == leader_tl
|
||||||
|
)
|
||||||
|
):
|
||||||
|
debug_member(member, "unhealthy")
|
||||||
|
unhealthy_replica += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
if member["role"] == "sync_standby":
|
||||||
|
sync_replica += 1
|
||||||
|
|
||||||
|
if self.max_lag is None or self.max_lag >= int(member["lag"]):
|
||||||
|
debug_member(member, "healthy")
|
||||||
|
healthy_replica += 1
|
||||||
|
else:
|
||||||
|
debug_member(member, "unhealthy")
|
||||||
|
unhealthy_replica += 1
|
||||||
|
|
||||||
# The actual check
|
# The actual check
|
||||||
yield nagiosplugin.Metric("healthy_replica", healthy_replica)
|
yield nagiosplugin.Metric("healthy_replica", healthy_replica)
|
||||||
|
@ -123,6 +179,11 @@ class ClusterHasReplica(PatroniResource):
|
||||||
yield nagiosplugin.Metric(
|
yield nagiosplugin.Metric(
|
||||||
f"{replica['name']}_lag", replica["lag"], context="replica_lag"
|
f"{replica['name']}_lag", replica["lag"], context="replica_lag"
|
||||||
)
|
)
|
||||||
|
yield nagiosplugin.Metric(
|
||||||
|
f"{replica['name']}_timeline",
|
||||||
|
replica["timeline"],
|
||||||
|
context="replica_timeline",
|
||||||
|
)
|
||||||
yield nagiosplugin.Metric(
|
yield nagiosplugin.Metric(
|
||||||
f"{replica['name']}_sync", replica["sync"], context="replica_sync"
|
f"{replica['name']}_sync", replica["sync"], context="replica_sync"
|
||||||
)
|
)
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
import json
|
import json
|
||||||
|
from functools import lru_cache
|
||||||
from typing import Any, Callable, List, Optional, Tuple, Union
|
from typing import Any, Callable, List, Optional, Tuple, Union
|
||||||
from urllib.parse import urlparse
|
from urllib.parse import urlparse
|
||||||
|
|
||||||
|
@ -29,7 +30,7 @@ class Parameters:
|
||||||
verbose: int
|
verbose: int
|
||||||
|
|
||||||
|
|
||||||
@attr.s(auto_attribs=True, slots=True)
|
@attr.s(auto_attribs=True, eq=False, slots=True)
|
||||||
class PatroniResource(nagiosplugin.Resource):
|
class PatroniResource(nagiosplugin.Resource):
|
||||||
conn_info: ConnectionInfo
|
conn_info: ConnectionInfo
|
||||||
|
|
||||||
|
@ -76,6 +77,27 @@ class PatroniResource(nagiosplugin.Resource):
|
||||||
return None
|
return None
|
||||||
raise nagiosplugin.CheckError("Connection failed for all provided endpoints")
|
raise nagiosplugin.CheckError("Connection failed for all provided endpoints")
|
||||||
|
|
||||||
|
@lru_cache(maxsize=None)
|
||||||
|
def has_detailed_states(self) -> bool:
|
||||||
|
# get patroni's version to find out if the "streaming" and "in archive recovery" states are available
|
||||||
|
patroni_item_dict = self.rest_api("patroni")
|
||||||
|
|
||||||
|
if tuple(
|
||||||
|
int(v) for v in patroni_item_dict["patroni"]["version"].split(".", 2)
|
||||||
|
) >= (3, 0, 4):
|
||||||
|
_log.debug(
|
||||||
|
"Patroni's version is %(version)s, more detailed states can be used to check for the health of replicas.",
|
||||||
|
{"version": patroni_item_dict["patroni"]["version"]},
|
||||||
|
)
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
_log.debug(
|
||||||
|
"Patroni's version is %(version)s, the running state and the timelines must be used to check for the health of replicas.",
|
||||||
|
{"version": patroni_item_dict["patroni"]["version"]},
|
||||||
|
)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
HandleUnknown = Callable[[nagiosplugin.Summary, nagiosplugin.Results], Any]
|
HandleUnknown = Callable[[nagiosplugin.Summary, nagiosplugin.Results], Any]
|
||||||
|
|
||||||
|
|
|
@ -50,12 +50,13 @@ class PatroniAPI(HTTPServer):
|
||||||
|
|
||||||
|
|
||||||
def cluster_api_set_replica_running(in_json: Path, target_dir: Path) -> Path:
|
def cluster_api_set_replica_running(in_json: Path, target_dir: Path) -> Path:
|
||||||
# starting from 3.0.4 the state of replicas is streaming instead of running
|
# starting from 3.0.4 the state of replicas is streaming or in archive recovery
|
||||||
|
# instead of running
|
||||||
with in_json.open() as f:
|
with in_json.open() as f:
|
||||||
js = json.load(f)
|
js = json.load(f)
|
||||||
for node in js["members"]:
|
for node in js["members"]:
|
||||||
if node["role"] in ["replica", "sync_standby"]:
|
if node["role"] in ["replica", "sync_standby"]:
|
||||||
if node["state"] == "streaming":
|
if node["state"] in ["streaming", "in archive recovery"]:
|
||||||
node["state"] = "running"
|
node["state"] = "running"
|
||||||
assert target_dir.is_dir()
|
assert target_dir.is_dir()
|
||||||
out_json = target_dir / in_json.name
|
out_json = target_dir / in_json.name
|
||||||
|
|
35
tests/json/cluster_has_replica_ko_all_replica.json
Normal file
35
tests/json/cluster_has_replica_ko_all_replica.json
Normal file
|
@ -0,0 +1,35 @@
|
||||||
|
{
|
||||||
|
"members": [
|
||||||
|
{
|
||||||
|
"name": "srv1",
|
||||||
|
"role": "replica",
|
||||||
|
"state": "running",
|
||||||
|
"api_url": "https://10.20.199.3:8008/patroni",
|
||||||
|
"host": "10.20.199.3",
|
||||||
|
"port": 5432,
|
||||||
|
"timeline": 51,
|
||||||
|
"lag": 0
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "srv2",
|
||||||
|
"role": "replica",
|
||||||
|
"state": "running",
|
||||||
|
"api_url": "https://10.20.199.4:8008/patroni",
|
||||||
|
"host": "10.20.199.4",
|
||||||
|
"port": 5432,
|
||||||
|
"timeline": 51,
|
||||||
|
"lag": 0
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "srv3",
|
||||||
|
"role": "replica",
|
||||||
|
"state": "running",
|
||||||
|
"api_url": "https://10.20.199.5:8008/patroni",
|
||||||
|
"host": "10.20.199.5",
|
||||||
|
"port": 5432,
|
||||||
|
"timeline": 51,
|
||||||
|
"lag": 0
|
||||||
|
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
33
tests/json/cluster_has_replica_ko_wrong_tl.json
Normal file
33
tests/json/cluster_has_replica_ko_wrong_tl.json
Normal file
|
@ -0,0 +1,33 @@
|
||||||
|
{
|
||||||
|
"members": [
|
||||||
|
{
|
||||||
|
"name": "srv1",
|
||||||
|
"role": "leader",
|
||||||
|
"state": "running",
|
||||||
|
"api_url": "https://10.20.199.3:8008/patroni",
|
||||||
|
"host": "10.20.199.3",
|
||||||
|
"port": 5432,
|
||||||
|
"timeline": 51
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "srv2",
|
||||||
|
"role": "replica",
|
||||||
|
"state": "running",
|
||||||
|
"api_url": "https://10.20.199.4:8008/patroni",
|
||||||
|
"host": "10.20.199.4",
|
||||||
|
"port": 5432,
|
||||||
|
"timeline": 50,
|
||||||
|
"lag": 1000000
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "srv3",
|
||||||
|
"role": "replica",
|
||||||
|
"state": "streaming",
|
||||||
|
"api_url": "https://10.20.199.5:8008/patroni",
|
||||||
|
"host": "10.20.199.5",
|
||||||
|
"port": 5432,
|
||||||
|
"timeline": 51,
|
||||||
|
"lag": 0
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
|
@ -12,7 +12,7 @@
|
||||||
{
|
{
|
||||||
"name": "srv2",
|
"name": "srv2",
|
||||||
"role": "replica",
|
"role": "replica",
|
||||||
"state": "streaming",
|
"state": "in archive recovery",
|
||||||
"api_url": "https://10.20.199.4:8008/patroni",
|
"api_url": "https://10.20.199.4:8008/patroni",
|
||||||
"host": "10.20.199.4",
|
"host": "10.20.199.4",
|
||||||
"port": 5432,
|
"port": 5432,
|
||||||
|
|
26
tests/json/cluster_has_replica_patroni_verion_3.0.0.json
Normal file
26
tests/json/cluster_has_replica_patroni_verion_3.0.0.json
Normal file
|
@ -0,0 +1,26 @@
|
||||||
|
{
|
||||||
|
"state": "running",
|
||||||
|
"postmaster_start_time": "2021-08-11 07:02:20.732 UTC",
|
||||||
|
"role": "master",
|
||||||
|
"server_version": 110012,
|
||||||
|
"cluster_unlocked": false,
|
||||||
|
"xlog": {
|
||||||
|
"location": 1174407088
|
||||||
|
},
|
||||||
|
"timeline": 51,
|
||||||
|
"replication": [
|
||||||
|
{
|
||||||
|
"usename": "replicator",
|
||||||
|
"application_name": "srv1",
|
||||||
|
"client_addr": "10.20.199.3",
|
||||||
|
"state": "streaming",
|
||||||
|
"sync_state": "async",
|
||||||
|
"sync_priority": 0
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"database_system_identifier": "6965971025273547206",
|
||||||
|
"patroni": {
|
||||||
|
"version": "3.0.0",
|
||||||
|
"scope": "patroni-demo"
|
||||||
|
}
|
||||||
|
}
|
26
tests/json/cluster_has_replica_patroni_verion_3.1.0.json
Normal file
26
tests/json/cluster_has_replica_patroni_verion_3.1.0.json
Normal file
|
@ -0,0 +1,26 @@
|
||||||
|
{
|
||||||
|
"state": "running",
|
||||||
|
"postmaster_start_time": "2021-08-11 07:02:20.732 UTC",
|
||||||
|
"role": "master",
|
||||||
|
"server_version": 110012,
|
||||||
|
"cluster_unlocked": false,
|
||||||
|
"xlog": {
|
||||||
|
"location": 1174407088
|
||||||
|
},
|
||||||
|
"timeline": 51,
|
||||||
|
"replication": [
|
||||||
|
{
|
||||||
|
"usename": "replicator",
|
||||||
|
"application_name": "srv1",
|
||||||
|
"client_addr": "10.20.199.3",
|
||||||
|
"state": "streaming",
|
||||||
|
"sync_state": "async",
|
||||||
|
"sync_priority": 0
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"database_system_identifier": "6965971025273547206",
|
||||||
|
"patroni": {
|
||||||
|
"version": "3.1.0",
|
||||||
|
"scope": "patroni-demo"
|
||||||
|
}
|
||||||
|
}
|
|
@ -13,22 +13,23 @@ from . import PatroniAPI, cluster_api_set_replica_running
|
||||||
def cluster_has_replica_ok(
|
def cluster_has_replica_ok(
|
||||||
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
|
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
|
||||||
) -> Iterator[None]:
|
) -> Iterator[None]:
|
||||||
path: Union[str, Path] = "cluster_has_replica_ok.json"
|
cluster_path: Union[str, Path] = "cluster_has_replica_ok.json"
|
||||||
|
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
|
||||||
if old_replica_state:
|
if old_replica_state:
|
||||||
path = cluster_api_set_replica_running(datadir / path, tmp_path)
|
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
|
||||||
with patroni_api.routes({"cluster": path}):
|
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
|
||||||
|
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
|
||||||
yield None
|
yield None
|
||||||
|
|
||||||
|
|
||||||
# TODO Lag threshold tests
|
|
||||||
@pytest.mark.usefixtures("cluster_has_replica_ok")
|
@pytest.mark.usefixtures("cluster_has_replica_ok")
|
||||||
def test_cluster_has_relica_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None:
|
def test_cluster_has_relica_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None:
|
||||||
result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_replica"])
|
result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_replica"])
|
||||||
assert result.exit_code == 0
|
|
||||||
assert (
|
assert (
|
||||||
result.stdout
|
result.stdout
|
||||||
== "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=0 srv3_lag=0 srv3_sync=1 sync_replica=1 unhealthy_replica=0\n"
|
== "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=1 srv3_timeline=51 sync_replica=1 unhealthy_replica=0\n"
|
||||||
)
|
)
|
||||||
|
assert result.exit_code == 0
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.usefixtures("cluster_has_replica_ok")
|
@pytest.mark.usefixtures("cluster_has_replica_ok")
|
||||||
|
@ -47,11 +48,11 @@ def test_cluster_has_replica_ok_with_count_thresholds(
|
||||||
"@0",
|
"@0",
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
assert result.exit_code == 0
|
|
||||||
assert (
|
assert (
|
||||||
result.stdout
|
result.stdout
|
||||||
== "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2;@1;@0 srv2_lag=0 srv2_sync=0 srv3_lag=0 srv3_sync=1 sync_replica=1 unhealthy_replica=0\n"
|
== "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2;@1;@0 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=1 srv3_timeline=51 sync_replica=1 unhealthy_replica=0\n"
|
||||||
)
|
)
|
||||||
|
assert result.exit_code == 0
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.usefixtures("cluster_has_replica_ok")
|
@pytest.mark.usefixtures("cluster_has_replica_ok")
|
||||||
|
@ -68,21 +69,23 @@ def test_cluster_has_replica_ok_with_sync_count_thresholds(
|
||||||
"1:",
|
"1:",
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
assert result.exit_code == 0
|
|
||||||
assert (
|
assert (
|
||||||
result.stdout
|
result.stdout
|
||||||
== "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=0 srv3_lag=0 srv3_sync=1 sync_replica=1;1: unhealthy_replica=0\n"
|
== "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=1 srv3_timeline=51 sync_replica=1;1: unhealthy_replica=0\n"
|
||||||
)
|
)
|
||||||
|
assert result.exit_code == 0
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def cluster_has_replica_ok_lag(
|
def cluster_has_replica_ok_lag(
|
||||||
patroni_api: PatroniAPI, datadir: Path, tmp_path: Path, old_replica_state: bool
|
patroni_api: PatroniAPI, datadir: Path, tmp_path: Path, old_replica_state: bool
|
||||||
) -> Iterator[None]:
|
) -> Iterator[None]:
|
||||||
path: Union[str, Path] = "cluster_has_replica_ok_lag.json"
|
cluster_path: Union[str, Path] = "cluster_has_replica_ok_lag.json"
|
||||||
|
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
|
||||||
if old_replica_state:
|
if old_replica_state:
|
||||||
path = cluster_api_set_replica_running(datadir / path, tmp_path)
|
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
|
||||||
with patroni_api.routes({"cluster": path}):
|
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
|
||||||
|
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
|
||||||
yield None
|
yield None
|
||||||
|
|
||||||
|
|
||||||
|
@ -104,21 +107,23 @@ def test_cluster_has_replica_ok_with_count_thresholds_lag(
|
||||||
"1MB",
|
"1MB",
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
assert result.exit_code == 0
|
|
||||||
assert (
|
assert (
|
||||||
result.stdout
|
result.stdout
|
||||||
== "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2;@1;@0 srv2_lag=1024 srv2_sync=0 srv3_lag=0 srv3_sync=0 sync_replica=0 unhealthy_replica=0\n"
|
== "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2;@1;@0 srv2_lag=1024 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=0\n"
|
||||||
)
|
)
|
||||||
|
assert result.exit_code == 0
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def cluster_has_replica_ko(
|
def cluster_has_replica_ko(
|
||||||
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
|
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
|
||||||
) -> Iterator[None]:
|
) -> Iterator[None]:
|
||||||
path: Union[str, Path] = "cluster_has_replica_ko.json"
|
cluster_path: Union[str, Path] = "cluster_has_replica_ko.json"
|
||||||
|
patroni_path: Union[str, Path] = "cluster_has_replica_patroni_verion_3.1.0.json"
|
||||||
if old_replica_state:
|
if old_replica_state:
|
||||||
path = cluster_api_set_replica_running(datadir / path, tmp_path)
|
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
|
||||||
with patroni_api.routes({"cluster": path}):
|
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
|
||||||
|
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
|
||||||
yield None
|
yield None
|
||||||
|
|
||||||
|
|
||||||
|
@ -138,11 +143,11 @@ def test_cluster_has_replica_ko_with_count_thresholds(
|
||||||
"@0",
|
"@0",
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
assert result.exit_code == 1
|
|
||||||
assert (
|
assert (
|
||||||
result.stdout
|
result.stdout
|
||||||
== "CLUSTERHASREPLICA WARNING - healthy_replica is 1 (outside range @0:1) | healthy_replica=1;@1;@0 srv3_lag=0 srv3_sync=0 sync_replica=0 unhealthy_replica=1\n"
|
== "CLUSTERHASREPLICA WARNING - healthy_replica is 1 (outside range @0:1) | healthy_replica=1;@1;@0 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=1\n"
|
||||||
)
|
)
|
||||||
|
assert result.exit_code == 1
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.usefixtures("cluster_has_replica_ko")
|
@pytest.mark.usefixtures("cluster_has_replica_ko")
|
||||||
|
@ -161,21 +166,24 @@ def test_cluster_has_replica_ko_with_sync_count_thresholds(
|
||||||
"1:",
|
"1:",
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
assert result.exit_code == 2
|
# The lag on srv2 is "unknown". We don't handle string in perfstats so we have to scratch all the second node stats
|
||||||
assert (
|
assert (
|
||||||
result.stdout
|
result.stdout
|
||||||
== "CLUSTERHASREPLICA CRITICAL - sync_replica is 0 (outside range 1:) | healthy_replica=1 srv3_lag=0 srv3_sync=0 sync_replica=0;2:;1: unhealthy_replica=1\n"
|
== "CLUSTERHASREPLICA CRITICAL - sync_replica is 0 (outside range 1:) | healthy_replica=1 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0;2:;1: unhealthy_replica=1\n"
|
||||||
)
|
)
|
||||||
|
assert result.exit_code == 2
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def cluster_has_replica_ko_lag(
|
def cluster_has_replica_ko_lag(
|
||||||
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
|
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
|
||||||
) -> Iterator[None]:
|
) -> Iterator[None]:
|
||||||
path: Union[str, Path] = "cluster_has_replica_ko_lag.json"
|
cluster_path: Union[str, Path] = "cluster_has_replica_ko_lag.json"
|
||||||
|
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
|
||||||
if old_replica_state:
|
if old_replica_state:
|
||||||
path = cluster_api_set_replica_running(datadir / path, tmp_path)
|
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
|
||||||
with patroni_api.routes({"cluster": path}):
|
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
|
||||||
|
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
|
||||||
yield None
|
yield None
|
||||||
|
|
||||||
|
|
||||||
|
@ -197,8 +205,84 @@ def test_cluster_has_replica_ko_with_count_thresholds_and_lag(
|
||||||
"1MB",
|
"1MB",
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
assert result.exit_code == 2
|
|
||||||
assert (
|
assert (
|
||||||
result.stdout
|
result.stdout
|
||||||
== "CLUSTERHASREPLICA CRITICAL - healthy_replica is 0 (outside range @0:0) | healthy_replica=0;@1;@0 srv2_lag=10241024 srv2_sync=0 srv3_lag=20000000 srv3_sync=0 sync_replica=0 unhealthy_replica=2\n"
|
== "CLUSTERHASREPLICA CRITICAL - healthy_replica is 0 (outside range @0:0) | healthy_replica=0;@1;@0 srv2_lag=10241024 srv2_sync=0 srv2_timeline=51 srv3_lag=20000000 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=2\n"
|
||||||
)
|
)
|
||||||
|
assert result.exit_code == 2
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def cluster_has_replica_ko_wrong_tl(
|
||||||
|
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
|
||||||
|
) -> Iterator[None]:
|
||||||
|
cluster_path: Union[str, Path] = "cluster_has_replica_ko_wrong_tl.json"
|
||||||
|
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
|
||||||
|
if old_replica_state:
|
||||||
|
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
|
||||||
|
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
|
||||||
|
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
|
||||||
|
yield None
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.usefixtures("cluster_has_replica_ko_wrong_tl")
|
||||||
|
def test_cluster_has_replica_ko_wrong_tl(
|
||||||
|
runner: CliRunner, patroni_api: PatroniAPI
|
||||||
|
) -> None:
|
||||||
|
result = runner.invoke(
|
||||||
|
main,
|
||||||
|
[
|
||||||
|
"-e",
|
||||||
|
patroni_api.endpoint,
|
||||||
|
"cluster_has_replica",
|
||||||
|
"--warning",
|
||||||
|
"@1",
|
||||||
|
"--critical",
|
||||||
|
"@0",
|
||||||
|
"--max-lag",
|
||||||
|
"1MB",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
assert (
|
||||||
|
result.stdout
|
||||||
|
== "CLUSTERHASREPLICA WARNING - healthy_replica is 1 (outside range @0:1) | healthy_replica=1;@1;@0 srv2_lag=1000000 srv2_sync=0 srv2_timeline=50 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=1\n"
|
||||||
|
)
|
||||||
|
assert result.exit_code == 1
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def cluster_has_replica_ko_all_replica(
|
||||||
|
patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path
|
||||||
|
) -> Iterator[None]:
|
||||||
|
cluster_path: Union[str, Path] = "cluster_has_replica_ko_all_replica.json"
|
||||||
|
patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json"
|
||||||
|
if old_replica_state:
|
||||||
|
cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path)
|
||||||
|
patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json"
|
||||||
|
with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}):
|
||||||
|
yield None
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.usefixtures("cluster_has_replica_ko_all_replica")
|
||||||
|
def test_cluster_has_replica_ko_all_replica(
|
||||||
|
runner: CliRunner, patroni_api: PatroniAPI
|
||||||
|
) -> None:
|
||||||
|
result = runner.invoke(
|
||||||
|
main,
|
||||||
|
[
|
||||||
|
"-e",
|
||||||
|
patroni_api.endpoint,
|
||||||
|
"cluster_has_replica",
|
||||||
|
"--warning",
|
||||||
|
"@1",
|
||||||
|
"--critical",
|
||||||
|
"@0",
|
||||||
|
"--max-lag",
|
||||||
|
"1MB",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
assert (
|
||||||
|
result.stdout
|
||||||
|
== "CLUSTERHASREPLICA CRITICAL - healthy_replica is 0 (outside range @0:0) | healthy_replica=0;@1;@0 srv1_lag=0 srv1_sync=0 srv1_timeline=51 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=3\n"
|
||||||
|
)
|
||||||
|
assert result.exit_code == 2
|
||||||
|
|
Loading…
Reference in a new issue