nagios-nrpe: new async check #132
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
security
wontfix
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: evolix/ansible-roles#132
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
As discussed internally, we want to be able to execute some tasks that might take longer than the usual NRE timeout (10 seconds) or use too much resources to be run every minute or so.
The proposed solution is to have a wrapper to execute a task and store in a temporary file the return code and the output.
Another wrapper would be used by NRPE itself to fetch the result before returning it. It would also check if the output is fresh enough.
check_async_exec
would execute/path/to/check --with-option foo
and store :/var/lib/misc/check_async/my_long_check.rc
/var/lib/misc/check_async/my_long_check.out
check_async_report
would read the files. If the files were older than the defined max age, it would report the data but with the UNKNOWN code rather than the actual check code.There is also an option to add a
timeout
right inside thecheck_async_exec
to mimic what we've also added inalerts_wrapper
. It's not its primary responsability, but it might be useful frequently enough to be relevant to add.A little bit of bikeshedding:
/var/lib/misc/check_async/my_long_check.status
.status
it's more explicit thanrc
, isn't an abbreviation and won't be confused with config file (as some ends in.rc
too).check_async_exec --name my_long_check -- /path/to/check --with-option foo
when calling another program with its own argument it's always better to separate it from the parent command with--
to avoid collisions between the arguments of the two commands.Overall I think it's a great solution. I'll like to know how such async check would compose with Icinga's "Check now" button.
Thanks for your feedback.
I'm OK with the
.status
suffix.I'm also OK with the "end of parameters" signaled by
--
.This is a conventional way to say that what comes next is not an argument.
This is however not how
timeout
works.I have no strong opinion on this.
Regarding the "check now" action in Icinga, it's a good point but I don't think it's the responsability or the
check_async_exec
to reschedule a the real check. Except if there is a semantic in the NRPE procotol to signal the check that it's not executed in the reagular schedule but via an explicit command.Great!
I had a look into Nagios documentation about manual checks, it uses SCHEDULE_FORCED_SVC_CHECK command but the check does not get data about wich nagios command requested its execution either throw a environment variables or arguments.
Taking a step back, Nagios already provides a mechanism to do asynchronous check through so call passive checks. However it can't be done with NRPE so we would need aditional agent and daemon.
We'll put aside this "native async checks" since we can't easily add a new agent and the corresponding security policy in our current setup.
We should focus on the semantics et features of the mentioned scripts.
Nice !
If I understand well, the wrapper
check_async_exec
would be ran by cron ?IMHO, a more coherent and "intuitive" location for
check_async_exec
would rather be the directory/usr/lib/nagios/plugins
, where there are already all the plugins.For the data file generated, there is already a
/var/lib/nagios
directory.On the other hand, we should take care to avoid file spreading across the filesystem. In my daily practice, I find scripts in many different locations (for historical reasons), sometimes with a lack of consistency.
@bwaegeneire
I guess, if we want to force the execution of a script that takes times to execute, running it manually on the server is not so bad. As we are (generally) connected to it when we use the "check now" feature (after an operation), running
check_async_exec
is quite straightforward.If would be nice if the output of the check precise that it is async, in order to avoid a misuse of "check now".
Connex proposal, but relevant for the topic: adding a script to execute a check locally, by providing the check name given in NRPE .cfg files (name that one also sees in Icinga for instance). This would avoid having to open the NRPE .cfg file to know all the check options each time. I didn't find anything like that on the net (with a quick search).
This could be used to re-run quickly a
check_async_exec
withnrpe_run_check check_async_1
.Command completion would be a must :)