nagios-nrpe: new async check #132

Open
opened 2021-09-06 16:26:01 +02:00 by jlecour · 5 comments
Owner

As discussed internally, we want to be able to execute some tasks that might take longer than the usual NRE timeout (10 seconds) or use too much resources to be run every minute or so.

The proposed solution is to have a wrapper to execute a task and store in a temporary file the return code and the output.
Another wrapper would be used by NRPE itself to fetch the result before returning it. It would also check if the output is fresh enough.

## execute the long check
$ check_async_exec --name my_long_check /path/to/check --with-option foo

## return the result/output of "my_long_check"
$ check_async_report --name my_long_check --max-age 1h

check_async_exec would execute /path/to/check --with-option foo and store :

  • the return code in /var/lib/misc/check_async/my_long_check.rc
  • the output in /var/lib/misc/check_async/my_long_check.out

check_async_report would read the files. If the files were older than the defined max age, it would report the data but with the UNKNOWN code rather than the actual check code.

There is also an option to add a timeout right inside the check_async_exec to mimic what we've also added in alerts_wrapper. It's not its primary responsability, but it might be useful frequently enough to be relevant to add.

As discussed internally, we want to be able to execute some tasks that might take longer than the usual NRE timeout (10 seconds) or use too much resources to be run every minute or so. The proposed solution is to have a wrapper to execute a task and store in a temporary file the return code and the output. Another wrapper would be used by NRPE itself to fetch the result before returning it. It would also check if the output is fresh enough. ``` ## execute the long check $ check_async_exec --name my_long_check /path/to/check --with-option foo ## return the result/output of "my_long_check" $ check_async_report --name my_long_check --max-age 1h ``` `check_async_exec` would execute `/path/to/check --with-option foo` and store : * the return code in `/var/lib/misc/check_async/my_long_check.rc` * the output in `/var/lib/misc/check_async/my_long_check.out` `check_async_report` would read the files. If the files were older than the defined max age, it would report the data but with the UNKNOWN code rather than the actual check code. There is also an option to add a `timeout` right inside the `check_async_exec` to mimic what we've also added in `alerts_wrapper`. It's not its primary responsability, but it might be useful frequently enough to be relevant to add.
Owner

A little bit of bikeshedding:

  • the return code in /var/lib/misc/check_async/my_long_check.rc

/var/lib/misc/check_async/my_long_check.status .status it's more explicit than rc, isn't an abbreviation and won't be confused with config file (as some ends in .rc too).

$ check_async_exec --name my_long_check /path/to/check --with-option foo

check_async_exec --name my_long_check -- /path/to/check --with-option foo when calling another program with its own argument it's always better to separate it from the parent command with -- to avoid collisions between the arguments of the two commands.

Overall I think it's a great solution. I'll like to know how such async check would compose with Icinga's "Check now" button.

A little bit of bikeshedding: > - the return code in /var/lib/misc/check_async/my_long_check.rc `/var/lib/misc/check_async/my_long_check.status` `.status` it's more explicit than `rc`, isn't an abbreviation and won't be confused with config file (as some ends in `.rc` too). > $ check_async_exec --name my_long_check /path/to/check --with-option foo `check_async_exec --name my_long_check -- /path/to/check --with-option foo` when calling another program with its own argument it's always better to separate it from the parent command with `--` to avoid collisions between the arguments of the two commands. Overall I think it's a great solution. I'll like to know how such async check would compose with Icinga's "Check now" button.
Author
Owner

Thanks for your feedback.

I'm OK with the .status suffix.

I'm also OK with the "end of parameters" signaled by --.
This is a conventional way to say that what comes next is not an argument.
This is however not how timeout works.
I have no strong opinion on this.

Regarding the "check now" action in Icinga, it's a good point but I don't think it's the responsability or the check_async_exec to reschedule a the real check. Except if there is a semantic in the NRPE procotol to signal the check that it's not executed in the reagular schedule but via an explicit command.

Thanks for your feedback. I'm OK with the `.status` suffix. I'm also OK with the "end of parameters" signaled by `--`. This is a conventional way to say that what comes next is not an argument. This is however not how `timeout` works. I have no strong opinion on this. Regarding the "check now" action in Icinga, it's a good point but I don't think it's the responsability or the `check_async_exec` to reschedule a the real check. Except if there is a semantic in the NRPE procotol to signal the check that it's not executed in the reagular schedule but via an explicit command.
Owner

Great!

I had a look into Nagios documentation about manual checks, it uses SCHEDULE_FORCED_SVC_CHECK command but the check does not get data about wich nagios command requested its execution either throw a environment variables or arguments.

Taking a step back, Nagios already provides a mechanism to do asynchronous check through so call passive checks. However it can't be done with NRPE so we would need aditional agent and daemon.

Great! I had a look into Nagios documentation about manual checks, it uses [SCHEDULE_FORCED_SVC_CHECK](https://assets.nagios.com/downloads/nagioscore/docs/externalcmds/cmdinfo.php?command_id=129) command but the check does not get data about wich nagios command requested its execution either throw a environment variables or arguments. Taking a step back, Nagios already provides a mechanism to do asynchronous check through so call [passive checks](https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/passivechecks.html). However it can't be done with NRPE so we would need aditional agent and daemon.
Author
Owner

We'll put aside this "native async checks" since we can't easily add a new agent and the corresponding security policy in our current setup.

We should focus on the semantics et features of the mentioned scripts.

We'll put aside this "native async checks" since we can't easily add a new agent and the corresponding security policy in our current setup. We should focus on the semantics et features of the mentioned scripts.
Owner

Nice !

If I understand well, the wrapper check_async_exec would be ran by cron ?

/var/lib/misc/check_async/

IMHO, a more coherent and "intuitive" location for check_async_exec would rather be the directory /usr/lib/nagios/plugins, where there are already all the plugins.

For the data file generated, there is already a /var/lib/nagios directory.

On the other hand, we should take care to avoid file spreading across the filesystem. In my daily practice, I find scripts in many different locations (for historical reasons), sometimes with a lack of consistency.

@bwaegeneire

Overall I think it's a great solution. I'll like to know how such async check would compose with Icinga's "Check now" button.```

I guess, if we want to force the execution of a script that takes times to execute, running it manually on the server is not so bad. As we are (generally) connected to it when we use the "check now" feature (after an operation), running check_async_exec is quite straightforward.

If would be nice if the output of the check precise that it is async, in order to avoid a misuse of "check now".

Connex proposal, but relevant for the topic: adding a script to execute a check locally, by providing the check name given in NRPE .cfg files (name that one also sees in Icinga for instance). This would avoid having to open the NRPE .cfg file to know all the check options each time. I didn't find anything like that on the net (with a quick search).

$ nrpe_run_check --verbose check_load
Reading commands /etc/nagios/*.cfg and /etc/nagios/nrpe.d/*.cfg
Found in nrpe.local.cfg :
command[check_load]=/usr/lib/nagios/plugins/check_load --percpu --warning=0.7,0.6,0.5 --critical=0.9,0.8,0.7
Running command...
Result :
OK - Charge moyenne: 0.00, 0.01, 0.00|load1=0.000;0.700;0.900;0; load5=0.015;0.600;0.800;0; load15=0.000;0.500;0.700;0;

This could be used to re-run quickly a check_async_exec with nrpe_run_check check_async_1.

Command completion would be a must :)

Nice ! If I understand well, the wrapper `check_async_exec` would be ran by cron ? > /var/lib/misc/check_async/ IMHO, a more coherent and "intuitive" location for `check_async_exec` would rather be the directory `/usr/lib/nagios/plugins`, where there are already all the plugins. For the data file generated, there is already a `/var/lib/nagios` directory. On the other hand, we should take care to avoid file spreading across the filesystem. In my daily practice, I find scripts in many different locations (for historical reasons), sometimes with a lack of consistency. @bwaegeneire >Overall I think it's a great solution. I'll like to know how such async check would compose with Icinga's "Check now" button.``` I guess, if we want to force the execution of a script that takes times to execute, running it manually on the server is not so bad. As we are (generally) connected to it when we use the "check now" feature (after an operation), running `check_async_exec` is quite straightforward. If would be nice if the output of the check precise that it is async, in order to avoid a misuse of "check now". Connex proposal, but relevant for the topic: adding a script to execute a check locally, by providing the check name given in NRPE .cfg files (name that one also sees in Icinga for instance). This would avoid having to open the NRPE .cfg file to know all the check options each time. I didn't find anything like that on the net (with a quick search). ``` $ nrpe_run_check --verbose check_load Reading commands /etc/nagios/*.cfg and /etc/nagios/nrpe.d/*.cfg Found in nrpe.local.cfg : command[check_load]=/usr/lib/nagios/plugins/check_load --percpu --warning=0.7,0.6,0.5 --critical=0.9,0.8,0.7 Running command... Result : OK - Charge moyenne: 0.00, 0.01, 0.00|load1=0.000;0.700;0.900;0; load5=0.015;0.600;0.800;0; load15=0.000;0.500;0.700;0; ``` This could be used to re-run quickly a `check_async_exec` with `nrpe_run_check check_async_1`. Command completion would be a must :)
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: evolix/ansible-roles#132
No description provided.