Automatic Host Removal in CheckMK

Ephemeral infrastructure (containers, short-lived VMs, lab environments) is great – until your monitoring system starts collecting zombie hosts like it is preparing for a compliance audit.

In CheckMK 2.4 Raw, “Automatic host removal” is easy to misread: it does not remove hosts just because a service goes CRIT (even the “CheckMK”/agent-related service). Instead, it removes hosts that CheckMK classifies as vanished based on its inventory/service discovery logic.

This post explains what “vanished” really means in this context, the dependency chain that must be in place, and a practical troubleshooting checklist.

Context and requirements

Assumptions for this article:

  • CheckMK 2.4 Raw Edition
  • Hosts are created dynamically (for example via automation/API, piggyback, container integration, or external tooling)
  • You expect hosts to be removed after a timeout (example: “remove after 15 minutes if unreachable”)
  • You are using host labels to target rules (common for container hosts)

Key point: “Host unreachable” is not the same as “host vanished”.

Solution overview (the actual lifecycle)

For automatic removal to happen, CheckMK typically needs this chain:

  1. Periodic service discovery runs
  2. CheckMK detects that hosts/services are vanished (after X cycles/time)
  3. Optionally, CheckMK deactivates vanished hosts
  4. Automatic host removal removes those vanished/deactivated hosts (after Y time)
  5. Removal is executed by a background job/housekeeping, not instantly

If any earlier step does not happen, “Automatic host removal” will have nothing to delete.

Step-by-step: what to verify (and why it matters)

1) The host must be “vanished” (and usually “inactive”), not just CRIT

“Automatic host removal” is evaluated for vanished hosts – not for hosts that merely have a CRIT service.

A CRIT “CheckMK” service often means one of:

  • agent unreachable
  • piggyback source missing
  • special agent failed

In all of those cases, the host still exists in your configuration, and CheckMK will not necessarily classify it as vanished.

What to verify:

  • In the GUI, open the host and look for its vanished status.
  • Use built-in views and search for something like:
    • Vanished hosts
    • hosts with state “vanished” (depending on your view setup)

If the host never appears as vanished, the removal rule will never trigger, regardless of CRIT services. For containers, this also means the container must not appear in the piggyback data. Otherwise it is only considered DOWN, not vanished, and will hence not be removed.


2) Ensure periodic discovery is running and configured to detect vanished objects

The “vanished” classification is typically built by regular discovery. If discovery is not running (or not applied to the relevant folder/labels), CheckMK never builds the vanished state, and nothing gets removed.

What to check:

  • Your periodic discovery (automation/discovery job) is actually enabled for these hosts/folders.
  • Discovery behavior is configured so that vanished objects are detected and handled (rather than silently ignoring changes).
  • If you use rules like Inventory: automatic service discovery (wording may vary), confirm it matches your container-labeled hosts.

Operationally: if discovery is off, CheckMK is blind to the fact that the host should be considered gone.


3) Confirm rule matching and activate changes (the classic two-step)

Automation rules do nothing unless:

  1. the rule matches the object
  2. changes are activated on the site

What to verify:

  • After creating/changing the rule: Activate changes.
  • Use Analyse / Rule analysis for the specific host:
    • Do label conditions match exactly (key/value, case sensitivity)?
    • Are you matching host labels (not service labels)?
    • Is the host in the folder/site you think it is (important in distributed monitoring)?

If the rule analysis says the rule does not apply, CheckMK will also not apply it at runtime.


4) “The CheckMK service is CRIT” is usually the wrong trigger

People often anchor on “the CheckMK service is CRIT for 15 minutes” and assume removal should happen.

But the removal rule typically expects vanished-host state, not “any service named X is CRIT”.

Also, depending on your monitoring method, the relevant service name can differ:

  • “CheckMK Agent” (agent communication)
  • “CheckMK” (often piggyback-related)
  • “Check_MK” (legacy naming in some setups)
  • a special agent service

Action:

  • Re-read the rule option text carefully.
  • If it references vanished hosts anywhere, your trigger is the vanished mechanism, not a service state.

A small but important distinction: a host can be unreachable while still being very much present (at least in configuration).


5) Background job timing: it will not remove exactly at 15 minutes

Even if the rule says “remove after 15 minutes”, the deletion is performed by periodic background jobs/housekeeping.

Depending on site load and scheduling, this can be delayed beyond your exact threshold.

What to check:

  • Background job / cron status for the site (and that the site is running normally)
  • In distributed setups: you are checking the correct site where the host is configured

If the housekeeping cycle does not run (or is failing), removal will not happen even with correct rules.


Common pitfalls and troubleshooting

Quick troubleshooting checklist (do this in order):

  1. Does the host show up in a “Vanished hosts” view at all?
    • If no: discovery/vanished mechanism is not engaged.
  2. Is periodic discovery enabled for those container hosts/folder?
    • If no: enable it and ensure it matches labels/folders.
  3. Rule analysis: does the “Automatic host removal” rule match the host?
    • If no: fix label/folder/site conditions.
  4. Activate changes
    • Do not skip this, even if you are sure you did it.
  5. Wait for housekeeping/background jobs (often longer than 15 minutes)
    • Then re-check vanished state and whether the host gets deleted.

Conclusion

In CheckMK 2.4 Raw, Automatic host removal is not a reaction to CRIT services. It is the final step in a lifecycle that starts with periodic discovery and leads to hosts being classified as vanished (often then deactivated) before the background job deletes them.

If you are troubleshooting this, start by proving one thing: does the host ever become “vanished” in CheckMK? If not, the removal rule is not the problem yet.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.