Mission critical events can happen in your environment while you are away. Our alerting feature allows you to route those alerts to various destinations.
We support two types of alerts. (1) Event Alerts: alerts generated when an event happens in your account. (2) Threshold Alerts: alerts generated when a metric exceeds a designated threshold.
Both types of alerts support an alert policy. The policy ensures that a given alert can only be generated once during a defined window of time. For example, if the alert policy is set to 120 seconds, only one alert will be sent over a two-minute span, even if the alert has been independently triggered 100 times by a metric exceeding and then dropping beneath the same threshold again and again. Preventing such alerts from generating multiple times will keep your system orderly when a single alert would do. The generation of new alerts in this kind of scenario will stop until the alert triggers again within the next duration window.
Event Alerts are based on events that happen in your account. To help ensure that you only receive events of high importance, we provide filters. These filters are by:
- Category, that tells what the event is all about; e.g., “High Swap Activity”, “Disk Device Full”, etc. Categories are available as a filter in the Events Dashboard; check here for more information.
- Level (severity), out of info, warning or critical. This is also available as a filter in the Events Dashboard.
- Host name, refer to your Host List for a list of possible host names.
- Host tags, used to define groups of hosts. (See also how to filter by tag in the UI.)
The default settings will match on any event in your environment. If you find that these settings generate too many alerts, try applying filters. Filters are combined with an AND clause. For example:
if the host is “any” AND the level is “critical” AND the category is “Host stopped sending data”
would filter out all but “Host stopped sending data” events that are “critical” from any host in your environment.
Threshold Alerts are based on metrics that exceed a designated threshold. The alert is generated when the metric crosses the threshold trigger value and has maintained that value for a defined duration of time. Once that duration of time has passed, the alert activates and a “Threshold Alert” event of a defined level generates. No further alerts will generate while the original alert remains in an active state. The active state will be cleared once the metric meets the reset value; from then on, new alerts can be generated. For a list of the metric categories, see here.
For example consider a user who sets a threshold alert on “os.cpu.loadavg” with a trigger value of “4” and a reset value of “2,” a duration of “5 seconds,” and an event level of “critical”. Let’s say “os.cpu.loadavg” spikes to a value of “7” and holds that value for “5 seconds”. The alert will be generated with a “Threshold Alert” event with level “critical”. Now consider that “os.cpu.loadavg” value drops to “3” and then rises back to “7” for another “5 seconds”. No new alert will be generated since its state is already active. Now consider that “os.cpu.loadavg” drops down to “1”, then goes back up to “7” for another 5 seconds. A new alert is generated since the metric went below the reset value the active state was cleared.