Some Generic Monitoring Terminology

  • Alert Generation is the process of sending messages via email, pages, SMS, or any other modality to proactiovely inform a human that something is amiss.
  • The availability of a service is a measurement of how reliably clients are able to access the service over a particular period of time. This is similar to, but not precisely the same as the quality of the service over the specified time period. If the service is unavailable, then quality is moot. But the quality of an available service may be anywhere from perfect to awful.

    A service is considered to be up/available if it is functioning correctly and is capable of receiving & processing client requests whether or not there are actually any client requests to be processed. Similarly, a service is considered to be down/unavailable if it is unable to receive & process client requests, whether or not there are actually any client requests to be processed.

    A particular client will perceive the service to be up/available only if all network components between the client and the service are functioning properly, as well as the client and the service themselves. (DIAGRAM) This implies that different clients may view the availability of a given service differently, which complicates the business of measuring availability. At one extreme, we may consider the service availability from the perspective of the hosts running the server application. This gives the "most optimistic" availability measurement, since only those network issues which prevent the service hosts from communicating amongst themselves or with any required external services (eg NTP)= can count against this type of availability. At the other extreme, we may consider the perspective of any potential client running on any host, anywhere on the internet. This gives the "most pessimistic" availability measurement, since at any given moment there are probably (local) DNS issues that prevent a nonzero fraction of potential clients from ever getting to the service, regardless of how perfectly the service itself may be performing.

    To measure the availability of a service, it is often convenient to deploy a system of agents to various points in the ambient network to monitor both the service and the network itself. Each agent collects information over time and submits reports at regular intervals (or on demand) to a central repository, where another agent uses knowledge about the structure and function of the system to integrate the assembled data into a single continuous state trajectory for the service and ambient network over time. This state trajectory should be relatively sparse, even if large quantities of data are collected by the agents (if the agents are tuned properly, then increasing the frequency with which they collect data should not substantially increase the amount of information in the state trajectory). It should probably be stored in XML, so humans can annotate it by hand and make manual amendments (eg after postmortem reconstruction of "enterprise events").

    The accuracy of any service availability report generated from the state trajectory clearly depends on:

    1. the locations at which agents are deployed
    2. the types of measurements made by the agents
    3. the logic & performance of the agents
    4. the logic & performance of the data integration agent
    5. the match between network quality as perceived by the monitoring agents network quality as perceived by potential service using clients

    Reliable agents may compensate somewhat for an unreliable network by "voting" to determine whether or not a service is up/down at any given moment. Unreliable agents are bad news - it is almost better to have no data at all than data from a faulty monitoring agent. So it is very important to thoroughly test agents (and any changes to them) before they are deployed (duh).

    Since it is not generally possible to detect all client failures, it is not generally possible to directly measure the impact of a service outage on the community of clients at large. But estimates are useful for assessing damage and setting priorities, and when we need one we use a MODEL####

  • Application Monitoring is the process of detecting and reporting problems with the proper operation of an application or service. It generally takes one of two forms: Passive checks have the advantage of being quicker and less resource-intensive, which is important when monitoring a service whose performance is load-sensitive. Active checks have the advantage of being able to detect unanticipated failure modes, which is important when monitoring applications which depend on a number of distributed network services, eg DNS, NFS, NTP etc. or which for other reasons (like running on NT) are prone to byzantine failure modes (in which, for example, a process is running but unresponsive).

    Ideally, both active and passive application monitoring data should be combined with other monitoring data via event correlation to reduce the total number of errors/alerts generated, and to simplify failure diagnosis.

  • Network Monitoring is the process of detecting and reporting problems either with the movement of data from one point to another through a network (called End to End network monitoring) or errors reported by network devices themselves (called Passive network monitoring). Some examples: Though network problems may be due to configuration issues on one or more network devices, hardware failure, or exceptional load on one or more network devices, they typically manifest themselves in mysterious latency, dropped packets, and application failures. So ideally, data from network monitoring should be combined with other monitoring data via event correlation to reduce the total number of errors/alerts generated, and to simplify failure diagnosis.
  • Network Services - some examples:
  • Event Correlation is the process of combining monitoring data from one or many different sources over time with a functional model expressed as a set of rules reflecting the relationships between the various applications, services, servers, and networks to produce new data, which (if the functional model is good) are less voluminous and more informative.

    Simple forms of event correlation (in the small) may be programmed into monitoring agents themselves. For example, an agent which is configured to only send one alert every 15 minutes must decide whether or not to send an alert for a continuing problem based on how long it has been since the last alert was sent. Agents are also typically configured to send messages about different types of problems to different alert recipients.

    More complex forms of event correlation (in the large) are useful for limiting the number of spurious alerts generated during an outage. For example, problems connecting to an HTTP server (detected via application monitoring) should probably be ignored while the network segment to which it is connected is down (detected via network monitoring).

    Note that in order for event correlation in the large to work, all of the following are required:

    1. an Event Format which allows monitoring data from various sources to be encapsulated in "events", eg SNMP traps
    2. an Event Transport System which allows events to be passed between agents & hosts, eg HTTP, SMTP, or direct tcp/ip
    3. a Functional Model, or Rulest (set of rules) expressing the relevant relationships between the components of the monitored system
    4. an Event Correlation Engine which applies the rules in the ruleset to an incoming stream of events
  • Server Monitoring is the process of detecting and reporting problems with the general health of a machine. It typically involves a set of passive checks (eg syslog, process table), since these are uaually quicker and consume less resources than active checks.

    Note that the distinction between network monitoring and application monitoring becomes arbitrary if the machine in question is a network device, or if an application running on the machine provides a network service. In such cases we're probably talking about *both* network and application monitoring at the same time.

  • Quality Of Service (AKA QOS) refers to the functional quality of a service, measured in terms of a metric particular to the service. For example, the quality of a VOIP or telephony service could be measured in terms of any of the following metrics:

    For an overview of voice quality terminology, check any of these resources:

  • Uptime see availability.

  • Last update to this page: 2001/09/13 by Peter Wolfenden