A service is considered to be up/available if it is functioning correctly and is capable of receiving & processing client requests whether or not there are actually any client requests to be processed. Similarly, a service is considered to be down/unavailable if it is unable to receive & process client requests, whether or not there are actually any client requests to be processed.
A particular client will perceive the service to be up/available only if all network components between the client and the service are functioning properly, as well as the client and the service themselves. (DIAGRAM) This implies that different clients may view the availability of a given service differently, which complicates the business of measuring availability. At one extreme, we may consider the service availability from the perspective of the hosts running the server application. This gives the "most optimistic" availability measurement, since only those network issues which prevent the service hosts from communicating amongst themselves or with any required external services (eg NTP)= can count against this type of availability. At the other extreme, we may consider the perspective of any potential client running on any host, anywhere on the internet. This gives the "most pessimistic" availability measurement, since at any given moment there are probably (local) DNS issues that prevent a nonzero fraction of potential clients from ever getting to the service, regardless of how perfectly the service itself may be performing.
To measure the availability of a service, it is often convenient to deploy a system of agents to various points in the ambient network to monitor both the service and the network itself. Each agent collects information over time and submits reports at regular intervals (or on demand) to a central repository, where another agent uses knowledge about the structure and function of the system to integrate the assembled data into a single continuous state trajectory for the service and ambient network over time. This state trajectory should be relatively sparse, even if large quantities of data are collected by the agents (if the agents are tuned properly, then increasing the frequency with which they collect data should not substantially increase the amount of information in the state trajectory). It should probably be stored in XML, so humans can annotate it by hand and make manual amendments (eg after postmortem reconstruction of "enterprise events").
The accuracy of any service availability report generated from the state trajectory clearly depends on:
Reliable agents may compensate somewhat for an unreliable network by "voting" to determine whether or not a service is up/down at any given moment. Unreliable agents are bad news - it is almost better to have no data at all than data from a faulty monitoring agent. So it is very important to thoroughly test agents (and any changes to them) before they are deployed (duh).
Since it is not generally possible to detect all client failures, it is not generally possible to directly measure the impact of a service outage on the community of clients at large. But estimates are useful for assessing damage and setting priorities, and when we need one we use a MODEL####
Such an active check tests the backend database, the frontend HTTP server, the network between the polling agent and the webserver, and the TCP/IP stack on the polling agent host. Which is both a blessing (in the sense that a successful test indicates that all of the above system components are functioning properly) and a curse (in the sense that diagnosing a failure is complicated by the fact that any of the components could be at fault).
Ideally, both active and passive application monitoring data should be combined with other monitoring data via event correlation to reduce the total number of errors/alerts generated, and to simplify failure diagnosis.
Simple forms of event correlation (in the small) may be programmed into monitoring agents themselves. For example, an agent which is configured to only send one alert every 15 minutes must decide whether or not to send an alert for a continuing problem based on how long it has been since the last alert was sent. Agents are also typically configured to send messages about different types of problems to different alert recipients.
More complex forms of event correlation (in the large) are useful for limiting the number of spurious alerts generated during an outage. For example, problems connecting to an HTTP server (detected via application monitoring) should probably be ignored while the network segment to which it is connected is down (detected via network monitoring).
Note that in order for event correlation in the large to work, all of the following are required:
Note that the distinction between network monitoring and application monitoring becomes arbitrary if the machine in question is a network device, or if an application running on the machine provides a network service. In such cases we're probably talking about *both* network and application monitoring at the same time.
For an overview of voice quality terminology, check any of these resources: