Background: Know: SNMP trap, service, failure, error, fault, fault tolerance, Recognize:




Why multiple symptoms of one problem?Edit


Event correlation is a technique for making sense of a large number of events and pinpointing the few events that are really important in that mass of information. The object of event correlation is to attempt to pinpoint a single problem event in one resource which may cause many symptom events in related resources. The symptom here refers to the observable effects of a fault at the system boundary.There are several situations that multiple symptoms may arise from just one problem.

Fault Tolerance

Fault tolerance refers to how an operating system responds to a hardware or software failure. Fault tolerance is essentially a system’s ability to allow for failures or malfunctions, and this ability can be provided by software, hardware or a combination of both. Some computer systems have two or more duplicate systems in order to handle faults gracefully.

Symptoms may exist as two types

1. Lost:

  • Transmission error: even if data has been entered correctly in a system, it still can become corrupted when it is transmitted.
  • Network congestion: when there is too much data traffic carried at a node or link, the network transmmision rate may slow down or start loosing date which will lead to a low quality-of-sevice. In this case, there will cause packet loss and delay as well.
  • SNMP traps are transported over UDP which can be unreliable across mis-behaving networks.

2. Erroneously generated:

Take a simple example here, the malfunction of cabinet door sensor may cause several symptoms. For instance, when cabinet hinges are out of balance, the door may not close properly, or it may even look crooked, or in some cases, it will reduce the mobility of the cabinet door.

Root causeEdit

"Root cause" from event correlation may bring about multiple symptoms although it could not be directly observed by neither users nor Network Managers. Thus, NM needs to act on to eliminate symptoms to determine the real cause.

In particular, service failure symptoms reported by users cannot be explained that the faults are the root causes. For instance, the fact that network does not work is sometimes not triggered from that when XYZ switch is turned off. If there are problems with the configuration of network access like wrong TCP/IP configuration, the user's network will be surely not connected to as usual.

A symptom may have multiple potential causesEdit

Still, take the fact that "network doesn't work" as an example. If the issue is reported from only one user, it may still be caused by several causes, such as wrong network configuration or incorrect physical link. However, when such case is reflected from many users, it is more likely to be the root cause of the phenomenon. For example, this symptom results from the upstream router such as the worm.

See alsoEdit

Corresponding TELE9752 lecture slides


1. Clemm:Network Management Fundamentals, Cisco Press, 2006

2. Commer: Automated Network Management Systems: Current and Future Capabilities, Pearson, 2006

3. Fault tolerant systems

4. A survey of Event Correlation Techniques and Related Topics