Articles / Building a Network Manageme…

Building a Network Management System

This article looks at current NMS offerings and considers how and what would make a "real" NMS.

What is an NMS?

The normal definition of NMS is "Network Management System". This is nice and easy to say, but very hard to pin down to an exact specification. What constitutes a well-rounded NMS?

I believe it to consist of at least:

  1. Up/Downtime Monitoring
  2. Reporting
  3. Configuration Change Management
  4. IP/Asset Management
  5. Security
  6. Event Correlation/Root Cause
  7. Alerting

There are a large number of Free/Open Source Software and commercial systems that claim to be NMSes, but none come close to covering all this functionality. Typically, systems fall into either a Network Monitoring (Up/Down) or Network Reporting role, not both.

NMS Generations

The types of systems available can be crudely categorized into three distinct generations:

  1. Pure Up/Down Monitoring.
    Typically with just ICMP, but some with applications (DNS, HTTP, etc.).
  2. Event correlation.
    Polling using SNMP, ICMP, and applications. Alerting on SNMP traps and syslog.
  3. Root Cause Analysis.
    Advanced event correlation to ensure minimum false negative alerts.

Event Correlation/RCA

Event correlation is the core functionality of an NMS. Without it, too many false negative alerts are generated, which make the system ineffective.

Root Core Analysis takes event correlation a step further. Rather than just dampening alerts from nodes downstream of an existing problem, it only alerts on the real cause of a problem, to significantly reduce the time needed for a fix.

Efficient/Intelligent Polling

Currently, a typical NMS platform will consist of two main systems, with one solution doing the Up/Down monitoring, the other the reporting. This leads to extremely inefficient double polling of devices. Why ping a host to see if it's up when you've just gathered interface stats from it? Some systems can be integrated to help reduce this double polling, but only a single NMS solution will truly provide the most efficient use of the network.

To map, or not to map?

The traditional NMS provides a network map for operators to be able to point and click through to any problems. Some systems have dropped this functionality, claiming that operators only really need to be told what the real problems are. These are typically Second Generation event correlation engines, that just provide a list of problems for the operator.

However, no matter how advanced the logic is in an NMS, it cannot cover all problems, and providing a visual representation for operators to work with can provide major gains. The human brain works best with visual images rather than the written word. NMSes need a map!

It's all about the Man-Machine Interface!

Aside from alerting (via email, SMS, etc.), how should an NMS interface with the operators? There are two distinct camps, dedicated GUI and HTTP. A growing number of HTTP interfaces (typically with some Java thrown in) are being used.

While this type of interface may have its uses, it is not the best medium in an operational environment. A dedicated GUI is the only way to provide a fast, efficient, reliable mechanism for operators to interact with an NMS.

Putting the M in NMS

The M stands for Management, but what's being managed, exactly? Network problems, mainly. A single generic management interface is somewhat of a holy grail that some people have been chasing. Is it achievable?

How far should management be taken? Many vendors have proprietary management software for their systems to provide an alternative to the commandline. Should an NMS allow full management of a device without having to resort to a CLI? Some things can be done easily by SNMP, but what interaction should an NMS have with a device's CLI? RANCID provides an easy change management system for routers, but also shows the possibilities of being able to integrate functions into an NMS that typically are done at the CLI level.

Think being able to do mass changes (for example, SNMP community changes) via a few clicks on a GUI, rather than manually having to login to thousands of devices.

Current Solutions

F/OSS

I'll mention the commercial solutions as well, as they typically have far better Man-Machine Interfaces. This is a typical problem with F/OSS, as programmers don't usually make good UI engineers.

Commercial

  • HP OpenView
  • SMARTS
  • Aprisma
  • Netcool
  • Concord
  • Proviso
  • InfoVista

Recreation or integration?

The beauty of F/OSS is that we have a huge, growing repository of code. So, do we start coding the "perfect" NMS from scratch, or use the tools already provided and just integrate the functionality we require?

Some commercial vendors make big claims about how their code is multi-threaded and "industrial strength". Producing good, clean, efficient code that will run on many platforms and is part of a large system is hard to do! Such a large, complicated system can also be extremely hard for new coders to get into. Keeping the functionality compartmentalized into small programs can ease these problems. This ties in well with using existing toolsets and just concentrating on an integration issue.

OSSIM is taking the integration approach, and it is well worth watching how well this works. Obviously, the double polling issue rears its head here, and would be a serious limiting factor in any large implementation. Although OSSIM is coming from a security requirements background, it offers an example for the creation of a "proper" F/OSS NMS system. How much work is involved in integrating Nagios with RRDTOOL? Could the cheops-ng GUI be used as the frontend for Nagios?

How do our original NMS requirements map to existing F/OSS projects?

Up/Down Monitoring:Nagios, BigBrother
Reporting:MRTG, RRDTool
Configuration Change Management:RANCID
IP/Asset Management:Northstar
Security:Snort, Tripwire
Alerting:Sendmail, etc.

Most of the functionality is covered across a number of projects. This only leaves Root Cause Analysis. Unfortunately, this is probably one of the hardest things to do.

To Poll, or not to Poll?

First generation NMSes like Nagios and Big Brother rely on polling, via ICMP or an application-specific method (HTTP, FTP, etc.), to do their up/down monitoring. Unfortunately, this really isn't network management. It's just node polling, and has major disadvantages.

To poll means there is a polling interval. What is the state of your network during these intervals? Actively polling the network is also a major scalability problem. The larger your network, the more polling required. Active polling systems are fine for monitoring a handful of systems, but to manage a network, you have to look at other mechanisms.

This is where systems such as OpenNMS and JFFNMS come in. These are realtime event-driven systems. Events are typically from SNMP traps, but can come from other sources such as syslog. There is no polling interval as such in these systems. If a node goes down, an SNMP trap is generated by the switch immediately. You now have true realtime network monitoring.

Of course, SNMP traps are typically not generated on application failures. Most NMSes will resort back to polling to monitor applications.

The Next Generation F/OSS NMS?

It would be nice to see better support for enterprise/carrier-grade functionality in F/OSS NMSes, such as support for bulkstats, netflow, and RCA.

However, there is something I have not seen either F/OSS or commercial systems using: Host/Network sniffing. Having a local host-based sniffer or a dedicated sniffer on a mirrored switch port could leverage enormous gains for NMSes:

Network efficiency
No polling! No extra traffic is generated, as it relies on seeing exactly what's happening on the network.
Spotting problems immediately
It sees TCP RSTs, switch ports losing carrier signal, etc.
Real graphing
Not from graphing host to destination, but actual "real world" traffic.
The ability to track full user QoS
Tied into the network authentication platform (radius, et al), it can give real world user QoS reporting.
Extra functionality
Massive potential for per-IP-block monitoring/reporting, etc.
It's fast, flexible, distributed, and scalable!
 

Developing an NMS-centric pcap-based sniffer seems like the way forward. It could be easily integrated with current systems by being developed separately, and just generating SNMP traps when required.

Recent comments

02 Mar 2011 00:11 Avatar sven_nestle

Yea linux Nag is right as far as inner networking. I know it well. But its completely wrong for the "big corporate" world.

Do you have secure access to teller machines? No. Will they ever tell you what network they use? No.

They took apart the real internet and left a phony one in it's place. They aren't giving up for nothing. And they gov. is even spending huge money developing a "second internet". You heard me right. And internet that only gov. workers can access and develope.

02 Mar 2011 00:07 Avatar sven_nestle

Well true and NOT TRUE. Not true at all.

Have you ever read about various telco softwares? I have a manual for one that does the whole trip (tel, data, tv, phone) and was used in Canada.

Don't forget the telephone company has telco software and hardware and they aren't going to play nice with your linux PC. They and Cisco are going to cut you off however they can to prevent competition (like linux blade routers competing in the tel. / data / video / phone / satellite uplink - tv / cell market.)

How popular would allot of these TV stations be if anyone could broadcast? The technology is there it is simply blocked off to competition.

For examples. IP. I could get an IP but they block it. Mail. I could host mail but they block it. BROADCAST. Extremely critical if you want to make it big: it's very difficult to obtain / money.

You could access "the internet". They are blocking you out.

10 Oct 2007 15:03 Avatar webmistress

GroundWork Monitor wasn't included here
GroundWork Monitor Open Source is a great option: http://www.groundworkopensource.com/

06 Sep 2006 09:26 Avatar MadEyeMoody

A Real NMS
Most Developers out there today are all jacked up about SOA. Its easy to program, uses SSL / HTTPS for security, and its becoming very prolific. When you throw in J2EE and JMS, you now have all your Dev guys drooling.

Some of this stuff just doesn't work in certain cases. For example, lets say you have a process that collects performance data on a device in clumps. Like Netflow data. Data sets are huge. And you want your data keyed correctly so that it is usable and functional. So, you end up encoding Netflow data into XML records. This becomes a huge behemoth across the wire as not only reach record delinited, it is also escaped. For example, you use a field called ACMEVALUE. In XML speak, thats:

<ACMEVALUE>1234567890

</ACMEVALUE>

So now, you've added alot more data to the dataset for the sake of flexibility. And this really adds up across the wire!

The second thing you do is that you take a long time to process.A SOAP transaction cannot be completed until all of the data is encapsulated in the SOAP envelope. This may take an inordinate amount of time and blocks vital resources dirung the process.

In SOA, when you start using stuff like their publish and subscribe in near real time, it ends up blocking during the IO phases which slows down everything and makes it non-scaleable in large environments.

Additionally, everyone is wrapped up around a CMDB concept as introduced by ITIL. Not to say a CMDB won't work.... In some cases, a CMDB is used... Its only localized. Think about windows registries and you get the jist. Stuff happens too fast on some levels of the data to keep this data in a centralized spot. You ebnd up having to mix data elements and locations dependent upon the volatility and usefullness of the data elements themselves.

SNMP, when you look at it, is a schema for a highly distrubted database where the data access mexchanism is accomplished via SNMP versus somethnig like SQL*Net or ODBC and SQL.

The thought of a Real NMS is evolving very quickly - almost haphazardly. Yet the technology being used to do the next generation NMS systems lacks stability and may not be very scalable. It has been said that Corporate America is spending a huge amount of money to convert all their applications to SOA and JAva only to lose functionality, stability, and scalability. And worse yet, they are offshoring the coding in many cases making it impossible to support in the future with off shoring support as well!

06 Sep 2006 07:47 Avatar MadEyeMoody

OpenNMS should be added...
OpenNMS is doing very well these days. It should be part of the list.

Screenshot

Project Spotlight

Kigo Video Converter Ultimate for Mac

A tool for converting and editing videos.

Screenshot

Project Spotlight

Kid3

An efficient tagger for MP3, Ogg/Vorbis, and FLAC files.