[Linux-aus] Looking for linux app....

Fri May 25 00:29:45 UTC 2007

On Thu, 2007-05-24 at 22:33 +1000, Randall Crook wrote:
> Way back when, 6 or 7 years ago, I played with a tool call (I think)
> tkined.. was tired up with another tool called scotty.

It depends what aspect of Scotty you liked.

If you liked its ability to cheaply poll lots of SNMP variables across a
network, then Torrus is good but has a high setup cost (the assumption
is you're a ISP where the cost of setting up the templates is nothing
as long as adding a new router is a one-liner). It will also generate
traps (and thus link to Nagios, etc).

If you want something to do graphs with a low setup cost then Cacti is
good. Downside is that it's almost impossible to extend, so it's nice
for small networks but not serious ones.

Both of these use RRDtool as the database, so you can then use tools
like drraw against them.

If you want network status monitoring with limited pretty graphs, then 
Nagios. It has a world of plug-ins, some good and useful, some not.
Good product, and would be my choice for monitoring a small network.

If you liked the programming interface then you are a bit stuffed.
None of the replacements have such a nice API into regular SNMP
polling.

There's nothing that will do the HP OpenView trick of generating
a network map from SNMP autodiscovery. That never worked anyway,
you always spend more time tuning the graph then starting from
scratch -- Tkined was particularly irritating that way.

SNMP-based network maps have fallen out of favour as a network
operations tool.  A simple log console of open faults is what's
commonly used.  A "weathermap" will give the 11,000m view of
the network, mainly useful for spotting congested trunk capacity.
RRDtool is used to archive information for capacity planning and
for creating a benchmark for fault analysis (eg, "do we usually
see CRC errors on this link").

In general, select a tool that let's you graph a *lot* of variables.

Interfaces:
 - carrier status
 - in/out packet/octet/flow
 - errors, one graph per error type (if you aggregate them you
   can't tell apart media errors and misconfiguration)
 - 95 percentile of in/out (easily answers "is this a DoS attack")
 - number of filtered packets/octets that would have been
    - forwarded
    - delivered to control plane
 - if ADSL, line gains and count of channels used.
Environment
 - all available temps, fan speeds, current and voltage
 - CPU utilisation
 - memory use
 - run queue length
Routing
 - number of neighbours
 - protocol state for each neighbour
 - number of accepted/filtered routes from each neighbour
 - entries in each routing protocol's routing table
    - if OSPF, number of entries in each area
 - size of forwarding table
 - if a simple BGP customer, presence of default route
   from ISP (in fact, this indication alone makes it
   worthwhile for customers to run a simple BGP connection,
   since you can now classify upstream failure into link
   versus network issues).
Spanning tree (I'd suggest running a modern single spanning-tree
across your switched network)
 - interface forwarding state.
 - bridge state.
 - number of neighbours.

You can easily get 100 graphs per box. But when you have a fault
it becomes very easy to track down the root cause of the issue.
You can also see why the tools all use templates, you really
only want to set up the list of variables once per type of
hardware.

On a medium sized customer network I'd also: run something to
produce NetFlow and grab that with flow-tools; run Snort and
feed that into Nagios; configure Nagios to send SMS notifications.

I've seen a network that also used SNMP to monitor Linux end systems.
It was very impressive. Vendors would be sent to replace hard drives
when SMART tests failed, not when the machine (used for customer
service) failed.