Press "Enter" to skip to content

Solving a CPU hog – cacti, PHP, snmp forking, snmp v2 and php-snmp bugs

We run cacti to do graphing of a lot of our resources. Obviously, traffic on switches, routers and servers, but also we do application level tracking. Things like # of online users or transactions submitted to a batch system or hits on an Apache web server, etc, etc. This amount of monitoring can cause cacti to get pretty busy. And for a long time, it hasn’t been an issue. Then last July, we really started doing a lot more monitoring for a new application architecture we are hosting for a client. This started to increase the load on our monitoring server quite a bit.
Yearly Load Average

Back in December, you can see a break. This is when I upgraded this box from Panther server to Tiger Server. As you can see the load really jumped. So, I dug into the server to see what was causing the issue. I kept seeing this larger than acceptable load average, and just sitting there watching top, I couldn’t really see any one process taking up as much cpu as was warranted. In top, when you see the total cpu %, you can generally add up the cpu% of the top 4 or 5 processes and get close to what it is telling you the total cpu utilization is. However, top was showing about 45% cpu used and the top 4 processes, were only adding up to about 20%. Where was this extra cpu time going? I then noticed that most of the cpu usage was in the “system” % and I started watching the process list closer. Turns out there was a continuous stream of snmpget processes being launched and completed one right after another. Aha! So cacti is forking off snmpget processes to go and retrieve data from the devices we monitor. No single process was doing much heavy processing but the shear number of forking was causing all this load.

So. What to do with all these forked processes? My first thought was that we are monitoring a lot of SNMP v1 devices and that in SNMP v1 you could only request a single value, but that in SNMP v2, you can make a request for a range of values. So, I went on an upgrading process of either swapping out older v1-only devices for v2 capable or upgrading the IOS on some Cisco switches we run to a version that supports SNMP v2. Interestingly, there is a place on the Cisco site where you can directly download the latest versions of some older Cisco switches, like the venerable 2924XL devices.

So, getting rid of some of the SNMP v1 devices did have some impact in reducing the number of forked processes generated by cacti. You can see that reflected in the drop on the Load Average graph around the end of December. However, this did not solve the issue to the extent I’d hoped as you can see there was still considerable load on the box. So, the only thing left to update was the PHP on the server to a version with the php-snmp functions built-in so that no forking would be necessary. This meant: PHP 5.

I updated the server’s MySQL to MySQL 5.0.x. Then updated the cacti install to the latest version. Then downloaded the PHP 5.2 installer from Mr. Liyange’s site and made the upgrade and making the appropriate changes to the php.ini file.
I had done this same setup for a client before. PHP 5.2, cacti, etc and had had some issues with the built into PHP snmp functions. There is a function in the lib/snmp.php file in cacti called: snmp_get_method($version). It’s purpose is to find the best method of calling SNMP, based on the requested version of SNMP and the availability of certain functions or callable executable. The issue I had had was that when I was using the cacti interface to poll a device for interfaces to graph traffic for, php-snmp would fail and cacti would give a not very helpful snmp error. At that time, I merely added a line that forced cacti to use the snmp binaries. OK, now it was time to really track this down.

The first error I encountered in the php error log was when php was calling snmp2_get(), when I made cacti repoll a switch.

Could not open snmp connection: Unknown host

Which was obviously not correct. The second error message I saw in the base apache error log was:

No support for requested transport domain “udp”

so, I did some googling and found (primarily in the PHP bug tracking site) that php-snmp tended to work fine in the CLI version of PHP (which is what is used when cacti does it’s normal periodic polling), but that gave these errors when called from within the Apache module

I had also turned on debugging in SNMP, by adding:

doDebugging 1

to the snmp client conf file at /usr/share/snmpd/snmp.conf and watching all the relevant logs. SNMP outputs a LOT of info, and I had already gotten an idea of what the basic problem was, so I turned debugging back off.
OK, the basic issue here was that php-snmp works in PHP cli, but not in Apache. I guess you could call it a hack, but I merely added this line to the top of the snmp_get_method() function:

if (!empty($_SERVER[‘HTTP_HOST’])) return SNMP_METHOD_BINARY;

which is basically forcing cacti to call the cli version of snmp functions when there is a HTTP Host header, which is only going to be the case when this function is called from within Apache which is only when you are doing configs of your devices and data sources. (a lot of witches, there)All other times, the function continues on and chooses to use the php-snmp built-ins. So after all this debugging and tweaking, what’s the result? The polling process that cacti went through to poll all our devices by forking snmpget calls which used to take up to 3-4 minutes to complete, now takes (with only 2 concurrent poller processes) just under 14 seconds. As a result the load is now down considerably on that box:

Daily Load Average

MUCH better.

And the other things we have running on that server now run much more quickly.

2 Comments

  1. Mattbot
    Mattbot April 20, 2007

    Hi, thanks for posting this, it solved one of the problems I’ve been working on trying to setup Cacti on Mac OS X.

  2. Raphaël Gertz
    Raphaël Gertz April 1, 2010

    I had a similar problem with snmp2_get, i just got false every time.

    I solved the problem by increasing the timeout to 100000 and it fixed my problem.

Leave a Reply