My Blog – Page 2 – R-fx Networks

March 24, 2011September 24, 2018

On The Road: Network Disaster & Dual Public-Private Network

As an administrator within a mid-sized organization, you can find yourself wearing many occupational hats, which becomes only second nature after awhile. One of these many hats I wear, is that of lead network administrator, which is something I am particularly fond of… I love networking and everything about it (except maybe wiring racks and crimping :|).

Today many data center networks are designed in a dual public-private network setup, which simply put is you have a private network parallel to your public network — effectively you run two cat6 copper runs to all racks and servers. The traditional concept behind this is that your servers and/or server customers receive all the benefits a private network entails; unlimited server-to-server traffic, gigabit server-to-server data transfers, secure communication for dedicated back-end application or database servers, out-of-band VPN management, fast off-server local-network backups, relieves congestion on the public network and the list goes on. The costs to run cat6 are about $0.12/ft, it is cheap, provides for a more robust and flexible network environment and simply put is just good practice that leased/colo server customers like to see (we run our private network to all customer servers at work, free of charge!).

There however is an unconventional usage to this dual public-private network implementation that can very well save you some serious headaches, make you the hero of the day and gives true meaning to thinking outside the box.

Our fateful day begins on a beautiful spring day in May of 2009 (how cliche does that sound?). I was just getting on the road heading from Troy, MI to Montreal, Quebec which is about an 8 hour drive. I had to return home to deal with a family emergency and my boss Bill was great about it, offering to drive me to Montreal… One problem though, being a smaller organization our staffing in Troy for the data center pretty much consisted of Bill and myself. Not to worry Bill said!, A quick phone call later and we managed to secure a commitment from one of the data center owners to respond to any hardware events we might have while on the road — after all its only 8 hours away and Bill expected to be back the next day after dropping me off, what could go wrong in a single day!

So, off we went, setting out on the road, we quickly made our way out of Michigan enjoying the wonderfully scenic view of the Ontario landscape (read: a whole lot of nothing). A few stops later for junk food, restroom breaks, and an obligatory stop at one of Canada’s great attractions, Tim Horton’s :), we now found ourselves about 3 or so hours into the journey. Then it happened, Bill’s pager started to go off, seconds later, mine started to go off – something blew up. The laptop bag quickly got pulled out from under my seat as Bill drove and I began to pound away at my keyboard, trying to figure out what was going on. We had multiple servers reporting down, which quickly lead to the realization that all the downed servers were on the same physical rack. The first thing I thought was maybe an APC strip (power outlet strip) tripped or failed, but then I tried to ping some of the private IP’s for the downed servers and they were responding. The conclusion was instant when I saw the private IP’s responding – we just lost a public switch – CRAP!.

Immediately a call was placed to the data center hands that we had thought, and who had committed too, being available to assist us in the event of any kind of failure but we knew this was something he may not even be able to handle. It did not matter, when we needed him, he was nowhere near the data center and couldn’t get there, some commitment that was! At this point Bill took the first U-Turn possible and we started the drive back to Michigan, some 3 hours away, with a rack of servers down from a failed public switch. We were at this point left with our thumbs up our butts, both Bill and myself quietly freaking out in our heads and ultimately unable to do anything, our on-call data center hands failed us and the only other two people who could do anything about it were sitting in the truck 3 hours away.

I sat there, in the truck, contemplating all sorts of things, hoping a power cycle of the switch would work, but sure enough, it did not, that would have been too simple! The more I thought about things, the more I kept returning to the private network, we have all these downed servers and they are responding on the private network…. Then it hit me, why don’t I try route traffic for the downed servers, through a new gateway, a gateway I create on the private network! That was it, I got to work frantically, Bill asking me what I was doing, and me saying only that I am trying something, to give me a few minutes — all the while Bill is still driving back towards Michigan.

The plan was fairly simple, in my head it seemed like that anyways. I would take a server from anywhere else on the network and temporarily use it as a Linux routing/gateway server (think: Windows Internet Connection Sharing) by enabling ip forwarding so it can forward/route packets. Then I would set a static route on the affected servers telling them to route traffic for the public IP network through to the private IP of the designated temporary gateway server, followed by configuring our edge router to static route the IP block of the downed switch to the designated temporary gateway server.

That said it sounds more complicated than it actually is. The server I chose to use for this temporary gateway role was one on the next rack over, lets call this server GW. The private IP on GW is 10.10.7.50 and its public IP is 172.11.14.5. The public IP space that is offline from the downed public switch is 172.11.13.0-255 but each server only has in use 1-3 IP’s.

First things first, on server GW we enabled IP Forwarding with:
# echo 1 > /proc/sys/net/ipv4/ip_forward

Then on one of the affected servers on the downed rack, we needed to add a static route to tell it to route public traffic through the private network to our GW server; note eth1 is our private network interface (this takes care of traffic leaving the downed server):
# route add -net 172.11.13.0/24 gw 10.10.7.50 eth1

Then on the GW server we needed to add a static route similar to above but for each of the downed servers main IP’s.This is slightly tedious but hey its not a perfect situation to begin with right!?. So the downed server we are working on has a public IP of 172.11.13.20 and a private IP of 10.10.7.26, again the private interface is eth1. (this route will take care of traffic going to the downed server):
# route add -host 172.11.13.20 gw 10.10.7.26 eth1

In adding the two routes, the server was immediately able to ping out to the internet but none of it’s IP’s were responding outside of the network. This is because our router was still sending routed traffic for 172.11.13.0/24 to the downed switch. We need to tell it to redirect the traffic to the GW server at 172.11.14.5. Once logged into the edge router running Cisco IOS, I passed it the following static route:
router1(config)# ip route 172.11.13.0 255.255.255.0 172.11.14.5

With that done, the public IP on the downed server started to respond from the internet and traffic began to flow into the server. It was done! I had configured the downed server to route traffic out through an intermediate gateway server on the private network and for that gateway server to likewise route inbound traffic back through the private network to the affected server. Now all that was needed was repeating the first route command on all downed servers then repeating the second route command for each of the downed servers main IP’s. Tedious, not ideal, far from it in fact but it was working – we were bringing servers back up and our total outage time was about 40 minutes. Though significant, less so than it could have been of 3+ hours!

Once I had got the routing working over the private network and the first server back online, this was enough for Bill to turn back around and continue me on my journey to Montreal. Although we could have continued back to Michigan to replace the failed switch, we had a workable solution in place that allowed me to get where I needed to go and handle the failed public switch the next day.

There is arguably more than a few factors to this situation and the network we have at work that made this approach possible but they are outside the scope of this article. The take away is really very simple; a Dual Public-Private Network allows you a great many advantages, as listed earlier in this article, however it is the simple fact of having that private network parallel to your public network that affords you options in a disaster, options you may otherwise not have without it.

What do you think?
Was the choice to continue back on the trip to Montreal the right decision or should we have completely returned to Michigan to replace that failed switch?

March 17, 2011September 24, 2018

Data Integrity: AIDE for Host Based Intrusion Detection

It used to be all the talk, everyone knew it, accepted it but few did anything about it and still even today, very few do anything about it. What is it? Data Integrity. But it is not in the form of how we usually look at data integrity; it is not backups, raid management or similar — it is host based intrusion detection.

What is host based intrusion detection (hIDS)? In it simplest form it is basically the monitoring of a file system for added, deleted or modified content, for the purpose of intrusion detection and (post) compromise forensic analysis. At one time hIDS was a very popular topic with allot of emphasis pushed on it from the security community and although it still is an area of religious focus for some, it is generally a very under utilized part of a well rounded security and management policy. Note how I said management policy there also, as hIDS is not just about intrusion detection but can also play a vital role in day-to-day operations of any organization by providing “change monitoring” capabilities. This can play out in many scenarios but the simplest being that it allows you to track changes to file systems made through regular administration tasks such as software installations, updates or more importantly administrative mistakes. Though the topic of change monitoring can be a whole article in of itself, hIDS to me is vitally important in both respects as an intrusion detection AND change monitoring resource.

I can not beat around the fact that even myself, over the years, have let hIDS fall to the wayside, I used to be the biggest fan of tripwire and would use it on everything. However, over time tripwire became a time consuming, bloated and difficult tool to manage, it is also tediously slow and would cause very undesirable loads on larger systems. This made for hIDS falling out of my regular security and management habits which in turn had a way about sneaking up on me and biting me in the butt whenever a system got compromised or an administrator would make a “oopsy” on a server.

A few years back I experimented with a tool called AIDE (advanced intrusion detection environment), at the time it was the new kid on the block but showed incredible potential with a very simplified configuration approach, fast database build times and reasonably modest resources usage — by tripwire standards it was exactly what I was looking for, simple and fast. AIDE has since grown up a bit, many of the small issues I used to have with it are now fixed and it is now available in the package management for most major distributions including FreeBSD, Ubuntu, Fedora & RHEL (CentOS).

The configuration and deployment scenario we are going to look at today is one that is suitable for web and application servers but really can be broadly applied to just about any system. We are going to slightly sacrifice some monitoring attributes from files on the system in the name of increasing performance and usability while maintaining a complete picture of added, deleted and modified files. So, let’s jump right on in….

The first task is we need to install AIDE, for the purpose of this article I am assuming you are using Fedora or an RHEL based OS (i.e: CentOS), so please refer to your distributions package management or download and compile the sources at http://aide.sourceforge.net/ if a binary version is not available for you.

# yum install -y aide

The binary default installation paths for AIDE place the configuration at /etc/aide.conf , executable at /usr/sbin/aide and databases at /var/lib/aide/. The obviously important part being the configuration file so lets get a handle on that for the moment. The configuration defaults are a little loud, intensive and in my opinion will overwhelm anyone who has never used hIDS before; even for myself the defaults were just too much. That said, we are going to backup the default configuration for reference purposes and download my own custom aide.conf:

# cp /etc/aide.conf /etc/aide.conf.default
# wget http://www.rfxn.com/downloads/aide.conf -O /etc/aide.conf
# chmod 600 /etc/aide.conf

This configuration was created for a WHM/Cpanel server, it is however generalized in nature and can apply to almost any server but will require modification to keep noise to a minimum. Now I stress that fact, noise — hIDS reports can get very loud if you do not tune them and that can lead to them being ignored as a nuisance but more on that later. Lets take a look at the configuration file we just downloaded and I will attempt to break it down for you by each section:

# nano -w /etc/aide.conf
( or your preferred editor *ahem vi* )

The first 10 or so lines of the file declare the output and database paths for AIDE, they should not be edited, the first parts we want to look at follow:

# Whether to gzip the output to database
gzip_dbout=no

# Verbose level of message output - Default 5
verbose=5

These options speak for themselves; do we want to gzip the output databases? No, we do not as our management script that we will run from cron and look at later is going to take care of that for us. Next is is the verbosity level (0-255 — less to more) which defaults at 5. The verbosity is fine left at the default, you can lower it to 2 if you want strictly add/delete/modified info in the reports with NO EXTENDED information on what attributes were modified on files (i.e: user, group, permissions, size, md5) — suitable maybe for a very simplified change management policy. If set to 20 then reports will be exceedingly detailed in item-by-item change information and reports can become massive — so I recommend leaving it at the default of 5 for the best balance of detail and noise reduction.

Next the configuration file lists, in comments, the supported attributes that can be monitored on files and paths and then our default monitoring rules of what attributes we will actually use; this list shows the depth of AIDE and should be reviewed in brief for at least a fundamental understanding of what you are working with:

# These are the default rules.
#
#p:     permissions
#i:     inode:
#n:     number of links
#u:     user
#g:     group
#s:     size
#b:     block count
#m:     mtime
#a:     atime
#c:     ctime
#S:     check for growing size
#md5:    md5 checksum
#sha1:   sha1 checksum
#rmd160: rmd160 checksum
#tiger:  tiger checksum
#haval:  haval checksum
#gost:   gost checksum
#crc32:  crc32 checksum
#E:     Empty group
#>:     Growing logfile p+u+g+i+n+S

# You can create custom rules like this.
R=p+u+g+s+md5
L=p+u+g
>=p+u+g+i+n+S

NORMAL = R
LOG = p+u+g
DIR = p+u+g+md5

The important parts here that we will be using, and can be seen from the custom rules, are p,u,g,s,md5 for permissions, user, group, size and md5 hashes. How does this work in our interest? The basics of permission, user, and group are fundamentals we would always want to be notified of changes on, as really, those are attributes that shouldn’t ever change without an administrator doing so intentionally (i.e: /etc/shadow gets set 666). Then there is size and md5 which will tell us that a file has been modified, though we are not specifically tracking mtime (modified time), it is not strictly needed as md5 will tell us when even a single bit has changed in a file and mtime is an easily forged attribute (although feel free to add m to the R= list for mtime tracking if you desire it).

Then we have the paths to be monitored which you’ll note we are not monitoring on the top level ‘/’ itself but instead a specific list. Although you can monitor from the top level, it is not recommended on very large servers, if you do choose to monitor from the top level then be sure to add ‘!/home’ and other heavily modified user paths into your ignore list (covered next), especially if you have a shared hosted environment. Keep in mind, this is not about monitoring every single user level change but rather the integrity at the system (root) or critical application/content level.

/etc    NORMAL
/boot   NORMAL
/bin    NORMAL
/sbin   NORMAL
/lib    NORMAL
/opt    NORMAL
/usr    NORMAL
/root   NORMAL
/var    NORMAL
/var/log      LOG

## monitoring /home can create excessive run-time delays
# /home   DIR

As mentioned above, monitoring of /home is not the best of ideas, especially on larger servers with hundreds of users. The exception to this rule is smaller servers that are task oriented towards mission critical sites or applications. In these situations, such as my employers and even my own web server that have no other task than to host a few sites, monitoring of /home can be invaluable in detecting intrusions in your web site and web applications. This is especially true if you run billing, support forums, help desks and similar web applications on a single server dedicated to your businesses corporate web presence. So, the take away here is — monitor /home sparingly and evaluate it on a case-by-case basis.

Now, onto our ignore list which is as simple as it gets — any paths that are not subject to monitoring for whatever reason, be it too heavily modified or just administratively not suitable to be reported on.

!/backup
!/dev
!/etc/mtab
!/home/cpeasyapache
!/home/MySQL-install
!/home/[a-z0-9]+/mail
!/home/[a-z0-9]+/logs
!/home/[a-z0-9]+/.sqmaildata
!/home/[a-z0-9]+/.cpanel
!/root/tmp/pear
!/usr/local/apache/domlogs
!/usr/local/apache/htdocs/mrtg
!/usr/local/apache/logs
!/usr/local/bfd/tmp
!/usr/local/cpanel
!/usr/local/sim/internals/stat
!/usr/local/sim/internals/tmp
!/usr/local/sim/logs
!/var/cache
!/var/cpanel
!/var/lib/mlocate
!/var/lib/mysql
!/var/lib/rpm
!/var/spool
!/var/www/html/mrtg
!/tmp/sess_.*
!/var/tmp/sess_.*
!/var/log/dcpumon

Generally speaking, you do want to limit the paths ignored as every ignored path is a potential area that an attacker can store malicious software. That said though, we are trying to strike a balance in our reports that alert us to intrusions while still being reasonable enough in length to be regularly reviewed. The important thing to remember is although an attacker can hide content in these ignored paths, to effectively compromise or backdoor a server, the attacker needs to replace and modify a broad set of binaries and logs on the server, which will stand out clearly in our reports. Nevertheless, remove any paths from the ignore list that may not apply to your environment or add too it as appropriate.

That’s it for the configuration side of AIDE, hopefully you found it straight forward and not too overwhelming, if you did then google tripwire and you’ll thank me later 😉

The next part of our AIDE installation is the management and reporting component. The approach we will be taking is using a management script executed through cron daily, weekly or monthly to perform maintenance tasks and generate reports, which can optionally be emailed. The maintenance consists of compressing and rotating our old AIDE databases and logs to time stamped backups along with deleting data that has aged past a certain point.


# wget http://rfxn.com/downloads/cron.aide -O /etc/cron.weekly/aide
# chmod 755 /etc/cron.weekly/aide

The default for this article will be to run AIDE on a weekly basis, this is what I recommend as I have found that daily creates too many reports that become a burden to check and monthly creates reports that are far too large and noisy — weekly strikes the right balance in report size and frequency. The cron has two variables in it that can be modified for email and max age of databases/logs, so go ahead and open /etc/cron.weekly/aide with your preferred editor and modify them as you see fit.

# email address for reports
email=

# max age of logs and databases in hours
# default 2160 = 90 days
maxage=2160

The e-mail address variable can be left blank to not send any emails, if you choose this then reports can be manually viewed at /var/lib/aide/aide.log and are rotated into time stamped backups after each execution (i.e: aide.log.20110315-162841). The maxage variable, in hours, is the frequency at which aide logs and databases will be deleted, which I think 90 days is a reasonable length of retention. However, I strongly recommend for a number of reasons that you make sure /var/lib/aide is included in your remote backups so that if you ever need it, you can pull in older databases for compromise or change analysis across a wider time range than the default last-execution comparison reports.

Although it is not needed, you can go ahead and give the cron job a first run, or simply wait till the end of the week. Let’s assume your like me though and want to play with your new toy 🙂 We will run it through the time command so you can get an idea of how long execution will take in your environment, might also be a good idea to open a second console and top it to see what the resource hit is like for you — typically all CPU but the script runs AIDE as nice 19 which is the lowest system priority meaning other processes can use CPU before AIDE if they request it.

# time sh /etc/cron.weekly/aide

Let it run, it may take anywhere from 10 to 60 minutes depending on the servers specs and amount of data, for very large servers, especially if you choose to monitor /home, do not be surprised with run times beyond 60 minutes. Once completed check your email or the /var/lib/aide/aide.log file for your first report and that’s it, you are all set.

Two small warnings about report output, the first is that when you perform software updates or your control panel (i.e: WHM/Cpanel) does so automatically, you can obviously expect to see a very loud report generated. You can optionally force the database to regenerate when you run server updates by executing ‘/usr/sbin/aide –init’ and this will keep the next report nice and clean. The second warning is that sometimes the first report can be exceedingly noisy with all kinds of attribute warnings, if this happens give the cron script (/etc/cron.weekly/aide) a second run and you should receive a nice clean report free of warnings and noise.

For convenience, I have also made a small installer script that will take care of everything above in my defaults and install AIDE/cron script for you, suitable for use on additional servers after you’ve run through this on your first server.

# wget http://www.rfxn.com/downloads/install_aide
# sh install_aide "[email protected]"

I hope AIDE proves to be as useful for you as it has been for me, hIDS is a critical component in any security and management policy and you should take the time to tweak the configuration for your specific environment. If you find the reports are too noisy then please ignore paths that are problematic before you ditch AIDE; if you give AIDE a chance it will be good to you and one day it may very well save you in a compromise or administrative “oops” situation.

March 16, 2011September 24, 2018

LMD 1.3.9: Quietly Awesome

It has been a busy couple of weeks for the LMD project, lots of late nights and sleepless days behind me and I can say I am a ‘little’ happier with where things are in the project now 🙂

This release has no major feature changes or additions other than a modification in the default hexdepth that is used to scan malware; increased from 15,736 to 61,440 (1024*60). This enables LMD to better detect threats that it was having a little difficulty with due to the byte size of some malware. At the moment there is no byte-offset feature that would allow us to create more targeted hex signatures, which kind of fly’s in the face of my goal of improving performance — but it is what it is and for the moment things are O.K. With the new hex-depth value, the miss rate on valid malware that rules already exist for is below 1% and I can live with that till I put together a new scanner engine/logic. You can apply this update to your installations by using the -d|–update-ver flags.

In light of performance concerns with the current incarnation of LMD, I felt it prudent, and also has been requested, to create a set of signatures compatible with ClamAV. This allows those who wish to do so, to leverage the LMD signature set with ClamAV’s very impressive scanner performance. Although there are performance concerns with LMD on large file sets, it does need to be said that for day-to-day operations of the cron initiated or inotify real-time scans, there is little to no performance issues. This is strictly a situation where if you choose to scan entire home directory trees, where file lists exceed tens or hundreds of thousands, you now have an option to take advantage of our signatures within a scanner engine that can handle those large file sets.

The LMD converted ClamAV signatures ship as part of the current release of LMD and are stored at /usr/local/maldetect/sigs/ and are named rfxn.ndb and rfxn.hdb. The ClamAV signatures will not be updated with the usage of -u|–update but rather are static files placed inside the release package when it is rebuilt nightly with new signatures. As such, the latest versions of the rfxn.ndb|hdb files are always available at the following URL’s (these are updated whenever LMD base signature are updated — typically daily):
http://www.rfxn.com/downloads/rfxn.ndb
http://www.rfxn.com/downloads/rfxn.hdb

To make use of these signatures in ClamAV using the clamscan command, you would run a command similar to the following:
clamscan -d /usr/local/maldetect/sigs/rfxn.ndb -d /usr/local/maldetect/sigs/rfxn.hdb -d /var/clamav/ –infected -r /path/to/scan

The -d options specify the virus databases to use, when you use the -d option it will exclusively use those databases for virus scans and ignore the default ClamAV virus database, so we redefine -d /var/clamav/ to also have all of the default ClamAV signatures included in our scans. The –infected option will only display those files that are found to be infected and the -r option is for recursive scanning (descend directory tree’s).

That all said, the real guts of changes recently have been in the signatures themselves, we have in the last two weeks went from 6,083 to 6,769 signatures, an increase of 686 — one of the largest updates to signatures to date for a single month. A great deal of these signatures have come from the submissions queue which ended up in such a backlogged state that there was 2,317 items pending review. Currently I have managed to get the queue cut down to 1,381 and I have committed myself to eliminating the backlog by the end of the month or shortly there after (ya we’ve heard that before right ?). It should be understood that reviewing the submissions queue is an exceedingly tedious task as every file needs to be manually reviewed, assessed on its merits as malware (or deleted), then ultimately hex match patterns created, classify the malware, hash it and insert it into the signatures list. As painful of a process as it is at times, I do enjoy it, it is just very time consuming so please show patience when you submit malware with the checkout option.

Finally, I have allot in store for LMD going forward, as always time is the biggest factor but be assured the project will continue to grow and improve in the best interest of detect malware on your servers. Likewise, for any who have followed previous blogs, the dailythreats website is still a work in progress and will compliment the project once it is released, in the very near future. As always, thank you to everyone who uses my work and please consider a donation whenever possible.

Did You Know? Stats:
Active Installations (Unique IP daily update queries): 5,903
Total Downloads (Project to date): 32,749
Total Malware Signatures: 6,769
rfxn.com Malware Repository: 21,569 files / 1.6G data
Tracked Malware URL’s: 18,304
Tracked IRC C&C Botnets: 421

March 9, 2011September 24, 2018

Happy Birthday APF: 8 Years Strong

On this day eight years ago, Advanced Policy Firewall (APF) version 0.5 for Linux was publicly released. Since then, APF has stood the test of time and still remains to this day, one of the most widely used Linux firewall solutions, with especially high usage in the web hosting industry.

I was 18 years old when APF first met the world, I was a very different person back then and so to was the web hosting industry. There was but a handful of dedicated server providers, it was a time when Cobalt RAQ’s still dominated a large part of the leased server market and white-box leased servers were quickly starting to pickup momentum from providers such as rackshack.net. As every other person tried to start a web hosting business, it quickly became clear on many industry forums and websites that there lacked an easy-to-use firewall solution that was also comprehensive. This is where APF came in, it gave to the masses a simple, usable and stable firewall suite for servers that were typically managed by individuals with very little to no experience in Linux ipchains and later iptables firewalling.

The project has seen in excess of 400,000 downloads to date or 55% of all downloads in the last 10 years on rfxn.com, the latest versions which retrieves data daily from rfxn.com reports that there is over 24,000 active APF installations with these features enabled, countless thousands more with them disabled or legacy releases, and countless more applications ship with APF integrated. In excess of 1,700 separate corporate, governmental and organizational networks report using APF (through ASN tracking) and roughly 260,000 web sites are directly protected by APF (through domainsbyip.com tracking). Any google search on APF or related terms quickly brings up tens of thousands of references to the project in an assortment of installation , usage and best practice guides.

It is clear APF is a force, it is here to stay, sure there is much that can be improved upon it and that will come with time but for now, let us acknowledge that this has been a good 8 years for APF and here is to many more, happy birthday APF and Long live open source software 🙂

February 14, 2011September 24, 2018

Raid Management: Know Whats Really Going On

In today’s hosting environment it is common place for servers to have hardware based raid cards but what is not common place is having a reliable method for checking the status of the raid arrays. Few would question the value to data integrity by making use of raid technology but very few organizations and businesses implement the tools required to proactively maintain raid arrays, they simply hope for a DC tech to hear a raid alarm and assume the technician will handle the failure. The reality is very different, data centers are loud and increasingly server-dense so hearing a raid alarm let alone pin-pointing the server with the alarm going off, is a daunting task. I remember more than a few times where I found myself with a paper towel tube to my ear listening server to server to try find that troubled box with the annoying alarm going off. This is not how servers should be managed.

As server administrators or web host operators, it is your responsibility, your duty, to have tools in place that can proactively monitor the status of raid arrays and alert you when an array becomes degraded. That way you can have a paper trail of sorts when something has went wrong, submit a ticket to your data center technicians and have the situation corrected before a degraded array from a single disk failure turns into a multi-disk failure and failed array with data loss.

I have created and been using for sometime a script that can query the status of raid controllers from Areca, 3Ware & MegaRaid. The MegaRaid support is mostly intended for Dell PowerEdge PERC cards, however it should work for most MegaRaid based controllers, ymmv though.

The principle is very simple, the package contains the proprietary command line tools from Areca, 3Ware & MegaRaid that can query the status of respective controllers and then an accompanied ‘check’ script handles determining what raid controller is on the system and then runs the appropriate tool in order to get the raid status and if it is degraded or in any state other than a consistent one, it will dispatch an alert to a configured e-mail address.

Download and extract the package:

# wget http://www.rfxn.com/downloads/raid_check_pub.tar.gz
# tar xvfz raid_check_pub.tar.gz

The package will extract to raid_check/, you should place this under /root/ as the check script expects to be run from /root/raid_check/. If you wish to change the path then please modify ‘raid_check/check’.

With the package now setup under /root/raid_check/ you need to modify the ‘raid_check/check’ script to set an email address that alerts are going to get sent too. Once this is done you should symlink the check script to cron.daily so that raid failures will be picked up on daily cron runs, you may change this to cron.hourly if so desired.

# ln -s /root/raid_check/check /etc/cron.daily/raid_check

That’s it, you can give the check script a run to see if things are working. If there is a failure or inconsistency detected then it will be shown on console in addition to the email alert being sent. If everything is OK and there is no issues detected, then no output will be presented.

# sh /root/raid_check/check

Tip: You can check if your server has a raid card by running the following command:

# cat /proc/scsi/scsi  | grep Vendor

If you see Vendors listed as ATA followed by hard drive model names (i.e: WDC, HD etc…) then your servers disks are directly connected and there is no raid controller present. If on the other hand you see vendor names such as Areca, AMC, 3ware, or MegaRaid then you have a hardware raid controller.