On 2 separate occasions and on 2 separate Lotus Foundations Start IBM servers, we experienced a problem where the server became inaccessible remotely and even on-site. Even the handy LCD panel buttons were unresponsive when pressed. Furthermore, since the server is used as the network’s main router, none of the workstations were able to access the Internet anymore. The only solution was to turn the server off and then back on again. However, this only fixed the problem temporarily. Within a matter of minutes the server became unresponsive again and we had to turn it off and on again to get it back up and running.
Looking at the Web console right after rebooting we could see that the server was trying to start up all of the tasks and everything looked rather normal. However, within a few minutes we noticed that the CPUs spiked and stayed that way. CPU1 and CPU2 were pegged at 100%. Eventually, the Web console was no longer responsive and it was time to turn the server on and off again.
The servers in question were both IBM Lotus Foundations Start Entry Level servers with two 250Gb Hard Drives using RAID 1. There was also a separate IDB drive used for backups.
To try to isolate the problem we tried a variety of things:
1. When the server came back up we used the Web console to turn off some tasks on the server, but the server still eventually became unresponsible;
2. Next, we removed one drive of the RAID array and rebooted again. Even booting the server with teach separate disk didn’t solve the problem
3. Removing both of the drives (and leaving the IDB drive inserted), the server did boot fine and the Internet was accessible. However, the IDB drive showed no backups to restore from so restoring from backup was not an option.
Lotus Foundations support turned me on to some IBM support articles that described how to handle disk corruption on the server. The article that I used was this one: Checking the main disk for filesystem corruption http://www-01.ibm.com/support/docview.wss?uid=swg21387563
There’s no need to go into detail here on how we went about getting the drive back up and running since the above article does a great job doing just that. The important thing to note is that, in both cases, the article did successfully fix the disk corruption and the server came up right away after a reboot. On one server, one file was lost. On the other server no data was lost but we were able to restore the file from a recent backup off the IDB drive that had previously registered as having no backups.
In the end, the fix was simple and the servers came back up problem free. However, it was still very disconcerting that the servers became inoperable due to the disk corruption. It would have been better if the server’s router came up without the need for disks so that everyone could still use the Internet while the server battled the disk corruption issue. Also, since the server became remotely inaccessible, we were required to go on-site to troubleshoot and fix the problem. This was a major inconvenience, but an unavoidable necessity.
We have not been able to determine the cause of the disk corruption. If we do determine the cause, we will post an updated blog entry.
We hope this posting helps other companies supporting Lotus Foundations to arrive more quickly at a possible fix when encountering similar symptoms.