Fornax@UWA Incident 0009

Incident: 0009

Date/Time: 21/12/2012 10:40 AM

iVEC Operations Staff: Ashley Chew

Summary

SGI DDN IS16000 undergone a vendor firmware update to address some issues which made the storage system un-functional where

As the Physical Storage system housing lustre is unavailable, the maintenance day became an extended outage until the storage system could be restored.


Issue has been resolved


Root Cause

During a routine firmware upgrade of an SGI IS16000 storage array on Fornax, the system had an unexpected problem which caused an extended outage to the storage array. A root cause analysis conducted by DDN technical staff in the USA showed that there was a procedural error during the firmware upgrade. DDN and SGI staff are reviewing the upgrade instructions in DDN’s SFAOS user guide to determine whether any modifications are needed to prevent a repeat. There were no problems with the firmware itself or with the underlying hardware or software. No data was lost during the incident


Report:


Date 17/12/2012

Fornax (Compute,Devel,IO,Login and Lustre Nodes ) were shutdown so no system could access the Lustre Filesystem during the update of the SGI DDN IS16000 storage unit.

On site vendor engineer proceeded to go though the strict procedure / process to upgrade the firmware to 1.5.1.1.

On the last update, there was technical problem which had the above mentioned issues.

The onsite SGI engineer proceeded to contact the DDN partner for support where special procedure were given.

Date 18/12/2012 

After the procedure set out by DDN support, all storage raid pools of the IS16000 was undergoing a “force verify process”. And depending on the state of the storage pools, they may undergo a raid rebuild process.

Date 19/12/2012

Most of the storage pools has finished "force verify process" and ones marked are undergoing the raid rebuild process.

Date 20/12/2012

Raid rebuild of the designated pools finished approximately around 3pm where there was attempt to restore the Lustre Filesystem functionality for Fornax.

Filesystem was found to be functional and hence fornax was brought back online and released for public access approximately at 6pm

Action Items: Find out from vendor what caused the problem with firmware update. Complete (see Root Cause above).

Back to log

Retrieved from "http://portal.ivec.org/docs/index.php?title=Fornax@UWA_Incident_0009&oldid=1234"