Incident: 0009
Date/Time: 21/12/2012 10:40 AM
iVEC Operations Staff: Ashley Chew
Summary
SGI DDN IS16000 undergone a vendor firmware update to address some issues which made the storage system un-functional where
- "Disks" were missing from storage pools
- Unassigned Disk
- Storage Pools became degraded or inoperative
- Host of other issues
As the Physical Storage system housing lustre is unavailable, the maintenance day became an extended outage until the storage system could be restored.
Issue has been resolved
Root Cause
During a routine firmware upgrade of an SGI IS16000 storage array on Fornax, the system had an unexpected problem which caused an extended outage to the storage array. A root cause analysis conducted by DDN technical staff in the USA showed that there was a procedural error during the firmware upgrade. DDN and SGI staff are reviewing the upgrade instructions in DDN’s SFAOS user guide to determine whether any modifications are needed to prevent a repeat. There were no problems with the firmware itself or with the underlying hardware or software. No data was lost during the incident
Report:
Date 17/12/2012 Fornax (Compute,Devel,IO,Login and Lustre Nodes ) were shutdown so no system could access the Lustre Filesystem during the update of the SGI DDN IS16000 storage unit. On site vendor engineer proceeded to go though the strict procedure / process to upgrade the firmware to 1.5.1.1. On the last update, there was technical problem which had the above mentioned issues. The onsite SGI engineer proceeded to contact the DDN partner for support where special procedure were given. Date 18/12/2012 After the procedure set out by DDN support, all storage raid pools of the IS16000 was undergoing a “force verify process”. And depending on the state of the storage pools, they may undergo a raid rebuild process. Date 19/12/2012 Most of the storage pools has finished "force verify process" and ones marked are undergoing the raid rebuild process. Date 20/12/2012 Raid rebuild of the designated pools finished approximately around 3pm where there was attempt to restore the Lustre Filesystem functionality for Fornax. Filesystem was found to be functional and hence fornax was brought back online and released for public access approximately at 6pm
Action Items: Find out from vendor what caused the problem with firmware update. Complete (see Root Cause above).
Back to log