RAID Documentation

From Research
Revision as of 23:13, 16 June 2015 by Gordp (talk | contribs) (→‎RAID Maintenance)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Raid Usage

In the Early Days <c> we used Arena EX3 external SCSI-attached RAID arrays. These external units featured dual-redundant power-supplies, and held six IDE drives. Later, we found we could use SATA drives, with a low-profile SATA-to-PATA (IDE) interface, and a little bit of connector trimming. This boosted the capacity for these external SCSI devices, but only to a point - the limitations of the internal controller seemed to cap the capacity at 2TB.

So, we phased out these Arena EX3 RAID arrays, in favour of big Chenbro rack-mounted chassis, each capable of holding 16 drives. A 3Ware controller with 16 ports addressed these drives, which are hot-pluggable in the Chenbro chassis + drive-bay-backplane.

Our Chenbro-chassis in-house-built RAID arrays are found on Spitfire and Hurricane. Here is how we use them now:

Spitfire

Controller /c1 contains two RAID arrays:

  • /u0 for /home/users
    • /c1/u0 shows up as a single partition /dev/sda1, mounted at /home/users
    • formed from 14x 500GB drives, occupying physical slots p2-p15. RAID-5 capacity is n-1, for a nominal 6.5TB (6TiB - the reported capacity)
    • uses XFS filesystem


  • /u1 for the Gentoo GNU/Linux operating system
    • /c1/u1 shows up as three partitions, following our classical layout:
      • /dev/sdb1 is mountable at /boot
      • /dev/sdb2 is swap
      • /dev/sdb3 is /, type ext3
    • formed from 2x 150GB drives, occupying physical slots p0-p1. RAID-1 (mirror) capacity is nominal 150GB (140GiB reported)
spitfire ~ # tw_cli /c1 show 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    OK             -       -       64K     6053.47   ON     OFF    
u1    RAID-1    OK             -       -       -       139.688   ON     OFF    

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u1   139.73 GB SATA  0   -            WDC WD1500ADFD-00NL
p1    OK             u1   139.73 GB SATA  1   -            WDC WD1500ADFD-00NL
p2    OK             u0   465.76 GB SATA  2   -            WDC WD5000ABYS-01TN
p3    OK             u0   465.76 GB SATA  3   -            ST3500320NS
p4    OK             u0   465.76 GB SATA  4   -            ST3500320NS
p5    OK             u0   465.76 GB SATA  5   -            ST3500320NS
p6    OK             u0   465.76 GB SATA  6   -            ST500NM0011
p7    OK             u0   465.76 GB SATA  7   -            ST3500320NS         
p8    OK             u0   465.76 GB SATA  8   -            ST500NM0011
p9    OK             u0   465.76 GB SATA  9   -            ST3500320NS
p10   OK             u0   465.76 GB SATA  10  -            ST3500320NS
p11   OK             u0   465.76 GB SATA  11  -            ST3500320NS
p12   OK             u0   465.76 GB SATA  12  -            ST3500320NS
p13   OK             u0   465.76 GB SATA  13  -            ST500NM0011
p14   OK             u0   465.76 GB SATA  14  -            ST3500320NS
p15   OK             u0   465.76 GB SATA  15  -            ST3500320NS

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       229    06-Nov-2011


Hurricane

Controller /c1 contains two RAID arrays:

  • /u0 for projects, which includes SVN, CVS, software-deployments, Amanda-holding-disk, projects/infrastructure (containing eBooks, docs, web_content, scripts and some backups
    • /c1/u0 shows up as a single partition /dev/sda1, mounted at /mnt/raid
    • formed from 14x 500GB drives, occupying physical slots p2-p15. RAID-5 capacity is n-1, for a nominal 6.5TB (6TiB - the reported capacity)
    • uses XFS filesystem
    • Usage, and breakdown as of June 2012:
hurricane raid # cd /mnt/raid/ ; du -h --max-depth=1
15G	./svn
4.0K	./holding
3.5G	./cvs
214G	./projects
763G	./software
995G	.


  • /u1 for the Gentoo GNU/Linux operating system
    • /c1/u1 shows up as three partitions, following our classical layout:
      • /dev/sdb1 is mountable at /boot
      • /dev/sdb2 is swap
      • /dev/sdb3 is /, type ext3
    • formed from 2x 150GB drives, occupying physical slots p0-p1. RAID-1 (mirror) capacity is nominal 150GB (140GiB reported)


Musashi

Still using the older external Arena EX3 RAID-array.

  • shows up as a single-partition /dev/sdc1 mounted on /mnt/raid4, type XFS
  • /mnt/raid4/mirror is bind-mounted to /export/mirror where it is NFS-exported
  • holds our Gentoo GNU/Linux Mirror, for local Portage/distfile sync


Hercules

  • to be added

RAID Maintenance

We have drive-failures periodically - reported both through Nagios and via daily logwatch-emails. These must be replaced promptly, to avoid data-loss! The Chenbro chassis supports drive-hot-swap. Typically, to replace a bad hard drive, you would log onto the afflicted server and type the following:

spitfire ~ # tw_cli
//spitfire> /c1 show

That would produce a list of statuses for each drive in the array and clearly identify the bad drive. Once you know the name of the bad drive (eg: P15), you remove it (P15) from the array, and replace it with a new drive. It will automatically commence a RAID-rebuild. You can use /c1 show later on to check on the array status.


Lately, however, the replaced drive is showing up as a new Unit, u?. Not helpful :-( but here's what to do:

spitfire ~ # tw_cli
//spitfire> /c1 rescan     will find the replaced drive, assign it to a new Unit u2
//spitfire> /c1/u2 del     do not use remove; del will keep the drive, but un-assign it
//spitfire> maint rebuild c1 u0 p15     example, to add replaced drive p15 into Unit u0 on Controller c1


Some tips:

  • tw_cli doesn't need to be typed on a line by itself. It can be combined with parameters such as: tw_cli /c1 show
  • The /c1 argument specifies the RAID controller. On some servers, such as spitfire, c1 is the correct controller identifier. But other servers might use c0, or c2, etc. So, when using the /c1 show command, don't be surprised if you get an error, saying that the controller isn't found. You can go ahead and try a different identifier, such as c0.


Change stripe size on an existing RAID-5 "live" system:

spitfire ~ # tw_cli /c1/u0 migrate type=raid5 stripe=256


Update the firmware using the command-line - first download (and unzip if necessary) the firmware file. This goes pretty quickly (less than a minute) and will not take effect until after reboot:

spitfire ~ # tw_cli /c1 update fw=~/prom0006.img