RAID Documentation: Difference between revisions

From Research
Jump to navigation Jump to search
 
(4 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Raid Usage ==
== Raid Usage ==
In the Early Days <c> we used Arena EX3 external SCSI-attached RAID arrays.  These external units featured dual-redundant power-supplies, and held six IDE drives.  Later, we found we could use SATA drives, with a low-profile SATA-to-PATA (IDE) interface, and a little bit of connector trimming.  This boosted the capacity for these external SCSI devices, but only to a point - the limitations of the internal 32-bit controller seemed to cap the capacity at 2TB.
In the Early Days <c> we used Arena EX3 external SCSI-attached RAID arrays.  These external units featured dual-redundant power-supplies, and held six IDE drives.  Later, we found we could use SATA drives, with a low-profile SATA-to-PATA (IDE) interface, and a little bit of connector trimming.  This boosted the capacity for these external SCSI devices, but only to a point - the limitations of the internal controller seemed to cap the capacity at 2TB.


So, we phased out these Arena EX3 RAID arrays, in favour of big Chenbro rack-mounted chassis, each capable of holding 16 drives.  A 3Ware controller with 16 ports addressed these drives, which are hot-pluggable in the Chenbro chassis + drive-bay-backplane.
So, we phased out these Arena EX3 RAID arrays, in favour of big Chenbro rack-mounted chassis, each capable of holding 16 drives.  A 3Ware controller with 16 ports addressed these drives, which are hot-pluggable in the Chenbro chassis + drive-bay-backplane.
Line 84: Line 84:


== RAID Maintenance ==
== RAID Maintenance ==
We have drive-failures periodically - reported both through Nagios and via daily logwatch-emails.  These must be replaced promptly, to avoid data-loss!  The Chenbro chassis supports drive-hot-swap, and it used to be that the 3Ware controller would then automatically commence a RAID-rebuild.  Lately, however, the replaced drive is showing up as a new Unit, '''u?'''.  Not helpful :-( but here's what to do:
We have drive-failures periodically - reported both through Nagios and via daily logwatch-emails.  These must be replaced promptly, to avoid data-loss!  The Chenbro chassis supports drive-hot-swap.  Typically, to replace a bad hard drive, you would log onto the afflicted server and type the following:
 
<font color=red>spitfire</font> <font color=blue>~ #</font> '''tw_cli'''
//spitfire> '''/c1 show'''
 
That would produce a list of statuses for each drive in the array and clearly identify the bad drive.  Once you know the name of the bad drive (eg: P15), you remove it (P15) from the array, and replace it with a new drive.  It will automatically commence a RAID-rebuild.  You can use '''/c1 show''' later on to check on the array status.
 
 
Lately, however, the replaced drive is showing up as a new Unit, '''u?'''.  Not helpful :-( but here's what to do:


  <font color=red>spitfire</font> <font color=blue>~ #</font> '''tw_cli'''
  <font color=red>spitfire</font> <font color=blue>~ #</font> '''tw_cli'''
Line 90: Line 98:
  //spitfire> '''/c1/u2 del'''    ''do not use '''remove'''; '''del''' will keep the drive, but un-assign it''
  //spitfire> '''/c1/u2 del'''    ''do not use '''remove'''; '''del''' will keep the drive, but un-assign it''
  //spitfire> '''maint rebuild c1 u0 p15'''    ''example, to add replaced drive p15 into Unit u0 on Controller c1''
  //spitfire> '''maint rebuild c1 u0 p15'''    ''example, to add replaced drive p15 into Unit u0 on Controller c1''
Some tips:
* '''tw_cli''' doesn't need to be typed on a line by itself. It can be combined with parameters such as: '''tw_cli /c1 show'''
* The '''/c1''' argument specifies the RAID controller.  On some servers, such as spitfire, c1 is the correct controller identifier.  But other servers might use c0, or c2, etc.  So, when using the '''/c1 show''' command, don't be surprised if you get an error, saying that the controller isn't found.  You can go ahead and try a different identifier, such as c0.
Change stripe size on an existing RAID-5 "live" system:
<font color=red>spitfire</font> <font color=blue>~ #</font> '''tw_cli /c1/u0 migrate type=raid5 stripe=256'''
Update the firmware using the command-line - first download (and unzip if necessary) the firmware file.  This goes pretty quickly (less than a minute) and will '''not''' take effect until after reboot:
<font color=red>spitfire</font> <font color=blue>~ #</font> '''tw_cli /c1 update fw=~/prom0006.img'''

Latest revision as of 23:13, 16 June 2015

Raid Usage

In the Early Days <c> we used Arena EX3 external SCSI-attached RAID arrays. These external units featured dual-redundant power-supplies, and held six IDE drives. Later, we found we could use SATA drives, with a low-profile SATA-to-PATA (IDE) interface, and a little bit of connector trimming. This boosted the capacity for these external SCSI devices, but only to a point - the limitations of the internal controller seemed to cap the capacity at 2TB.

So, we phased out these Arena EX3 RAID arrays, in favour of big Chenbro rack-mounted chassis, each capable of holding 16 drives. A 3Ware controller with 16 ports addressed these drives, which are hot-pluggable in the Chenbro chassis + drive-bay-backplane.

Our Chenbro-chassis in-house-built RAID arrays are found on Spitfire and Hurricane. Here is how we use them now:

Spitfire

Controller /c1 contains two RAID arrays:

  • /u0 for /home/users
    • /c1/u0 shows up as a single partition /dev/sda1, mounted at /home/users
    • formed from 14x 500GB drives, occupying physical slots p2-p15. RAID-5 capacity is n-1, for a nominal 6.5TB (6TiB - the reported capacity)
    • uses XFS filesystem


  • /u1 for the Gentoo GNU/Linux operating system
    • /c1/u1 shows up as three partitions, following our classical layout:
      • /dev/sdb1 is mountable at /boot
      • /dev/sdb2 is swap
      • /dev/sdb3 is /, type ext3
    • formed from 2x 150GB drives, occupying physical slots p0-p1. RAID-1 (mirror) capacity is nominal 150GB (140GiB reported)
spitfire ~ # tw_cli /c1 show 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    OK             -       -       64K     6053.47   ON     OFF    
u1    RAID-1    OK             -       -       -       139.688   ON     OFF    

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u1   139.73 GB SATA  0   -            WDC WD1500ADFD-00NL
p1    OK             u1   139.73 GB SATA  1   -            WDC WD1500ADFD-00NL
p2    OK             u0   465.76 GB SATA  2   -            WDC WD5000ABYS-01TN
p3    OK             u0   465.76 GB SATA  3   -            ST3500320NS
p4    OK             u0   465.76 GB SATA  4   -            ST3500320NS
p5    OK             u0   465.76 GB SATA  5   -            ST3500320NS
p6    OK             u0   465.76 GB SATA  6   -            ST500NM0011
p7    OK             u0   465.76 GB SATA  7   -            ST3500320NS         
p8    OK             u0   465.76 GB SATA  8   -            ST500NM0011
p9    OK             u0   465.76 GB SATA  9   -            ST3500320NS
p10   OK             u0   465.76 GB SATA  10  -            ST3500320NS
p11   OK             u0   465.76 GB SATA  11  -            ST3500320NS
p12   OK             u0   465.76 GB SATA  12  -            ST3500320NS
p13   OK             u0   465.76 GB SATA  13  -            ST500NM0011
p14   OK             u0   465.76 GB SATA  14  -            ST3500320NS
p15   OK             u0   465.76 GB SATA  15  -            ST3500320NS

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       229    06-Nov-2011


Hurricane

Controller /c1 contains two RAID arrays:

  • /u0 for projects, which includes SVN, CVS, software-deployments, Amanda-holding-disk, projects/infrastructure (containing eBooks, docs, web_content, scripts and some backups
    • /c1/u0 shows up as a single partition /dev/sda1, mounted at /mnt/raid
    • formed from 14x 500GB drives, occupying physical slots p2-p15. RAID-5 capacity is n-1, for a nominal 6.5TB (6TiB - the reported capacity)
    • uses XFS filesystem
    • Usage, and breakdown as of June 2012:
hurricane raid # cd /mnt/raid/ ; du -h --max-depth=1
15G	./svn
4.0K	./holding
3.5G	./cvs
214G	./projects
763G	./software
995G	.


  • /u1 for the Gentoo GNU/Linux operating system
    • /c1/u1 shows up as three partitions, following our classical layout:
      • /dev/sdb1 is mountable at /boot
      • /dev/sdb2 is swap
      • /dev/sdb3 is /, type ext3
    • formed from 2x 150GB drives, occupying physical slots p0-p1. RAID-1 (mirror) capacity is nominal 150GB (140GiB reported)


Musashi

Still using the older external Arena EX3 RAID-array.

  • shows up as a single-partition /dev/sdc1 mounted on /mnt/raid4, type XFS
  • /mnt/raid4/mirror is bind-mounted to /export/mirror where it is NFS-exported
  • holds our Gentoo GNU/Linux Mirror, for local Portage/distfile sync


Hercules

  • to be added

RAID Maintenance

We have drive-failures periodically - reported both through Nagios and via daily logwatch-emails. These must be replaced promptly, to avoid data-loss! The Chenbro chassis supports drive-hot-swap. Typically, to replace a bad hard drive, you would log onto the afflicted server and type the following:

spitfire ~ # tw_cli
//spitfire> /c1 show

That would produce a list of statuses for each drive in the array and clearly identify the bad drive. Once you know the name of the bad drive (eg: P15), you remove it (P15) from the array, and replace it with a new drive. It will automatically commence a RAID-rebuild. You can use /c1 show later on to check on the array status.


Lately, however, the replaced drive is showing up as a new Unit, u?. Not helpful :-( but here's what to do:

spitfire ~ # tw_cli
//spitfire> /c1 rescan     will find the replaced drive, assign it to a new Unit u2
//spitfire> /c1/u2 del     do not use remove; del will keep the drive, but un-assign it
//spitfire> maint rebuild c1 u0 p15     example, to add replaced drive p15 into Unit u0 on Controller c1


Some tips:

  • tw_cli doesn't need to be typed on a line by itself. It can be combined with parameters such as: tw_cli /c1 show
  • The /c1 argument specifies the RAID controller. On some servers, such as spitfire, c1 is the correct controller identifier. But other servers might use c0, or c2, etc. So, when using the /c1 show command, don't be surprised if you get an error, saying that the controller isn't found. You can go ahead and try a different identifier, such as c0.


Change stripe size on an existing RAID-5 "live" system:

spitfire ~ # tw_cli /c1/u0 migrate type=raid5 stripe=256


Update the firmware using the command-line - first download (and unzip if necessary) the firmware file. This goes pretty quickly (less than a minute) and will not take effect until after reboot:

spitfire ~ # tw_cli /c1 update fw=~/prom0006.img