HOWTO Setup S.M.A.R.T. hard-drive Monitoring: Difference between revisions
No edit summary |
No edit summary |
||
(4 intermediate revisions by the same user not shown) | |||
Line 11: | Line 11: | ||
Representative configuration file: | Representative configuration file: | ||
<font color=red>hostname</font> <font color=blue># </font>'''emacs -nw /etc/smartd.conf''' | <font color=red>hostname</font> <font color=blue># </font>'''emacs -nw /etc/smartd.conf''' | ||
# This file: /etc/smartd.conf | # This file: /etc/smartd.conf | ||
# created April 14, 2010, Gordon Pritchard | # created April 14, 2010, Gordon Pritchard | ||
Line 17: | Line 16: | ||
# Monitor 16 SATA disks connected to a 3ware 9650 controller which | # Monitor 16 SATA disks connected to a 3ware 9650 controller which | ||
# uses the (built-into-kernel) 3w-9xxx driver. | # uses the (built-into-kernel) 3w-9xxx driver. | ||
# The first two drives are WD 150GB Raptor, configured as RAID-1 | |||
# long selftest takes 72min | |||
# The remaining 14 drives are Seagate 500GB Barracuda, configured as RAID-5 | |||
# long selftest takes 120min | |||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../0/01 \ # Long Self-tests Sundays between 1- | -s L/../../0/01 \ # Long Self-tests Sundays between 1-2:30am | ||
-d 3ware,0 \ # First physical drive, behind 3Ware controller | -d 3ware,0 \ # First physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
# SMART health status | # SMART health status (-H) | ||
# usage failures (-f) | |||
# changes in Prefailure, Usage (-t) | |||
# increases in error (-l error) | # increases in error (-l error) | ||
# increases in self-test errors (-l selftest) | # increases in self-test errors (-l selftest) | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../0/03 \ # Long Self-tests Sunday between 3- | -s L/../../0/03 \ # Long Self-tests Sunday between 3-4:30am | ||
-d 3ware,1 \ # Second physical drive, behind 3Ware controller | -d 3ware,1 \ # Second physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
Line 42: | Line 48: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../0/ | -s L/../../0/22 \ # Long Self-tests Sundays between 10pm-midnight | ||
-d 3ware,2 \ # Third physical drive, behind 3Ware controller | -d 3ware,2 \ # Third physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
Line 54: | Line 61: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../1/01 \ # Long Self-tests Mondays between 1- | -s L/../../1/01 \ # Long Self-tests Mondays between 1-3am | ||
-d 3ware,3 \ # Fourth physical drive, behind 3Ware controller | -d 3ware,3 \ # Fourth physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
Line 66: | Line 74: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../1/ | -s L/../../1/22 \ # Long Self-tests Mondays between 10pm-midnight | ||
-d 3ware,4 \ # Fifth physical drive, behind 3Ware controller | -d 3ware,4 \ # Fifth physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
Line 78: | Line 87: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../2/01 \ # Long Self-tests Tuesdays between 1- | -s L/../../2/01 \ # Long Self-tests Tuesdays between 1-3am | ||
-d 3ware,5 \ # Sixth physical drive, behind 3Ware controller | -d 3ware,5 \ # Sixth physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
Line 90: | Line 100: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../2/ | -s L/../../2/22 \ # Long Self-tests Tuesdays between 10pm-midnight | ||
-d 3ware,6 \ # Seventh physical drive, behind 3Ware controller | -d 3ware,6 \ # Seventh physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
Line 102: | Line 113: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../3/01 \ # Long Self-tests Wednesdays between 1- | -s L/../../3/01 \ # Long Self-tests Wednesdays between 1-3am | ||
-d 3ware,7 \ # Eighth physical drive, behind 3Ware controller | -d 3ware,7 \ # Eighth physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
Line 112: | Line 124: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../3/ | -s L/../../3/22 \ # Long Self-tests Wednesdays between 10pm-midnight | ||
-d 3ware,8 \ # Ninth physical drive, behind 3Ware controller | -d 3ware,8 \ # Ninth physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
Line 124: | Line 137: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../4/01 \ # Long Self-tests Thursdays between 1- | -s L/../../4/01 \ # Long Self-tests Thursdays between 1-3am | ||
-d 3ware,9 \ # Tenth physical drive, behind 3Ware controller | -d 3ware,9 \ # Tenth physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
Line 136: | Line 150: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../4/ | -s L/../../4/22 \ # Long Self-tests Thursdays between 10pm-midnight | ||
-d 3ware,10 \ # Eleventh physical drive, behind 3Ware controller | -d 3ware,10 \ # Eleventh physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
Line 148: | Line 163: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../5/01 \ # Long Self-tests Fridays between 1- | -s L/../../5/01 \ # Long Self-tests Fridays between 1-3am | ||
-d 3ware,11 \ # Twelfth physical drive, behind 3Ware controller | -d 3ware,11 \ # Twelfth physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
Line 160: | Line 176: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../5/ | -s L/../../5/22 \ # Long Self-tests Fridays between 10pm-midnight | ||
-d 3ware,12 \ # Thirteenth physical drive, behind 3Ware controller | -d 3ware,12 \ # Thirteenth physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
Line 172: | Line 189: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../6/01 \ # Long Self-tests Saturdays between 1- | -s L/../../6/01 \ # Long Self-tests Saturdays between 1-3am | ||
-d 3ware,13 \ # Fourteenth physical drive, behind 3Ware controller | -d 3ware,13 \ # Fourteenth physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
Line 184: | Line 202: | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../6/03 \ # Long Self-tests Saturdays between 3- | -s L/../../6/03 \ # Long Self-tests Saturdays between 3-5am | ||
-d 3ware,14 \ # Fifteenth physical drive, behind 3Ware controller | -d 3ware,14 \ # Fifteenth physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
# SMART health status | # SMART health status (-H) | ||
# usage failures (-f) | |||
# changes in Prefailure, Usage (-t) | |||
# increases in error (-l error) | # increases in error (-l error) | ||
# increases in self-test errors (-l selftest) | # increases in self-test errors (-l selftest) | ||
/dev/twa0 \ | /dev/twa0 \ | ||
-s L/../../6/ | -s L/../../6/22 \ # Long Self-tests Saturdays between 10pm-midnight | ||
-d 3ware,15 \ # Sixteenth physical drive, behind 3Ware controller | -d 3ware,15 \ # Sixteenth physical drive, behind 3Ware controller | ||
-I 194 \ # Ignore temperature | -I 194 \ # Ignore temperature | ||
-I 190 \ # Ignore Air Temperature | |||
-I 9 \ # Ignore power-on hours | -I 9 \ # Ignore power-on hours | ||
-a # Report on: | -a # Report on: | ||
# SMART health status | # SMART health status (-H) | ||
# usage failures (-f) | |||
# changes in Prefailure, Usage (-t) | |||
# increases in error (-l error) | # increases in error (-l error) | ||
# increases in self-test errors (-l selftest) | # increases in self-test errors (-l selftest) | ||
=Available SMART parameters= | |||
In order to find what parameters are available for monitoring or ignoring, issue the '''smartctl''' command: | |||
<font color=red>hostname</font> <font color=blue># </font>'''smartctl --all /dev/twa0 -d 3ware,0''' ''(of course, change the ''',0''' to any other 3Ware-connected drive you're interested in)'' | |||
=How Long Does a (Long) Selftest Take= | |||
To learn how long you should allow for a Long SelfTest (typical values are 72min for a 150GB WD Raptor, and 150min for a 500GB WD RE2): | |||
<font color=red>hostname</font> <font color=blue># </font>'''smartctl -c /dev/twa0 -d 3ware,0''' ''(of course, change the ''',0''' to any other 3Ware-connected drive you're interested in)'' |
Latest revision as of 16:39, 15 April 2010
Typically, we use a 3Ware controller, with anywhere from 2 - 16 individual hard-drives attached. Although we use 3Ware's tw_cli tool in both a daily cron-job and also in a Nagios monitor, it's also a good idea to get a daily logwatch-email stanza with drive-statistics and alerts. Belts and suspenders, you know.
In addition to (passive) monitoring, we actively invoke Long Self-Tests on each drive. These tests are scheduled to try to hit low-usage times, and also avoid our tape-backup times.
Install the (Gentoo) package:
hostname # emerge -av smartmontools
Start the smartd monitoring daemon automatically in the default runlevel:
hostname # rc-update add smartd default
Representative configuration file:
hostname # emacs -nw /etc/smartd.conf # This file: /etc/smartd.conf # created April 14, 2010, Gordon Pritchard # Monitor 16 SATA disks connected to a 3ware 9650 controller which # uses the (built-into-kernel) 3w-9xxx driver. # The first two drives are WD 150GB Raptor, configured as RAID-1 # long selftest takes 72min # The remaining 14 drives are Seagate 500GB Barracuda, configured as RAID-5 # long selftest takes 120min /dev/twa0 \ -s L/../../0/01 \ # Long Self-tests Sundays between 1-2:30am -d 3ware,0 \ # First physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../0/03 \ # Long Self-tests Sunday between 3-4:30am -d 3ware,1 \ # Second physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../0/22 \ # Long Self-tests Sundays between 10pm-midnight -d 3ware,2 \ # Third physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../1/01 \ # Long Self-tests Mondays between 1-3am -d 3ware,3 \ # Fourth physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../1/22 \ # Long Self-tests Mondays between 10pm-midnight -d 3ware,4 \ # Fifth physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../2/01 \ # Long Self-tests Tuesdays between 1-3am -d 3ware,5 \ # Sixth physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../2/22 \ # Long Self-tests Tuesdays between 10pm-midnight -d 3ware,6 \ # Seventh physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../3/01 \ # Long Self-tests Wednesdays between 1-3am -d 3ware,7 \ # Eighth physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../3/22 \ # Long Self-tests Wednesdays between 10pm-midnight -d 3ware,8 \ # Ninth physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../4/01 \ # Long Self-tests Thursdays between 1-3am -d 3ware,9 \ # Tenth physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../4/22 \ # Long Self-tests Thursdays between 10pm-midnight -d 3ware,10 \ # Eleventh physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../5/01 \ # Long Self-tests Fridays between 1-3am -d 3ware,11 \ # Twelfth physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../5/22 \ # Long Self-tests Fridays between 10pm-midnight -d 3ware,12 \ # Thirteenth physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../6/01 \ # Long Self-tests Saturdays between 1-3am -d 3ware,13 \ # Fourteenth physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../6/03 \ # Long Self-tests Saturdays between 3-5am -d 3ware,14 \ # Fifteenth physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest) /dev/twa0 \ -s L/../../6/22 \ # Long Self-tests Saturdays between 10pm-midnight -d 3ware,15 \ # Sixteenth physical drive, behind 3Ware controller -I 194 \ # Ignore temperature -I 190 \ # Ignore Air Temperature -I 9 \ # Ignore power-on hours -a # Report on: # SMART health status (-H) # usage failures (-f) # changes in Prefailure, Usage (-t) # increases in error (-l error) # increases in self-test errors (-l selftest)
Available SMART parameters
In order to find what parameters are available for monitoring or ignoring, issue the smartctl command:
hostname # smartctl --all /dev/twa0 -d 3ware,0 (of course, change the ,0 to any other 3Ware-connected drive you're interested in)
How Long Does a (Long) Selftest Take
To learn how long you should allow for a Long SelfTest (typical values are 72min for a 150GB WD Raptor, and 150min for a 500GB WD RE2):
hostname # smartctl -c /dev/twa0 -d 3ware,0 (of course, change the ,0 to any other 3Ware-connected drive you're interested in)