HOWTO Setup S.M.A.R.T. hard-drive Monitoring: Difference between revisions
		
		
		
		Jump to navigation
		Jump to search
		
No edit summary  | 
				No edit summary  | 
				||
| Line 26: | Line 26: | ||
       -a                   # Report on:  |        -a                   # Report on:  | ||
                            #  SMART health status  |                             #  SMART health status  | ||
                           #  usage failures (-f)  | |||
                           #  changes in Prefailure, Usage (-t)  | |||
                            #  increases in error (-l error)  |                             #  increases in error (-l error)  | ||
                            #  increases in self-test errors (-l selftest)  |                             #  increases in self-test errors (-l selftest)  | ||
| Line 190: | Line 192: | ||
       -a                   # Report on:  |        -a                   # Report on:  | ||
                            #  SMART health status  |                             #  SMART health status  | ||
                           #  usage failures (-f)  | |||
                           #  changes in Prefailure, Usage (-t)  | |||
                            #  increases in error (-l error)  |                             #  increases in error (-l error)  | ||
                            #  increases in self-test errors (-l selftest)  |                             #  increases in self-test errors (-l selftest)  | ||
| Line 200: | Line 204: | ||
       -a                   # Report on:  |        -a                   # Report on:  | ||
                            #  SMART health status  |                             #  SMART health status  | ||
                           #  usage failures (-f)  | |||
                           #  changes in Prefailure, Usage (-t)  | |||
                            #  increases in error (-l error)  |                             #  increases in error (-l error)  | ||
                            #  increases in self-test errors (-l selftest)  |                             #  increases in self-test errors (-l selftest)  | ||
Revision as of 18:31, 14 April 2010
Typically, we use a 3Ware controller, with anywhere from 2 - 16 individual hard-drives attached. Although we use 3Ware's tw_cli tool in both a daily cron-job and also in a Nagios monitor, it's also a good idea to get a daily logwatch-email stanza with drive-statistics and alerts. Belts and suspenders, you know.
In addition to (passive) monitoring, we actively invoke Long Self-Tests on each drive. These tests are scheduled to try to hit low-usage times, and also avoid our tape-backup times.
Install the (Gentoo) package:
hostname # emerge -av smartmontools
Start the smartd monitoring daemon automatically in the default runlevel:
hostname # rc-update add smartd default
Representative configuration file:
hostname # emacs -nw /etc/smartd.conf
# This file:  /etc/smartd.conf
# created April 14, 2010, Gordon Pritchard
# Monitor 16 SATA disks connected to a 3ware 9650 controller which
# uses the (built-into-kernel) 3w-9xxx driver.
/dev/twa0 \
     -s L/../../0/01 \    # Long Self-tests Sundays between 1-2am
     -d 3ware,0 \         # First physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../0/03 \    # Long Self-tests Sunday between 3-4am
     -d 3ware,1 \         # Second physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../0/23 \    # Long Self-tests Sundays between 11pm-midnight
     -d 3ware,2 \         # Third physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../1/01 \    # Long Self-tests Mondays between 1-2am
     -d 3ware,3 \         # Fourth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../1/23 \    # Long Self-tests Mondays between 11pm-midnight
     -d 3ware,4 \         # Fifth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../2/01 \    # Long Self-tests Tuesdays between 1-2am
     -d 3ware,5 \         # Sixth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../2/23 \    # Long Self-tests Tuesdays between 11pm-midnight
     -d 3ware,6 \         # Seventh physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../3/01 \    # Long Self-tests Wednesdays between 1-2am
     -d 3ware,7 \         # Eighth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../3/23 \    # Long Self-tests Wednesdays between 11pm-midnight
     -d 3ware,8 \         # Ninth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../4/01 \    # Long Self-tests Thursdays between 1-2am
     -d 3ware,9 \         # Tenth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../4/23 \    # Long Self-tests Thursdays between 11pm-midnight
     -d 3ware,10 \        # Eleventh physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../5/01 \    # Long Self-tests Fridays between 1-2am
     -d 3ware,11 \        # Twelfth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../5/23 \    # Long Self-tests Fridays between 11pm-midnight
     -d 3ware,12 \        # Thirteenth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../6/01 \    # Long Self-tests Saturdays between 1-2am
     -d 3ware,13 \        # Fourteenth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../6/03 \    # Long Self-tests Saturdays between 3-4am
     -d 3ware,14 \        # Fifteenth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)
/dev/twa0 \
     -s L/../../6/23 \    # Long Self-tests Saturdays between 11pm-midnight
     -d 3ware,15 \        # Sixteenth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)