Tag Archives: ssd

SSD的监控

SSD现在越来越普及了,我们也开始比较大量的使用SSD用作数据库存储,涉及到一些性能方面的可以参考《4k Sector Size》,《What every programmer should know about solid-state drives》。下面主要讲讲SSD的监控。

我们使用的主要是Intel的SSD,服务器包括了HP、IBM、DELL等。使用的OS主要是OEL 5 和OEL 6,工具使用的是smartctl(The smartmontools package contains two utility programs (smartctl and smartd) to control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System (SMART) built into most modern ATA and SCSI harddisks. In many cases, these utilities will provide advanced warning of disk degradation and failure. )

smartmontools 尽量选取比较新的版本,因为有些老的版本对某些硬件或者指标不支持。我这里选用的是6.2的,同时我在不同的OS上编译了两个版本,然后就可以通过脚本自动scp到各台机器:
-rwxr-xr-x 1 mymonitor oinstall 1588025 Sep 12 17:03 smartctl5
-rwxr-xr-x 1 mymonitor oinstall 1744199 Sep 12 17:02 smartctl6

# ----------------------------------------------------------------------------------------
# Func : get_os_version
# ----------------------------------------------------------------------------------------
sub get_os_version() {
	
my $_ssh=shift;
 
	eval {
   		my $osversion= $_ssh->capture2({timeout => 5},'cat /etc/issue | grep release');
		if($_ssh->error){
			$logger->warning("get os version error");
			return -1;
		}else{
			
			if(index($osversion,'release 5')>-1){
				return 5;
			}elsif(index($osversion,'release 6')>-1){
				return 6;
			}else{
				return -1;
			}
		}
  }
  
}
# ----------------------------------------------------------------------------------------
# Func : scp_smartctl
# ----------------------------------------------------------------------------------------
sub scp_smartctl() {
	
my $_ssh=shift;
my $_os_version=shift;
my $_os_host=shift;
 
	eval {
	    if($_os_version==5){
	    	$_ssh->scp_put("$workdir/smartctl5", "$workdir/smartctl");
			if($_ssh->error){
				$logger->error("scp smartctl5 to $_os_host failed:$_ssh->error");
				exit;
			}else{
				$logger->info("scp smartctl5 to $_os_host success.");
				$_ssh->capture2({timeout => 5},"chmod +x $workdir/smartctl");
				if($_ssh->error){
					$logger->error("chmodx $workdir/smartctl on $_os_host failed:$_ssh->error");
					exit;						
				}else{
					$logger->info("chmodx $workdir/smartctl on $_os_host  success.");
				}
				
			}	    		
    	}elsif($_os_version==6){
	    	$_ssh->scp_put("$workdir/smartctl6", "$workdir/smartctl");
			if($_ssh->error){
				$logger->error("scp smartctl6 to $_os_host failed:$_ssh->error");
				exit;
			}else{
				$logger->info("scp smartctl6 to $_os_host success.");
				$_ssh->capture2({timeout => 5},"chmod +x $workdir/smartctl");
				if($_ssh->error){
					$logger->error("chmodx $workdir/smartctl on $_os_host failed:$_ssh->error");
					exit;						
				}else{
					$logger->info("chmodx $workdir/smartctl on $_os_host success.");
				}				
			}		    		
    	}else{
    		exit;
    	}
  	}
  
}

smartctl获取磁盘信息需要RAID卡的支持,各个厂商的实现不太一样,主要是HP的 device TYPE比较不一样:
IBM: smartctl -a -d megaraid,6 /dev/sdd
HP: smartctl -a -d cciss,6 /dev/sdd
DELL:smartctl -a -d megaraid,6 /dev/sdd
通过实验发现最后面要求的disk只要在本服务器上存在就可以了,不一定要跟device id一一匹配。

# ----------------------------------------------------------------------------------------
# Func : get_machine_type
# ----------------------------------------------------------------------------------------
sub get_machine_type() {
	
my $_ssh=shift;
 
	eval {
   		my $machine_type= $_ssh->capture2({timeout => 5},'sudo /usr/sbin/dmidecode -t1 | grep "Manufacturer"');
		if($_ssh->error){
			$logger->warning("get machine type error");
			return "Other";
		}else{
			$machine_type=uc($machine_type);
			if(index($machine_type,'DELL')>-1){
				return "DELL";
			}elsif(index($machine_type,'HP')>-1){
				return "HP";
			}elsif(index($machine_type,'IBM')>-1){
				return "IBM";
			}elsif(index($machine_type,'FUJITSU')>-1){
				return "FUJITSU";			
			}else{
				return "Other";
			}
		}
  }
  
}

上面用到的还有一个device id,如果是LSI的raid卡,先安装raid卡的工具程序,首先去lsi官网去拉个megacli-2.00.11-2.x86_64.rpm。参考
这个链接:http://blog.yufeng.info/archives/1096
我们采用的是一个比较暴力的办法:

# ----------------------------------------------------------------------------------------
# Func : get_ssd_deviceid
# ----------------------------------------------------------------------------------------
sub get_ssd_deviceid() {
	
my $_ssh=shift;
my $_machine_type=shift;

my $_devcount=0;
my $_i;
my $_msg;
my $_ssd_deviceid;
my $_parms_disk;
my $_raid_type;
 
	eval {
		if($_machine_type eq "HP"){
			$_raid_type="cciss";
			$_devcount=$_ssh->capture2('sudo df -h | grep "/dev/cciss/c0d0" | wc -l');
		
			if($_devcount>0){
				$_parms_disk="/dev/cciss/c0d0";
			}else{
				$_parms_disk="/dev/sda";
			}
		}else{
			$_raid_type="megaraid";
			$_parms_disk="/dev/sda";	
		}
		
		#try 	
		for($_i=1;$_i<=20 ; $_i++){
			$_msg=$_ssh->capture2("sudo $workdir/smartctl -a -d $_raid_type,$_i $_parms_disk | grep  'Solid State Device'");
			chomp($_msg);
			if(index($_msg,'Solid State Device')>-1){
				if(!(defined $_ssd_deviceid && "" ne $_ssd_deviceid)){
					$_ssd_deviceid= $_i ;
				}else{
					$_ssd_deviceid = $_ssd_deviceid . "|" . $_i;
				}						
			}
		}
					
		if(defined $_ssd_deviceid && "" ne $_ssd_deviceid){
			return $_parms_disk . "," . $_ssd_deviceid;
		}else{
			return $_parms_disk . "," ;
		}		

	}
  
}

最后就是监控指标,我们主要选取了这几个:Reallocated_Sector_Ct, Available_Reserver_Space, Media_Wearout_Indicator。指标的选择可以参考这个链接:http://www.hellodb.net/2011/04/ssd-media-wear.html

foreach my $id  (@resultids) {
	chomp($id);
	$logger->debug("$result[0] ssd check  deviceid:$id");
	if(defined $id && "" ne $id){
		$chk_ssd_msg=$cssh->capture2("sudo $workdir/smartctl -a -d $raid_type,$id $parms_disk | grep -E \"$result[12]\"");
		$logger->debug("$result[0] ssd check msg:$chk_ssd_msg");
	    foreach my $line (split /\n/,$chk_ssd_msg) { 
		if ($line =~ m{(\d+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)}i) { 
		    $attrname=$2;
		    $attrvalue=$4;
		    $attrthresholds=$tabhash{"$2"};
		    if ($ssdrpt==1){
			$logger->info("$result[0] ssd check: deviceid:$id;ATTRIBUTE_NAME:$attrname;VALUE:$attrvalue;Thresholds:$attrthresholds");	
		    }
		    
		    if($attrvalue <= $attrthresholds && $result[14]>0){
					&send_report("$result[0]:SSD Check warning. deviceid:$id; $attrname:$attrvalue");
			
		    }
		} 
	    }
	}
}