Linux Monitor / Healthcheck script Settembre 3, 2009
Posted by installatore in Varie, linux, networking, script, solaris.Tags: linux, sys, monitor, admin, check, health, sysadmin
trackback
In questo articolo,descrivo come tramite l’uso dei comandi della shell di linux possiamo crearci uno script da schedulare successivamente con cron,per monitorare lo stato di salute dei nostri server.La funzionalità dello script è molto semplice ,sicuramente può essere migliorato.
Come prima cosa partirei controllando che il/i filesystem del nostro server non siano pieni .
[root@rac2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
14G 13G 775M 95% /
/dev/sda1 99M 19M 76M 20% /boot
tmpfs 502M 300M 202M 60% /dev/shm
Controlliamo inoltre gli ultimi login effettuati sulla macchina,come si dice….la prudenza non è mai troppa…
[root@rac2 ~]# last | head -20root pts/3 192.168.0.8 Tue Aug 18 09:20 still logged inoracle pts/3 :0.0 Tue Aug 18 09:17 – 09:20 (00:02)oracle pts/3 :0.0 Tue Aug 18 08:19 – 09:16 (00:56)oracle pts/2 :0.0 Tue Aug 18 07:08 still logged inoracle :0 Tue Aug 18 07:08 still logged inoracle :0 Tue Aug 18 07:08 – 07:08 (00:00)root pts/1 openfiler1 Tue Aug 18 07:04 still logged inreboot system boot 2.6.18-128.4.1.e Tue Aug 18 06:57 (02:25)root pts/1 :0.0 Tue Aug 18 06:44 – down (00:11)root pts/2 openfiler1 Tue Aug 18 06:22 – down (00:32)root pts/1 :0.0 Tue Aug 18 06:15 – 06:43 (00:28)root :0 Tue Aug 18 06:14 – down (00:40)root :0 Tue Aug 18 06:14 – 06:14 (00:00)reboot system boot 2.6.18-128.4.1.e Tue Aug 18 06:10 (00:45)root pts/1 :0.0 Tue Aug 18 05:49 – down (00:19)root pts/1 :0.0 Tue Aug 18 05:47 – 05:48 (00:00)root pts/1 :0.0 Sun Aug 16 16:36 – 05:46 (1+13:10)root pts/2 :0.0 Sun Aug 16 16:26 – down (1+13:41)root pts/1 :0.0 Sun Aug 16 16:24 – 16:36 (00:12)root :0 Sun Aug 16 16:23 – down (1+13:45)
Un’altra cosa da controllare soprattutto sui sistemi esposti su internet,è la grandezza della cartella temporanea e relativi file,molto spesso al suo interno si annidano rootkit o altri malware.Nel mio caso la cartella temporanea era /tmp
[root@rac2 ~]# du -h /tmp8.0K /tmp/gconfd-root4.0K /tmp/keyring-JuVYH24.0K /tmp/.oracle16K /tmp
Controlliamo anche lo stato dell’hardware della scheda madre andando a fare un grep su tutte le voci che contegono state o status.
[root@rac2 ~]# dmidecode |grep -B 2 StatSerial Number: …..Asset Tag:Boot-up State: SafePower Supply State: SafeThermal State: SafeSecurity Status: None–Max Speed: 5200 MHzCurrent Speed: 2400 MHzStatus: Populated, Enabled–On Board Device InformationType: EthernetStatus: Enabled–On Board Device InformationType: SoundStatus: Enabled–On Board Device InformationType: OtherStatus: Enabled–Access Method: Memory-mapped physical 32-bit addressAccess Address: 0xFFF81000Status: Valid, Not Full–Handle 0×1800, DMI type 24, 5 bytes.Hardware SecurityPower-On Password Status: EnabledKeyboard Password Status: Not ImplementedAdministrator Password Status: EnabledFront Panel Reset Status: Not Implemented–Cooling DeviceType: FanStatus: OK–Cooling DeviceType: FanStatus: OK–Cooling DeviceType: FanStatus: OK–Handle 0×2000, DMI type 32, 11 bytes.System Boot InformationStatus: No errors detected
A questo punto diamo un’occhiata allo stato dei pacchetti droppati e agli errori di trasmissione e ricezione sulle interfacce di rete.In linea di massima questi valori non dovrebbero crescere in maniera esorbitante nell’arco della giornata.
[root@rac2 ~]# ifconfigeth4 Link encap:Ethernet HWaddr 00:0C:29:2D:6F:3Einet addr:192.168.0.102 Bcast:192.168.0.255 Mask:255.255.255.0inet6 addr: fe80::20c:29ff:fe2d:6f3e/64 Scope:LinkUP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1RX packets:2814028 errors:0 dropped:0 overruns:0 frame:0TX packets:383162 errors:0 dropped:0 overruns:0 carrier:0collisions:0 txqueuelen:1000RX bytes:2263618394 (2.1 GiB) TX bytes: 372588412946 (347 GiB)Base address:0×2000 Memory:d8940000-d8960000
Oltre al comando ifconfig per controllare la parte “fisica” del collegamento in rete possiamo usare ethtool.
[root@rac2 ~]# ethtool eth0Settings for eth0:Supported ports: [ TP ]Supported link modes: 10baseT/Half 10baseT/Full100baseT/Half 100baseT/Full1000baseT/FullSupports auto-negotiation: YesAdvertised link modes: 10baseT/Half 10baseT/Full100baseT/Half 100baseT/Full1000baseT/FullAdvertised auto-negotiation: YesSpeed: 1000Mb/sDuplex: FullPort: Twisted PairPHYAD: 0Transceiver: internalAuto-negotiation: onSupports Wake-on: dWake-on: dCurrent message level: 0×00000007 (7)Link detected: yes
Tramite lm_sensors possiamo ottenere i valori di elettricità e di temperatura di alcuni componenti della scheda madre della nostra macchina,quindi se non lo avete già fatto bisogna installare lm_sensors e lanciare il comando sensors-detect e seguire la procedura guidata per la prima configurazione.
[root@rac2]# sensorslm85b-i2c-0-2eAdapter: SMBus I801 adapter at c400V1.5: +1.47 V (min = +1.42 V, max = +1.58 V)VCore: +1.49 V (min = +1.45 V, max = +1.60 V)V3.3: +3.33 V (min = +3.13 V, max = +3.47 V)V5: +5.03 V (min = +4.74 V, max = +5.26 V)V12: +12.25 V (min = +11.38 V, max = +12.62 V)CPU_Fan: 2386 RPM (min = 4000 RPM) ALARMfan2: 0 RPM (min = 0 RPM)fan3: 0 RPM (min = 0 RPM)fan4: 300 RPM (min = 0 RPM)CPU: +29°C (low = +10°C, high = +50°C)Board: +29°C (low = +10°C, high = +35°C)Remote: +28°C (low = +10°C, high = +35°C)CPU_PWM: 255Fan2_PWM: 255Fan3_PWM: 77vid: +1.525 V (VRM Version 9.0)
Bisogna inoltre controllare eventuali errori hardware ai dischi tramite dmesg,in dmesg trovate anche altri errori relativi alle periferiche,qui di seguito lo utilizzo con un grep per i dischi,ovviamente potete personalizzare il grep a vostra discrezione.
[root@rac2 ~]# dmesg | grep sdaSCSI device sda: 33554432 512-byte hdwr sectors (17180 MB)sda: Write Protect is offsda: Mode Sense: 5d 00 00 00sda: cache data unavailablesda: assuming drive cache: write throughSCSI device sda: 33554432 512-byte hdwr sectors (17180 MB)sda: Write Protect is offsda: Mode Sense: 5d 00 00 00sda: cache data unavailablesda: assuming drive cache: write throughsda: sda1 sda2sd 0:0:0:0: Attached scsi disk sdaEXT3 FS on sda1, internal journal
Quindi controlliamo il numero di processi “zombie” e quale tra i processi sta consumando più ram e cpu giusto per vedere che non ci siano problemi (il numero dei processi zombie non deve mai essere troppo alto).
[root@rac2]# top -bn 2 >> /tmp/top.txttop – 09:52:46 up 130 days, 5 users, load average: 0.13, 0.21, 0.27Tasks: 159 total, 3 running, 156 sleeping, 0 stopped, 0 zombieCpu(s): 0.0%us, 1.0%sy, 0.0%ni, 95.0%id, 3.6%wa, 0.0%hi, 0.3%si, 0.0%stMem: 1027004k total, 936852k used, 90152k free, 20292k buffersSwap: 2064376k total, 71652k used, 1992724k free, 681484k cachedPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND1314 root 15 0 12740 1108 804 R 1.0 0.1 0:00.16 top1274 root 15 0 88948 3328 2588 R 0.3 0.3 0:00.27 sshd1 root 15 0 10344 508 476 S 0.0 0.0 0:01.22 init2 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/03 root 34 19 0 0 0 S 0.0 0.0 0:00.05 ksoftirqd/04 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/05 root 10 -5 0 0 0 S 0.0 0.0 0:05.58 events/06 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khelper15 root 10 -5 0 0 0 S 0.0 0.0 0:00.01 kthread19 root 10 -5 0 0 0 S 0.0 0.0 0:08.29 kblockd/020 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 kacpid205 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 cqueue/0208 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khubd
A questo punto spiegati tutti i singoli comandi possiamo crearci uno script da lanciare tutti i giorni tramite cron,in modo tale che ci invii anche una mail con il resoconto di tutti i comandi.
#!/bin/sh# monitor script linux v1uname -a > /tmp/mhc.txtecho “—————————-”>>/tmp/mhc.txtdf -h >> /tmp/mhc.txtlast |head -10 >> /tmp/mhc.txtecho “—————————-”>>/tmp/mhc.txtdu -h /tmp >> /tmp/mhc.txtecho “—————————-”>>/tmp/mhc.txtethtool eth0 >> /tmp/mhc.txtecho “—————————-”>>/tmp/mhc.txtifconfig eth0 >> /tmp/mhc.txtecho “—————————-”>>/tmp/mhc.txtdmidecode |grep -B 2 Stat >> /tmp/mhc.txtecho “—————————-”>>/tmp/mhc.txtsensors >> /tmp/mhc.txtecho “—————————-”>>/tmp/mhc.txttop -bn 2 >> /tmp/mhc.txtecho “—————————-”>>/tmp/mhc.txtsmartctl -d ata -iH /dev/sda >> /tmp/mhc.txtsmartctl -d ata -iH /dev/sdb >> /tmp/mhc.txtecho “—————————-”>>/tmp/mhc.txtdmesg |grep sda >> /tmp/mhc.txtdmesg |grep sdb >> /tmp/mhc.txtecho “—————————-”>>/tmp/mhc.txtcat /tmp/mhc.txt |mail -s server-health-check your@email.com
[root@rac2 ~]# df -hFilesystem Size Used Avail Use% Mounted on/dev/mapper/VolGroup00-LogVol0014G 13G 775M 95% //dev/sda1 99M 19M 76M 20% /boottmpfs 502M 300M 202M 60% /de
Commenti»
No comments yet — be the first.