Actions
Anomalie #1321
ferméOpium : backtraces frequentes dans les syslogs et état SMART d'un des disques
Début:
29/06/2013
Echéance:
% réalisé:
100%
Temps estimé:
Difficulté:
2 Facile
Description
De plus en plus fréquemment, l'on voit syslog qui écrit via wall sur les terminau des messages de ce type :
Toujours lié à "/sys/devices/virtual/block/vroot7/stat"
kernel:[117500.193975] general protection fault: 0000 [#301] SMP Message from syslogd@opium at Jun 29 00:21:24 ... kernel:[117500.194060] last sysfs file: /sys/devices/virtual/block/vroot7/stat Message from syslogd@opium at Jun 29 00:21:24 ... kernel:[117500.197068] Stack: Message from syslogd@opium at Jun 29 00:21:24 ... kernel:[117500.197570] Call Trace: Message from syslogd@opium at Jun 29 00:21:24 ... kernel:[117500.197947] Code: fa 66 66 90 66 66 90 65 8b 04 25 a8 e3 00 00 48 98 49 8b 94 c4 f0 02 00 00 8b 4a 18 89 4c 24 14 48 8b 1a 48 85 db 74 0c 8b 42 14 <48> 8b 04 c3 48 89 02 eb 19 48 8b 4c 24 08 49 89 d0 44 89 ee 83
Dans les syslogs, on trouve des messages de ce type :
[117727.112851] general protection fault: 0000 [#377] SMP [117727.112935] last sysfs file: /sys/devices/virtual/block/vroot7/stat [117727.112966] CPU 2 [117727.113020] Modules linked in: cryptd aes_x86_64 aes_generic xts gf128mul ses enclosure usb_storage tun ipt_LOG xt_tcpudp ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ext4 jbd2 crc16 ext2 dm_crypt firewire_sbp2 loop snd_hda_codec_atihdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcsp asus_atk0110 parport_pc snd_pcm edac_core i2c_piix4 parport snd_timer edac_mce_amd processor i2c_core shpchp button evdev snd soundcore snd_page_alloc pci_hotplug ext3 jbd mbcache dm_mod raid456 md_mod async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx ide_cd_mod sd_mod cdrom crc_t10dif ide_pci_generic ohci_hcd ata_generic ahci thermal r8169 mii atiixp ehci_hcd ide_core firewire_ohci usbcore firewire_core nls_base crc_itu_t floppy thermal_sys libata scsi_mod [last unloaded: scsi_wait_scan] [117727.115436] Pid: 11837, comm: munin-node Tainted: G D W 2.6.32-bpo.3-vserver-amd64 #1 System Product Name [117727.115470] RIP: 0010:[<ffffffff810f0f6c>] [<ffffffff810f0f6c>] __kmalloc+0xd2/0x141 [117727.115533] RSP: 0018:ffff88008b103b88 EFLAGS: 00010082 [117727.115564] RAX: 0000000000000000 RBX: 940f003e8348c031 RCX: 0000000000000020 [117727.115597] RDX: ffff880005111f00 RSI: 00000000000000d0 RDI: ffffffff811308e1 [117727.115630] RBP: 0000000000000246 R08: 0000000000000000 R09: 0000000000000000 [117727.115663] R10: 00000000000001c0 R11: ffff88008b103b38 R12: ffffffff8144c4f0 [117727.115696] R13: 00000000000000d0 R14: 00000000000000d0 R15: 000000000000001c [117727.115730] FS: 00007ff34f358700(0000) GS:ffff880005100000(0000) knlGS:0000000000000000 [117727.115763] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [117727.115794] CR2: 00007ff34e3d8050 CR3: 0000000110aee000 CR4: 00000000000006e0 [117727.115828] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [117727.115861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [117727.115895] Process munin-node (pid: 11837, threadinfo ffff88008b102000, task ffff88011c2f1590) [117727.115928] Stack: [117727.115957] 00000000000001c0 ffffffff811308e1 00000020811306f0 ffff880052445238 [117727.116068] <0> 00000000000001c0 00000000fffffff8 ffffffff811306f0 00000000fffffff4 [117727.116232] <0> 0000000000000001 ffffffff811308e1 ffff88010dc17d00 0000000000000080 [117727.116424] Call Trace: [117727.116456] [<ffffffff811308e1>] ? load_elf_binary+0x1f1/0x1958 [117727.116488] [<ffffffff811306f0>] ? load_elf_binary+0x0/0x1958 [117727.116519] [<ffffffff811308e1>] ? load_elf_binary+0x1f1/0x1958 [117727.116551] [<ffffffff810f7cd9>] ? do_sync_read+0xce/0x113 [117727.116583] [<ffffffff812fdc17>] ? __down_read+0x15/0xab [117727.116615] [<ffffffff81065a0e>] ? autoremove_wake_function+0x0/0x2e [117727.116648] [<ffffffff811306f0>] ? load_elf_binary+0x0/0x1958 [117727.116680] [<ffffffff810fc5a6>] ? search_binary_handler+0xb4/0x245 [117727.116712] [<ffffffff8112f0fc>] ? load_script+0x0/0x1ec [117727.116743] [<ffffffff8112f2bd>] ? load_script+0x1c1/0x1ec [117727.116775] [<ffffffff810fc1ba>] ? get_arg_page+0x4b/0xa4 [117727.116807] [<ffffffff810fc5a6>] ? search_binary_handler+0xb4/0x245 [117727.116815] [<ffffffff810fda14>] ? do_execve+0x1e8/0x2dc [117727.116815] [<ffffffff8100f4eb>] ? sys_execve+0x35/0x4c [117727.116815] [<ffffffff81010f9a>] ? stub_execve+0x6a/0xc0 [117727.116815] Code: fa 66 66 90 66 66 90 65 8b 04 25 a8 e3 00 00 48 98 49 8b 94 c4 f0 02 00 00 8b 4a 18 89 4c 24 14 48 8b 1a 48 85 db 74 0c 8b 42 14 <48> 8b 04 c3 48 89 02 eb 19 48 8b 4c 24 08 49 89 d0 44 89 ee 83 [117727.116815] RIP [<ffffffff810f0f6c>] __kmalloc+0xd2/0x141 [117727.116815] RSP <ffff88008b103b88> [117727.116815] ---[ end trace 721a8e84d9e66c0b ]---
Dans les processus cités dans ces traces, on voit un peu de tout : munin-node (beaucoup), nrpe (2 fois), bash (1 fois), ntp_kernel_pll_...
Ça sent le problème matériel.
En plus de ça, le smart de /dev/sdc est dans un état assez triste :
root@opium:~# smartctl -a /dev/sdc smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.12 family Device Model: ST31000528AS Serial Number: 5VP4GSKE Firmware Version: CC38 User Capacity: 1 000 204 886 016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Jun 29 00:33:08 2013 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 609) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 185) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 120470406 3 Spin_Up_Time 0x0003 095 094 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 280 5 Reallocated_Sector_Ct 0x0033 096 096 036 Pre-fail Always - 198 7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail Always - 633051438 9 Power_On_Hours 0x0032 069 069 000 Old_age Always - 27454 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 140 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 088 088 000 Old_age Always - 12 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 42950328331 189 High_Fly_Writes 0x003a 099 099 000 Old_age Always - 1 190 Airflow_Temperature_Cel 0x0022 051 036 045 Old_age Always In_the_past 49 (34 33 54 39) 194 Temperature_Celsius 0x0022 049 064 000 Old_age Always - 49 (0 16 0 0) 195 Hardware_ECC_Recovered 0x001a 041 013 000 Old_age Always - 120470406 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 3 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 3 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 272176372542530 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2042468901 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1485209532 SMART Error Log Version: 1 ATA Error Count: 12 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 12 occurred at disk power-on lifetime: 26972 hours (1123 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 39 59 1b 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 35 59 1b 40 00 7d+12:48:02.411 READ FPDMA QUEUED 27 00 00 00 00 00 e0 00 7d+12:48:02.411 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 7d+12:48:02.410 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 7d+12:48:02.410 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 7d+12:48:02.390 READ NATIVE MAX ADDRESS EXT Error 11 occurred at disk power-on lifetime: 26972 hours (1123 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 39 59 1b 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 35 59 1b 40 00 7d+12:47:59.230 READ FPDMA QUEUED 27 00 00 00 00 00 e0 00 7d+12:47:59.230 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 7d+12:47:59.229 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 7d+12:47:59.229 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 7d+12:47:59.210 READ NATIVE MAX ADDRESS EXT Error 10 occurred at disk power-on lifetime: 26972 hours (1123 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 39 59 1b 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 35 59 1b 40 00 7d+12:47:56.067 READ FPDMA QUEUED 27 00 00 00 00 00 e0 00 7d+12:47:56.066 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 7d+12:47:56.066 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 7d+12:47:56.065 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 7d+12:47:56.046 READ NATIVE MAX ADDRESS EXT Error 9 occurred at disk power-on lifetime: 26972 hours (1123 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 39 59 1b 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 35 59 1b 40 00 7d+12:47:52.895 READ FPDMA QUEUED 27 00 00 00 00 00 e0 00 7d+12:47:52.894 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 7d+12:47:52.894 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 7d+12:47:52.893 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 7d+12:47:52.874 READ NATIVE MAX ADDRESS EXT Error 8 occurred at disk power-on lifetime: 26972 hours (1123 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 39 59 1b 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 35 59 1b 40 00 7d+12:47:49.748 READ FPDMA QUEUED 27 00 00 00 00 00 e0 00 7d+12:47:49.747 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 7d+12:47:49.746 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 7d+12:47:49.746 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 7d+12:47:49.727 READ NATIVE MAX ADDRESS EXT SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Il faudrait vérifier si cela est lié. et le cas échéant, prendre des mesures.
Actions