Group: fa.linux.kernel




Subject: 2.6.20.3 AMD64 oops in CFQ code
From: linux@horizon.com
Date: 3/22/2007 12:38:58 PM
This is a uniprocessor AMD64 system running software RAID-5 and RAID-10 over multiple PCIe SiI3132 SATA controllers. The hardware has been very stable for a long time, but has been acting up of late since I upgraded to 2.6.20.3. ECC memory should preclude the possibility of bit-flip errors. Kernel 2.6.20.3 + linuxpps patches (confined to drivers/serial, and not actually in use as I stole the serial port for a console). It takes half a day to reproduce the problem, so bisecting would be painful. BackupPC_dump mostly writes to a large (1.7 TB) ext3 RAID5 partition. Here are two oopes, a few minutes (16:31, to be precise) apart. Unusually, it oopsed twice *without* locking up the system.. Usually, I see this followed by an error from drivers/input/keyboard/atkbd.c: printk(KERN_WARNING "atkbd.c: Spurious %s on %s. " "Some program might be trying access hardware directly.\n", emitted at 1 Hz with the keyboard LEDs flashing and the system unresponsive to keyboard or pings. (I think it was spurious ACK on serio/input0, but my memory may be faulty.) If anyone has any suggestions, they'd be gratefully received. Unable to handle kernel NULL pointer dereference at 0000000000000098 RIP: [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68 PGD 777e9067 PUD 78774067 PMD 0 Oops: 0000 [1] CPU 0 Modules linked in: ecb Pid: 2837, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40 RIP: 0010:[<ffffffff8031504a>] [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0 x68 RSP: 0018:ffff8100770bbaf8 EFLAGS: 00010092 RAX: ffff81007fb36c80 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 000000010003e4e7 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff81007fb37a00 R08: 00000000ffffffff R09: ffff81005d390298 R10: ffff81007fcb4f80 R11: ffff81007fcb4f80 R12: ffff81007facd280 R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000 FS: 00002b322d120d30(0000) GS:ffffffff805de000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000098 CR3: 000000007bcf0000 CR4: 00000000000006e0 Process BackupPC_dump (pid: 2837, threadinfo ffff8100770ba000, task ffff81007fc5d8e0) Stack: 0000000000000000 ffff8100770f39f0 0000000000000000 0000000000000004 0000000000000001 ffffffff80315253 ffffffff803b2607 ffff81005da2bc40 ffff81007fac3800 ffff81007facd280 ffff81007facd280 ffff81005d390298 Call Trace: [<ffffffff80315253>] cfq_dispatch_requests+0x152/0x512 [<ffffffff803b2607>] scsi_done+0x0/0x18 [<ffffffff8030d9f1>] elv_next_request+0x137/0x147 [<ffffffff803b7ce0>] scsi_request_fn+0x6a/0x33a [<ffffffff8024d407>] generic_unplug_device+0xa/0xe [<ffffffff80407ced>] unplug_slaves+0x5b/0x94 [<ffffffff80223d65>] sync_page+0x0/0x40 [<ffffffff80223d9b>] sync_page+0x36/0x40 [<ffffffff80256d45>] __wait_on_bit_lock+0x36/0x65 [<ffffffff80237496>] __lock_page+0x5e/0x64 [<ffffffff8028061d>] wake_bit_function+0x0/0x23 [<ffffffff802074de>] find_get_page+0xe/0x2d [<ffffffff8020b38e>] do_generic_mapping_read+0x1c2/0x40d [<ffffffff8020bd80>] file_read_actor+0x0/0x118 [<ffffffff8021422e>] generic_file_aio_read+0x15c/0x19e [<ffffffff8020bafa>] do_sync_read+0xc9/0x10c [<ffffffff80210342>] may_open+0x5b/0x1c6 [<ffffffff802805ef>] autoremove_wake_function+0x0/0x2e [<ffffffff8020a857>] vfs_read+0xaa/0x152 [<ffffffff8020faf3>] sys_read+0x45/0x6e [<ffffffff8025041e>] system_call+0x7e/0x83 Code: 4c 8b ae 98 00 00 00 4c 8b 70 08 e8 63 fe ff ff 8b 43 28 4c RIP [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68 RSP <ffff8100770bbaf8> CR2: 0000000000000098 <1>Unable to handle kernel NULL pointer dereference at 0000000000000098 RIP: [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68 PGD 79bd2067 PUD 789f9067 PMD 0 Oops: 0000 [2] CPU 0 Modules linked in: ecb Pid: 2834, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40 RIP: 0010:[<ffffffff8031504a>] [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0 x 68 RSP: 0018:ffff8100789b5af8 EFLAGS: 00010092 RAX: ffff81007fb36c80 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 000000010007ac16 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff81007fb37a00 R08: 00000000ffffffff R09: ffff810064dd45e0 R10: ffff81007fcb4f80 R11: ffff81007fcb4f80 R12: ffff81007facd280 R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000 FS: 00002b0a7c680d30(0000) GS:ffffffff805de000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000098 CR3: 0000000079d36000 CR4: 00000000000006e0 Process BackupPC_dump (pid: 2834, threadinfo ffff8100789b4000, task ffff81007a23 5140) Stack: 0000000000000000 ffff81007b9ebbd0 0000000000000000 0000000000000004 0000000000000001 ffffffff80315253 ffffffff803b2607 ffff81000e67ba00 ffff81007fac3800 ffff81007facd280 ffff81007facd280 ffff810064dd45e0 Call Trace: [<ffffffff80315253>] cfq_dispatch_requests+0x152/0x512 [<ffffffff803b2607>] scsi_done+0x0/0x18 [<ffffffff8030d9f1>] elv_next_request+0x137/0x147 [<ffffffff803b7ce0>] scsi_request_fn+0x6a/0x33a [<ffffffff8024d407>] generic_unplug_device+0xa/0xe [<ffffffff80407ced>] unplug_slaves+0x5b/0x94 [<ffffffff80223d65>] sync_page+0x0/0x40 [<ffffffff80223d9b>] sync_page+0x36/0x40 [<ffffffff80256d45>] __wait_on_bit_lock+0x36/0x65 [<ffffffff80237496>] __lock_page+0x5e/0x64 [<ffffffff8028061d>] wake_bit_function+0x0/0x23 [<ffffffff802074de>] find_get_page+0xe/0x2d [<ffffffff8020b38e>] do_generic_mapping_read+0x1c2/0x40d [<ffffffff8020bd80>] file_read_actor+0x0/0x118 [<ffffffff8021422e>] generic_file_aio_read+0x15c/0x19e [<ffffffff8020bafa>] do_sync_read+0xc9/0x10c [<ffffffff80210342>] may_open+0x5b/0x1c6 [<ffffffff802805ef>] autoremove_wake_function+0x0/0x2e [<ffffffff8020a857>] vfs_read+0xaa/0x152 [<ffffffff8020faf3>] sys_read+0x45/0x6e [<ffffffff8025041e>] system_call+0x7e/0x83 Code: 4c 8b ae 98 00 00 00 4c 8b 70 08 e8 63 fe ff ff 8b 43 28 4c RIP [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68 RSP <ffff8100789b5af8> CR2: 0000000000000098 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

Subject: 2.6.20.3 AMD64 oops in CFQ code
From: linux@horizon.com
Date: 3/22/2007 6:54:51 PM
> 3 (I think) seperate instances of this, each involving raid5. Is your > array degraded or fully operational? Ding! A drive fell out the other day, which is why the problems only appeared recently. md5 : active raid5 sdf4[5] sdd4[3] sdc4[2] sdb4[1] sda4[0] 1719155200 blocks level 5, 64k chunk, algorithm 2 [6/5] [UUUU_U] bitmap: 149/164 pages [596KB], 1024KB chunk H'm... this means that my alarm scripts aren't working. Well, that's good to know. The drive is being re-integrated now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

Subject: 2.6.20.3 AMD64 oops in CFQ code
From: linux@horizon.com
Date: 3/23/2007 5:47:03 PM
As an additional data point, here's a libata problem I'm having trying to rebuild the array. I have six identical 400 GB drives (ST3400832AS), and one is giving me hassles. I've run SMART short and long diagnostics, badblocks, and Seagate's "seatools" diagnostic software, and none of these find problems. It is the only one of the six with a non-zero reallocated sector count (it's 26). Anyway, the drive is partitioned into a 45G RAID-10 part and a 350G RAID-5 part. The RAID-10 part integrated successfully, but the RAID-5 got to about 60% and then puked: ata5.00: exception Emask 0x0 SAct 0x1ef SErr 0x0 action 0x2 frozen ata5.00: cmd 61/c0:00:d2:d0:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 98304 out res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/40:08:92:d1:b9/00:00:1c:00:00/40 tag 1 cdb 0x0 data 32768 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:10:d2:d1:b9/01:00:1c:00:00/40 tag 2 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:18:d2:d2:b9/01:00:1c:00:00/40 tag 3 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:28:d2:d3:b9/01:00:1c:00:00/40 tag 5 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:30:d2:d4:b9/01:00:1c:00:00/40 tag 6 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:38:d2:d5:b9/01:00:1c:00:00/40 tag 7 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:40:d2:d6:b9/01:00:1c:00:00/40 tag 8 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5: soft resetting port ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sde: 781422768 512-byte hdwr sectors (400088 MB) sde: Write Protect is off SCSI device sde: write cache: enabled, read cache: enabled, doesn't support DPO or FUA ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata5: soft resetting port ata5: softreset failed (timeout) ata5: softreset failed, retrying in 5 secs ata5: hard resetting port ata5: softreset failed (timeout) ata5: follow-up softreset failed, retrying in 5 secs ata5: hard resetting port ata5: softreset failed (timeout) ata5: reset failed, giving up ata5.00: disabled ata5: EH complete sd 4:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sde, sector 91795259 md: super_written gets error=-5, uptodate=0 raid10: Disk failure on sde3, disabling device. Operation continuing on 5 devices sd 4:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sde, sector 481942994 raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices sd 4:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sde, sector 481944018 md: md5: recovery done. RAID10 conf printout: --- wd:5 rd:6 disk 0, wo:0, o:1, dev:sdb3 disk 1, wo:0, o:1, dev:sdc3 disk 2, wo:0, o:1, dev:sdd3 disk 3, wo:1, o:0, dev:sde3 disk 4, wo:0, o:1, dev:sdf3 disk 5, wo:0, o:1, dev:sda3 RAID10 conf printout: --- wd:5 rd:6 disk 0, wo:0, o:1, dev:sdb3 disk 1, wo:0, o:1, dev:sdc3 disk 2, wo:0, o:1, dev:sdd3 disk 4, wo:0, o:1, dev:sdf3 disk 5, wo:0, o:1, dev:sda3 RAID5 conf printout: --- rd:6 wd:5 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 3, o:1, dev:sdd4 disk 4, o:0, dev:sde4 disk 5, o:1, dev:sdf4 RAID5 conf printout: --- rd:6 wd:5 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 3, o:1, dev:sdd4 disk 5, o:1, dev:sdf4 The first error address is just barely inside the RAID-10 part (which ends at sector 91,795,410), while the second and third errors (at 481,942,994) look like where the reconstruction was working. Anyway, what's annoying is that I can't figure out how to bring the drive back on line without resetting the box. It's in a hot-swap enclosure, but power cycling the drive doesn't seem to help. I thought libata hotplug was working? (SiI3132 card, using the sil24 driver.) (H'm... after rebooting, reallocated sectors jumped from 26 to 39. Something is up with that drive.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/