|
|
Subject: 2.6.20.3 AMD64 oops in CFQ code
From: linux@horizon.com
Date: 3/22/2007 12:38:58 PM
This is a uniprocessor AMD64 system running software RAID-5 and RAID-10
over multiple PCIe SiI3132 SATA controllers. The hardware has been very
stable for a long time, but has been acting up of late since I upgraded
to 2.6.20.3. ECC memory should preclude the possibility of bit-flip
errors.
Kernel 2.6.20.3 + linuxpps patches (confined to drivers/serial, and not
actually in use as I stole the serial port for a console).
It takes half a day to reproduce the problem, so bisecting would be painful.
BackupPC_dump mostly writes to a large (1.7 TB) ext3 RAID5 partition.
Here are two oopes, a few minutes (16:31, to be precise) apart.
Unusually, it oopsed twice *without* locking up the system.. Usually,
I see this followed by an error from drivers/input/keyboard/atkbd.c:
printk(KERN_WARNING "atkbd.c: Spurious %s on %s. "
"Some program might be trying access hardware directly.\n",
emitted at 1 Hz with the keyboard LEDs flashing and the system
unresponsive to keyboard or pings.
(I think it was spurious ACK on serio/input0, but my memory may be faulty.)
If anyone has any suggestions, they'd be gratefully received.
Unable to handle kernel NULL pointer dereference at 0000000000000098 RIP:
[<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
PGD 777e9067 PUD 78774067 PMD 0
Oops: 0000 [1]
CPU 0
Modules linked in: ecb
Pid: 2837, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40
RIP: 0010:[<ffffffff8031504a>] [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0
x68
RSP: 0018:ffff8100770bbaf8 EFLAGS: 00010092
RAX: ffff81007fb36c80 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 000000010003e4e7 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff81007fb37a00 R08: 00000000ffffffff R09: ffff81005d390298
R10: ffff81007fcb4f80 R11: ffff81007fcb4f80 R12: ffff81007facd280
R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000
FS: 00002b322d120d30(0000) GS:ffffffff805de000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000098 CR3: 000000007bcf0000 CR4: 00000000000006e0
Process BackupPC_dump (pid: 2837, threadinfo ffff8100770ba000, task ffff81007fc5d8e0)
Stack: 0000000000000000 ffff8100770f39f0 0000000000000000 0000000000000004
0000000000000001 ffffffff80315253 ffffffff803b2607 ffff81005da2bc40
ffff81007fac3800 ffff81007facd280 ffff81007facd280 ffff81005d390298
Call Trace:
[<ffffffff80315253>] cfq_dispatch_requests+0x152/0x512
[<ffffffff803b2607>] scsi_done+0x0/0x18
[<ffffffff8030d9f1>] elv_next_request+0x137/0x147
[<ffffffff803b7ce0>] scsi_request_fn+0x6a/0x33a
[<ffffffff8024d407>] generic_unplug_device+0xa/0xe
[<ffffffff80407ced>] unplug_slaves+0x5b/0x94
[<ffffffff80223d65>] sync_page+0x0/0x40
[<ffffffff80223d9b>] sync_page+0x36/0x40
[<ffffffff80256d45>] __wait_on_bit_lock+0x36/0x65
[<ffffffff80237496>] __lock_page+0x5e/0x64
[<ffffffff8028061d>] wake_bit_function+0x0/0x23
[<ffffffff802074de>] find_get_page+0xe/0x2d
[<ffffffff8020b38e>] do_generic_mapping_read+0x1c2/0x40d
[<ffffffff8020bd80>] file_read_actor+0x0/0x118
[<ffffffff8021422e>] generic_file_aio_read+0x15c/0x19e
[<ffffffff8020bafa>] do_sync_read+0xc9/0x10c
[<ffffffff80210342>] may_open+0x5b/0x1c6
[<ffffffff802805ef>] autoremove_wake_function+0x0/0x2e
[<ffffffff8020a857>] vfs_read+0xaa/0x152
[<ffffffff8020faf3>] sys_read+0x45/0x6e
[<ffffffff8025041e>] system_call+0x7e/0x83
Code: 4c 8b ae 98 00 00 00 4c 8b 70 08 e8 63 fe ff ff 8b 43 28 4c
RIP [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
RSP <ffff8100770bbaf8>
CR2: 0000000000000098
<1>Unable to handle kernel NULL pointer dereference at 0000000000000098 RIP:
[<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
PGD 79bd2067 PUD 789f9067 PMD 0
Oops: 0000 [2]
CPU 0
Modules linked in: ecb
Pid: 2834, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40
RIP: 0010:[<ffffffff8031504a>] [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0
x
68
RSP: 0018:ffff8100789b5af8 EFLAGS: 00010092
RAX: ffff81007fb36c80 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 000000010007ac16 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff81007fb37a00 R08: 00000000ffffffff R09: ffff810064dd45e0
R10: ffff81007fcb4f80 R11: ffff81007fcb4f80 R12: ffff81007facd280
R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000
FS: 00002b0a7c680d30(0000) GS:ffffffff805de000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000098 CR3: 0000000079d36000 CR4: 00000000000006e0
Process BackupPC_dump (pid: 2834, threadinfo ffff8100789b4000, task ffff81007a23
5140)
Stack: 0000000000000000 ffff81007b9ebbd0 0000000000000000 0000000000000004
0000000000000001 ffffffff80315253 ffffffff803b2607 ffff81000e67ba00
ffff81007fac3800 ffff81007facd280 ffff81007facd280 ffff810064dd45e0
Call Trace:
[<ffffffff80315253>] cfq_dispatch_requests+0x152/0x512
[<ffffffff803b2607>] scsi_done+0x0/0x18
[<ffffffff8030d9f1>] elv_next_request+0x137/0x147
[<ffffffff803b7ce0>] scsi_request_fn+0x6a/0x33a
[<ffffffff8024d407>] generic_unplug_device+0xa/0xe
[<ffffffff80407ced>] unplug_slaves+0x5b/0x94
[<ffffffff80223d65>] sync_page+0x0/0x40
[<ffffffff80223d9b>] sync_page+0x36/0x40
[<ffffffff80256d45>] __wait_on_bit_lock+0x36/0x65
[<ffffffff80237496>] __lock_page+0x5e/0x64
[<ffffffff8028061d>] wake_bit_function+0x0/0x23
[<ffffffff802074de>] find_get_page+0xe/0x2d
[<ffffffff8020b38e>] do_generic_mapping_read+0x1c2/0x40d
[<ffffffff8020bd80>] file_read_actor+0x0/0x118
[<ffffffff8021422e>] generic_file_aio_read+0x15c/0x19e
[<ffffffff8020bafa>] do_sync_read+0xc9/0x10c
[<ffffffff80210342>] may_open+0x5b/0x1c6
[<ffffffff802805ef>] autoremove_wake_function+0x0/0x2e
[<ffffffff8020a857>] vfs_read+0xaa/0x152
[<ffffffff8020faf3>] sys_read+0x45/0x6e
[<ffffffff8025041e>] system_call+0x7e/0x83
Code: 4c 8b ae 98 00 00 00 4c 8b 70 08 e8 63 fe ff ff 8b 43 28 4c
RIP [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
RSP <ffff8100789b5af8>
CR2: 0000000000000098
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Subject: 2.6.20.3 AMD64 oops in CFQ code
From: linux@horizon.com
Date: 3/22/2007 6:54:51 PM
> 3 (I think) seperate instances of this, each involving raid5. Is your
> array degraded or fully operational?
Ding! A drive fell out the other day, which is why the problems only
appeared recently.
md5 : active raid5 sdf4[5] sdd4[3] sdc4[2] sdb4[1] sda4[0]
1719155200 blocks level 5, 64k chunk, algorithm 2 [6/5] [UUUU_U]
bitmap: 149/164 pages [596KB], 1024KB chunk
H'm... this means that my alarm scripts aren't working. Well, that's
good to know. The drive is being re-integrated now.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Subject: 2.6.20.3 AMD64 oops in CFQ code
From: linux@horizon.com
Date: 3/23/2007 5:47:03 PM
As an additional data point, here's a libata problem I'm having trying to
rebuild the array.
I have six identical 400 GB drives (ST3400832AS), and one is giving
me hassles. I've run SMART short and long diagnostics, badblocks, and
Seagate's "seatools" diagnostic software, and none of these find problems.
It is the only one of the six with a non-zero reallocated sector count
(it's 26).
Anyway, the drive is partitioned into a 45G RAID-10 part and a 350G RAID-5
part. The RAID-10 part integrated successfully, but the RAID-5 got to
about 60% and then puked:
ata5.00: exception Emask 0x0 SAct 0x1ef SErr 0x0 action 0x2 frozen
ata5.00: cmd 61/c0:00:d2:d0:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 98304 out
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/40:08:92:d1:b9/00:00:1c:00:00/40 tag 1 cdb 0x0 data 32768 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:10:d2:d1:b9/01:00:1c:00:00/40 tag 2 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:18:d2:d2:b9/01:00:1c:00:00/40 tag 3 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:28:d2:d3:b9/01:00:1c:00:00/40 tag 5 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:30:d2:d4:b9/01:00:1c:00:00/40 tag 6 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:38:d2:d5:b9/01:00:1c:00:00/40 tag 7 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:40:d2:d6:b9/01:00:1c:00:00/40 tag 8 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5: soft resetting port
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata5.00: configured for UDMA/100
ata5: EH complete
SCSI device sde: 781422768 512-byte hdwr sectors (400088 MB)
sde: Write Protect is off
SCSI device sde: write cache: enabled, read cache: enabled, doesn't support DPO or
FUA
ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5: soft resetting port
ata5: softreset failed (timeout)
ata5: softreset failed, retrying in 5 secs
ata5: hard resetting port
ata5: softreset failed (timeout)
ata5: follow-up softreset failed, retrying in 5 secs
ata5: hard resetting port
ata5: softreset failed (timeout)
ata5: reset failed, giving up
ata5.00: disabled
ata5: EH complete
sd 4:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sde, sector 91795259
md: super_written gets error=-5, uptodate=0
raid10: Disk failure on sde3, disabling device.
Operation continuing on 5 devices
sd 4:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sde, sector 481942994
raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices
sd 4:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sde, sector 481944018
md: md5: recovery done.
RAID10 conf printout:
--- wd:5 rd:6
disk 0, wo:0, o:1, dev:sdb3
disk 1, wo:0, o:1, dev:sdc3
disk 2, wo:0, o:1, dev:sdd3
disk 3, wo:1, o:0, dev:sde3
disk 4, wo:0, o:1, dev:sdf3
disk 5, wo:0, o:1, dev:sda3
RAID10 conf printout:
--- wd:5 rd:6
disk 0, wo:0, o:1, dev:sdb3
disk 1, wo:0, o:1, dev:sdc3
disk 2, wo:0, o:1, dev:sdd3
disk 4, wo:0, o:1, dev:sdf3
disk 5, wo:0, o:1, dev:sda3
RAID5 conf printout:
--- rd:6 wd:5
disk 0, o:1, dev:sda4
disk 1, o:1, dev:sdb4
disk 2, o:1, dev:sdc4
disk 3, o:1, dev:sdd4
disk 4, o:0, dev:sde4
disk 5, o:1, dev:sdf4
RAID5 conf printout:
--- rd:6 wd:5
disk 0, o:1, dev:sda4
disk 1, o:1, dev:sdb4
disk 2, o:1, dev:sdc4
disk 3, o:1, dev:sdd4
disk 5, o:1, dev:sdf4
The first error address is just barely inside the RAID-10 part (which ends at
sector 91,795,410), while the second and third errors (at 481,942,994)
look like where the reconstruction was working.
Anyway, what's annoying is that I can't figure out how to bring the
drive back on line without resetting the box. It's in a hot-swap enclosure,
but power cycling the drive doesn't seem to help. I thought libata hotplug
was working? (SiI3132 card, using the sil24 driver.)
(H'm... after rebooting, reallocated sectors jumped from 26 to 39.
Something is up with that drive.)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
|