2.6.16-rc1-mm3 [Kernel]

Next: Grant

From: Eric Dumazet on 27 Jan 2006 01:10

Andrew Morton a ?crit :
> Andy Whitcroft <apw(a)shadowen.org> wrote:
>> Yes. I think I have this one. It appears that the patch below is the
>> trigger for all our recent panic woe's. The last of the testing should
>> complete in the next few hours and I will be able to confirm that
>> hypothesis; results so far are all good.
>>
>> reduce-size-of-percpudata-and-make-sure-per_cpuobject.patch
>
> That patch did have some missed conversions, which might well explain the
> crash.
>
> Thanks for narrowing it down - I'll keep that patch in next -mm (and will
> include the known fixups). Could you please boot test that? If we're
> still in trouble, I'll drop it.

The NULL choice was maybe wrong. We might need more than one page to fully
catch all accesses. Something like 32KB.

In the meantime could you apply this one ?

Signed-off-by: Eric Dumazet <dada1(a)cosmosbay.com>

From: Andy Whitcroft on 27 Jan 2006 05:20

Eric Dumazet wrote:
> Andrew Morton a ?crit :
>
>> Andy Whitcroft <apw(a)shadowen.org> wrote:
>>
>>> Yes. I think I have this one. It appears that the patch below is the
>>> trigger for all our recent panic woe's. The last of the testing should
>>> complete in the next few hours and I will be able to confirm that
>>> hypothesis; results so far are all good.
>>>
>>> reduce-size-of-percpudata-and-make-sure-per_cpuobject.patch
>>
>>
>> That patch did have some missed conversions, which might well explain the
>> crash.
>>
>> Thanks for narrowing it down - I'll keep that patch in next -mm (and will
>> include the known fixups). Could you please boot test that? If we're
>> still in trouble, I'll drop it.

Sounds eminently fair. I think the patch has merit so now we know the
symptoms we can spent a little effort to get the kinks out. Will test
the next -mm as a matter of course.

> The NULL choice was maybe wrong. We might need more than one page to
> fully catch all accesses. Something like 32KB.

The crash behavoir is handy to catch that the problem exists, and is
very cheap (0 cost) at run time. However, once its known I think we
need something more targetted to allow tracking of the cause. Perhaps
we could set the offset thingy to -1 or something and simply do
something like the following in per_cpu():

if (__per_cpu_offset[i] == -1)
BUG();
else
*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu])

> In the meantime could you apply this one ?
>
> Signed-off-by: Eric Dumazet <dada1(a)cosmosbay.com>
>
>
>
> ------------------------------------------------------------------------
>
> --- a/arch/i386/kernel/nmi.c 2006-01-27 07:51:04.000000000 +0100
> +++ b/arch/i386/kernel/nmi.c 2006-01-27 07:52:14.000000000 +0100
> @@ -148,7 +148,7 @@
> if (nmi_watchdog == NMI_LOCAL_APIC)
> smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 0);
>
> - for (cpu = 0; cpu < NR_CPUS; cpu++)
> + for_each_cpu(cpu)
> prev_nmi_count[cpu] = per_cpu(irq_stat, cpu).__nmi_count;
> local_irq_enable();
> mdelay((10*1000)/nmi_hz); // wait 10 ticks

No change to the panic's in alloc_slabmgmt. A very quick review seems
to say that slab percpu data is actually not in percpu space, so that
seems a little odd. Not had any real time to trace it further.

If you have any other missed ones than this send them along and I'll put
them through the mill.

-apw
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Michal Piotrowski on 27 Jan 2006 05:20

Hi,

On 26/01/06, Nick Piggin <nickpiggin(a)yahoo.com.au> wrote:
> Nick Piggin wrote:
> Sorry, wrong patch.
>
> Note the warnings you are seeing should not result in memory
> corruption, but will result in the given hugepage leaking.
>
> --
> SUSE Labs, Novell Inc.
>
>
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -294,6 +294,8 @@ struct page {
> */
> static inline int put_page_testzero(struct page *page)
> {
> + if (unlikely(PageCompound(page)))
> + page = (struct page *)page_private(page);
> BUG_ON(atomic_read(&page->_count) == 0);
> return atomic_dec_and_test(&page->_count);
> }
>
>
>

Now I have got this:

BUG: unable to handle kernel paging request at virtual address eaa34b3c
printing eip:
b0161cdd
*pde = 0048a067
*pte = 3aa34000
Oops: 0000 [#1]
PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /devices/pci0000:00/0000:00:1d.1/usb3/idVendor
Modules linked in: snd_intel8x0 snd_ac97_codec snd_ac97_bus
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer ide_cd cdrom intel_agp
agpgart snd i2c_i801 hw_random soundcore snd_page_alloc unix
CPU: 0
EIP: 0060:[<b0161cdd>] Not tainted VLI
EFLAGS: 00010282 (2.6.16-rc1-mm3 #4)
EIP is at do_path_lookup+0x22b/0x259
eax: eaa34b20 ebx: eb328000 ecx: 00000000 edx: eb328f4c
esi: ffffff9c edi: fffffffe ebp: eb328f24 esp: eb328f0c
ds: 007b es: 007b ss: 0068
Process udevd (pid: 731, threadinfo=eb328000 task=eb30ca80)
Stack: <0>00000000 eb5cb000 b015fab1 eb5cb000 eb5cb000 00000000
eb328f40 b01621f3
eb328f4c ffffff9c afbf6dec afbf6dec 00000100 eb328f9c b015bf69 eb328f4c
eaa34b20 b23d5f28 00000000 eb329003 b015f8ce 00000000 00000001 00000000
Call Trace:
[<b0103917>] show_stack_log_lvl+0xaa/0xb5
[<b0103a54>] show_registers+0x132/0x19d
[<b0103d91>] die+0x171/0x1fb
[<b02ab110>] do_page_fault+0x3be/0x568
[<b010343f>] error_code+0x4f/0x54
[<b01621f3>] __user_walk_fd+0x2d/0x41
[<b015bf69>] sys_readlinkat+0x26/0x93
[<b015bfe9>] sys_readlink+0x13/0x15
[<b01028bf>] sysenter_past_esp+0x54/0x75
Code: 00 83 c0 04 e8 9a 82 14 00 8b 03 c7 80 e4 01 00 00 00 00 00 00
8b 55 08 8b 45 ec e8 55 fa ff ff 89 c7 8b 55 08 8b 02 85 c0 74 24 <8b>
50 1c 85 d2 74 1d b8 00 f0 ff ff 21 e0 8b 00 83 b8 d4 04 00
<6>ACPI: PCI Interrupt 0000:02:05.0[A] -> GSI 22 (level, low) -> IRQ 21

Here is dmesg:
http://www.stardust.webpages.pl/files/mm/2.6.16-rc1-mm3/mm-dmesg2

Here is config
http://www.stardust.webpages.pl/files/mm/2.6.16-rc1-mm3/mm-config

Regards,
Michal Piotrowski
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric Dumazet on 27 Jan 2006 05:40

Andy Whitcroft a ?crit :
> Eric Dumazet wrote:
>> The NULL choice was maybe wrong. We might need more than one page to
>> fully catch all accesses. Something like 32KB.
>
> The crash behavoir is handy to catch that the problem exists, and is
> very cheap (0 cost) at run time. However, once its known I think we
> need something more targetted to allow tracking of the cause. Perhaps
> we could set the offset thingy to -1 or something and simply do
> something like the following in per_cpu():
>
> if (__per_cpu_offset[i] == -1)
> BUG();
> else
> *RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu])
>

Yes we can set __per_cpu_offset[not_possible_cpu] to 0, because
[__per_cpu_start,__per_cpu_end) is in init section and should be discarded in
free_initmem(). I'm not sure if the freed virtual space can later be reused.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Reuben Farrelly on 27 Jan 2006 06:50

On 25/01/2006 8:24 p.m., Andrew Morton wrote:
> http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc1/2.6.16-rc1-mm3/
>
> - Dropped the timekeeping patch series due to a complex timesource selection
> bug.
>
> - Various fixes and updates.
>
>
>
> Changes since 2.6.16-rc1-mm2:

Just triggered this one, which had a fairly bad effect on connectivity to the box:

i2c /dev entries driver
slab error in kmem_cache_destroy(): cache `ip_conntrack': Can't free all objects
[<b010412b>] show_trace+0xd/0xf
[<b01041cc>] dump_stack+0x17/0x19
[<b0155d04>] kmem_cache_destroy+0x9b/0x1a9
[<f0ebf701>] ip_conntrack_cleanup+0x5d/0x10e [ip_conntrack]
[<f0ebe31e>] init_or_cleanup+0x1f8/0x283 [ip_conntrack]
[<f0ec2c4e>] fini+0xa/0x66 [ip_conntrack]
[<b0136d06>] sys_delete_module+0x161/0x1fb
[<b0102b3f>] sysenter_past_esp+0x54/0x75
Removing netfilter NETLINK layer.
[root(a)tornado log]#

I was just reading IMAP mail at the time, ie same as I'd been doing for an hour
or two beforehand and not altering config of the box in any way. I was able to
log on via console but lost all network connectivity and had to reboot :(

Generic details such as .config is at http://www.reub.net/files/kernel/

reuben
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Next: Grant