Next: Grant
From: Eric Dumazet on 27 Jan 2006 01:10 Andrew Morton a ?crit : > Andy Whitcroft <apw(a)shadowen.org> wrote: >> Yes. I think I have this one. It appears that the patch below is the >> trigger for all our recent panic woe's. The last of the testing should >> complete in the next few hours and I will be able to confirm that >> hypothesis; results so far are all good. >> >> reduce-size-of-percpudata-and-make-sure-per_cpuobject.patch > > That patch did have some missed conversions, which might well explain the > crash. > > Thanks for narrowing it down - I'll keep that patch in next -mm (and will > include the known fixups). Could you please boot test that? If we're > still in trouble, I'll drop it. The NULL choice was maybe wrong. We might need more than one page to fully catch all accesses. Something like 32KB. In the meantime could you apply this one ? Signed-off-by: Eric Dumazet <dada1(a)cosmosbay.com>
From: Andy Whitcroft on 27 Jan 2006 05:20 Eric Dumazet wrote: > Andrew Morton a ?crit : > >> Andy Whitcroft <apw(a)shadowen.org> wrote: >> >>> Yes. I think I have this one. It appears that the patch below is the >>> trigger for all our recent panic woe's. The last of the testing should >>> complete in the next few hours and I will be able to confirm that >>> hypothesis; results so far are all good. >>> >>> reduce-size-of-percpudata-and-make-sure-per_cpuobject.patch >> >> >> That patch did have some missed conversions, which might well explain the >> crash. >> >> Thanks for narrowing it down - I'll keep that patch in next -mm (and will >> include the known fixups). Could you please boot test that? If we're >> still in trouble, I'll drop it. Sounds eminently fair. I think the patch has merit so now we know the symptoms we can spent a little effort to get the kinks out. Will test the next -mm as a matter of course. > The NULL choice was maybe wrong. We might need more than one page to > fully catch all accesses. Something like 32KB. The crash behavoir is handy to catch that the problem exists, and is very cheap (0 cost) at run time. However, once its known I think we need something more targetted to allow tracking of the cause. Perhaps we could set the offset thingy to -1 or something and simply do something like the following in per_cpu(): if (__per_cpu_offset[i] == -1) BUG(); else *RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]) > In the meantime could you apply this one ? > > Signed-off-by: Eric Dumazet <dada1(a)cosmosbay.com> > > > > ------------------------------------------------------------------------ > > --- a/arch/i386/kernel/nmi.c 2006-01-27 07:51:04.000000000 +0100 > +++ b/arch/i386/kernel/nmi.c 2006-01-27 07:52:14.000000000 +0100 > @@ -148,7 +148,7 @@ > if (nmi_watchdog == NMI_LOCAL_APIC) > smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 0); > > - for (cpu = 0; cpu < NR_CPUS; cpu++) > + for_each_cpu(cpu) > prev_nmi_count[cpu] = per_cpu(irq_stat, cpu).__nmi_count; > local_irq_enable(); > mdelay((10*1000)/nmi_hz); // wait 10 ticks No change to the panic's in alloc_slabmgmt. A very quick review seems to say that slab percpu data is actually not in percpu space, so that seems a little odd. Not had any real time to trace it further. If you have any other missed ones than this send them along and I'll put them through the mill. -apw - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Michal Piotrowski on 27 Jan 2006 05:20 Hi, On 26/01/06, Nick Piggin <nickpiggin(a)yahoo.com.au> wrote: > Nick Piggin wrote: > Sorry, wrong patch. > > Note the warnings you are seeing should not result in memory > corruption, but will result in the given hugepage leaking. > > -- > SUSE Labs, Novell Inc. > > > Index: linux-2.6/include/linux/mm.h > =================================================================== > --- linux-2.6.orig/include/linux/mm.h > +++ linux-2.6/include/linux/mm.h > @@ -294,6 +294,8 @@ struct page { > */ > static inline int put_page_testzero(struct page *page) > { > + if (unlikely(PageCompound(page))) > + page = (struct page *)page_private(page); > BUG_ON(atomic_read(&page->_count) == 0); > return atomic_dec_and_test(&page->_count); > } > > > Now I have got this: BUG: unable to handle kernel paging request at virtual address eaa34b3c printing eip: b0161cdd *pde = 0048a067 *pte = 3aa34000 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: /devices/pci0000:00/0000:00:1d.1/usb3/idVendor Modules linked in: snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_timer ide_cd cdrom intel_agp agpgart snd i2c_i801 hw_random soundcore snd_page_alloc unix CPU: 0 EIP: 0060:[<b0161cdd>] Not tainted VLI EFLAGS: 00010282 (2.6.16-rc1-mm3 #4) EIP is at do_path_lookup+0x22b/0x259 eax: eaa34b20 ebx: eb328000 ecx: 00000000 edx: eb328f4c esi: ffffff9c edi: fffffffe ebp: eb328f24 esp: eb328f0c ds: 007b es: 007b ss: 0068 Process udevd (pid: 731, threadinfo=eb328000 task=eb30ca80) Stack: <0>00000000 eb5cb000 b015fab1 eb5cb000 eb5cb000 00000000 eb328f40 b01621f3 eb328f4c ffffff9c afbf6dec afbf6dec 00000100 eb328f9c b015bf69 eb328f4c eaa34b20 b23d5f28 00000000 eb329003 b015f8ce 00000000 00000001 00000000 Call Trace: [<b0103917>] show_stack_log_lvl+0xaa/0xb5 [<b0103a54>] show_registers+0x132/0x19d [<b0103d91>] die+0x171/0x1fb [<b02ab110>] do_page_fault+0x3be/0x568 [<b010343f>] error_code+0x4f/0x54 [<b01621f3>] __user_walk_fd+0x2d/0x41 [<b015bf69>] sys_readlinkat+0x26/0x93 [<b015bfe9>] sys_readlink+0x13/0x15 [<b01028bf>] sysenter_past_esp+0x54/0x75 Code: 00 83 c0 04 e8 9a 82 14 00 8b 03 c7 80 e4 01 00 00 00 00 00 00 8b 55 08 8b 45 ec e8 55 fa ff ff 89 c7 8b 55 08 8b 02 85 c0 74 24 <8b> 50 1c 85 d2 74 1d b8 00 f0 ff ff 21 e0 8b 00 83 b8 d4 04 00 <6>ACPI: PCI Interrupt 0000:02:05.0[A] -> GSI 22 (level, low) -> IRQ 21 Here is dmesg: http://www.stardust.webpages.pl/files/mm/2.6.16-rc1-mm3/mm-dmesg2 Here is config http://www.stardust.webpages.pl/files/mm/2.6.16-rc1-mm3/mm-config Regards, Michal Piotrowski - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Eric Dumazet on 27 Jan 2006 05:40 Andy Whitcroft a ?crit : > Eric Dumazet wrote: >> The NULL choice was maybe wrong. We might need more than one page to >> fully catch all accesses. Something like 32KB. > > The crash behavoir is handy to catch that the problem exists, and is > very cheap (0 cost) at run time. However, once its known I think we > need something more targetted to allow tracking of the cause. Perhaps > we could set the offset thingy to -1 or something and simply do > something like the following in per_cpu(): > > if (__per_cpu_offset[i] == -1) > BUG(); > else > *RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]) > Yes we can set __per_cpu_offset[not_possible_cpu] to 0, because [__per_cpu_start,__per_cpu_end) is in init section and should be discarded in free_initmem(). I'm not sure if the freed virtual space can later be reused. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Reuben Farrelly on 27 Jan 2006 06:50
On 25/01/2006 8:24 p.m., Andrew Morton wrote: > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc1/2.6.16-rc1-mm3/ > > - Dropped the timekeeping patch series due to a complex timesource selection > bug. > > - Various fixes and updates. > > > > Changes since 2.6.16-rc1-mm2: Just triggered this one, which had a fairly bad effect on connectivity to the box: i2c /dev entries driver slab error in kmem_cache_destroy(): cache `ip_conntrack': Can't free all objects [<b010412b>] show_trace+0xd/0xf [<b01041cc>] dump_stack+0x17/0x19 [<b0155d04>] kmem_cache_destroy+0x9b/0x1a9 [<f0ebf701>] ip_conntrack_cleanup+0x5d/0x10e [ip_conntrack] [<f0ebe31e>] init_or_cleanup+0x1f8/0x283 [ip_conntrack] [<f0ec2c4e>] fini+0xa/0x66 [ip_conntrack] [<b0136d06>] sys_delete_module+0x161/0x1fb [<b0102b3f>] sysenter_past_esp+0x54/0x75 Removing netfilter NETLINK layer. [root(a)tornado log]# I was just reading IMAP mail at the time, ie same as I'd been doing for an hour or two beforehand and not altering config of the box in any way. I was able to log on via console but lost all network connectivity and had to reboot :( Generic details such as .config is at http://www.reub.net/files/kernel/ reuben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |