Comment 25 for bug 7168

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Thu, 29 Jul 2004 09:20:29 +0200
From: Javier =?iso-8859-1?Q?Fern=E1ndez-Sanguino_Pe=F1a?= <email address hidden>
To: Horms <email address hidden>
Cc: <email address hidden>
Subject: Re: Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug

--gKMricLos+KVdGMg
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

> I am not sure that would really help.
> Are you sure that it couldn't be a hardware problem.

I don't see any hardware problems in the log before the kernel oopses. If
there were, if there are hardware issues, then it's the kernel fault that
nothing gets reported.=20

The only think I can think of is that there might be some (unreported by
the kernel) hard drive problems which doesn't get reported by the kernel
and when it tries to use the swap space it cannot read/write to it and this
generates the oopses. Isnt' there a tool to test the swapspace? (besides=20
'mkswap -c')

The one thing I'm surprised about is that the oopses vary somewhat in their=
=20
messages:

 kernel BUG at mmap.c:1172!
 kernel BUG at page_alloc.c:152!
 kernel BUG at page_alloc.c:221!

Digging the code of the first one I find it in mm/mmap.c exit_mmap():
        /* This is just debugging */
        if (mm->map_count)
                BUG();

And the page_alloc ones code are:

mm/page_alloc.c:
     84 static void FASTCALL(__free_pages_ok (struct page *page, unsigned i=
nt or der));
     85 static void __free_pages_ok (struct page *page, unsigned int order)
     86 {
(...)
    149 buddy1 =3D base + (page_idx ^ -mask);
    150 buddy2 =3D base + page_idx;
    151 if (BAD_RANGE(zone,buddy1))
    152 BUG();
    153 if (BAD_RANGE(zone,buddy2))
    154 BUG();
(...)
    203 static struct page * rmqueue(zone_t *zone, unsigned int order)
    204 {
(...)
    219 page =3D list_entry(curr, struct page, list=
);
    220 if (BAD_RANGE(zone,page))
    221 BUG();

I don't have an in depth knowledge of the kernel, but I don't believe that
hardware issues can make the above code generate those BUG(). It looks to
me that somehow, the kernel is not handling its swap definitions properly.

Can you figure up a way in which I could reproduce these errors and maybe=
=20
trace the kernel to see what's going on?

> It seems to be rather intermittend and do not have any
> other reports of similar failures.

The "intermittency" might be related to the fact that it's a problem in the=
=20
cleanup of swap pages, when swap is not used, the problem does not show=20
up. For what it's worth, in my system:

$ free
             total used free shared buffers cached
Mem: 386156 381800 4356 0 15712 258080
-/+ buffers/cache: 108008 278148
Swap: 979956 1088 978868

So swap is not usually used. The oops seem to appear when cron jobs make=20
intensive use of the system and the swap usage goes up and down.

> Pending a way to reliably reproduce the problem,=20
> or at least some confirmation that it manifests on
> different hardware I have changed the severity to important.

I understand this but I would appreciate some indication on how to debug=20
this issue myself if necessary and trace what the kernel is hitting=20
against.

Regards

Javier

--gKMricLos+KVdGMg
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFBCKU9i4sehJTrj0oRAvApAJ4ky5Rd7V25CxGiCAfaiIj3Y+KAqACeMVqP
d8d4EAQxliq+TY4oEGC4EHs=
=OGPB
-----END PGP SIGNATURE-----

--gKMricLos+KVdGMg--