Defunct process consumes all CPU, init does not reap

Bug #913787 reported by Sampo
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

I have managed to create zombie process that consumes 100% of the CPU
and is unkillable. I see here two bugs

1. Zombie process should not have any CPU activity (only process
   table slot should be consumed). Something is wrong with kernel
   allowing a zombie process to still have executing thread.
2. After killing the process's parent, it is inherited by init process (pid 1).
   Init should reap the zombie. This is not happening.

How to reproduce: Create over 4GB file on encrypted home directory. (This can
easily happen if you start downloading a torrent.)

Workaround: do not encrypt home directories. Given that the encrypted home directories
become dangerous for nonexpert users due to this bug, encrypt home directories option
should not be offered in the install process.

Discussion: While the culprit may appear to be ecryptfs kernel module
or ext4 or their interaction, I claim that despite module's
misbehaviour, it still should not be possible to have process in
zombie state executing a thread. I also claim that init still should
reap the process.

It is NOT a filesystem, disk, or crypto CPU performance problem
(filesystem continues to perform for all other purposes and I waited
over 30h for it to possibly sort itself out). It really is something
unduely stuck in kernel.

Distribution: LinuxMint 12 "lisa" based on Ubuntu 11.10 "Oneric" based on Debian?
apt-get upgrade run on 20120107
lsb_release -rd
Description: Linux Mint 12 Lisa
Release: 12
uname -a
Linux saz 3.0.0-12-generic #20-Ubuntu SMP Fri Oct 7 14:50:42 UTC 2011 i686 i686 i386 GNU/Linux
mount |grep ecrypt
/home/sk/.Private on /home/sk type ecryptfs (ecryptfs_check_dev_ruid,ecryptfs_cipher=aes,ecryptfs_key_bytes=16,ecryptfs_unlink_sigs,ecryptfs_sig=42979cfc6e80278a,ecryptfs_fnek_sig=6261c2e7178a57d3)

Potentially related bugs: #431975 (2009-09-17), #888497

Given that this is obvious and serious bug, it is quite sad that it
has not been fixed in 2 years. If there is no intent to support the
encrypted home directories, then the feature should be removed (at
least from install).

Cheers,
--Sampo

P.S. ubuntu-bug PID on the zombie process reports

*** Error: Invalid PID

The specified process ID does not belong to a program.

Press any key to continue...

No pending crash reports. Try --help for more information.

Perhaps ubuntu-bug should be more robust to be useful in submitting
bugs like this.

P.S2: It is quite annoying that I can't get past "Is one of these bugs the same bug"
question although this is genuinely different bug. I had to change the subject
line to be less descriptive to get past. Such automated braindamage *requires* me
to degrade the quality of my bug report.

Tags: bot-comment
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/913787/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
Sampo (sampo) wrote : Re: [Bug 913787]
Download full text (4.5 KiB)

Ubuntu Foundation's Bug Bot <email address hidden> said:

> Thank you for taking the time to report this bug and helping to make

> Ubuntu better. It seems that your bug report is not filed about a

> specific source package though, rather it is just filed against Ubuntu

> in general. It is important that bug reports be filed about source

> packages so that people interested in the package can find the bugs

> about it. You can find some hints about determining what package your

> bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage.

This link is broken.

At any rate, the bug is about kernel, however your bug tracker does not accept kernel

as package name, so I was unable to label it correctly.

--Sampo

> You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

>

> To change the source package that this bug is filed about visit

> https://bugs.launchpad.net/ubuntu/+bug/913787/+editstatus and add the

> package name in the text box next to the word Package.

>

> [This is an automated message. I apologize if it reached you

> inappropriately; please just reply to this message indicating so.]

>

> ** Tags added: bot-comment

>

> --

> You received this bug notification because you are subscribed to the bug

> report.

> https://bugs.launchpad.net/bugs/913787

>

> Title:

> Defunct process consumes all CPU, init does not reap

>

> Status in Ubuntu:

> New

>

> Bug description:

> I have managed to create zombie process that consumes 100% of the CPU

> and is unkillable. I see here two bugs

>

> 1. Zombie process should not have any CPU activity (only process

> table slot should be consumed). Something is wrong with kernel

> allowing a zombie process to still have executing thread.

> 2. After killing the process's parent, it is inherited by init process (pid 1).

> Init should reap the zombie. This is not happening.

>

> How to reproduce: Create over 4GB file on encrypted home directory. (This can

> easily happen if you start downloading a torrent.)

>

> Workaround: do not encrypt home directories. Given that the encrypted home directories

> become dangerous for nonexpert users due to this bug, encrypt home directories option

> should not be offered in the install process.

>

> Discussion: While the culprit may appear to be ecryptfs kernel module

> or ext4 or their interaction, I claim that despite module's

> misbehaviour, it still should not be possible to have process in

> zombie state executing a thread. I also claim that init still should

> reap the process.

>

> It is NOT a filesystem, disk, or crypto CPU performance problem

> (filesystem continues to perform for all other purposes and I waited

> over 30h for it to possibly sort itself out). It really is something

> unduely stuck in kernel.

>

> Distribution: LinuxMint 12 "lisa" based on Ubuntu 11.10 "Oneric" based on Debian?

> apt-get upgrade run on 20120107

> lsb_release -rd

> Description: Linux Mint 12 Lisa

> Release: 12

> uname -a

> Linux saz 3.0.0-12-generic #20-Ubuntu SMP Fri Oct 7 14:50:42 UTC 2011 i686 i686 i386 GNU/L...

Read more...

Revision history for this message
Phillip Susi (psusi) wrote :

The name of the kernel package is "linux" and the link is not broken. By definition a zombie process can not use any cpu since it has exited. Can you post the output of ps -l on this process?

Also your reproduction steps are incomplete. You said create a 4gb file, but then what? What process are you trying to end and how are you ending it?

affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Are you running Ubuntu, or Mint?

Revision history for this message
Sampo (sampo) wrote :

LinuxMint 12, with all packages upgraded to latest on 20120107.
(This info was available in my original post near the end.)

(The link was broken when I tried it: it showed the typical Wiki "Add this page" stand in
page. Seems this had been fixed since then.)
(Is there a way to trim from this thread the excessive quoted material? When I replied to the
mail, I did not realize it would end up in the bug tracker - and now the broken link issue
is moot so it should be trimmed away.)

Are there any investigative steps I could take? My machine still has the zombie eating 100% of the CPU.

Cheers,
--Sampo

Revision history for this message
Sampo (sampo) wrote :
Download full text (12.8 KiB)

The zombie is pid 12989

ps -axl
Warning: bad ps syntax, perhaps a bogus '-'? See http://procps.sf.net/faq.html
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
4 0 1 0 20 0 3316 1944 poll_s Ss ? 0:01 /sbin/init
1 0 2 0 20 0 0 0 kthrea S ? 0:00 [kthreadd]
1 0 3 2 20 0 0 0 run_ks S ? 0:02 [ksoftirqd/0]
1 0 6 2 -100 - 0 0 cpu_st S ? 0:00 [migration/0]
1 0 7 2 -100 - 0 0 cpu_st S ? 0:00 [migration/1]
1 0 9 2 20 0 0 0 run_ks S ? 0:00 [ksoftirqd/1]
1 0 11 2 -100 - 0 0 cpu_st S ? 0:00 [migration/2]
1 0 13 2 20 0 0 0 run_ks S ? 0:02 [ksoftirqd/2]
1 0 14 2 -100 - 0 0 cpu_st S ? 0:00 [migration/3]
1 0 16 2 20 0 0 0 run_ks S ? 0:00 [ksoftirqd/3]
1 0 17 2 0 -20 0 0 rescue S< ? 0:00 [cpuset]
1 0 18 2 0 -20 0 0 rescue S< ? 0:00 [khelper]
1 0 19 2 0 -20 0 0 rescue S< ? 0:00 [netns]
1 0 21 2 20 0 0 0 bdi_sy S ? 0:00 [sync_supers]
1 0 22 2 20 0 0 0 bdi_fo S ? 0:00 [bdi-default]
1 0 23 2 0 -20 0 0 rescue S< ? 0:00 [kintegrityd]
1 0 24 2 0 -20 0 0 rescue S< ? 0:00 [kblockd]
1 0 25 2 0 -20 0 0 rescue S< ? 0:00 [ata_sff]
5 0 26 2 20 0 0 0 hub_th S ? 0:00 [khubd]
1 0 27 2 0 -20 0 0 rescue S< ? 0:00 [md]
1 0 29 2 20 0 0 0 watchd S ? 0:00 [khungtaskd]
1 0 30 2 20 0 0 0 kswapd S ? 0:15 [kswapd0]
1 0 31 2 25 5 0 0 ksm_sc SN ? 0:00 [ksmd]
1 0 32 2 39 19 0 0 khugep SN ? 0:00 [khugepaged]
1 0 33 2 20 0 0 0 fsnoti S ? 0:00 [fsnotify_mark]
1 0 34 2 20 0 0 0 ecrypt S ? 0:00 [ecryptfs-kthrea]
1 0 35 2 0 -20 0 0 rescue S< ? 0:00 [crypto]
1 0 43 2 0 -20 0 0 rescue S< ? 0:00 [kthrotld]
1 0 210 2 20 0 0 0 scsi_e S ? 0:00 [scsi_eh_0]
1 0 211 2 20 0 0 0 scsi_e S ? 0:00 [scsi_eh_1]
1 0 212 2 20 0 0 0 scsi_e S ? 0:00 [scsi_eh_2]
1 0 214 2 20 0 0 0 scsi_e S ? 0:00 [scsi_eh_3]
1 0 216 2 20 0 0 0 scsi_e S ? 0:00 [scsi_eh_4]
1 0 224 2 20 0 0 0 scsi_e S ? 0:00 [scsi_eh_5]
1 0 305 2 20 0 0 0 kjourn S ? 0:10 [jbd2/sda8-8]
1 0 306 2 0 -20 0 0 rescue S< ? 0:00 [ext4-dio-unwrit]
1 0 362 1 20 0 2648 588 poll_s S ? 0:00 upstart-udev-bridge -...

Revision history for this message
Sampo (sampo) wrote :

Re reproduction:

When you create the file (e.g. by write(2) system calls), the process gets stuck at some point when
it goes over a threshold, suspected to be 4GB, though my file was actually 5GB.

Then I tried to kill the process (no effect) and then kill -9, which made it zombie. Then I killed its parent
so init (pid 1) inherited it. Then nothing. init did not reap it and I have not killed init because I thought
it may be more interesting to investigate before reboot.

Cheers,
--Sampo

Revision history for this message
Sampo (sampo) wrote :

Easier way to reproduce:

cd /home/sampo # This directory is the encrypted home directory
dd if=/dev/zero of=seek5GB seek=5G bs=1K count=1

Then kill it.

--Sampo

Revision history for this message
Phillip Susi (psusi) wrote : Re: [Bug 913787] Re: Defunct process consumes all CPU, init does not reap

How about ps -lL 12989?

Revision history for this message
Sampo (sampo) wrote :

ps -lL 12989
F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD
0 Z 1000 12989 1 12989 0 80 0 - 0 exit ? 0:30 [transmission-gt] <defunct>
1 R 1000 12989 1 12991 97 80 0 - 37113 - ? 2694:09 [transmission-gt]

Revision history for this message
Phillip Susi (psusi) wrote :

It looks like transmission has one thread trying to exit, which has placed the process in the zombie state, and the other thread is runaway in kernel space. Can you press alt-sysrq-t and look through /var/log/kern.log for the section relating to that process?

Revision history for this message
Sampo (sampo) wrote :

As unbeliable as it may sound, my machine (Samsung NP300U) does not have sysrq key. Is there some
other way to produce whatever alt-sysrq-t does?

However, I did find the /var/log/kern.log file. It starts from Jan 8 and last time stamp was few minutes ago
so I believe it is valid.

I tried grepping oops, trans, 12989, and 12991. No matches.
I also tried to grep for the other stcuk process that corresponds to the dd command line above.

I tried attaching to the surviving transmission thread with strace and gdb, but no syscall activity
was visible and I was unable to stop the process for stack trace.

If you are online right now, we could skype chat: sampo.kellomaki to make these debug
cycles shorter.

Cheers,
--Sampo

Revision history for this message
Sampo (sampo) wrote :
Download full text (8.7 KiB)

Ok found it: echo t >/proc/sysrq-trigger

My dd experiments apparently are not zombies, but unkillable, never the less. For the
dd trials a call trace was found in the log. For the run-away transmission-gtk thread (12991)
no such trace could be found. Please find below what I thought relevant. I'll attach gzipped
full log.

Jan 10 23:30:50 saz kernel: [332630.928121] dd R running 0 20562 1 0x00000004
Jan 10 23:30:50 saz kernel: [332630.928128] 00000000 00000000 f62e9180 00000000 f0495bb4 c10deba0 00003000 00000000
Jan 10 23:30:50 saz kernel: [332630.928141] f0495d34 00000001 00001000 00000000 ffffffff 00002000 00000000 f0495c20
Jan 10 23:30:50 saz kernel: [332630.928154] c10e0554 00002000 00000000 f0495cf8 00001000 00000000 efcf2900 00012eb5
Jan 10 23:30:50 saz kernel: [332630.928166] Call Trace:
Jan 10 23:30:50 saz kernel: [332630.928174] [<c10deba0>] ? generic_file_buffered_write+0x50/0x80
Jan 10 23:30:50 saz kernel: [332630.928182] [<c10e0554>] ? __generic_file_aio_write+0x224/0x4f0
Jan 10 23:30:50 saz kernel: [332630.928191] [<c102b60c>] ? kmap_atomic_prot+0x4c/0x100
Jan 10 23:30:50 saz kernel: [332630.928198] [<c102b6d3>] ? __kmap_atomic+0x13/0x20
Jan 10 23:30:50 saz kernel: [332630.928205] [<c12565c2>] ? scatterwalk_map+0x22/0x30
Jan 10 23:30:50 saz kernel: [332630.928212] [<c1258ebc>] ? blkcipher_walk_next+0x1cc/0x3c0
Jan 10 23:30:50 saz kernel: [332630.928221] [<c1259125>] ? blkcipher_walk_first+0x75/0x160
Jan 10 23:30:50 saz kernel: [332630.928229] [<c125f231>] ? crypto_cbc_encrypt+0x81/0x110
Jan 10 23:30:50 saz kernel: [332630.928236] [<c125f9e0>] ? crypto_aes_set_key+0x30/0x30
Jan 10 23:30:50 saz kernel: [332630.928245] [<c11fb4f4>] ? encrypt_scatterlist+0x94/0x120
Jan 10 23:30:50 saz kernel: [332630.928253] [<c11fbd74>] ? ecryptfs_encrypt_extent+0xe4/0x230
Jan 10 23:30:50 saz kernel: [332630.928262] [<c11fc2f4>] ? ecryptfs_encrypt_page+0x94/0x1c0
Jan 10 23:30:50 saz kernel: [332630.928271] [<c10df494>] ? read_cache_page_async+0x24/0x30
Jan 10 23:30:50 saz kernel: [332630.928279] [<c102b69e>] ? kmap_atomic_prot+0xde/0x100
Jan 10 23:30:50 saz kernel: [332630.928286] [<c11fac46>] ? ecryptfs_write+0x196/0x320
Jan 10 23:30:50 saz kernel: [332630.928293] [<c11f6fe0>] ? ecryptfs_open+0x130/0x2a0
Jan 10 23:30:50 saz kernel: [332630.928301] [<c11f83e8>] ? truncate_upper.isra.12+0x2a8/0x370
Jan 10 23:30:50 saz kernel: [332630.928308] [<c10e0ea3>] ? filemap_fault+0xf3/0x390
Jan 10 23:30:50 saz kernel: [332630.928314] [<c1143178>] ? mntput+0x18/0x30
Jan 10 23:30:50 saz kernel: [332630.928321] [<c113147a>] ? path_put+0x1a/0x20
Jan 10 23:30:50 saz kernel: [332630.928328] [<c11f85ff>] ? ecryptfs_setattr+0x14f/0x260
Jan 10 23:30:50 saz kernel: [332630.928336] [<c152fc2f>] ? do_page_fault+0x22f/0x4a0
Jan 10 23:30:50 saz kernel: [332630.928343] [<c113192b>] ? putname+0x2b/0x40
Jan 10 23:30:50 saz kernel: [332630.928352] [<c11402b9>] ? notify_change+0x149/0x310
Jan 10 23:30:50 saz kernel: [332630.928360] [<c11265a6>] ? do_truncate+0x56/0x90
Jan 10 23:30:50 saz kernel: [332630.928367] [<c152fc2f>] ? do_page_fault+0x22f/0x4a0
Jan 10 23:30:50 saz kernel: [332630.928374] [<c113192b>] ? putname+0x2b/0x...

Read more...

Revision history for this message
Sampo (sampo) wrote :

Full log from which above was extracts. --Sampo

Revision history for this message
Alejandro R. Mosteo (mosteo) wrote :
Download full text (8.1 KiB)

I have right now a deluged process behaving exactly as this (100% cpu, unkillable, becomes zombie when killed and keeps eating 100% cpu, using an encrypted filesystem).

Possibly duplicates: #925309 #665211 (this last one was reported by me)
Not so sure: #838061

Quite nasty bug, since it forces a reboot.

Here is the info I could gather about the runaway process. Note how it goes from running to zombie:

$ uname -a
Linux isila 3.0.0-15-generic-pae #26-Ubuntu SMP Fri Jan 20 17:07:31 UTC 2012 i686 i686 i386 GNU/Linux
$ ps ax | grep deluged | grep -v grep
 3416 ? Sl 79:04 /usr/bin/python /usr/bin/deluged --port=58846 --config=/home/user/.config/deluge
$ kill 3416 ; sleep 1; ps ax | grep deluged | grep -v grep
 3416 ? Sl 79:22 /usr/bin/python /usr/bin/deluged --port=58846 --config=/home/user/.config/deluge
$ kill -9 3416 ; sleep 1; ps ax | grep deluged | grep -v grep
 3416 ? Zl 79:33 [deluged] <defunct>
$ kill -9 3416 ; sleep 1; ps ax | grep deluged | grep -v grep
 3416 ? Zl 79:37 [deluged] <defunct>
$ top -b -p 3416
top - 13:52:18 up 1:45, 4 users, load average: 1.19, 1.23, 1.58
Tasks: 1 total, 0 running, 0 sleeping, 0 stopped, 1 zombie
Cpu(s): 6.3%us, 19.9%sy, 0.0%ni, 59.9%id, 13.8%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 8179056k total, 7960816k used, 218240k free, 232724k buffers
Swap: 8388604k total, 1316k used, 8387288k free, 6397284k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 3416 user 20 0 0 0 0 Z 100 0.0 80:27.44 deluged <defunct>

And using the sysrq-t thingie:

Feb 10 13:52:32 Isila kernel: [ 6329.309318] deluged x ed2a9e8c 0 3416 1 0x00000004
Feb 10 13:52:32 Isila kernel: [ 6329.309321] de99fe40 00000046 00000000 ed2a9e8c 00000000 00000000 f74d0000 c18b2d40
Feb 10 13:52:32 Isila kernel: [ 6329.309324] c18b2d40 e2d22e9d 000005b1 f7886d40 e82d8000 e9830000 f18cf520 f7407e00
Feb 10 13:52:32 Isila kernel: [ 6329.309327] de99fe28 c112392a c1278379 f18cf520 c10b81f2 de99fe28 c1278379 c1278379
Feb 10 13:52:32 Isila kernel: [ 6329.309331] Call Trace:
Feb 10 13:52:32 Isila kernel: [ 6329.309333] [<c112392a>] ? kmem_cache_free+0xea/0x100
Feb 10 13:52:32 Isila kernel: [ 6329.309334] [<c1278379>] ? put_io_context+0x39/0x60
Feb 10 13:52:32 Isila kernel: [ 6329.309336] [<c10b81f2>] ? call_rcu_sched+0x12/0x20
Feb 10 13:52:32 Isila kernel: [ 6329.309338] [<c1278379>] ? put_io_context+0x39/0x60
Feb 10 13:52:32 Isila kernel: [ 6329.309340] [<c1278379>] ? put_io_context+0x39/0x60
Feb 10 13:52:32 Isila kernel: [ 6329.309342] [<c1278379>] ? put_io_context+0x39/0x60
Feb 10 13:52:32 Isila kernel: [ 6329.309344] [<c155ae25>] schedule+0x35/0x50
Feb 10 13:52:32 Isila kernel: [ 6329.309345] [<c1054810>] do_exit+0x1f0/0x3a0
Feb 10 13:52:32 Isila kernel: [ 6329.309347] [<c1061897>] ? recalc_sigpending+0x17/0x40
Feb 10 13:52:32 Isila kernel: [ 6329.309349] [<c1061a11>] ? dequeue_signal+0x31/0x190
Feb 10 13:52:32 Isila kernel: [ 6329.309351] [<c102de68>] ? default_spin_lock_flags+0x8/0x10
Feb 10 13:52:32 Isila kernel: [ 6329.309353] [<c1054b18...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.