/usr/bin/script freezes and eats all CPU, has to be killed, which kills shell and programs inside

Bug #1558236 reported by Stéphane Gourichon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
util-linux (Ubuntu)
New
Undecided
Unassigned

Bug Description

# Context

I've been using /usr/bin/script quite a lot for about 4 years (since about Ubuntu 12.04, maybe before). It always worked without problem.

I'm using several script instance most of the time, in several terminals.

Reproducible: randomly, once every few days.

Problem happens since Ubuntu 16.04. Never seen before (12.04 12.10 13.04 13.10 14.04 14.10 15.04 15.10).

# Expected

Continued operation, terminal running programs normally, little CPU used.

# Observed

Suddenly, terminal freezes.
Other tabs of same terminal window not affected.
Top shows script consumes 100% CPU.

Strace shows an endless stream of this line:

poll([{fd=4, events=POLLIN}], 1, 10) = 1 ([{fd=4, revents=POLLIN}])

/proc/*/fd shows what file descriptor it is:

total 0
dr-x------ 2 stephane stephane 0 mars 16 20:03 .
dr-xr-xr-x 9 stephane stephane 0 mars 16 20:03 ..
lrwx------ 1 stephane stephane 64 mars 16 20:15 0 -> /dev/pts/17 (deleted)
lrwx------ 1 stephane stephane 64 mars 16 20:15 1 -> /dev/pts/17 (deleted)
lrwx------ 1 stephane stephane 64 mars 16 20:03 2 -> /dev/pts/17 (deleted)
lrwx------ 1 stephane stephane 64 mars 16 20:15 3 -> /dev/ptmx
lrwx------ 1 stephane stephane 64 mars 16 20:15 4 -> /dev/pts/21
lrwx------ 1 stephane stephane 64 mars 16 20:15 5 -> anon_inode:[signalfd]
l-wx------ 1 stephane stephane 64 mars 16 20:15 6 -> /somewhere/mylogfile.txt
l-wx------ 1 stephane stephane 64 mars 16 20:15 7 -> /somewhere/mylogfile_timings.txt

What can we do?

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: bsdutils 1:2.27.1-6ubuntu1
ProcVersionSignature: Ubuntu 4.4.0-13.29-generic 4.4.5
Uname: Linux 4.4.0-13-generic x86_64
ApportVersion: 2.20-0ubuntu3
Architecture: amd64
CurrentDesktop: XFCE
Date: Wed Mar 16 20:04:02 2016
ExecutablePath: /usr/bin/script
InstallationDate: Installed on 2016-02-03 (41 days ago)
InstallationMedia: Xubuntu 16.04 LTS "Xenial Xerus" - Alpha amd64 (20160203)
SourcePackage: util-linux
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Stéphane Gourichon (stephane-gourichon-lpad) wrote :
Revision history for this message
Stéphane Gourichon (stephane-gourichon-lpad) wrote :

kill -TERM does not affect runaway /usr/bin/script
kill -HUP stops it instantly

Revision history for this message
Stéphane Gourichon (stephane-gourichon-lpad) wrote :

Confirmed that bug only happens on Ubuntu 16.04, not on previous Ubuntu releases and not on Debian 8.3.

Machine has these OSes and can boot any at will, in case it helps: Debian stable (8.3), Xubuntu 16.04, Xubuntu 15.04, Linux Mint 17.2 (based on Ubuntu 14.04), Fedora 23 XFCE Edition.

summary: - script freezes and eats all CPU, has to be killed, which kills shell and
- programs inside
+ /usr/bin/script freezes and eats all CPU, has to be killed, which kills
+ shell and programs inside
Revision history for this message
Stéphane Gourichon (stephane-gourichon-lpad) wrote :
Download full text (3.5 KiB)

## Status

Bug still occurs every few days, when I use Ubuntu 16.04, not when using others.

## Trying to get more precise info

### Location in source code

Following https://wiki.ubuntu.com/DebuggingProgramCrash I could get an interactive gdb prompt showing that it's blocked in the while loop below:

term-utils/script.c lines 278-287:

static void wait_for_empty_fd(int fd)
{
 struct pollfd fds[] = {
            { .fd = fd, .events = POLLIN }
 };

 while (poll(fds, 1, 10) == 1) {
  DBG(IO, ul_debug(" data to read"));
 }
}

### Activity at debugger level

Here's the gdb log of the loop when pressing return key at each prompt repeats "step" gdb command:

wait_for_empty_fd (fd=<optimized out>) at term-utils/script.c:284
284 while (poll(fds, 1, 10) == 1) {
(gdb)
285 DBG(IO, ul_debug(" data to read"));
(gdb)
284 while (poll(fds, 1, 10) == 1) {
(gdb)
poll (__timeout=10, __nfds=1, __fds=0x7fffd0eff840) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
46 return __poll_alias (__fds, __nfds, __timeout);
(gdb)
poll () at ../sysdeps/unix/syscall-template.S:84
84 ../sysdeps/unix/syscall-template.S: Aucun fichier ou dossier de ce type.
(gdb)

(gdb) bt
#0 poll (__timeout=10, __nfds=1, __fds=0x7fffd0eff840) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#1 wait_for_empty_fd (fd=<optimized out>) at term-utils/script.c:284
#2 write_to_shell (ctl=ctl@entry=0x7fffd0f019c0, buf=buf@entry=0x7fffd0eff890 "sometext"..., bufsz=bufsz@entry=1) at term-utils/script.c:303
#3 0x0000000000402e6d in handle_io (ctl=ctl@entry=0x7fffd0f019c0, fd=0, eof=eof@entry=0x7fffd0f01904) at term-utils/script.c:346
#4 0x00000000004023fb in do_io (ctl=0x7fffd0f019c0) at term-utils/script.c:475
#5 main (argc=<optimized out>, argv=<optimized out>) at term-utils/script.c:781

(gdb) print __fds
$1 = (struct pollfd *) 0x7fffd0eff840
(gdb) print *__fds
$2 = {fd = 4, events = 1, revents = 1}

lrwx------ 1 stephane stephane 64 Apr 5 16:38 0 -> /dev/pts/6
lrwx------ 1 stephane stephane 64 Apr 5 16:38 1 -> /dev/pts/6
lrwx------ 1 stephane stephane 64 Apr 5 16:38 2 -> /dev/pts/6
lrwx------ 1 stephane stephane 64 Apr 5 16:38 3 -> /dev/ptmx
lrwx------ 1 stephane stephane 64 Apr 5 16:38 4 -> /dev/pts/21
lrwx------ 1 stephane stephane 64 Apr 5 16:38 5 -> anon_inode:[signalfd]
l-wx------ 1 stephane stephane 64 Apr 5 16:38 6 -> /tmp/myscriptfile.txt
l-wx------ 1 stephane stephane 64 Apr 5 16:38 7 -> /tmp/myscriptfile.timings

## Summary

* "script" polls slave pts at high speed consuming 100% CPU, assuming this won't last. But sometimes it does last forever.
* Polling code was introduced in commit below:

commit 54c6611d6f7b73609a5331f4d0bcf63c4af6429e
Author: Karel Zak <email address hidden>
Date: Wed Jun 17 15:48:50 2015 +0200

    script: fix EOF problems

    * remove STDIN from poll() if:

      * STDIN already closed -- poll returns POLLHUP, for example:

                echo "date" | script

      * detect EOF as returned by read(), for example:

                script -c "echo Hello" < /dev/null

    * don't write to master when there is still something to read from
      slave (it means shell is not initialized yet or busy)

    Signed-off-b...

Read more...

Revision history for this message
Stéphane Gourichon (stephane-gourichon-lpad) wrote :

# Reproducible? Tried once, takes about an hour.

This was enough to reproduce the bug once within about one hour :

open a terminal, type:

script

then on the shell prompt type:

yes

The 'yes' command produces as intended an infinite stream of lines with the character 'y'.

After about one hour, the bug happened.

Only tried once so far.

Revision history for this message
Stéphane Gourichon (stephane-gourichon-lpad) wrote :

# Workaround? Tried once, works.

TL;DR: reading the slave pts from another process unfreezes and allows continued operation

  cd /proc/24621/fd

  ls -al

total 0
dr-x------ 2 stephane stephane 0 avril 5 16:39 .
dr-xr-xr-x 9 stephane stephane 0 avril 5 15:13 ..
lrwx------ 1 stephane stephane 64 avril 5 16:56 0 -> /dev/pts/8
lrwx------ 1 stephane stephane 64 avril 5 16:56 1 -> /dev/pts/8
lrwx------ 1 stephane stephane 64 avril 5 16:39 2 -> /dev/pts/8
lrwx------ 1 stephane stephane 64 avril 5 16:56 3 -> /dev/ptmx
lrwx------ 1 stephane stephane 64 avril 5 16:56 4 -> /dev/pts/20
lrwx------ 1 stephane stephane 64 avril 5 16:56 5 -> anon_inode:[signalfd]
l-wx------ 1 stephane stephane 64 avril 5 16:56 6 -> /home/stephane/typescript (deleted)

So, pts/8 is the master, pts/20 is the slave.

  echo foo >/dev/pts/8

-> This shows "foo" in the terminal window.

  head -n 1 /dev/pts/8

^C

This blocks my shell, so I do Ctrl-C to regain.

  echo foo >/dev/pts/20

This blocks my shell, so I do Ctrl-C to regain.

^Cbash: echo: erreur d'écriture : Appel système interrompu

This one is interesting:

  head -n 1 /dev/pts/20

It does not block, and works around the bug immediately.

I'll try again.

tags: added: has-workaround regression-release reproducible
Revision history for this message
Stéphane Gourichon (stephane-gourichon-lpad) wrote :
Download full text (5.3 KiB)

# Confirmed that the workaround works in practice.

I had a hang, figured out which /usr/bin/script process was eating the CPU, then issued:

head -n 1 /proc/7514/fd/4

and it continued working.

# Try to figure out what happens.

Below is after the problem was fixed (things worked well again, the terminal running /usr/bin/script has some text flowing at all times).

In another terminal I typed the commands below

{ while sleep 1 ; do date ; done } & # First, get some timestamps to see where program pauses
strace head -n 1 /proc/7514/fd/4

And observed the output below (shortened):

open("/proc/7514/fd/4", O_RDONLY) = 3
read(3, jeudi 7 avril 2016, 14:30:02 (UTC+0200)
jeudi 7 avril 2016, 14:30:03 (UTC+0200)
jeudi 7 avril 2016, 14:30:04 (UTC+0200)

"head" sat waiting until I typed something in the terminal running /usr/bin/script.

jeudi 7 avril 2016, 14:30:05 (UTC+0200)
jeudi 7 avril 2016, 14:30:06 (UTC+0200)
"\n", 8192) = 1
lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
fstat(3, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
write(1, "\n", 1
) = 1
close(3) = 0
close(1) = 0
close(2) = 0
exit_group(0) = ?

I'll compare with what happens next time /usr/bin/script hangs.

Full output

execve("/usr/bin/head", ["head", "-n", "1", "/proc/7514/fd/4"], [/* 72 vars */]) = 0
brk(NULL) = 0xde4000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f35d956c000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=123250, ...}) = 0
mmap(NULL, 123250, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f35d954d000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\t\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1868984, ...}) = 0
mmap(NULL, 3971584, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f35d8f7f000
mprotect(0x7f35d913f000, 2097152, PROT_NONE) = 0
mmap(0x7f35d933f000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c0000) = 0x7f35d933f000
mmap(0x7f35d9345000, 14848, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f35d9345000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f35d954c000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f35d954b000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f35d954a000
arch_prctl(ARCH_SET_FS, 0x7f35d954b700) = 0
mprotect(0x7f35d933f000, 16384, PROT_READ) = 0
mprotect(0x608000, 4096, PROT_READ) = 0
mprotect(0x7f35d9...

Read more...

Revision history for this message
Stéphane Gourichon (stephane-gourichon-lpad) wrote :
Download full text (3.8 KiB)

# Summary

Bug hit again, workaround works again, this time with a strace.
I suspect the workaround has a drawback: the characters it eats are lost for the script.
Next time I'll try dd to "lose" only one character.

Thank you for your attention.

# Details

The bug has hit again.

Here is a selected portion of

strace head /proc/7514/fd/4

open("/proc/7514/fd/4", O_RDONLY) = 3
read(3, "\n", 8192) = 1
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 14), ...}) = 0
write(1, "\n", 1
) = 1
read(3, "a", 8192) = 1
read(3, "a", 8192) = 1
read(3, "*", 8192) = 1
read(3, "\177", 8192) = 1
read(3, "e", 8192) = 1
read(3, "e", 8192) = 1
read(3, "\r", 8192) = 1
read(3, ^Cstrace: Process 14679 detached
 <detached ...>

The characters return by read() are what I typed on the terminal hosting the stuck /usr/bin/script.

I think that script unlocked itself (i.e. the bug went away) as soon as the first read() returned, but I'm not sure.

Thank you for your attention.

strace head /proc/7514/fd/4

execve("/usr/bin/head", ["head", "/proc/7514/fd/4"], [/* 31 vars */]) = 0
brk(NULL) = 0x18a8000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff26a1f6000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=123250, ...}) = 0
mmap(NULL, 123250, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ff26a1d7000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\t\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1868984, ...}) = 0
mmap(NULL, 3971584, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ff269c09000
mprotect(0x7ff269dc9000, 2097152, PROT_NONE) = 0
mmap(0x7ff269fc9000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c0000) = 0x7ff269fc9000
mmap(0x7ff269fcf000, 14848, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ff269fcf000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff26a1d6000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff26a1d5000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff26a1d4000
arch_prctl(ARCH_SET_FS, 0x7ff26a1d5700) = 0
mprotect(0x7ff269fc9000, 16384, PROT_READ) = 0
mprotect(0x608000, 4096, PROT_READ) = 0
mprotect(0x7ff26a1f8000, 4096, PROT_READ) = 0
munmap(0x7ff26a1d7000, 123250) = 0
brk(NULL) = 0x18a8000
brk(0x18c9000) = 0x18c9000
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=3001856...

Read more...

Revision history for this message
Stéphane Gourichon (stephane-gourichon-lpad) wrote :

TL;DR: Bug happened again, workaround works.

Have not tried dd yet, but whenever the bug happen I open a shell, use bash Ctrl-R to find the one-liner below, press enter and expect that it works around the bug.

for PROCESSLINE in $( grep -a ^script /proc/*/cmdline -l ) ; do cd $(dirname $PROCESSLINE) ; pwd ; ls -al fd ; strace head -n 1 fd/4 & echo voila ; done

It did.
Here's a relevant part of strace output:

open("fd/4", O_RDONLY) = 3
read(3, "\n", 8192) = 1
lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
fstat(3, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 9), ...}) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
write(1, "\n", 1
) = 1
close(3) = 0

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.