rsync timeouts and stays in state D

Bug #582804 reported by Raphaël Pinson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
rsync (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Binary package hint: rsync

I'm syncing a rather big list of small files using rsync. My client is rsync 3.0.7-1ubuntu1 running on Lucid, and the server is a backport for rsync 3.0.0-1 for Etch (although I had the same problem with rsync 2.6.9 as server, so it seems to be a problem on the client side).

Some time in the middle of the sync, the client freezes. When I kill it (using ctrl+C or kill), the client returns with an error "undefined error" but the process keeps running in state D (as seen in ps), so it cannot be killed. The servers returns, too, and is not seen anymore in the process list.

Here are the last lines of the strace tracks on both the client and the server:

client:
=================
lstat("archives/private/sun-exploit-build/2009-August/date.html", 0x7fff4c604ec0) = -1 ENOENT (No such file or directory)
lstat("archives/private/sun-exploit-build/2009-August/index.html", {st_mode=S_IFLNK|0777, st_size=11, ...}) = 0
lstat("archives/private/sun-exploit-build/2009-August/subject.html", 0x7fff4c604ec0) = -1 ENOENT (No such file or directory)
lstat("archives/private/sun-exploit-build/2009-August/thread.html", 0x7fff4c604ec0) = -1 ENOENT (No such file or directory)
select(5, [3], [4], [4], {60, 0}) = 1 (out [4], left {59, 999997})
write(4, "\37\3\0\7\0\1\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\240\0\0\0\0\0"..., 803) = 803
gettimeofday({1274270586, 245243}, NULL) = 0
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}^C <unfinished ...>
=================

server:
=================
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\"> (Priorit&eacute; supervision:"..., 4092) = 4092
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "AHx8fA6H3JUKXLFmiAGzelQDAw527\nAl"..., 4092) = 4092
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "itOSosZqrN5to9mRnA\nOGBL3x59BrQUd"..., 4092) = 4092
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "<email address hidden>"..., 4092) = 4092
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {51, 216000})
write(1, ".131.30])\n\tby gwprod01m1.infra.m"..., 4092) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
write(2, "rsync: writefd_unbuffered failed"..., 79) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
write(2, "rsync error: errors with program"..., 83) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
gettimeofday({1274271146, 160025}, NULL) = 0
select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
gettimeofday({1274271146, 259470}, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
gettimeofday({1274271146, 263457}, NULL) = 0
exit_group(13) = ?
Process 20281 detached
=================

I can't find where this problem comes from, but I get it every time I relaunch rsync.

Revision history for this message
Raphaël Pinson (raphink) wrote :
Download full text (3.3 KiB)

This bug has also been reproduced with a tar-pipe-tar process, using

ssh remote tar cfP - /path/to/dir | tar xf -

After some investigation, it seems there is an issue with the ssh process. When the rsync process is stuck in state 'D', the ssh process is in state 'S+':

client:
=================
root 651 0.0 0.0 49252 1080 ? Ss 09:53 0:00 /usr/sbin/sshd
root 2681 0.0 0.0 70572 3308 ? Ss 09:53 0:00 \_ sshd: root@pts/0
root 2693 0.0 0.0 19340 3496 pts/0 Ss 09:53 0:00 | \_ -bash
root 8210 0.6 0.0 4328 800 pts/0 S+ 10:00 0:12 | \_ strace rsync -av --ignore-existing gforge01.vprod.infra.s1.p.fti.net:/var/lib/mailman/ /var/lib/mailman
root 8211 0.2 0.3 124076 14536 pts/0 D+ 10:00 0:04 | \_ rsync -av --ignore-existing gforge01.vprod.infra.s1.p.fti.net:/var/lib/mailman/ /var/lib/mailman
root 8212 5.0 0.1 44556 6756 pts/0 S+ 10:00 1:39 | \_ ssh gforge01.vprod.infra.s1.p.fti.net rsync --server --sender -vlogDtpre.iLsf . /var/lib/mailman/
root 8213 3.3 0.4 285744 19876 pts/0 D+ 10:00 1:05 | \_ rsync -av --ignore-existing gforge01.vprod.infra.s1.p.fti.net:/var/lib/mailman/ /var/lib/mailman
=================

server:
=================
root 26107 5.4 0.2 13452 6616 ? Ss 10:00 1:55 \_ sshd: root@notty
root 26154 2.5 1.3 93880 41428 ? Ss 10:00 0:54 | \_ rsync --server --sender -vlogDtpre.iLsf . /var/lib/mailman/
=================

The strace track on the ssh process on both sides of the network link show that the processes try to access file descriptors that are not listed in /proc:

client:
=================
root@mail01:/var/log# strace -p 8212
Process 8212 attached - interrupt to quit
select(7, [3 4], [5], NULL, NULL

root@mail01:/var/log# ls -l /proc/8212/fd
total 0
lrwx------ 1 root root 64 May 20 10:28 0 -> socket:[11967]
lrwx------ 1 root root 64 May 20 10:28 1 -> socket:[11970]
l-wx------ 1 root root 64 May 20 10:16 2 -> /var/log/rsync_strace
lrwx------ 1 root root 64 May 20 10:28 3 -> socket:[11980]
lrwx------ 1 root root 64 May 20 10:28 4 -> socket:[11967]
lrwx------ 1 root root 64 May 20 10:28 5 -> socket:[11970]
l-wx------ 1 root root 64 May 20 10:28 6 -> /var/log/rsync_strace
=================

server:
=================
[PROD/MASTER] gforge01:~# strace -p 26107
Process 26107 attached - interrupt to quit
select(14, [3 6], [], NULL, NULL

[PROD/MASTER] gforge01:~# ls -l /proc/26107/fd/
total 0
lrwx------ 1 root root 64 2010-05-20 10:36 0 -> /dev/null
lrwx------ 1 root root 64 2010-05-20 10:36 1 -> /dev/null
l-wx------ 1 root root 64 2010-05-20 10:36 10 -> pipe:[4066236232]
lr-x------ 1 root root 64 2010-05-20 10:36 11 -> pipe:[4066236233]
lr-x------ 1 root root 64 2010-05-20 10:36 13 -> pipe:[4066236234]
lrwx------ 1 root root 64 2010-05-20 10:36 2 -> /dev/null
lrwx------ 1 root root 64 2010-05-20 10:36 3 -> socket:[4066235605]
lrwx------ 1 root root 64 2010-05-20 10:36 4 -> socket:[4066235803]
lrwx------ 1 root root 64 2010-05-20 10:36 5 -> socket:[4066236227]
lr-x------ 1 root root 64 2010-05-20 10:36 6 -> pipe:[4066236229]
l...

Read more...

Revision history for this message
Raphaël Pinson (raphink) wrote :

It might also be important to know that both machines are xen domU servers.

The client machine is an Ubuntu Lucid domU with a 2.6.32-22-server kernel running on a Centos 5.4 dom0 with a 2.6.18-149.el5xen kernel and a Xen 3.4.2 server.

The server machine is a Debian Etch domU with a 2.6.18-164.el5xen kernel running on a Centos 5.4 dom0 with a 2.6.18-164.el5xen kernel and a Xen 3.3.0 server.

Revision history for this message
James Page (james-page) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. We are sorry that we do not always have the capacity to look at all reported bugs in a timely manner. There have been many changes in Ubuntu since that time you reported the bug and your problem may have been fixed with some of the updates. It would help us a lot if you could test it on a newer Ubuntu version. When you test it and it is still an issue, kindly upload the updated logs by running apport-collect 582804 and any other logs that are relevant for this particular issue.

Changed in rsync (Ubuntu):
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for rsync (Ubuntu) because there has been no activity for 60 days.]

Changed in rsync (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.