rsync timeouts and stays in state D
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
rsync (Ubuntu) |
Expired
|
Undecided
|
Unassigned |
Bug Description
Binary package hint: rsync
I'm syncing a rather big list of small files using rsync. My client is rsync 3.0.7-1ubuntu1 running on Lucid, and the server is a backport for rsync 3.0.0-1 for Etch (although I had the same problem with rsync 2.6.9 as server, so it seems to be a problem on the client side).
Some time in the middle of the sync, the client freezes. When I kill it (using ctrl+C or kill), the client returns with an error "undefined error" but the process keeps running in state D (as seen in ps), so it cannot be killed. The servers returns, too, and is not seen anymore in the process list.
Here are the last lines of the strace tracks on both the client and the server:
client:
=================
lstat("
lstat("
lstat("
lstat("
select(5, [3], [4], [4], {60, 0}) = 1 (out [4], left {59, 999997})
write(4, "\37\3\
gettimeofday(
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}^C <unfinished ...>
=================
server:
=================
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\"> (Priorité supervision:"..., 4092) = 4092
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "AHx8fA6H3JUKXL
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "itOSosZqrN5to9
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "<email address hidden>"..., 4092) = 4092
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {51, 216000})
write(1, ".131.30])\n\tby gwprod01m1.
--- SIGPIPE (Broken pipe) @ 0 (0) ---
write(2, "rsync: writefd_unbuffered failed"..., 79) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
rt_sigaction(
rt_sigaction(
write(2, "rsync error: errors with program"..., 83) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
rt_sigaction(
rt_sigaction(
gettimeofday(
select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
gettimeofday(
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
gettimeofday(
exit_group(13) = ?
Process 20281 detached
=================
I can't find where this problem comes from, but I get it every time I relaunch rsync.
This bug has also been reproduced with a tar-pipe-tar process, using
ssh remote tar cfP - /path/to/dir | tar xf -
After some investigation, it seems there is an issue with the ssh process. When the rsync process is stuck in state 'D', the ssh process is in state 'S+':
client: vprod.infra. s1.p.fti. net:/var/ lib/mailman/ /var/lib/mailman vprod.infra. s1.p.fti. net:/var/ lib/mailman/ /var/lib/mailman vprod.infra. s1.p.fti. net rsync --server --sender -vlogDtpre.iLsf . /var/lib/mailman/ vprod.infra. s1.p.fti. net:/var/ lib/mailman/ /var/lib/mailman
=================
root 651 0.0 0.0 49252 1080 ? Ss 09:53 0:00 /usr/sbin/sshd
root 2681 0.0 0.0 70572 3308 ? Ss 09:53 0:00 \_ sshd: root@pts/0
root 2693 0.0 0.0 19340 3496 pts/0 Ss 09:53 0:00 | \_ -bash
root 8210 0.6 0.0 4328 800 pts/0 S+ 10:00 0:12 | \_ strace rsync -av --ignore-existing gforge01.
root 8211 0.2 0.3 124076 14536 pts/0 D+ 10:00 0:04 | \_ rsync -av --ignore-existing gforge01.
root 8212 5.0 0.1 44556 6756 pts/0 S+ 10:00 1:39 | \_ ssh gforge01.
root 8213 3.3 0.4 285744 19876 pts/0 D+ 10:00 1:05 | \_ rsync -av --ignore-existing gforge01.
=================
server:
=================
root 26107 5.4 0.2 13452 6616 ? Ss 10:00 1:55 \_ sshd: root@notty
root 26154 2.5 1.3 93880 41428 ? Ss 10:00 0:54 | \_ rsync --server --sender -vlogDtpre.iLsf . /var/lib/mailman/
=================
The strace track on the ssh process on both sides of the network link show that the processes try to access file descriptors that are not listed in /proc:
client: /var/log# strace -p 8212
=================
root@mail01:
Process 8212 attached - interrupt to quit
select(7, [3 4], [5], NULL, NULL
root@mail01: /var/log# ls -l /proc/8212/fd rsync_strace rsync_strace
total 0
lrwx------ 1 root root 64 May 20 10:28 0 -> socket:[11967]
lrwx------ 1 root root 64 May 20 10:28 1 -> socket:[11970]
l-wx------ 1 root root 64 May 20 10:16 2 -> /var/log/
lrwx------ 1 root root 64 May 20 10:28 3 -> socket:[11980]
lrwx------ 1 root root 64 May 20 10:28 4 -> socket:[11967]
lrwx------ 1 root root 64 May 20 10:28 5 -> socket:[11970]
l-wx------ 1 root root 64 May 20 10:28 6 -> /var/log/
=================
server:
=================
[PROD/MASTER] gforge01:~# strace -p 26107
Process 26107 attached - interrupt to quit
select(14, [3 6], [], NULL, NULL
[PROD/MASTER] gforge01:~# ls -l /proc/26107/fd/
total 0
lrwx------ 1 root root 64 2010-05-20 10:36 0 -> /dev/null
lrwx------ 1 root root 64 2010-05-20 10:36 1 -> /dev/null
l-wx------ 1 root root 64 2010-05-20 10:36 10 -> pipe:[4066236232]
lr-x------ 1 root root 64 2010-05-20 10:36 11 -> pipe:[4066236233]
lr-x------ 1 root root 64 2010-05-20 10:36 13 -> pipe:[4066236234]
lrwx------ 1 root root 64 2010-05-20 10:36 2 -> /dev/null
lrwx------ 1 root root 64 2010-05-20 10:36 3 -> socket:[4066235605]
lrwx------ 1 root root 64 2010-05-20 10:36 4 -> socket:[4066235803]
lrwx------ 1 root root 64 2010-05-20 10:36 5 -> socket:[4066236227]
lr-x------ 1 root root 64 2010-05-20 10:36 6 -> pipe:[4066236229]
l...