bzr diff freezes; kill -9 doesn't kill it.

Bug #665082 reported by gpk
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
bzr (Ubuntu)
New
Undecided
Unassigned
linux (Ubuntu)
New
Undecided
Unassigned

Bug Description

Binary package hint: bzr

Bzr freezes on a repository and will not be terminated by ^C, ^Z, kill, kill -1, or kill -9.

After that sequence of events, htop shows this:
PRI=20, NI=0, VIRT=108M, RES=21816, SHR=4456, S=D, CPU%=0.0 MEM%=0.5, TIME+=0:00.20
COMMAND=/usr/bin/python /usr/bin/bzr diff

$ ps -e -f | grep bzr
zzz 11536 10455 0 13:03 pts/0 00:00:00 /usr/bin/python /usr/bin/bzr diff
zzz 15607 10647 0 13:09 pts/1 00:00:00 grep bzr
$

Here's the output it produced (Some boring bits converted to elipses):
$ bzr diff
=== modified file 'predict.tex'
--- predict.tex 2010-10-21 11:46:43 +0000
+++ predict.tex 2010-10-21 22:28:27 +0000
@@ -42,10 +42,20 @@
 The speakers were 20-28 years old, born to monolingual parents,..., typically a week apart.

+The data was then segmented with a language-independent speech recognition system built with the HTK toolkit\cite{ZZZ}. This system did not use any transcriptions or phonotactics for the segmentation. It's grammar allows any transitions amongst \textbf{c} (consonant),
+\textbf{v} (vowel}, \textbf{m} (nasal/liquid), and \textbf{S} (silence) states except \textbf{S}$\Rightarrow$\textbf{S}.
...
+Roughly speaking, each \textbf{V} region corresponds to the core of a syllable, and \textbf{C} regions combine the final consonant(s) of one syllable with the initial consonant(s) of the next.
+
+\appendices
+\section{Automatic Segmentation}\label{app:seg}

 All data was orthographically transcribed. ... was actually said.

-
-
-\end{document}
\ No newline at end of file
+\end{document}

^C^Z
^Z

ProblemType: Bug
DistroRelease: Ubuntu 10.10
Package: bzr 2.2.1-0ubuntu1
ProcVersionSignature: Ubuntu 2.6.35-22.35-generic 2.6.35.4
Uname: Linux 2.6.35-22-generic x86_64
Architecture: amd64
Date: Fri Oct 22 13:04:58 2010
InstallationMedia: Ubuntu 10.04.1 LTS "Lucid Lynx" - Release amd64 (20100816.1)
ProcEnviron:
 PATH=(custom, user)
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
SourcePackage: bzr

Revision history for this message
gpk (gpk-kochanski) wrote :
Revision history for this message
gpk (gpk-kochanski) wrote :

The same kind of thing happens when I run
bzr commit -m 'Weird'
in the same directory.

Revision history for this message
gpk (gpk-kochanski) wrote :

It may be worth mentioning that various nfs4 threads are busy, for no obvious reason, and that
the bzr repository is mounted locally via nfs4. There are three nfsd threads running at about
20% of CPU each, and two rpciod threads at 7%, kthreadd at 5%, and nfsiod at 5%.

Revision history for this message
Vincent Ladeuil (vila) wrote :

It is indeed worth mentioning that nfs4 is involved.

Could you reproduce the problem if you run the same command on the same bzr branch but on the nfs server ?

Changed in bzr (Ubuntu):
status: New → Incomplete
Revision history for this message
gpk (gpk-kochanski) wrote : Re: [Bug 665082] Re: bzr diff freezes; kill -9 doesn't kill it.

After rebooting the machine, bzr now works as expected.
So, it looks rather less like a bzr problem and more like a kernel problem.

On second thought, it might *also* be worth mentioning that the bzr
problems occurred between a
kernel upgrade and the reboot. Why that should mess up nfs4 is
mysterious to me,
but perhaps that is what happened.

On 22/10/10 18:17, Vincent Ladeuil wrote:
> It is indeed worth mentioning that nfs4 is involved.
>
> Could you reproduce the problem if you run the same command on the same
> bzr branch but on the nfs server ?
>
>
> ** Changed in: bzr (Ubuntu)
> Status: New => Incomplete
>

Revision history for this message
gpk (gpk-kochanski) wrote :
Download full text (6.7 KiB)

It did it again, this time with no kernel upgrade.
Bzr diff hangs, and kill -9 doesn't stop it.
Nfsd, rpciod, nfsiod, and kthreadd are running afterwards, and I don't see any system activity that could explain their activity.

I moved to the server directory, and tried some bzr commands as follows:

$ bzr diff
bzr: ERROR: Could not acquire lock "/export/gpk/Documents/2010/predict_next/.bzr/checkout/dirstate": [Errno 11] Resource temporarily unavailable
$ ps -e -f | grep bzr
gpk 9578 9512 0 08:51 ? 00:00:00 /usr/bin/python /usr/bin/bzr-notify
gpk 11416 9809 0 11:55 pts/1 00:00:00 /usr/bin/python /usr/bin/bzr diff
gpk 19849 20672 0 11:58 pts/3 00:00:00 grep bzr
$ kill 11416
$ kill -9 11416
$ ps -e -f | grep bzr
gpk 9578 9512 0 08:51 ? 00:00:00 /usr/bin/python /usr/bin/bzr-notify
gpk 11416 9809 0 11:55 pts/1 00:00:00 /usr/bin/python /usr/bin/bzr diff
gpk 26240 20672 0 11:58 pts/3 00:00:00 grep bzr
$ rm /export/gpk/Documents/2010/predict_next/.bzr/checkout/dirstate
desk:predict_next$ bzr diff
bzr: ERROR: invalid header line: ''
$

So, breaking the lock simply reveals that the .bzr directory is in an inconsistent state.

I attach the relevant bit of the output of mount:
localhost:/gpk on /home/gpk/n type nfs4 (rw,intr,addr=127.0.0.1,clientaddr=127.0.0.1)

and /etc/exports:
/export 192.168.2.2(rw,fsid=0,root_squash,no_subtree_check,async) 127.0.0.1(rw,fsid=0,root_squash,no_subtree_check,async) 192.168.3.2(rw,fsid=0,root_squash,no_subtree_check,async)

and the output of find .bzr -ls:
$ find .bzr -ls
2150176 0 drwxrwxr-x 1 gpk gpk 108 Oct 21 10:35 .bzr
2150177 0 drwxrwxr-x 1 gpk gpk 0 Oct 21 10:35 .bzr/branch-lock
2150180 0 -rw-rw-r-- 1 gpk gpk 141 Oct 21 10:35 .bzr/README
2150181 0 -rw-rw-r-- 1 gpk gpk 35 Oct 21 10:35 .bzr/branch-format
2150182 0 drwxrwxr-x 1 gpk gpk 104 Oct 22 23:53 .bzr/repository
2150183 0 drwxrwxr-x 1 gpk gpk 0 Oct 22 23:53 .bzr/repository/lock
2150186 0 drwxrwxr-x 1 gpk gpk 1440 Oct 22 23:53 .bzr/repository/indices
2150223 0 -rw-rw-r-- 1 gpk gpk 165 Oct 21 10:36 .bzr/repository/indices/be4c431cb7cba2707a9aac1cfbde3a8f.rix
2150224 0 -rw-rw-r-- 1 gpk gpk 165 Oct 21 10:36 .bzr/repository/indices/be4c431cb7cba2707a9aac1cfbde3a8f.iix
2150225 0 -rw-rw-r-- 1 gpk gpk 226 Oct 21 10:36 .bzr/repository/indices/be4c431cb7cba2707a9aac1cfbde3a8f.tix
2150226 0 -rw-rw-r-- 1 gpk gpk 72 Oct 21 10:36 .bzr/repository/indices/be4c431cb7cba2707a9aac1cfbde3a8f.six
2150227 0 -rw-rw-r-- 1 gpk gpk 184 Oct 21 10:36 .bzr/repository/indices/be4c431cb7cba2707a9aac1cfbde3a8f.cix
2150312 0 -rw-rw-r-- 1 gpk gpk 188 Oct 21 12:46 .bzr/repository/indices/7a8be53033f27ce13033053c17275f63.rix
2150313 0 -rw-rw-r-- 1 gpk gpk 187 Oct 21 12:46 .bzr/repository/indices/7a8be53033f27ce13033053c17275f63.iix
2150314 0 -rw-rw-r-- 1 gpk gpk 216 Oct 21 12:46 .bzr/repository/indices/7a8be53033f27ce13033053c172...

Read more...

Revision history for this message
gpk (gpk-kochanski) wrote :

It's quite reliable now. This time I ran
strace bzr diff
and got the attached output. (I was only able to capture the tail end of the
strace run.)

Bzr seems to freeze in fruncate(). As before various nfs-related daemons
are busy, but the system (and NFS4) remain responsive.

Revision history for this message
gpk (gpk-kochanski) wrote :

OK. I can confirm that it works correctly when not on NFS4, and hangs on the call to ftruncate when on NFS4. It looks like a NFS problem.

For completeness, I will attach strace -f runs on the real file system, and on the NFS4 mount of the file system. They will be called nfsbug2.txt and nfsbug3.txt, respectively.

Then, I shall aim this bug report at the NFS people, rather than you bzr guys.

Revision history for this message
gpk (gpk-kochanski) wrote :

Full strace -f report on bzr in a case where it hangs on a ftruncate() call when mounted on NFS4.

Revision history for this message
gpk (gpk-kochanski) wrote :

This initially looked like a bzr problem, but it now seems like a NFS4 problem. NFS4 hangs and spins its wheels on a call to ftruncate(). This is not killable via a kill -9.

affects: bzr (Ubuntu) → nfs-utils (Ubuntu)
Changed in nfs-utils (Ubuntu):
status: Incomplete → New
Revision history for this message
gpk (gpk-kochanski) wrote :

Here's what NFS4 is doing. I ran nfsstat -Z and it shows a lot of activity, even though the disk itself isn't active. I don't know enough about NFS4 to interpret this, but there are a thousands of
"lock" calls per second, and thousands of "renews" per second, both of which seem strange.
And lots of "write" calls per second that don't cause any disk activity. Odd...

$ nfsstat -Z
Collecting statistics; press CTRL-C to view results from interval (i.e., from pause to CTRL-C).
^CSignal received; displaying (only) statistics gathered over the last 0 minutes, 3 seconds:

Server rpc stats:
calls badcalls badauth badclnt xdrcall
59790 0 0 0 0

Server nfs v4:
null compound
0 0% 59790 100%

Server nfs v4 operations:
op0-unused op1-unused op2-future access close commit
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
create delegpurge delegreturn getattr getfh link
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
lock lockt locku lookup lookup_root nverify
19930 20% 0 0% 0 0% 0 0% 0 0% 0 0%
open openattr open_conf open_dgrd putfh putpubfh
0 0% 0 0% 0 0% 0 0% 39860 40% 0 0%
putrootfh read readdir readlink remove rename
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
renew restorefh savefh secinfo setattr setcltid
19930 20% 0 0% 0 0% 0 0% 0 0% 0 0%
setcltidconf verify write rellockowner bc_ctl bind_conn
0 0% 0 0% 19930 20% 0 0% 0 0% 0 0%
exchange_id create_ses destroy_ses free_stateid getdirdeleg getdevinfo
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
getdevlist layoutcommit layoutget layoutreturn secinfononam sequence
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
set_ssv test_stateid want_deleg destroy_clid reclaim_comp
0 0% 0 0% 0 0% 0 0% 0 0%

Client rpc stats:
calls retrans authrefrsh
59784 0 0

Revision history for this message
Steve Langasek (vorlon) wrote :

This is a previously reported bug in the kernel, not in nfs-utils; reassigning to the kernel package and marking as a duplicate.

I haven't had a chance to debug this in any detail. If you have some time to look into it (knowing that it requires frequent reboots of the client to make any progress!), it would be helpful to get a reduced ftruncate() test case that doesn't require bzr to reproduce this, and test with a daily upstream kernel (client and server).

affects: nfs-utils (Ubuntu) → linux (Ubuntu)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.