github.com/juju/juju/cmd/jujud test timeout on windows

Bug #1581157 reported by Curtis Hovey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Nate Finch
juju-core
Fix Released
Critical
Nate Finch
1.25
Fix Released
Critical
Nate Finch

Bug Description

As seen in
    http://reports.vapour.ws/releases/issue/57431b4d749a5664f001ee07
As see in these examples
    http://reports.vapour.ws/releases/3966/job/run-unit-tests-win2012-amd64/attempt/2391
    http://reports.vapour.ws/releases/3965/job/run-unit-tests-win2012-amd64/attempt/2388
    http://reports.vapour.ws/releases/3964/job/run-unit-tests-win2012-amd64/attempt/2384

github.com/juju/juju/cmd/jujud times out. there are several other bugs that affect specific tests, but we consistently see a timeout in this suite.

The windows registry issue shown in some of these test results can be ignored; we can contrive a pass even though the tests are genuinely broken. That leaved this timeout as the consistent reason for failure.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: none → 2.0-beta7
Revision history for this message
Dave Cheney (dave-cheney) wrote :

The cause is

goroutine 43 [chan receive]:
launchpad.net/tomb.(*Tomb).Wait(0xc08227e790, 0x0, 0x0)
 c:/users/admini~1/appdata/local/temp/tmpyrijbo/gogo/src/launchpad.net/tomb/tomb.go:108 +0x5f
github.com/juju/juju/worker/metrics/spool.(*socketListener).Stop(0xc08227e780)
 c:/users/admini~1/appdata/local/temp/tmpyrijbo/gogo/src/github.com/juju/juju/worker/metrics/spool/listener.go:56 +0x18c
github.com/juju/juju/worker/metrics/sender.(*sender).stop(0xc082269770)
 c:/users/admini~1/appdata/local/temp/tmpyrijbo/gogo/src/github.com/juju/juju/worker/metrics/sender/sender.go:104 +0x45
github.com/juju/juju/worker/metrics/sender.(*sender).(github.com/juju/juju/worker/metrics/sender.stop)-fm()
 c:/users/admini~1/appdata/local/temp/tmpyrijbo/gogo/src/github.com/juju/juju/worker/metrics/sender/manifold.go:77 +0x27
github.com/juju/juju/worker/metrics/spool.(*periodicWorker).Kill(0xc0822a40e0)
 c:/users/admini~1/appdata/local/temp/tmpyrijbo/gogo/src/github.com/juju/juju/worker/metrics/spool/listener.go:103 +0x28
github.com/juju/juju/worker/metrics/sender_test.(*ManifoldSuite).setupWorkerTest.func1(0xc0822791d0)
 c:/users/admini~1/appdata/local/temp/tmpyrijbo/gogo/src/github.com/juju/juju/worker/metrics/sender/manifold_test.go:101 +0x3d
github.com/juju/testing.(*CleanupSuite).callStack(0xc0821a9a28, 0xc0822791d0, 0xc08226bdd0, 0x2, 0x2)
 c:/users/admini~1/appdata/local/temp/tmpyrijbo/gogo/src/github.com/juju/testing/cleanup.go:52 +0x51
github.com/juju/testing.(*CleanupSuite).TearDownTest(0xc0821a9a28, 0xc0822791d0)
 c:/users/admini~1/appdata/local/temp/tmpyrijbo/gogo/src/github.com/juju/testing/cleanup.go:46 +0x4d

The Stop method on the metrics server did not stop, so we wait, forever.

Changed in juju-core:
assignee: nobody → Dave Cheney (dave-cheney)
summary: - github.com/juju/juju/cmd/jujud test timeout on widnows
+ github.com/juju/juju/cmd/jujud test timeout on windows
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta7 → 2.0-beta8
tags: removed: blocker
Curtis Hovey (sinzui)
description: updated
tags: added: blocker
Revision history for this message
Casey Marshall (cmars) wrote :

@dave-cheney, if you can reproduce this, can you send a stack trace of all the goroutines?

Specifically, I would like to know if github.com/juju/juju/worker/metrics/spool.socketListener.loop has exited when this happens, and if not, where it is?

Revision history for this message
Casey Marshall (cmars) wrote :

Ah, nevermind, I see a full trace in one of the linked CI reports. npipe's PipeListener Accept and Close both mutate the state of the listener without a mutex.

I'd expect `go test -race` to complain about this. The thing is, the uniter's RunListener also closes the listener like the above example, so I think it could happen there as well.

This may fix the issue: https://github.com/natefinch/npipe/pull/17

I've got a hyperv, will try to test this evening.

Revision history for this message
Casey Marshall (cmars) wrote :

Yep, http://reports.vapour.ws/releases/3965/job/run-unit-tests-win2012-amd64/attempt/2388 is an example of the race hitting the uniter's RunListener. We've gotta fix npipe.

Revision history for this message
Casey Marshall (cmars) wrote :

My fixes so far for this:

https://github.com/natefinch/npipe/pull/17
https://github.com/natefinch/npipe/pull/18

However, I think there's still a problem with Close() not always preempting Accept(), as commented in npipe pull request #18.

I wonder if this use of syscall.INFINITE is too naive (https://github.com/cmars/npipe/blob/7ed58dc21b9e3d962daef664a2f6ed07314544ad/npipe_windows.go#L182) but I know next to nothing about Windows APIs.

Revision history for this message
Casey Marshall (cmars) wrote :

My latest attempt at fixing this, with some promising experimental results: https://github.com/natefinch/npipe/pull/19

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta8 → 2.0-beta9
Revision history for this message
Casey Marshall (cmars) wrote :

I haven't been able to reproduce this test timeout on windows. Ran ./cmd/jujud/agent 300 times on hyper-v with 4 cores.

I did see lots of failures in conn suite setup/teardown, but nothing like the above stack traces.

https://github.com/natefinch/npipe/pull/20

Changed in juju-core:
status: Triaged → Incomplete
Revision history for this message
Alexis Bruemmer (alexis-bruemmer) wrote :

if this bug is incomplete it should not be a blocker.

Revision history for this message
Curtis Hovey (sinzui) wrote :

This test has constantly failed for weeks. There is no doubt The suite is broken and CI will not see a pass not give the entire revision a bless. HYou can remove blocker...but CI will still block because this is an essential part of testing.

Changed in juju-core:
status: Incomplete → Triaged
Revision history for this message
Alexis Bruemmer (alexis-bruemmer) wrote :

Curtis - can you please provide access details to the failing system.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta9 → 2.0-beta10
Revision history for this message
Nate Finch (natefinch) wrote :
Changed in juju-core:
assignee: Dave Cheney (dave-cheney) → Nate Finch (natefinch)
Revision history for this message
Nate Finch (natefinch) wrote :
Nate Finch (natefinch)
Changed in juju-core:
status: Triaged → In Progress
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Cheryl Jennings (cherylj) wrote :
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta10 → none
milestone: none → 2.0-beta10
Changed in juju-core:
assignee: nobody → Nate Finch (natefinch)
importance: Undecided → Critical
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.