etcdctl sometimes hangs during provisioning

Bug #1541105 reported by Corey O'Brien
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Magnum
Fix Released
Undecided
Corey O'Brien

Bug Description

Occasionally, etcdctl seems to hang during provisioning. Instead of returning and then sleeping or curling, etcdctl never exits. As a result, the agent wait condition is never set to success, so the stack/bay never goes to CREATE_COMPLETE.

Revision history for this message
Corey O'Brien (coreypobrien) wrote :

Here is the stack when etcdctrl is hung:

github.com/coreos/go-etcd/etcd.(*Client).SendRequest(0xc208074360, 0xc20803b8f0, 0x0, 0x0, 0x0)
/usr/share/gocode/src/github.com/coreos/go-etcd/etcd/requests.go:178 +0x2079
github.com/coreos/go-etcd/etcd.(*Client).getCancelable(0xc208074360, 0x7fffe702af52, 0x2f, 0xc20803b
/usr/share/gocode/src/github.com/coreos/go-etcd/etcd/requests.go:51 +0x41a
github.com/coreos/go-etcd/etcd.(*Client).get(0xc208074360, 0x7fffe702af52, 0x2f, 0xc20803b800, 0xc20
/usr/share/gocode/src/github.com/coreos/go-etcd/etcd/requests.go:62 +0x61
github.com/coreos/go-etcd/etcd.(*Client).RawGet(0xc208074360, 0x7fffe702af52, 0x2f, 0x0, 0x44ab13, 0
/usr/share/gocode/src/github.com/coreos/go-etcd/etcd/get.go:31 +0x26c
github.com/coreos/go-etcd/etcd.(*Client).Get(0xc208074360, 0x7fffe702af52, 0x2f, 0x0, 0x5, 0x0, 0x0)
/usr/share/gocode/src/github.com/coreos/go-etcd/etcd/get.go:11 +0x63
github.com/coreos/etcd/etcdctl/command.lsCommandFunc(0xc208044180, 0xc208074360, 0x5, 0x0, 0x0)
/builddir/build/BUILD/etcd-2.0.10/src/github.com/coreos/etcd/etcdctl/command/ls_command.go:65 +0x125
github.com/coreos/etcd/etcdctl/command.rawhandle(0xc208044180, 0x8b5f90, 0x4, 0x0, 0x0)
/builddir/build/BUILD/etcd-2.0.10/src/github.com/coreos/etcd/etcdctl/command/handle.go:72 +0x5bd
github.com/coreos/etcd/etcdctl/command.handleContextualPrint(0xc208044180, 0x8b5f90, 0x8b5fc0)
/builddir/build/BUILD/etcd-2.0.10/src/github.com/coreos/etcd/etcdctl/command/handle.go:92 +0x32
github.com/coreos/etcd/etcdctl/command.handleLs(0xc208044180, 0x8b5f90)
/builddir/build/BUILD/etcd-2.0.10/src/github.com/coreos/etcd/etcdctl/command/ls_command.go:41 +0x3e
github.com/coreos/etcd/etcdctl/command.func<C2><B7>004(0xc208044180)
/builddir/build/BUILD/etcd-2.0.10/src/github.com/coreos/etcd/etcdctl/command/ls_command.go:34 +0x34
github.com/codegangsta/cli.Command.Run(0x7fe8b0, 0x2, 0x0, 0x0, 0x83d6d0, 0x14, 0x0, 0x0, 0x0, 0x0,
/usr/share/gocode/src/github.com/codegangsta/cli/command.go:101 +0xe42
github.com/codegangsta/cli.(*App).Run(0xc208044000, 0xc20800a000, 0x5, 0x5, 0x0, 0x0)
/usr/share/gocode/src/github.com/codegangsta/cli/app.go:125 +0xb70

Revision history for this message
Corey O'Brien (coreypobrien) wrote :

Looks like we're running into a bug from the older version of etcdctl on the atomic image. The pre-built image has 2.0.10, but if I update to 2.2.25 everything works correctly.

I haven't narrowed it down to be sure, but I think the bug was fixed in one of these commits:
473207a requests.go: always stop retrying after some attempts
5d62264 requests.go: fix not retry for doing request error
1d96359 requests.go: alwasy retry when failing to get response

It looks like with 2.0.10, if etcd isn't up, 'ls' will hang. We should upgrade to 2.2.25 at least for this polling call to etcdctl.

Changed in magnum:
assignee: nobody → Corey O'Brien (coreypobrien)
Revision history for this message
Corey O'Brien (coreypobrien) wrote :

Sorry, too many twos. I meant 2.2.5, not 2.2.25.

The other way we can fix this is by temporarily installing 2.2.5 during notify-heat. I attached a patch here for that.

Revision history for this message
Stephen Gordon (sgordon) wrote :

The F23 atomic host image appears to have:

etcdctl v2.2.1

I have been working on updating the "build your own" instructions in the Magnum dev docs but I'm also wondering if it's worth revisiting whether it's actually required or the newer atomic images provide what is needed...

https://getfedora.org/en/cloud/download/atomic.html

Revision history for this message
Stephen Gordon (sgordon) wrote :

(I recognize we would still need a bump to get to 2.2.5 - just noting that this type of issue is likely to crop up more in the future)

Revision history for this message
Corey O'Brien (coreypobrien) wrote :

It would be awesome if a custom image wasn't required. I know there are at least a few tweaks needed for that to happen (interface names was the first thing I ran into).

v2.2.1 may be sufficient too, which would be nice.

Revision history for this message
Stephen Gordon (sgordon) wrote :

I asked on the M/L as well but is there a list of known tweaks anywhere? As far as I could tell from the Magnum docs the only special handling in the custom F21 image was injecting newer k8s/flannel/etcd packages, but the ones referred to were the F23 versions anyway.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to magnum (master)

Fix proposed to branch: master
Review: https://review.openstack.org/275994

Changed in magnum:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to magnum (master)

Reviewed: https://review.openstack.org/275994
Committed: https://git.openstack.org/cgit/openstack/magnum/commit/?id=1a9806b17f815453ca21c668dcdf85ab26278228
Submitter: Jenkins
Branch: master

commit 1a9806b17f815453ca21c668dcdf85ab26278228
Author: Corey O'Brien <email address hidden>
Date: Wed Feb 3 11:01:05 2016 -0500

    Fix gate issues with functional-api job

    Prevents etcdctl from hanging when etcd has not started by explictly
    specifying connection timeouts.
    Reduce swarm build time by removing the unneccessary dependency
    between masters and nodes.
    Only create 1 node instead of 2 nodes
    Remove test_update_bay_name_for_existing_bay

    Change-Id: If6724497b47247d2858b6da90309949f92314cfb
    Closes-Bug: 1541105

Changed in magnum:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/magnum 2.0.0

This issue was fixed in the openstack/magnum 2.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers