machine agent failed to register IP addresses, borks agent

Bug #1537585 reported by Dave Cheney
42
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Tim Penhey
juju-core
Fix Released
Critical
Tim Penhey
1.25
Fix Released
Critical
Tim Penhey

Bug Description

See attached log file.

In short, the machine agent went into a loop registering its IP addresses.

2016-01-25 01:44:34 INFO juju.networker networker.go:162 networker is disabled - not starting on machine "machine-3"
2016-01-25 01:44:34 INFO juju.worker runner.go:278 stopped "networker", err: <nil>
2016-01-25 01:44:34 DEBUG juju.worker runner.go:206 "networker" done: <nil>
2016-01-25 01:44:34 DEBUG juju.worker runner.go:209 removing "networker" from known workers
2016-01-25 01:44:34 DEBUG juju.worker.logger logger.go:60 logger setup
2016-01-25 01:44:34 DEBUG juju.worker.proxyupdater proxyupdater.go:153 new proxy settings proxy.Settings{Http:"", Https:"", Ftp:"", NoProxy:""}
2016-01-25 01:44:34 DEBUG juju.network network.go:274 no lxc bridge addresses to filter for machine
2016-01-25 01:44:34 INFO juju.worker.machiner machiner.go:132 setting addresses for machine-3 to ["local-machine:127.0.0.1" "local-cloud:10.245.3.84" "local-machine:::1"]
2016-01-25 01:44:34 DEBUG juju.utils.ssh authorisedkeys.go:92 reading authorised keys file /home/ubuntu/.ssh/authorized_keys
2016-01-25 01:44:34 DEBUG juju.utils.ssh authorisedkeys.go:92 reading authorised keys file /home/ubuntu/.ssh/authorized_keys
2016-01-25 01:44:34 DEBUG juju.utils.ssh authorisedkeys.go:129 writing authorised keys file /home/ubuntu/.ssh/authorized_keys
2016-01-25 01:44:34 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
2016-01-25 01:44:34 WARNING juju.cmd.jujud machine.go:1010 determining kvm support: INFO: Your CPU does not support KVM extensions
KVM acceleration can NOT be used
: exit status 1
no kvm containers possible
2016-01-25 01:44:35 ERROR juju.worker runner.go:226 exited "machiner": setting machine addresses: cannot set machine addresses of machine 3: state changing too quickly; try again soon
2016-01-25 01:44:39 ERROR juju.worker runner.go:226 exited "machiner": setting machine addresses: cannot set machine addresses of machine 3: state changing too quickly; try again soon
2016-01-25 01:44:42 ERROR juju.worker runner.go:226 exited "machiner": setting machine addresses: cannot set machine addresses of machine 3: state changing too quickly; try again soon

This continued way after the rest of the environment had stabalised. The error is a lie.

Revision history for this message
Dave Cheney (dave-cheney) wrote :

Adding a comment because you need one to host an attachment.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Björn Tillenius (bjornt) wrote :

I'm seeing this error quite frequently with 1.25.3.

tags: added: kanban-cross-team landscape
tags: removed: kanban-cross-team
Changed in juju-core:
milestone: none → 1.25.5
Changed in juju-core:
milestone: 1.25.5 → 1.25.6
Revision history for this message
Björn Tillenius (bjornt) wrote :

This doesn't seem to be happening much when deploying trusty workloads, but I can't deploy OpenStack on Xenial with MAAS 1.9 without hitting this bug. There's always some LXC or physical machine that doesn't get added properly.

tags: added: 2.0-count
Curtis Hovey (sinzui)
tags: added: network
no longer affects: juju-core/2.0
Changed in juju-core:
milestone: 1.25.6 → 2.0-beta7
Changed in juju-core:
assignee: nobody → James Tunnicliffe (dooferlad)
Revision history for this message
James Tunnicliffe (dooferlad) wrote :

Just ran juju deploy cs:bundle/openstack-base-42 with the latest code from master (2.0.?) and OpenStack deployed happily. Well, apart from not having enough disks in some nodes for ceph-osd, but other than that everything got an address and deployed.

This is on MAAS 1.9 with physical hardware rather than VMs. I will try again, but if I can't recreate the bug I can't really help.

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

Have three successful OpenStack deploys now with Juju 2.0.x and MAAS 1.9. Will try with Juju 1.25.x.

Changed in juju-core:
status: Triaged → Incomplete
Revision history for this message
Björn Tillenius (bjornt) wrote :

FWIW, I'm seeing this on 1.25.x. I didn't try with with 2.0.

Another difference is that we first do 'juju add-machine' for each physical machine and LXC that we deploy to. So we never let 'juju deploy' or 'juju add-unit' create the machine, we always use --to. Not sure whether that has an impact.

Revision history for this message
Björn Tillenius (bjornt) wrote :

BTW, I see this error when deploying OpenStack using the Landscape Autopilot. I tried reproducing it by doing only add-machine requests manually, but that seems to work fine.

Changed in juju-core:
status: Incomplete → Triaged
Revision history for this message
James Tunnicliffe (dooferlad) wrote :

Have managed to hit this a couple of times now by deploying six machines with five LXCs on each. No ideas yet.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

FWIW "state changing too quickly; try again soon" is a bad telltale sign of unhandled ErrAborted while setting the machine addresses (somewhere in state/machine.go)

Revision history for this message
Chad Smith (chad.smith) wrote :

Just hit this again with 2.0beta6. MAAS marks the machine as deployed, juju and juju leaves it in pending state with hundreds of failure messages

from juju-2.0 debug-log --include machine-2

machine-2: 2016-05-12 19:07:41 ERROR juju.worker.dependency engine.go:526 "machiner" manifold worker returned unexpected error: setting machine addresses: cannot set machine addresses of machine 2: state changing too quickly; try again soon
...

from juju-2.0 status
[Units]
ID WORKLOAD-STATUS JUJU-STATUS VERSION MACHINE PORTS PUBLIC-ADDRESS MESSAGE

[Machines]
ID STATE DNS INS-ID SERIES AZ
0 started 10.1.100.12 /MAAS/api/1.0/nodes/node-f2f50cc0-0d32-11e6-b381-00163e743f76/ trusty Leonardo
1 started 10.1.100.22 /MAAS/api/1.0/nodes/node-d45a81c4-0d45-11e6-ad89-00163e743f76/ trusty Leonardo
2 pending /MAAS/api/1.0/nodes/node-2d45eb4a-0d30-11e6-9b2e-00163e743f76/ trusty Leonard

Revision history for this message
Chad Smith (chad.smith) wrote :

Additional data with this same deployment, I'm also receiving the same errors for lxc 0 and 3 on machine 0.

lxc-ls --fancy on machine 0 shows all lxcs have IPs assigned.
juju-machine-0-lxc-0 RUNNING 10.1.100.25 - YES
juju-machine-0-lxc-1 RUNNING 10.1.100.26 - YES
juju-machine-0-lxc-2 RUNNING 10.1.100.27 - YES
juju-machine-0-lxc-3 RUNNING 10.1.100.28 - YES
juju-machine-0-lxc-4 RUNNING 10.1.100.21 - YES
juju-machine-0-lxc-5 RUNNING 10.1.100.24 - YES

Revision history for this message
Chad Smith (chad.smith) wrote :
Download full text (4.0 KiB)

I've been able to hit this every single deployment on multiple different maas substrates.
We are adding these machines to the admin model on a non-ha controller:

The api requests we are making are like the following 1 login followed by a single request for each request:
2016-05-12 16:30:33.576 INFO job-handler-1 JUJU2 request {'Request': 'Login', 'Params': {'credentials': u'e59b4a8ad5863ce5f167eea2312af853', 'auth-tag': u'user-admin@local'}, 'Version': 3, 'Type': 'Admin', 'RequestId': 1}
2016-05-12 16:30:33.578 INFO job-handler-1 JUJU2 request {'Request': 'Login', 'Params': {'credentials': u'e59b4a8ad5863ce5f167eea2312af853', 'auth-tag': u'user-admin@local'}, 'Version': 3, 'Type': 'Admin', 'RequestId': 1}
2016-05-12 16:30:33.579 INFO job-handler-1 JUJU2 request {'Request': 'Login', 'Params': {'credentials': u'e59b4a8ad5863ce5f167eea2312af853', 'auth-tag': u'user-admin@local'}, 'Version': 3, 'Type': 'Admin', 'RequestId': 1}
2016-05-12 16:30:33.602 INFO job-handler-1 JUJU2 request {'Request': 'Login', 'Params': {'credentials': u'e59b4a8ad5863ce5f167eea2312af853', 'auth-tag': u'user-admin@local'}, 'Version': 3, 'Type': 'Admin', 'RequestId': 1}
2016-05-12 16:30:33.603 INFO job-handler-1 JUJU2 request {'Request': 'Login', 'Params': {'credentials': u'e59b4a8ad5863ce5f167eea2312af853', 'auth-tag': u'user-admin@local'}, 'Version': 3, 'Type': 'Admin', 'RequestId': 1}
2016-05-12 16:30:33.605 INFO job-handler-1 JUJU2 request {'Request': 'Login', 'Params': {'credentials': u'e59b4a8ad5863ce5f167eea2312af853', 'auth-tag': u'user-admin@local'}, 'Version': 3, 'Type': 'Admin', 'RequestId': 1}
2016-05-12 16:30:33.606 INFO job-handler-1 JUJU2 request {'Request': 'Login', 'Params': {'credentials': u'e59b4a8ad5863ce5f167eea2312af853', 'auth-tag': u'user-admin@local'}, 'Version': 3, 'Type': 'Admin', 'RequestId': 1}
2016-05-12 16:30:33.607 INFO job-handler-1 JUJU2 request {'Request': 'Login', 'Params': {'credentials': u'e59b4a8ad5863ce5f167eea2312af853', 'auth-tag': u'user-admin@local'}, 'Version': 3, 'Type': 'Admin', 'RequestId': 1}
2016-05-12 16:30:33.688 INFO job-handler-1 JUJU2 request {'Request': 'AddMachines', 'Params': {'MachineParams': [{'Placement': {'Scope': u'88813a52-54dd-4a6c-8d34-63def46e63ab', 'Directive': u'darby.scapestack'}, 'Jobs': ['JobHostUnits']}]}, 'Version': 1, 'Type': 'Client', 'RequestId': 101}
2016-05-12 16:30:33.690 INFO job-handler-1 JUJU2 request {'Request': 'AddMachines', 'Params': {'MachineParams': [{'ContainerType': 'lxc', 'Jobs': ['JobHostUnits'], 'ParentId': u'0'}]}, 'Version': 1, 'Type': 'Client', 'RequestId': 101}
2016-05-12 16:30:33.691 INFO job-handler-1 JUJU2 request {'Request': 'AddMachines', 'Params': {'MachineParams': [{'Placement': {'Scope': u'88813a52-54dd-4a6c-8d34-63def46e63ab', 'Directive': u'sekine.scapestack'}, 'Jobs': ['JobHostUnits']}]}, 'Version': 1, 'Type': 'Client', 'RequestId': 101}
2016-05-12 16:30:33.751 INFO job-handler-1 JUJU2 request {'Request': 'AddMachines', 'Params': {'MachineParams': [{'ContainerType': 'lxc', 'Jobs': ['JobHostUnits'], 'ParentId': u'0'}]}, 'Version': 1, 'Type': 'Client', 'RequestId': 101}
2016-05-12 16:30:33.752 INFO job-handler-1 JUJU2 request {'Request': 'AddMachines', 'Params': ...

Read more...

Revision history for this message
Chad Smith (chad.smith) wrote :

The end result is at least one lxc on machine 0 is stuck and 1 physical machine is stuck

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta7 → 2.0-beta8
Curtis Hovey (sinzui)
tags: added: blocker
Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

Nate is picking this up as it appears James is OoO at the moment. We can perform hand-offs later if we don't wrap it up by the times James is back.

Changed in juju-core:
assignee: James Tunnicliffe (dooferlad) → Nate Finch (natefinch)
Revision history for this message
Nate Finch (natefinch) wrote :

BTW, I have been unable to reproduce, either on my own system using LXD or on other MAAS systems... though this may be a case of not having enough capacity to run enough machines to repro. Talking with landscape guys to see if there's a better way to repo.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

I've just had this occur too, with juju 2.0beta7 and maas 2.0beta5. I ended up having to destroy the unit to get it going again. Unfortunately I don't have an easily reproducible scenario for you.

Revision history for this message
David Britton (dpb) wrote :

FYI -- I have given access to a live environment reproducing this on 1.25.5 for natefinch to play around with.

Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

Nate is being pulled off on another priority; we will revisit this ASAP. Apologies for the delay.

Changed in juju-core:
assignee: Nate Finch (natefinch) → nobody
Changed in juju-core:
assignee: nobody → Nate Finch (natefinch)
Changed in juju-core:
milestone: 2.0-beta8 → 2.0-beta9
Revision history for this message
Nate Finch (natefinch) wrote :

I have this narrowed down to a few function calls with a good understanding of what the flavor of the problem is, but still trying to nail down specifics.

Revision history for this message
Nate Finch (natefinch) wrote :

Still looking at this. I think we can get more information by logging a bunch of stuff during repro, and figure out what's malfunctioning. Seems like two competing things are trying to reset the machine's addresses somehow.

Curtis Hovey (sinzui)
Changed in juju-core:
importance: High → Critical
Revision history for this message
Nate Finch (natefinch) wrote :

Working with David Britton to get a repro with a custom binary so we get better logging.

Still can't repro with my local machine for some reason. I wonder if something in MAAS is just different enough speed to hit a race condition that other providers are missing.

Ian Booth (wallyworld)
Changed in juju-core:
status: Triaged → In Progress
Chris Gregan (cgregan)
tags: added: cdo-qa-blocker
Revision history for this message
Nate Finch (natefinch) wrote :

We have a repro case we're working with, trying to get additional logs out of it. I'll continue working on this later tonight.

Revision history for this message
Mark Ramm (mark-ramm) wrote :

I can reproduce this reliably on lxd provider by deploying wiki-scalable and realtime-syslog-analytics, and connecting them together, all in one batch.

If further logs, etc are helpful let me know.

Revision history for this message
Nate Finch (natefinch) wrote :

I am now able to repro this bug locally thanks to Mark's suggestion above. This has greatly increased the speed with which I can run debuggnig cycles. I have an idea of what may be the problem. It looks like we may be setting the origin on one of the addresses incorrectly at some point, and that's causing our assertion to always fail.

Revision history for this message
Nate Finch (natefinch) wrote :
Download full text (12.3 KiB)

Analysis:

The problem is that we're changing the address from Origin "machine" to origin "provider"... the assertion in the code is asserting we're not changing anything... even though we are. I'm not sure why we care where the address is coming from. But certainly, getting the same address from the machine and the provider shouldn't be an error condition.

here's the output of some additional logging I added:

2016-06-06 19:03:31

xxxx setAddresses for "machineaddresses" on "3": ["local-machine:127.0.0.1" "local-cloud:192.168.88.129" "local-machine:::1"]
Stacktrace:
goroutine 5456 [running]:
github.com/juju/juju/state.(*Machine).setAddresses(0xc4211fc480, 0xc4208691d0, 0x3, 0x3, 0xc420ef4308, 0x1cb5357, 0x10, 0x0, 0x0)
 /home/nate/src/github.com/juju/juju/state/machine.go:1350 +0xc3
github.com/juju/juju/state.(*Machine).SetMachineAddresses(0xc4211fc480, 0xc4208691d0, 0x3, 0x3, 0x3, 0x3)
 /home/nate/src/github.com/juju/juju/state/machine.go:1337 +0xcd
github.com/juju/juju/apiserver/machine.(*MachinerAPI).SetMachineAddresses(0xc420dbe370, 0xc4214f6500, 0x1, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0)
 /home/nate/src/github.com/juju/juju/apiserver/machine/machiner.go:92 +0x3eb
reflect.Value.call(0x1c24500, 0xc420dbe370, 0x1e13, 0x1c9dc84, 0x4, 0xc420f5bb20, 0x1, 0x1, 0x18c6f80, 0xc420f5bb20, ...)
 /home/nate/go/src/reflect/value.go:434 +0x5c9
reflect.Value.Call(0x1c24500, 0xc420dbe370, 0x1e13, 0xc420f5bb20, 0x1, 0x1, 0x1e13, 0x0, 0xc421bb8fa4)
 /home/nate/go/src/reflect/value.go:302 +0xa4
github.com/juju/juju/rpc/rpcreflect.newMethod.func6(0x1c24500, 0xc420dbe370, 0x16, 0x1a0e100, 0xc420f5b860, 0x199, 0x0, 0xc421bb8e50, 0xe10c75, 0xc421bb8e48, ...)
 /home/nate/src/github.com/juju/juju/rpc/rpcreflect/type.go:323 +0x100
github.com/juju/juju/apiserver.(*srvCaller).Call(0xc4214de080, 0x0, 0x0, 0x1a0e100, 0xc420f5b860, 0x199, 0x1944160, 0xc4204d1388, 0x1e15448, 0xc421bb8ee8, ...)
 /home/nate/src/github.com/juju/juju/apiserver/root.go:126 +0xa7
github.com/juju/juju/rpc.(*Conn).runRequest(0xc42076d0e0, 0x2b6cf60, 0xc4214de080, 0x1e12a58, 0x31, 0xc421b37a10, 0x8, 0x1, 0x0, 0x0, ...)
 /home/nate/src/github.com/juju/juju/rpc/server.go:558 +0xc8
created by github.com/juju/juju/rpc.(*Conn).handleRequest
 /home/nate/src/github.com/juju/juju/rpc/server.go:490 +0x6c7

xxxx current addresses:
 local-machine:127.0.0.1
 local-cloud:192.168.88.129
 local-machine:::1
xxx Running setaddress attempt 0.
Ops: []txn.Op{txn.Op{C:"machines", Id:"bfe6a0d7-bf16-45dd-8de8-2e37688eddeb:3", Assert:bson.D{bson.DocElem{Name:"life", Value:bson.D{bson.DocElem{Name:"$ne", Value:2}}}}, Insert:interface {}(nil), Update:bson.D{bson.DocElem{Name:"$set", Value:bson.D{bson.DocElem{Name:"machineaddresses", Value:[]state.address{state.address{Value:"192.168.88.129", AddressType:"ipv4", Scope:"local-cloud", Origin:"machine", SpaceName:""}, state.address{Value:"127.0.0.1", AddressType:"ipv4", Scope:"local-machine", Origin:"machine", SpaceName:""}, state.address{Value:"::1", AddressType:"ipv6", Scope:"local-machine", Origin:"machine", SpaceName:""}}}}}}, Remove:false}, txn.Op{C:"machines", Id:"bfe6a0d7-bf16-45dd-8de8-2e37688eddeb:3", Assert:bson.D{bson.DocElem{Name:"$or", Value:[]bson.D{bson.D{bson...

Revision history for this message
Nate Finch (natefinch) wrote :

Sorry, ignore the bit above about asserting that we're not changing anything... that assertion keeps confusing me. It's asserting that the value hasn't changed out from underneath us... again, I don't actually know why we care, but it still seems like nothing is changing out from underneath us... yet some assertion is failing. AFAIK, there's no way to tell what assertion is failing.

Revision history for this message
Nate Finch (natefinch) wrote :

Example machine-0.log with extra logging

Revision history for this message
Nate Finch (natefinch) wrote :
Revision history for this message
Nate Finch (natefinch) wrote :

Quick note, this is reproducible most of the time (but not 100%) with this deploy on LXD:

juju deploy cs:realtime-syslog-analytics && juju deploy cs:mediawiki-scalable

Usually it works well to run this deploy then

juju switch controller
juju ssh 0
sudo tail -f /var/log/juju/machine-0.log | grep "cannot set addresses"

and just wait for the grep to show the right line. Sometimes it may take up to 5 minutes for that line to show up (likely because 13 containers are currently slamming my laptop).

Revision history for this message
Tim Penhey (thumper) wrote :

OK, after a day of slow reproductions and extra logging, I'm not any further than Nate was at the end of the day, but I have confirmed that something weird is going on.

http://pastebin.ubuntu.com/17085730/ contains some painstakingly formatted log output.

The worst bit of this bug is that it is very weirdly intermittent. At the start of the transaction build process I output what was stored in the DB (using a bson.M to avoid assumptions - perhaps should have used a bson.D, but c'est la vie).

All the transaction asserts appear to be met, but for some weird reason it fails.

I have nothing.

Revision history for this message
Tim Penhey (thumper) wrote :

Alrighty then...

The problem boils down the setting of the machine addresses. This passes the state.address type into the insert for the preferredpublicaddress and preferredprivateaddress.

At some stage between the Update, and when the transaction is finished, the ordering of the bson.D structure is different. This makes the assertions in the following calls to set the preferred addresses to fail.

Investigation continues as to why the ordering changes. Theoretically, mongo should guarantee order here.

Revision history for this message
Tim Penhey (thumper) wrote :

This only appears to happen when the transaction system is under load. Observations include when updating the supported containers for the machine during setting the machine addresses. Since there are two different transaction operations targeting the same document, weirdness ensues.

Still investigating the weirdness, but this does explain the intermittent nature of the issue.

Revision history for this message
Tim Penhey (thumper) wrote :

Found it all.

mgo under load will stash and reload the transactions converting the order dependent bson.D into order agnostic bson.M.

Tim Penhey (thumper)
Changed in juju-core:
assignee: Nate Finch (natefinch) → Tim Penhey (thumper)
status: In Progress → Fix Committed
tags: removed: blocker
Revision history for this message
Cheryl Jennings (cherylj) wrote :
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Björn Tillenius (bjornt) wrote :

I tried to deploy xenial/mitaka with the Landscape Autopilot using the 1.25.6 binaries at http://reports.vapour.ws/releases/4080/binaries.

I can confirm that those binaries indeed seem to fix the problem. I've managed to do quite a few deployments without ever running into this bug.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Thank you for the verification, Bjorn! We're working to get a 1.25.6 out as soon as we can.

affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta9 → none
milestone: none → 2.0-beta9
Changed in juju-core:
assignee: nobody → Tim Penhey (thumper)
importance: Undecided → Critical
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.