lxd unit failed to start, juju says host machine has no available device in space but it does

Bug #1825854 reported by Jason Hobbs
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned

Bug Description

A unit failed with this error:

16/lxd/0 down pending bionic host machine "16" has no available device in space(s) "ceph-replica-space"

However, the host has an interface in that space:

  bridges:
    br-bond0-2735:
      interfaces: [bond0.2735]
      addresses:
      - 192.168.35.4/26
      nameservers:
        search: [production.solutionsqa, maas]
        addresses: [10.244.40.34, 10.244.40.31, 10.244.40.33, 10.244.40.30, 10.244.40.32]
      mtu: 9000

This is with juju 2.5.4.

bundle, model crashdump and controller crashdump are available here:
https://oil-jenkins.canonical.com/artifacts/fd1f885f-d02c-42e7-9026-4072f9efc6ac/index.html

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.6-rc1
Changed in juju:
assignee: nobody → Joseph Phillips (manadart)
status: Triaged → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

hm, if this is an OIL run, then it is less likely that the issue is a mismatch between Juju's idea of what spaces are and what MAAS thinks spaces are.

Revision history for this message
Joseph Phillips (manadart) wrote :

It appears there is some issue updating the provider addresses in state.

machine-1: 12:28:44 INFO juju.worker.instancepoller machine "16" has new addresses: [local-cloud:10.244.40.254@oam-space(id:6) local-cloud:192.168.33.133@internal-space(id:2) local-cloud:192.168.35.4@ceph-replica-space(id:4) local-cloud:192.168.36.4@ceph-access-space(id:5)]
machine-1: 12:28:54 INFO juju.worker.instancepoller machine "16" has new addresses: [local-cloud:10.244.40.254@oam-space(id:6) local-cloud:192.168.33.133@internal-space(id:2) local-cloud:192.168.35.4@ceph-replica-space(id:4) local-cloud:192.168.36.4@ceph-access-space(id:5)]
machine-1: 12:29:05 INFO juju.worker.instancepoller machine "16" has new addresses: [local-cloud:10.244.40.254@oam-space(id:6) local-cloud:192.168.33.133@internal-space(id:2) local-cloud:192.168.35.4@ceph-replica-space(id:4) local-cloud:192.168.36.4@ceph-access-space(id:5)]

It repeats every 10 seconds. There is no error logged but what we have saved is obviously not equal to what the provider reports, every time they are compared.

Revision history for this message
Joseph Phillips (manadart) wrote :

2019-04-20 12:49:44 INFO juju.network bridge.go:118 bridger returned nothing
2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses observed on interface "eth0"
2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses observed on interface "eno2"
2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses observed on interface "eno3"
2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses observed on interface "eno4"
2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses observed on interface "eno49"
2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses observed on interface "eth6"
2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses observed on interface "eth7"
2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses observed on interface "bond0.2736"
2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses observed on interface "bond0.2735"
2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses observed on interface "eth1.2733"
2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses observed on interface "virbr0-nic"
2019-04-20 12:49:46 WARNING juju.provisioner provisioner_task.go:1350 failed to start machine 16/lxd/0 (unable to find host brid
ge for space(s) "ceph-replica-space" for container "16/lxd/0"), retrying in 10s (10 more attempts)

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1825854] Re: lxd unit failed to start, juju says host machine has no available device in space but it does

"no addresses observed" seems like it would be because those interfaces are
already bridged, and thus the IP address would be on the bridge and not the
device.

as for why we get:
machine-1: 12:29:05 INFO juju.worker.instancepoller machine "16" has new
addresses: [local-cloud:10.244.40.254@oam-space(id:6)
local-cloud:192.168.33.133@internal-space(id:2)
local-cloud:192.168.35.4@ceph-replica-space(id:4)
local-cloud:192.168.36.4@ceph-access-space(id:5)]

I really don't know. It would be interesting to see the database
information. I know in the past we had problems with sort ordering, and
some problems with locally detected vs provider detected, but I'm pretty
sure those have all been sorted out.

On Mon, Apr 29, 2019 at 1:45 PM Joseph Phillips <email address hidden>
wrote:

> 2019-04-20 12:49:44 INFO juju.network bridge.go:118 bridger returned
> nothing
> 2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses
> observed on interface "eth0"
> 2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses
> observed on interface "eno2"
> 2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses
> observed on interface "eno3"
> 2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses
> observed on interface "eno4"
> 2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses
> observed on interface "eno49"
> 2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses
> observed on interface "eth6"
> 2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses
> observed on interface "eth7"
> 2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses
> observed on interface "bond0.2736"
> 2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses
> observed on interface "bond0.2735"
> 2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses
> observed on interface "eth1.2733"
> 2019-04-20 12:49:44 INFO juju.api.common network.go:140 no addresses
> observed on interface "virbr0-nic"
> 2019-04-20 12:49:46 WARNING juju.provisioner provisioner_task.go:1350
> failed to start machine 16/lxd/0 (unable to find host brid
> ge for space(s) "ceph-replica-space" for container "16/lxd/0"), retrying
> in 10s (10 more attempts)
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1825854
>
> Title:
> lxd unit failed to start, juju says host machine has no available
> device in space but it does
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1825854/+subscriptions
>

Revision history for this message
Joseph Phillips (manadart) wrote :

Yes, "no addresses observed" is a red herring here.

The "new addresses" messages appear for the other machines too, so that doesn't appear to be causal - machine 17 gets set up fine with a bond in the space.

A telling message is this one:

machine-0: 12:49:56 WARNING juju.network.containerizer container "16/lxd/0" wants spaces "ceph-access-space", "ceph-replica-space", "oam-space", but host machine "16" has "ceph-access-space", "ceph-replica-space", "internal-space", "oam-space" missing "ceph-replica-space"

The set-up logic interrogates the link-layer devices on the machine, grouping them by space and filtering them. Whatever is stored in state for the address or device causes it to be omitted from consideration, then to generate this message machine.AllSpaces() is used, which tells us the one it thinks is missing is really there.

Unfortunately, we are missing some debug-level logging that would shed more light on this. I don't think I can glean much more by comparing logs and code.

I don't suppose this environment is around any longer fo us to look at the DB...

Changed in juju:
milestone: 2.6-rc1 → 2.6-rc2
Revision history for this message
Joseph Phillips (manadart) wrote :

I've marked this one incomplete for the moment.

If we get more information or a recurrence it can be re-opened.

Changed in juju:
status: In Progress → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Hi Joseph,

Can you please be very clear about what additional information you require and how to go about collecting it?

Changed in juju:
status: Incomplete → New
Revision history for this message
Joseph Phillips (manadart) wrote :

The crashdumps had some logging at debug level, but the machine agent logs were at info.

There are some logging statements in code relevant to this issue that shed light on what data is being processed to get the end (incorrect) result.

I see this line in machine 16's log:

2019-04-20 12:40:52 DEBUG juju.worker.logger logger.go:70 reconfiguring logging from "<root>=DEBUG" to "juju=INFO;juju.worker.dependency=DEBUG;unit=DEBUG"

It would be good if we got logs with the default at DEBUG, but at a pinch getting DEBUG for "juju.network.containerizer" would help some.

Best of all would be a DB dump for the affected model. This can be obtained via:

export JUJU_DEV_FEATURE_FLAGS=developer-mode
juju dump-db -m <model> > db-dump.yaml

Changed in juju:
status: New → Triaged
Changed in juju:
milestone: 2.6-rc2 → 2.6.1
Changed in juju:
milestone: 2.6.1 → 2.6.2
Changed in juju:
milestone: 2.6.2 → 2.6.3
Changed in juju:
milestone: 2.6.3 → 2.6.4
Changed in juju:
milestone: 2.6.4 → 2.6.5
Changed in juju:
milestone: 2.6.5 → 2.6.6
Changed in juju:
milestone: 2.6.6 → 2.6.7
Revision history for this message
Richard Harding (rharding) wrote :

Marking incomplete looking for the additional details that Joe asked for while trying to chase this down.

Changed in juju:
milestone: 2.6.7 → none
status: Triaged → Incomplete
assignee: Joseph Phillips (manadart) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.