improve reporting of IP address starvation in a subnet (or multiple subnets in a space)

Bug #1725356 reported by Dmitrii Shcherbakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Medium
Witold Krecicki

Bug Description

I don't think this is a unique per-charm problem so I think there could be better reporting on this.

When IP address starvation happens in a subnet (or space with multiple subnets) one will just get a cryptic error about the lack of network config for a binding.

root@juju-750932-22-lxd-5:/var/lib/juju/agents/unit-keystone-2/charm# network-get --primary-address public
ERROR no network config found for binding "public"

This can be debugged by looking at machine agent's logs, however, it is still hard to tell why a network config is empty for that specific interface.

root@juju-750932-22-lxd-5:/var/lib/juju/agents/unit-keystone-2/charm# grep 'observed network config' /var/log/juju/machine-22-lxd-5.log
2017-10-20 14:32:59 DEBUG juju.worker.machiner machiner.go:172 observed network config updated for "machine-22-lxd-5" to [{1 127.0.0.0/8 65536 0 lo loopback false false loopback 127.0.0.1 [] [] []} {1 ::1/128 65536 0 lo loopback false false loopback ::1 [] [] []} {63 00:16:3e:f6:2b:8c 10.30.20.0/22 1500 0 eth0 ethernet false false static 10.30.21.250 [] [] []} {63 00:16:3e:f6:2b:8c 1500 0 eth0 ethernet false false manual [] [] []} {65 00:16:3e:95:51:7c 1500 0 eth1 ethernet false false manual [] [] []}]
2017-10-20 14:33:02 DEBUG juju.worker.machiner machiner.go:172 observed network config updated for "machine-22-lxd-5" to [{1 127.0.0.0/8 65536 0 lo loopback false false loopback 127.0.0.1 [] [] []} {1 ::1/128 65536 0 lo loopback false false loopback ::1 [] [] []} {2 5a:ac:af:94:c3:9d 1500 0 lxdbr0 bridge false false manual [] [] []} {2 5a:ac:af:94:c3:9d 1500 0 lxdbr0 bridge false false manual [] [] []} {63 00:16:3e:f6:2b:8c 10.30.20.0/22 1500 0 eth0 ethernet false false static 10.30.21.250 [] [] []} {63 00:16:3e:f6:2b:8c 1500 0 eth0 ethernet false false manual [] [] []} {65 00:16:3e:95:51:7c 1500 0 eth1 ethernet false false manual [] [] []}]

root@juju-750932-22-lxd-5:/var/lib/juju/agents/unit-keystone-2/charm# cat /etc/network/interfaces

auto lo eth1 eth0

iface lo inet loopback
  dns-nameservers 10.30.20.21

iface eth0 inet static
  address 10.30.21.250/22
  gateway 10.30.20.21

iface eth1 inet manual

Tracing this to the host it can be seen that the container's veth interface is properly plugged into a host bridge so bridger worked and there is an IP address on that bridge interface. This means (and I confirmed by looking at MAAS) that there is a proper subnet configuration for that interface. The problem is revealed by looking at subnet address allocations - there were no more addresses to allocate for that container.

65: eth1@if66: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:95:51:7c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::216:3eff:fe95:517c/64 scope link
       valid_lft forever preferred_lft forever

root@juju-750932-22-lxd-5:/var/lib/juju/agents/unit-keystone-2/charm# ip a s
...
63: eth0@if64: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:f6:2b:8c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.30.21.250/22 brd 10.30.23.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::216:3eff:fef6:2b8c/64 scope link
       valid_lft forever preferred_lft forever
65: eth1@if66: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:95:51:7c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::216:3eff:fe95:517c/64 scope link
       valid_lft forever preferred_lft forever

nova003:~$ ip a s | grep if65
66: veth2RWQGO@if65: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-bond0.201 state UP group default qlen 1000

ubuntu@nova003:~$ brctl show | grep veth2RWQGO
       veth2RWQGO

nova003:~$ ip -4 -o a s br-bond0.201
40: br-bond0.201 inet 103.77.105.137/26 brd 103.77.105.191 scope global br-bond0.201\ valid_lft forever preferred_lft forever

Tags: cpe-onsite
Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.3.0
status: New → Triaged
importance: Undecided → High
Tim Penhey (thumper)
Changed in juju:
importance: High → Medium
milestone: 2.3.0 → 2.3-rc1
Witold Krecicki (wpk)
Changed in juju:
assignee: nobody → Witold Krecicki (wpk)
Witold Krecicki (wpk)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

https://github.com/juju/juju/pull/8084

So do I understand the patch correctly, that it is essentially just that our retry logic is interacting poorly. So we create a device, and get a failure to give it an IP address, but then when we retry provisioning we see the device already exists, and assume that it was set up correctly.

A different fix could also be to see that the device exists and validate it more thoroughly. But removing the device seems ok. But it does seem like if you had a network break and that is what caused provisioning to fail, then you would be in the same position. (You can't finish setting up the device *nor* delete it because you lost connectivity.)

Witold Krecicki (wpk)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Probably fix-released now.

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.