MAAS non-deterministic private-address with non-ethN interfaces

Bug #1314442 reported by Ryan Finnie
34
This bug affects 7 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Medium
Unassigned

Bug Description

On a MAAS setup, we have the following network setup[0] on some deployed units, in this example a nova-compute instance with a bonded interface. eth0-3 are part of bond0, and bond0 carries the IP specified by the hostname (the hostname which is returned by "unit-get public-address"). But "unit-get private-address" returns 10.0.3.1, which is the default bridge set up by lxc-net (and unusable), which appears to be installed by default during preseed.

The order seems to be, as returned by the machine agent log:

2014-04-30 02:10:14 INFO juju.worker.machiner machiner.go:85 setting addresses for machine-11 to ["local-machine:127.0.0.1" "local-cloud:10.0.3.1" "local-cloud:192.168.122.1" "local-cloud:10.34.6.3" "local-machine:::1" "fe80::a47e:62ff:fee1:c302" "fe80::3c1c:1dff:fe50:971f" "fe80::9e8e:99ff:fefb:e6" "fe80::4ab:1ff:fee0:a563" "fe80::70f3:3fff:fea3:b1a7" "fe80::883e:6eff:fef5:4649" "fe80::c1:4eff:fe8d:c869" "fe80::741f:75ff:fe2f:b297" "fe80::3840:c7ff:fe27:fa29" "fe80::702c:fdff:feda:ec19" "fe80::ac5c:f8ff:fe18:9ca8" "fe80::a801:caff:fe1f:adfd" "fe80::fc16:3eff:fef2:5511" "fe80::fc16:3eff:fe4d:c84c" "fe80::fc16:3eff:fe9a:989f" "fe80::5047:a2ff:fe1f:d218" "fe80::8c14:d8ff:fe58:35fd" "fe80::dcf3:39ff:fe9c:c350" "fe80::fc16:3eff:feab:fde2" "fe80::fc86:d6ff:fe8a:6180"]

It appears to be using the network index order and picking the first "local-cloud" address in that order.

I think (but am not 100% positive) this problem began with a juju-upgrade from 1.16.4 to 1.18.1.

[0]

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether 9c:8e:99:fb:00:e6 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether 9c:8e:99:fb:00:e6 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 9c:8e:99:fb:00:ea brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 9c:8e:99:fb:00:ec brd ff:ff:ff:ff:ff:ff
7: lxcbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether a6:7e:62:e1:c3:02 brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.1/24 brd 10.0.3.255 scope global lxcbr0
    inet6 fe80::a47e:62ff:fee1:c302/64 scope link
       valid_lft forever preferred_lft forever
9: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN
    link/ether f2:71:d5:0a:e7:5c brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
11: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
    link/ether 42:f6:63:b1:18:28 brd ff:ff:ff:ff:ff:ff
12: br-int: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether fe:9a:21:47:94:43 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::3c1c:1dff:fe50:971f/64 scope link
       valid_lft forever preferred_lft forever
16: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 9c:8e:99:fb:00:e6 brd ff:ff:ff:ff:ff:ff
    inet 10.34.6.3/21 brd 10.34.7.255 scope global bond0
    inet6 fe80::9e8e:99ff:fefb:e6/64 scope link
       valid_lft forever preferred_lft forever
[cut]

Curtis Hovey (sinzui)
tags: added: addressability maas-provider
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Jacek Nykis (jacekn) wrote :

It appears that this bug takes some time or certain conditions to manifest itself. I don't know what it is but in my environment it started affecting some hosts which had been provisioned many weeks ago while some hosts added recently are not affected (so far).

In my case this bug has significant impact. I am running neutron in OpenStack and the bug caused /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini to have incorrect local_ip setting which then broke connectivity to my OpenStack instances. I could see GRE tunnels using wrong IPs:
# ovs-vsctl show
xxxxxx
    Bridge br-int
        Port "..."
            tag: 1
            Interface "..."
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port br-int
            Interface br-int
                type: internal
    Bridge br-tun
        Port "gre-9"
            Interface "gre-9"
                type: gre
                options: {in_key=flow, local_ip="192.168.122.1", out_key=flow, remote_ip="..."}
        Port br-tun
            Interface br-tun
                type: internal
        Port "gre-2"
            Interface "gre-2"
                type: gre
                options: {in_key=flow, local_ip="192.168.122.1", out_key=flow, remote_ip="..."}

To test this I reverted change in ovs_neutron_plugin.ini, restarted neutron-plugin-openvswitch-agent and connectivity came back immediately.

Jacek Nykis (jacekn)
tags: added: canonical-is
Jacek Nykis (jacekn)
tags: added: production
Revision history for this message
Andrew Wilkins (axwalk) wrote :

I'm guessing this was broken in the course of fixing lp:1303735.

Ryan or Jacek, do either of you have a full machine-0.log from an environment exhibiting this bug? MAAS itself should be providing addresses back to Juju; Juju should prefer those addresses to the ones learned from the machine (i.e. the ones in the log message pasted in the description).

Revision history for this message
Tom Haddon (mthaddon) wrote :

I've attached a machine-X.log file from an affected machine (a compute node in the affected openstack instance).

ubuntu@nisse:~$ grep 'setting address' /var/log/juju/machine-35.log
2014-05-23 16:50:29 INFO juju.worker.machiner machiner.go:85 setting addresses for machine-35 to ["local-machine:127.0.0.1" "local-cloud:10.34.6.7" "local-machine:::1" "fe80::e611:5bff:fe0d:84a2"]
2014-05-23 16:57:20 INFO juju.worker.machiner machiner.go:85 setting addresses for machine-35 to ["local-machine:127.0.0.1" "local-cloud:10.34.6.7" "local-machine:::1" "fe80::e611:5bff:fe0d:84a2"]
2014-05-29 09:21:20 INFO juju.worker.machiner machiner.go:85 setting addresses for machine-35 to ["local-machine:127.0.0.1" "local-cloud:192.168.122.1" "local-cloud:10.34.6.7" "local-machine:::1" "fe80::cca4:66ff:fe49:9a2f" "fe80::e611:5bff:fe0d:84a2" "fe80::c44f:88ff:fee0:2992"]
2014-05-29 09:42:03 INFO juju.worker.machiner machiner.go:85 setting addresses for machine-35 to ["local-machine:127.0.0.1" "local-cloud:192.168.122.1" "local-cloud:10.34.6.7" "local-machine:::1" "fe80::2830:46ff:fed6:8c71" "fe80::7082:caff:fe96:2cad" "fe80::e611:5bff:fe0d:84a2"]

Seems like it's getting different addresses over time.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Thanks Tom. Is it possible to get a scrubbed machine-0.log too? That's the one that will contain log messages about provider addresses (i.e. the ones that MAAS reports, rather than those found on the machine via ifconfig).

Revision history for this message
Tom Haddon (mthaddon) wrote :

Sure - I don't suppose you either have a handy script for scrubbing this, or a handy script for providing the specific info from that log that you're after? I'll try to work on scrubbing machine-0.log but it may take me a little while as I'm sprinting this week.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

If you just grep for "has new addresses", then that should be enough for now.

Revision history for this message
Curtis Hovey (sinzui) wrote :

Is bug 1341524 this bug?

Revision history for this message
Jacek Nykis (jacekn) wrote :

> Is bug 1341524 this bug?

No I think they are completely different. This one affects private-address on machines with certain uptime while 1341524 is first boot problem.

Revision history for this message
Haw Loeung (hloeung) wrote :

Hi,

I ran into this issue earlier today when trying to deploy a service via manual provisioning to some MAAS hosts. It was selecting and using lxcbr0 when it shouldn't be.

I had to remove the LXC packages and bring down that interface.

Curtis Hovey (sinzui)
Changed in juju-core:
importance: High → Medium
tags: added: network
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Numerous fixes were done around picking, sorting, and filtering addresses for API endpoints, unit private/public addresses, IPv6 support since the bug was reported. I'm marking it as Fix Released, please reopen it if it still exists the latest releases.

Changed in juju-core:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.