MAAS

virtual nodes don't always PXE boot on the same NIC

Bug #1367482 reported by David Britton on 2014-09-09

This bug affects 3 people

	Status	Importance	Assigned to	Milestone
MAAS	Invalid	High	Unassigned	MAAS 1.7.3
Orange Box	Fix Committed	Medium	Darryl Weaver
libvirt	New	Undecided	Unassigned

Bug Description

dpb@helo:trunk$ juju bootstrap -v
Launching instance
WARNING picked arbitrary tools &{1.20.7-trusty-amd64 https://streams.canonical.com/juju/tools/releases/juju-1.20.7-trusty-amd64.tgz 31028bcfc37fdad261f3fc11b0d27af9e5a2cf0dcc1826f5457d81e920c85d6c 8234430}
- /MAAS/api/1.0/nodes/node-c94777ee-128b-11e4-8262-8cae4cfd5fd8/
Waiting for address
Attempting to connect to node1.maas:22
Attempting to connect to node1.maas:22
Attempting to connect to 10.14.100.2:22

### syslog snip ####

Sep 9 16:28:16 OrangeBox12 dhcpd: DHCPDISCOVER from c0:3f:d5:60:4e:f2 via br0
Sep 9 16:28:16 OrangeBox12 dhcpd: DHCPOFFER on 10.14.100.11 to c0:3f:d5:60:4e:f2 via br0
Sep 9 16:28:16 OrangeBox12 dhcpd: DHCPREQUEST for 10.14.100.11 (10.14.4.1) from c0:3f:d5:60:4e:f2 via br0
Sep 9 16:28:16 OrangeBox12 dhcpd: DHCPACK on 10.14.100.11 to c0:3f:d5:60:4e:f2 via br0
Sep 9 16:28:22 OrangeBox12 dhcpd: DHCPRELEASE of 10.14.100.11 from c0:3f:d5:60:4e:f2 via br0 (not found)
Sep 9 16:28:26 OrangeBox12 maas.dhcp.probe: [INFO] Running periodic DHCP probe.
Sep 9 16:28:26 OrangeBox12 maas.lease_upload_service: [INFO] Scanning DHCP leases...
Sep 9 16:28:26 OrangeBox12 maas.lease_upload_service: [INFO] No leases changed since last scan
Sep 9 16:28:26 OrangeBox12 dhcpd: DHCPDISCOVER from c0:3f:d5:60:4e:f5 via br0
Sep 9 16:28:26 OrangeBox12 dhcpd: DHCPOFFER on 10.14.60.37 to c0:3f:d5:60:4e:f5 via br0
Sep 9 16:28:28 OrangeBox12 dhcpd: DHCPDISCOVER from c0:3f:d5:60:4e:f2 via br0
Sep 9 16:28:28 OrangeBox12 dhcpd: DHCPOFFER on 10.14.100.11 to c0:3f:d5:60:4e:f2 via br0
Sep 9 16:28:28 OrangeBox12 dhcpd: DHCPREQUEST for 10.14.100.11 (10.14.4.1) from c0:3f:d5:60:4e:f2 via br0
Sep 9 16:28:28 OrangeBox12 dhcpd: DHCPACK on 10.14.100.11 to c0:3f:d5:60:4e:f2 via br0

### dns reports: ###

ubuntu@OrangeBox12:~$ dig node1.maas

; <<>> DiG 9.9.5-3-Ubuntu <<>> node1.maas
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58898
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;node1.maas. IN A

;; ANSWER SECTION:
node1.maas. 300 IN A 10.14.100.2

;; AUTHORITY SECTION:
maas. 300 IN NS maas.

;; ADDITIONAL SECTION:
maas. 300 IN A 10.14.4.1

;; Query time: 1 msec
;; SERVER: 10.14.4.1#53(10.14.4.1)
;; WHEN: Tue Sep 09 16:37:54 MDT 2014
;; MSG SIZE rcvd: 85

### Interesting leases ####

# The format of this file is documented in the dhcpd.leases(5) manual page.
# This lease file was written by isc-dhcp-4.2.4

host 10.14.100.11 {
  dynamic;
  hardware ethernet c0:3f:d5:60:4e:f2;
  fixed-address 10.14.100.11;
}

host 10.14.100.2 {
  dynamic;
  hardware ethernet c0:3f:d5:63:ff:41;
  fixed-address 10.14.100.2;
}

I'll paste in maas.log and syslog

Tags:

Revision history for this message

David Britton (dpb) wrote on 2014-09-09:

leases Edit (48.9 KiB, text/plain)

Revision history for this message

David Britton (dpb) wrote on 2014-09-09:

last-1000-of-maas.log Edit (209.6 KiB, text/plain)

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-09-09:

Everything looks fine from a MAAS pov;
c0:3f:d5:63:ff:41 has got IP 10.14.100.2
c0:3f:d5:60:4e:f2 has got IP 10.14.100.11

DNS is linking node1.maas to 10.14.100.2

Why do you think it's handing out incorrect addresses? Are you talking about juju not being able to contact the node on the network interface which is in the DNS? (it's always the first/oldest MAC on the node)

Changed in maas:
status:	New → Incomplete

Revision history for this message

David Britton (dpb) wrote on 2014-09-10: Re: [Bug 1367482] Re: 1.7: maas handed out incorrect address

On Tue, Sep 09, 2014 at 11:42:21PM -0000, Julian Edwards wrote:
> Everything looks fine from a MAAS pov;
> c0:3f:d5:63:ff:41 has got IP 10.14.100.2
> c0:3f:d5:60:4e:f2 has got IP 10.14.100.11
>
> DNS is linking node1.maas to 10.14.100.2

I think maybe I missed telling a key piece of information:

node1.maas (c0:3f:d5:60:4e:f2) should be 10.14.100.11

but DNS thinks it's 10.14.100.2

node9.maas c0:3f:d5:60:4e:f2 I have no idea about, but it probably
wasn't turned on at the time.

Anyway, in the leases file, *:4e:f2 is 10.14.100.11 and is the only
entry in there, yet the dns mapping was wrong.

This is the bug.

Maas version: 1.7.0~beta2+bzr2916-0ubuntu1~ppa1

--
David Britton <email address hidden>

Revision history for this message

David Britton (dpb) wrote on 2014-09-10:

On Tue, Sep 09, 2014 at 11:42:21PM -0000, Julian Edwards wrote:
> about juju not being able to contact the node on the network interface
> which is in the DNS? (it's always the first/oldest MAC on the node)

Just so it's also clear, these nodes have just one NIC/MAC.

Thanks for looking into this.

--
David Britton <email address hidden>

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-09-10:

On Wednesday 10 September 2014 04:42:57 you wrote:
> On Tue, Sep 09, 2014 at 11:42:21PM -0000, Julian Edwards wrote:
> > about juju not being able to contact the node on the network interface
> > which is in the DNS? (it's always the first/oldest MAC on the node)
>
> Just so it's also clear, these nodes have just one NIC/MAC.
>
> Thanks for looking into this.

Ok that makes more sense, thanks for clarifying.

Does the zone file get rewritten correctly if you restart apache? (restarting
it forces a start up task that rewrites the dns config)

Thanks.

Revision history for this message

David Britton (dpb) wrote on 2014-09-22: Re: 1.7: maas handed out incorrect address

I don't have this exact config up any longer, so if you can't get enough information from what I posted, I guess we'll have to drop it.

Dean Henrichsmeyer (dean) on 2014-09-22

Changed in maas:
status:	Incomplete → New

Christian Reis (kiko) on 2014-10-02

Changed in maas:
milestone:	none → 1.7.0
importance:	Undecided → High

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-10-07:

I don't have enough information to validate this right now so I'm marking it incomplete again; if you have a way to consistently re-create this that would be fantastic. Otherwise if you see it happen again, please grab a core maas team member *right away* so we can try to track this down. Thank you.

Changed in maas:
status:	New → Incomplete
milestone:	1.7.0 → none
importance:	High → Undecided

Revision history for this message

Adam Collard (adam-collard) wrote on 2014-11-11:

dhcpd.leases file Edit (29.8 KiB, text/plain)

This happened again in RC1. Sorry, couldn't grab anyone since it was on an Orange Box that was about to leave the metaphorical door.

node0vm2 booted and MAAS thought it had IP 10.14.100.3 but it actually got a 10.14.66.x address. Note the duplicate entries in the leases file (attached).

We worked around it by truncating the leases file and restarting maas-dhcpd

Changed in maas:
status:	Incomplete → Confirmed

Revision history for this message

David Britton (dpb) wrote on 2014-11-11:

#10

maas-logs.tar.gz Edit (4.6 MiB, application/x-tar)

maas logs: /var/log/maas/**, /var/log/syslog, fubar leases file

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2014-11-11:

#11

15:01 < roaksoax-us-holiday> dpb: weird, only 134
15:01 < roaksoax-us-holiday> dpb: ok, i think i've seen this bug before
15:01 < dpb> not sure how it got there, but from my understanding it was a pretty vanilla orangebox
15:02 < roaksoax-us-holiday> dpb: i think something else crashes on maas-clusterd(pserv.log), and this causes the dhcp stuff to break
15:02 < dpb> roaksoax-us-holiday: only other weird symptom I saw was it wasn't consistent. we allocated/started the machine to check, came up fine on the 100.3 address. Then, the next bootstrap, same split IP problem.
15:03 < dpb> but ya, could have been something as you say, not sure
15:03 < roaksoax-us-holiday> dpb: at a certain point in time, this happened: http://pastebin.ubuntu.com/8946932/
15:03 < roaksoax-us-holiday> dpb: and i'mn thinking that's the cause
15:04 < roaksoax-us-holiday> dpb: either cluster/region connection broke or something, that caused the DHCP stuff to go funky
15:04 < dpb> ahh, interesting
15:04 < roaksoax-us-holiday> dpb: so, restarting apache2, maas-clusterd and maas-dhcpd should fix this
15:04 < roaksoax-us-holiday> dpb: but ensuring that maas region/cluster *can* communicate
15:04 < dpb> well, IIUC, we had to truncate the file first
15:04 < roaksoax-us-holiday> but I think we definitely need to investigate further
15:04 < sparkiegeek> there was a tick in the cluster page

Christian Reis (kiko) on 2014-11-11

Changed in maas:
milestone:	none → 1.7.1

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-11-12:

#12

Can you guys try this with the latest RC3 please? An important IP fix went in for IP allocation in RC2.

Changed in maas:
status:	Confirmed → Incomplete

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-11-12:

#13

Next time this happens, it would be great to get the content of the DHCPLease and the StaticIPAddress tables.

A node's IP addresses are: the static addresses allocated to it if it has some (StaticIPAddress: static IP allocated during deployment), the dynamic IP addresses otherwise (DHCPLease: dynamic IP gathered from the leases file).

Here is how to collect that info:

$ sudo maas-region-admin shell

>>> from maasserver.models.staticipaddress import StaticIPAddress
>>> from maasserver.models.dhcplease import DHCPLease
>>> from maasserver.models.nodegroup import NodeGroup

>>> ng = NodeGroup.objects.all()[0]
>>> DHCPLease.objects.get_hostname_ip_mapping(ng)
>>> StaticIPAddress.objects.get_hostname_ip_mapping(ng)

This will help us diagnose the problem: it will allow us to make sure a static address was assigned to the node.

Revision history for this message

Adam Collard (adam-collard) wrote on 2014-11-19:

#14

This happened again

>>> DHCPLease.objects.get_hostname_ip_mapping(ng)
{}
>>> StaticIPAddress.objects.get_hostname_ip_mapping(ng)
defaultdict(<type 'list'>, {})

Changed in maas:
status:	Incomplete → Confirmed

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-11-19:

#15

(That last comment is not useful at this time as juju had released the node)

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-11-19:

#16

Ok we've got to the bottom of this.

When a node makes a PXE request, we record the NIC that did that as pxe_mac on the Node. This is used as the NIC for which we allocate a static IP.

If it then boots from a different NIC, the previous NIC will get the static IP! The pxe_mac is reset, but by then it's too late.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-11-19:

#17

So, it turns out that you cannot pre-set the NIC that should netboot in virt-manager. The solution suggested is to have:

1. a global setting that says whether to assign IPs to all NICs (on managed networks) or just the PXE boot NIC
2. a per-node setting that overrides this

We will arrange for the Landscape guys to carry a custom patch until this fix is available in a package.

Julian Edwards (julian-edwards) on 2014-11-19

summary:

- 1.7: maas handed out incorrect address
+ virtual nodes don't always PXE boot on the same NIC

Revision history for this message

Graham Binns (gmb) wrote on 2014-11-20:

#18

So, after some hacking, we came up with a partial fix — but only a
partial one — for this. We can assign both NICs static IPs and write the
hostmaps out just fine. DNS is a bit trickier — it needs a refactoring
since the DNS Zone config uses `hostname` as a dict key, so you can only
have one entry per host at the moment — but not a blocker.

However, this uncovers several other very problematic failure states:

- We don't currently (AIUI) bring up all the interfaces on a node on
   boot; only one is auto-configured, so even if we assign IPs to all
   the NICs on a node, only one will come up. However, Juju may try the
   IP address for which the NIC hasn't come up, in which case it will
   fail.
- Similarly, we don't guarantee that the networks of all the NICs will
   be routable from Juju's point of view (this doesn't fit the immediate
   case, since both NICs are on the same network here). The upshot is
   that having DNS entries for the node that point at IPs on different
   subnets could lead to Juju using the wrong address, and so failing.

We discussed moving the allocation of static IP and the writing of the
DHCP host map until after the node has finished deploying (i.e. once it
turns of netboot). However, this doesn't solve the problem, because we
can't guarantee that the node will use the correct interface when it
reboots, and we're back in the same state as we were before.

Realistically, the only correct way for us to fix this is to get the
udev rules right to force the correct NIC to be the primary NIC *every*
time the node boots. That's not quick work, however.

Changed in maas:
status:	Confirmed → Triaged
importance:	Undecided → Low
milestone:	1.7.1 → next

Dustin Kirkland  (kirkland) on 2014-12-16

Changed in maas:
importance:	Low → Critical
tags:	added: orange-box

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2014-12-16:

#19

@raharper, what, if anything, could we do from the libvirt/qemu side of things, to predictably a pxe device?

Revision history for this message

Ryan Harper (raharper) wrote on 2014-12-16:

#20

I spent quite a bit of time to see if there was something to do. The current state of pxe control in QEMU isn't ideal. You can disable loading of the option rom which prevents any nic of that *type* from pxe booting. However, there is no control over a per-nic basis.

The other effect of loading a pxe rom, is that *all* nics of a type (virtio, e1000, rtl8139, etc) will be tried. They will be tried in the order they appear on the PCI bus. The order is static.

From what I understand, the unpredictable nature doesn't stem from a QEMU/libvirt side directly, as I said, QEMU will pxe attempt exactly the same every single time. Rather the relatively short time the netboot runs is the cause of the variability as seen by MAAS PXE/DHCP. The first nic issues a bootp/dhcp broadcast request and after a few seconds with no response, it will try the next nic. In the above scenario then it's likely that the first nic times out before it can get a response from MAAS and QEMU switches to the next nic.
The performance/timeout of the bridge reacting to a new device (when KVM launches it adds the guest nic to the software bridge, it has a timeout before packets get forwarded to the devices on the bridge) could be tuned to reduce/delay this, but the problem could still show up on a heavily loaded MAAS when PXE/DHCP is slow to respond.

I've not looked at adjusting the PXE rom timeouts; there are no command line tunables, so it would likely involve generating custom PXE roms (or updating the default, and large default timeouts punish everyone).

For Orange box specifically, I suggest making the following changes to the KVM guest VM to handle this issue:

http://paste.ubuntu.com/9349582/

The changes included switch the first nic to use virtio (instead of rtl8139, unrelated but a better, faster choice), and then switch the second nic to be e1000, and then include the <rom file=''> directive which disables pxe booting on all e1000 nics. The VM now will pxe boot only from the virtio nic. If the bridge is slow or maas is busy, then the VM may not successfully pxe boot. This may or may not be more desirable from an Orange Box perspective.

I spent quite a bit of time to see if there was something to do.  The current state of pxe control in QEMU isn't ideal.  You can disable loading of the option rom which prevents any nic of that *type* from pxe booting.  However, there is no control over a per-nic basis.

The other effect of loading a pxe rom, is that *all* nics of a type (virtio, e1000, rtl8139, etc) will be tried.  They will be tried in the order they appear on the PCI bus.  The order is static.

From what I understand, the unpredictable nature doesn't stem from a QEMU/libvirt side directly, as I said, QEMU will pxe attempt exactly the same every single time.  Rather the relatively short time the netboot runs is the cause of the variability as seen by MAAS PXE/DHCP.  The first nic issues a bootp/dhcp broadcast request  and after a few seconds with no response, it will try the next nic.  In the above scenario then it's likely that the first nic times out before it can get a response from MAAS and QEMU switches to the next nic.
The performance/timeout of the bridge reacting to a new device (when KVM launches it adds the guest nic to the software bridge, it has a timeout before packets get forwarded to the devices on the bridge) could be tuned to reduce/delay this, but the problem could still show up on a heavily loaded MAAS when PXE/DHCP is slow to respond.

For Orange box specifically, I suggest making the following changes to the KVM guest VM to handle this issue:

http://paste.ubuntu.com/9349582/

The changes included switch the first nic to use virtio (instead of rtl8139, unrelated but a better, faster choice), and then switch the second nic to be e1000, and then include the <rom file=''> directive which disables pxe booting on all e1000 nics.  The VM now will pxe boot only from the virtio nic.  If the bridge is slow or maas is busy, then the VM may not successfully pxe boot.  This may or may not be more desirable from an Orange Box perspective.

Revision history for this message

David Britton (dpb) wrote on 2014-12-16: Re: [Bug 1367482] Re: virtual nodes don't always PXE boot on the same NIC

#21

On Tue, Dec 16, 2014 at 04:50:18PM -0000, Ryan Harper wrote:
> I spent quite a bit of time to see if there was something to do. The
> current state of pxe control in QEMU isn't ideal. You can disable
> loading of the option rom which prevents any nic of that *type* from pxe
> booting. However, there is no control over a per-nic basis.

@rharper -- didn't you also say something about a bridge timeout?

>
> The changes included switch the first nic to use virtio (instead of
> rtl8139, unrelated but a better, faster choice), and then switch the
> second nic to be e1000, and then include the <rom file=''> directive
> which disables pxe booting on all e1000 nics. The VM now will pxe boot
> only from the virtio nic. If the bridge is slow or maas is busy, then
> the VM may not successfully pxe boot. This may or may not be more
> desirable from an Orange Box perspective.
>

It should put us into a 'fast-fail' sitation, and remove a
false-positive (i.e., the node looks fine in MAAS, but is not reachable
via name or IP listed in maas).

If we are using juju -- the problem gets detected as a timeout
eventually (also not ideal). After switching to this, it would get
detected earlier as a 409, and a 'failed deployment' in the MAAS GUI.

I think all-in-all it's a good change for the orange box as it will
allow us to more precisely detect when the problem occurrs.

Thanks for this nice write-up, btw.

--
David Britton <email address hidden>

Revision history for this message

Ryan Harper (raharper) wrote on 2014-12-16:

#22

Download full text (3.9 KiB)

On Tue, Dec 16, 2014 at 11:08 AM, David Britton <<email address hidden>
> wrote:

> On Tue, Dec 16, 2014 at 04:50:18PM -0000, Ryan Harper wrote:
> > I spent quite a bit of time to see if there was something to do. The
> > current state of pxe control in QEMU isn't ideal. You can disable
> > loading of the option rom which prevents any nic of that *type* from pxe
> > booting. However, there is no control over a per-nic basis.
>
> @rharper -- didn't you also say something about a bridge timeout?
>

Yes, I've not done the benchmarking to determine if this resolves the nic
timeout with MaaS,
but here's some info on bridge forward_delay, defaults and how to modify.

The virbr0 bridge on systems with libvirt installed set the forward_delay
value to 2.0 (seconds).

% brctl showstp virbr0
virbr0
bridge id 8000.000000000000
designated root 8000.000000000000
root port 0 path cost 0
max age 20.00 bridge max age 20.00
hello time 2.00 bridge hello time 2.00
forward delay 2.00 bridge forward delay 2.00
ageing time 300.00
hello timer 0.73 tcn timer 0.00
topology change timer 0.00 gc timer 243.26
flags

However, the default for bridges are much higher:

% sudo brctl addbr testbr
(foudres) ~ % sudo brctl showstp testbr
testbr
bridge id 8000.000000000000
designated root 8000.000000000000
root port 0 path cost 0
max age 20.00 bridge max age 20.00
hello time 2.00 bridge hello time 2.00
forward delay 15.00 bridge forward delay 15.00
ageing time 300.00
hello timer 0.00 tcn timer 0.00
topology change timer 0.00 gc timer 0.00
flags

You can lower this value to 2 or 0. For reference, fowarding delay[1]
is the time spent in each of the Listening and Learning states before
the Forwarding state is entered. This delay is so that when a new bridge
comes onto a busy network it looks at some traffic before participating.

% sudo brctl setfd testbr 0
(foudres) ~ % sudo brctl showstp testbr
testbr
bridge id 8000.000000000000
designated root 8000.000000000000
root port 0 path cost 0
max age 20.00 bridge max age 20.00
hello time 2.00 bridge hello time 2.00
forward delay 0.00 bridge forward delay 0.00
ageing time 300.00
hello timer 0.00 tcn timer 0.00
topology change timer 0.00 gc timer 0.00
flags

If the bridge the KVM VM is on is also on public networks, it's possible
that lowering this value could cause issues[2].

1.
http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge#Forwarding_delay
2.
http://www.microhowto.info/troubleshooting/troubleshooting_ethernet_bridging_on_linux.html#idp212224

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Changed in orange-box:
assignee:	nobody → Darryl Weaver (dweaver)
status:	New → In Progress
importance:	Undecided → Medium
status:	In Progress → Fix Committed

Changed in maas:
status:	Triaged → Invalid
importance:	Critical → High
milestone:	next → none