no GARPs during ephemeral boot

Bug #1677668 reported by Sam Lee
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned
cloud-init (Ubuntu)
Incomplete
Wishlist
Unassigned
isc-dhcp (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Deploys time out with an error on the console that says,

"Can not apply stage final, no datasource found! Likely bad things to come!"

How to duplicate:
MAAS Version 2.1.3+bzr5573-0ubuntu1 (16.04.1)
1) Rack Controller and Region Controller in different VLANs
2) Use Cisco ASA as the router with "ARP Inspection" enabled
3) Clear the router ARP cache
4) Deploy 2 maas machines with interfaces set to "Static assign"
5) Observe deploys successfully
6) Release both machines and swap IP's.
7) Redeploy the same 2 machines
8) Observe deploy failure with the machine consoles stuck in the "ubuntu login" screen with "Can not apply stage final, no datasource Found! Likely bad things to come!"

The root cause is that during ephemeral PXE booting, no GARPs are sent, which in our environment will cause our router (Cisco ASA) to hold on to ARP table entries until it expires (default= 4 hours). Then combined with ASA feature "ARP Inspection" will drop packets from a MaaS machine using the previously used IP from a different MaaS machine.

The ephemeral boot image ephemeral-ubuntu-amd64-ga-16.04-xenial-daily.

Running tcpdump on the Rack Controller, showed no GARPs from the deploying MaaS machine. If there were GARPs sent, then the router would refresh its ARP cache thus avoiding the ARP Inspection dropping.

[Excerpt from Cisco ASA]
http://www.cisco.com/c/en/us/td/docs/security/asa/asa94/config-guides/cli/general/asa-94-general-config/basic-arp-mac.pdf
When you enable ARP inspection, the ASA compares the MAC address, IP address, and source interface in
all ARP packets to static entries in the ARP table, and takes the following actions:
• If the IP address, MAC address, and source interface match an ARP entry, the packet is passed through.
• If there is a mismatch between the MAC address, the IP address, or the interface, then the ASA drops
the packet.
• If the ARP packet does not match any entries in the static ARP table, then you can set the ASA to either
forward the packet out all interfaces (flood), or to drop the packet.

Revision history for this message
Sam Lee (samlee) wrote :

Forgot to mention that we didn't want to "Static assign" IPs in MaaS. We prefer using "Auto assign" but observed that MaaS will sometimes reuse a previously used IP from a different MaaS machine. But using "Static assign" we can reliably workaround the issue (or in this ticket case, force a failure on demand)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Sam,
to get this right it correctly PXE boots and starts the ephemeral with cloud-init to eventually produce the message.

So with the wrong arp in place, how did it get that far?
That is important to consider where the fix should be implemented.

PXE is essentially DHCP + tftp info, so it gets the new IP+tftp info as DHCP response.
At the same time the router still thinks the IP it got belongs to the other MAC where it was running before right?
Now on the Datasource probing based on IP it gets no reply because the router is on the old state and eventually we see the reported message.
The TFTP fetch should already be IP based - why isn't that failing?

Surely DS-i / Cloud-Init could force a GARP somehow, but shouldn't it be more a bug/feature of the network stack to force-announce itself with a GARP when it got an IP than later on a cloud-init bug?

Maybe I'm not deeply enough involved - not marking incomplete/triaged so Scott still looks at it later today.

Revision history for this message
Sam Lee (samlee) wrote :

Hi Chris,

Some new clarifications are in order. Please disregard the "ARP Inspection" claim. That feature wasn't even enabled.

Here's a very simplified drawing of the setup.

                                                                         +-------------------+
                                                                         | ARP CACHE |
                                            +------------+ | (expires 4 hours) |
                                            | | | 10.1.1.11 22:22
                                            | | | 10.1.2.100 33:33
                                            | ROUTER | | |
                                            | | | |
                                            | | | |
                                            | | +-------------------+
                                           +--------------+
                                           | |
                                           | |
                              +---------------+ +------------------+
                              | SWITCH A | | SWITCH B |
                     +--------+ | | |
                     | +---------------+ +------------------+
                     | | |
                     | | |
       +------------------+ +------------------+ +------------------+
       | | | | | |
       | | | 10.1.1.11 | | 10.1.2.100 |
       | | | 255.255.255.0 | | 255.255.255.0 |
       | | | | | REGION CTLR |
       | MAAS MACHINE 2 | | MAAS MACHINE 1 | | |
       | MAC 22:22 | | MAC 11:11 | | MAC 33:33 |
       +------------------+ +------------------+ +------------------+

1) Assuming Machine #2 was last deployed and then released within the past 4 hours, using the IP 10.1.1.11. Thus the router already has an ARP entry in its cache matching 10.1.1.11 to MAC 22:22.
2) Machine #1 is starting Deployment and happens to receive 10.1.1.11 from Controller to use for ephemeral PXE IP.
3) Machine #1 sends packet to 10.1.2.100:5240
4) Controller sees pack from 10.1.1.11
5) Controller responds to 10.1.1.11
6) Machine #1 never sees the response packet

We suspect the response packet was sent Machine #2. We are actively parsing the pcap data to confirm.

Revision history for this message
Sam Lee (samlee) wrote :

yikes! that did not format well...and I can't edit my own comment. Let me try again...

Revision history for this message
Sam Lee (samlee) wrote :

attached pic

Revision history for this message
Sam Lee (samlee) wrote :

I forgot to mention, Region and Rack Controllers are in separate VLANs. So the TFTP conversation is happening between the RACK Controller (DHCP/TFTP) and the Machine which both live on the same subnet, so the router's ARP Cache is not a factor.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Sam,
so in your picture we should add a RACK controller on Switch A and DHCP/TFTP goes there right?

The DHCP on this RACK controller is it who hands out the 10.1.1.11 (again) right?

Of your description in comment #3 that now looks like:
# I added a RACK controller with 10.1.1.100 - please confirm if that matches and refresh your pic.

1) Assuming Machine #2 was last deployed and then released within the past 4 hours, using the IP 10.1.1.11. Thus the router already has an ARP entry in its cache matching 10.1.1.11 to MAC 22:22.
2) Machine #1 is starting Deployment and happens to receive 10.1.1.11 by DHCP from RACK Controller to use for ephemeral PXE IP.
3) TFTP communication is with RACK controller on 10.1.1.100 and works
3) Machine #1 sends packet to 10.1.2.100:5240 for websocket feedback
4) REGION Controller sees pack from 10.1.1.11
5) REGION Controller responds to 10.1.1.11
6) Machine #1 never sees the response packet

Is that a proper summary now?

Also I thought that most dhcp clients already send a GARP when receiving a IP via DHCP.
In your setup the DHCP client in some way is PXE+Boot into cloud-init - I wonder if that is what we miss here.

Never the less without knowing otherwise I'd expect the PXE bit to be responsible sending them in this scenario. If not that then the DHCP server who "knows" it just changed something. We might force-fix it with cloud-init but I at least want to understand why there is no GARP of the other more obvious sources.

Revision history for this message
Sam Lee (samlee) wrote :

Hi Chris, Yes you are correct, and attached updated pic.

Although I don't disagree the PXE/DHCP client should be sending GARPs, but shouldn't any OS that binds to an IP send a GARP as part of its TCP stack initialization? That is, shouldn't the ephemeral boot image itself send a GARP (independent of whether there was one from PXE client)?

Felipe Reyes (freyes)
tags: added: sts
Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

Sam / Christian,

This sort of issue is not unheard of in cases where an IP address moves from interface to interface, or between hosts. Most situations that expect this type of issue (e.g., bonding link failover) already issue gratuitous ARPs in order to update L2 peers.

I think the bottom line here is that dhclient (the DHCP client typically used on Ubuntu, and presumably the one in use here) does not implement RFC 5227, "IPv4 Address Conflict Detection," which describes how gratutious ARPs must be done if the host provides that functionality (5227 2.3, "Announcing an Address").

Most network configuration tools (and the kernel itself) on linux do not issue gratuitous ARPs by default at address assignment time, so this lack isn't especially unusual. E.g., there is no option in /etc/network/interfaces to instruct ifup to issue a GARP.

I'll note that 5227 is a proposed standard, and, as such, hosts are not required to implement it, so dhclient is not violating any standards by not issuing gratuitous ARPs.

Now, none of the above actually resolves the problem here, it just explains that you've landed in a corner case that doesn't come up very often.

As far as resolving this, one obvious possibility is to add RFC 5227 functionality to dhclient through its dhclient-script facility (and in fact the man page for that is close to suggesting that: for the BOUND case, the script should "somehow" perform duplicate address detection via ARP).

I'm not too familiar with cloud-init's internals, but for 5227 compliance, the GARP would be issued on every boot, and cloud-init only runs on first boot, so an implementation within cloud-init would likely be setting up some persistent configuration.

Revision history for this message
Sam Lee (samlee) wrote :

In our case, we don't need GARP on every boot. Only during MaaS Deploy stage, where MaaS ephemeral boot image is trying to communicate with MaaS region controller (in a different VLAN).

The irony is, even if there was a way to add our own GARP instructions in cloud-init config, the region controller would have no way of sending the commands to the maas machine.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

First of all thanks Jay for the great in depth extra-insight!

As Sam added, Cloud-Init can't do a lot here since it doesn't get a config to this stage.
The only thing it could consider to do is unconditionally always trigger a GARP.

But that would be:
1. offending the cloud-init design which is that it is supposed to do what it is told (not more not less)
2. add the dependency to a tool like arping for the image so that cloud-init can issue the GARP

So one step back - implementation in dhclient as-is as script:
1. this would still have to be unconditionally, but at least be bound to dhcp
2. it would have the same need for arping (or similar) in the image
  2.1 an alternative to the dependency would be to re-implement, but redundancy always is worse

I wonder if the following would be an option:
You already get your IP address, so you get your dhcp reply; Just the environment doesn't realize you moved. IIRC dhcp can transport extra options, so to get rid of the "unconditional" thing. Could dhcp grow a feature to understand a "dhcp-please-garp" option? Handling this option could be done in a dhclient script then. And on the other side the MAAS dhcp server could present this option.

That at least would make it
1. conditionally only where requested
2. maas has the control since it is the dhcp
3. not affect environments where things are not providing this option (for SRU-ability)

Making the cloud-init a wishlist item, since it seems a feature request more than a bug there and also as outlined above not really fixable there. Instead adding a dhclient task.

All of this has to consider, will more of systemd-networkd replace dhclient? If so considerations have to be made for that or it will regress as soon as things are switched.

Changed in cloud-init (Ubuntu):
status: New → Incomplete
importance: Undecided → Wishlist
Revision history for this message
David Andruczyk (dandruczyk) wrote :

This causes problems for me as well during maas re-imaging wiht maas 2.9.2. see https://discourse.maas.io/t/changing-ips-and-lack-of-gratuitous-arp-and-the-pain-it-causes/4800

Ideally when pxebooting, pxelinux.0 should send a gratuitous arp and in theory it should solve the issue. Perhaps I'm mistaken...

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in isc-dhcp (Ubuntu):
status: New → Confirmed
Revision history for this message
Björn Tillenius (bjornt) wrote :

I'm marking this bug as incomplete for MAAS, since it's not clear what actually needs to be fixed in MAAS. It seems like this needs to be fixed at a lower level.

Changed in maas:
status: New → Incomplete
tags: added: se-00140843
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Marking bug as Invalid for MAAS as per previous comment, incomplete status since, and input from Björn and Jerzy.

Changed in maas:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.