Machines fail to deploy because cloud-init needs to accept both netplan spellings for grat arp
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| MAAS |
Undecided
|
Andres Rodriguez | ||
| cloud-init |
Medium
|
Ryan Harper | ||
| curtin |
Undecided
|
Unassigned |
Bug Description
Many nodes failed to boot after installation.
Here is one example, beartic.
beartic.
finishes install: 2019-05-
dhcp's after reboot:
10.244.
10.244.
10.244.
10.244.
grub and grub.cfg:
10.244.
10.244.
10.244.
10.244.
10.244.
10.244.
10.244.
10.244.
10.244.
but we never got any rsyslog message or api calls after that.
Related branches
- Server Team CI bot: Needs Fixing (continuous-integration) on 2019-06-04
- cloud-init Commiters: Pending requested 2019-06-04
-
Diff: 767 lines (+204/-315)18 files modifiedcloudinit/config/cc_growpart.py (+2/-1)
cloudinit/config/cc_resizefs.py (+3/-3)
cloudinit/config/cc_ubuntu_advantage.py (+1/-1)
cloudinit/net/network_state.py (+8/-0)
cloudinit/sources/DataSourceNoCloud.py (+23/-17)
cloudinit/util.py (+13/-9)
config/cloud.cfg.tmpl (+2/-2)
debian/changelog (+7/-0)
debian/patches/ubuntu-advantage-revert-tip.patch (+5/-255)
tests/unittests/test_datasource/test_azure.py (+0/-24)
tests/unittests/test_datasource/test_nocloud.py (+18/-0)
tests/unittests/test_distros/test_freebsd.py (+45/-0)
tests/unittests/test_ds_identify.py (+20/-0)
tests/unittests/test_handler/test_handler_resizefs.py (+1/-1)
tests/unittests/test_net.py (+46/-0)
tools/ds-identify (+8/-0)
tools/render-cloudcfg (+1/-1)
tools/run-container (+1/-1)
- Server Team CI bot: Approve (continuous-integration) on 2019-05-09
- Chad Smith: Approve on 2019-05-09
- Dan Watkins: Approve on 2019-05-03
-
Diff: 82 lines (+54/-0)2 files modifiedcloudinit/net/network_state.py (+8/-0)
tests/unittests/test_net.py (+46/-0)
- Andres Rodriguez (community): Approve on 2019-05-03
- Jason Hobbs (community): Approve on 2019-05-03
-
Diff: 48 lines (+9/-5)2 files modifiedsrc/maasserver/tests/test_preseed_network.py (+4/-2)
src/provisioningserver/utils/netplan.py (+5/-3)
Jason Hobbs (jason-hobbs) wrote : | #1 |
Andres Rodriguez (andreserl) wrote : | #2 |
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1827238] Re: 2.6beta2: many nodes failed deployment with time out | #3 |
On Wed, May 1, 2019 at 11:40 AM Andres Rodriguez <email address hidden>
wrote:
> So this is what I see on the logs:
>
> 1. On rackd.log on .32, I see the machine PXE boot to start the
> deployment process:
>
> 2019-05-01 10:32:33 provisioningser
> bootx64.efi requested by 10.244.41.7
> 2019-05-01 10:32:33 provisioningser
> bootx64.efi requested by 10.244.41.7
> 2019-05-01 10:32:33 provisioningser
> grubx64.efi requested by 10.244.41.7
> 2019-05-01 10:32:34 provisioningser
> /grub/x86_
> 2019-05-01 10:32:34 provisioningser
> /grub/x86_
> 2019-05-01 10:32:34 provisioningser
> /grub/x86_
> 2019-05-01 10:32:34 provisioningser
> /grub/x86_
> 2019-05-01 10:32:34 provisioningser
> /grub/grub.cfg requested by 10.244.41.7
> 2019-05-01 10:32:34 provisioningser
> /grub/grub.
> 2019-05-01 10:32:34 provisioningser
> /images/
> 10.244.41.7
> 2019-05-01 10:32:36 provisioningser
> /images/
> 10.244.41.7
> 2019-05-01 10:32:58 provisioningser
> /images/
>
> 2. On rackd.log on .30, I see it pxe boot post-deployment (and its told
> to localboot):
>
> 2019-05-01 10:38:13 provisioningser
> bootx64.efi requested by 10.244.41.7
> 2019-05-01 10:38:13 provisioningser
> bootx64.efi requested by 10.244.41.7
> 2019-05-01 10:38:14 provisioningser
> grubx64.efi requested by 10.244.41.7
> 2019-05-01 10:38:15 provisioningser
> /grub/x86_
> 2019-05-01 10:38:15 provisioningser
> /grub/x86_
> 2019-05-01 10:38:15 provisioningser
> /grub/x86_
> 2019-05-01 10:38:15 provisioningser
> /grub/x86_
> 2019-05-01 10:38:15 provisioningser
> /grub/grub.cfg requested by 10.244.41.7
> 2019-05-01 10:38:15 provisioningser
> /grub/grub.
>
>
> 3. I see that curtin has run the deployment process and hasn't reported
> any errors - log: https:/
> config: https:/
>
> So, from all the information above, I don't think we have enough
> information to know what the issue is.
>
> A. The machine was never in...
Changed in maas: | |
status: | New → Incomplete |
I don't see any errors during this install for curtin. It installs grub2 uefi, adds the entry (0017), reorders back to boot from network (0016).
Please reopen the curtin task if you believe there's a curtin error during the installation.
Changed in curtin: | |
status: | New → Invalid |
tags: | added: cdo-release-blocker |
Andres Rodriguez (andreserl) wrote : | #5 |
So there are no apparent issues from the MAAS perspective, we really need to see console logs to be able to determine what's the issue. Keeping this as incomplete for MAAS until console logs can be provided.
PS. You can add kernel params to the machine to log to console, and setup conserver-server to remotely gather the information automatically without manual intervention (once conserver-server has been setup of course).
Blake Rouse (blake-rouse) wrote : | #6 |
Do you have a tcpdump of the network traffic? That is needed to be able to determine what the issue is. We need to inspect the HTTP headers that is being sent from nginx to the client. Specifically the "Content-Length" header to see if there is a mismatch there.
It is possible that the correct Content-Length is being sent to the client, but the client is either closing the HTTP connection to soon or a TCP reset is occurring. That will still report as a 200 response on the server side as the response was 200. The difference will be if the actual amount of data sent to the client matches the Content-Length of the response HTTP header.
Knowing if those mismatch will be the first step, because there is a few outcomes:
1. The server is reading the file wrong and setting the Content-Length not to the actual file length.
2. The server is sending the file and something occurs for the file handler to be closed so the streaming action is stopped (I would expect a log in nginx errors if this were to occur, but maybe not)
3. The client is closing the connection before reading all the data being streamed from the server.
4. A TCP reset is occurring break the TCP connection, giving the same result of #3 but caused by TCP reset instead of client disconnect.
Also the output of the whole tree structure of /var/lib/
Jason Hobbs (jason-hobbs) wrote : | #7 |
I'm struggling to find the right parameters to get serial console. I did notice we get this error when booting after the install
Booting local disk...
Failed to open \efi\boot\
Failed to load image \efi\boot\
start_image() returned Not Found
Jason Hobbs (jason-hobbs) wrote : | #8 |
console output: http://
[ 55.404228] cloud-init[1039]: self._handle_
[ 55.416254] cloud-init[1039]: File "/usr/lib/
[ 55.428150] cloud-init[1039]: item_params.
[ 55.440297] cloud-init[1039]: File "/usr/lib/
[ 55.452161] cloud-init[1039]: 'params': dict((v2key_
[ 55.464226] cloud-init[1039]: KeyError: 'gratuitous-arp'
Andres Rodriguez (andreserl) wrote : | #9 |
MAAS sends this snippet:
network:
bonds:
bond0:
interfaces:
- eth6
- eth7
macaddress: 00:11:0a:66:2e:24
mtu: 9000
parameters:
down-delay: 0
mode: active-backup
up-delay: 0
Changed in maas: | |
status: | Incomplete → New |
Andres Rodriguez (andreserl) wrote : Re: Machines fail to deploy because cloud-init misspells gratuitous-arp as gratuitious-arp | #10 |
To document what the issue is, MAAS sends the key 'gratuitous-arp', following netplan, however cloud-init misspells this key as 'gratuitious-arp'. MAAS will *workaround* this issue until cloud-init fixes it.
summary: |
- 2.6beta2: many nodes failed deployment with time out + Machines fail to deploy because cloud-init misspells gratuitous-arp as + gratuitious-arp |
Changed in maas: | |
milestone: | none → 2.6.0rc1 |
assignee: | nobody → Andres Rodriguez (andreserl) |
status: | New → In Progress |
Ryan Harper (raharper) wrote : | #11 |
cloud-init included the only (mis)spelling of gratuitous-arp, per
https:/
This is now fixed in netplan and released. Cloud-init needs to accept both values now.
summary: |
- Machines fail to deploy because cloud-init misspells gratuitous-arp as - gratuitious-arp + Machines fail to deploy because cloud-init needs to accept both netplan + spellings for grat arp |
Changed in cloud-init: | |
importance: | Undecided → Medium |
status: | New → In Progress |
Changed in maas: | |
status: | In Progress → Fix Committed |
Ryan Harper (raharper) wrote : | #12 |
The misspelling is part of the original netplan; it has been fixed but only in bionic and newer. For max compatibility it's best to always render/supply the misspelling. Cloud-init will accept either key, but for now will only render the misspelling as all netplan releases support this key.
Steve Langasek (vorlon) wrote : Re: [Bug 1827238] Re: Machines fail to deploy because cloud-init needs to accept both netplan spellings for grat arp | #13 |
On Sat, May 04, 2019 at 12:34:00AM -0000, Ryan Harper wrote:
> The misspelling is part of the original netplan; it has been fixed but
> only in bionic and newer. For max compatibility it's best to always
> render/supply the misspelling. Cloud-init will accept either key, but
> for now will only render the misspelling as all netplan releases support
> this key.
In the original netplan, it's also misspelled in the output to networkd, so
probably never worked? From xenial, and bionic release pocket:
./src/networkd.c: g_string_
In the NM backend that's spelled 'num_grat_arp' in the output so should
work, but that's also largely irrelevant for cloud-init I would think.
Changed in maas: | |
milestone: | 2.6.0rc1 → 2.6.0beta3 |
Changed in cloud-init: | |
assignee: | nobody → Ryan Harper (raharper) |
This bug is fixed with commit ded1ec81 to cloud-init on branch master.
To view that commit see the following URL:
https:/
Changed in cloud-init: | |
status: | In Progress → Fix Committed |
Changed in maas: | |
status: | Fix Committed → Fix Released |
This bug is believed to be fixed in cloud-init in version 19.2. If this is still a problem for you, please make a comment and set the state back to New
Thank you.
Changed in cloud-init: | |
status: | Fix Committed → Fix Released |
So this is what I see on the logs:
1. On rackd.log on .32, I see the machine PXE boot to start the deployment process:
2019-05-01 10:32:33 provisioningser ver.rackdservic es.tftp: [info] bootx64.efi requested by 10.244.41.7 ver.rackdservic es.tftp: [info] bootx64.efi requested by 10.244.41.7 ver.rackdservic es.tftp: [info] grubx64.efi requested by 10.244.41.7 ver.rackdservic es.tftp: [info] /grub/x86_ 64-efi/ command. lst requested by 10.244.41.7 ver.rackdservic es.tftp: [info] /grub/x86_ 64-efi/ fs.lst requested by 10.244.41.7 ver.rackdservic es.tftp: [info] /grub/x86_ 64-efi/ crypto. lst requested by 10.244.41.7 ver.rackdservic es.tftp: [info] /grub/x86_ 64-efi/ terminal. lst requested by 10.244.41.7 ver.rackdservic es.tftp: [info] /grub/grub.cfg requested by 10.244.41.7 ver.rackdservic es.tftp: [info] /grub/grub. cfg-14: 02:ec:41: c7:dc requested by 10.244.41.7 ver.rackdservic es.http: [info] /images/ ubuntu/ amd64/ga- 18.04/bionic/ daily/boot- kernel requested by 10.244.41.7 ver.rackdservic es.http: [info] /images/ ubuntu/ amd64/ga- 18.04/bionic/ daily/boot- initrd requested by 10.244.41.7 ver.rackdservic es.http: [info] /images/ ubuntu/ amd64/ga- 18.04/bionic/ daily/squashfs requested by 10.244.41.7
2019-05-01 10:32:33 provisioningser
2019-05-01 10:32:33 provisioningser
2019-05-01 10:32:34 provisioningser
2019-05-01 10:32:34 provisioningser
2019-05-01 10:32:34 provisioningser
2019-05-01 10:32:34 provisioningser
2019-05-01 10:32:34 provisioningser
2019-05-01 10:32:34 provisioningser
2019-05-01 10:32:34 provisioningser
2019-05-01 10:32:36 provisioningser
2019-05-01 10:32:58 provisioningser
2. On rackd.log on .30, I see it pxe boot post-deployment (and its told to localboot):
2019-05-01 10:38:13 provisioningser ver.rackdservic es.tftp: [info] bootx64.efi requested by 10.244.41.7 ver.rackdservic es.tftp: [info] bootx64.efi requested by 10.244.41.7 ver.rackdservic es.tftp: [info] grubx64.efi requested by 10.244.41.7 ver.rackdservic es.tftp: [info] /grub/x86_ 64-efi/ command. lst requested by 10.244.41.7 ver.rackdservic es.tftp: [info] /grub/x86_ 64-efi/ fs.lst requested by 10.244.41.7 ver.rackdservic es.tftp: [info] /grub/x86_ 64-efi/ crypto. lst requested by 10.244.41.7 ver.rackdservic es.tftp: [info] /grub/x86_ 64-efi/ terminal. lst requested by 10.244.41.7 ver.rackdservic es.tftp: [info] /grub/grub.cfg requested by 10.244.41.7 ver.rackdservic es.tftp: [info] /grub/grub. cfg-14: 02:ec:41: c7:dc requested by 10.244.41.7
2019-05-01 10:38:13 provisioningser
2019-05-01 10:38:14 provisioningser
2019-05-01 10:38:15 provisioningser
2019-05-01 10:38:15 provisioningser
2019-05-01 10:38:15 provisioningser
2019-05-01 10:38:15 provisioningser
2019-05-01 10:38:15 provisioningser
2019-05-01 10:38:15 provisioningser
3. I see that curtin has run the deployment process and hasn't reported any errors - log: https:/ /pastebin. ubuntu. com/p/zMgTttxdS j/ | curtin config: https:/ /pastebin. ubuntu. com/p/Y2ZMX6Rst d/
So, from all the information above, I don't think we have enough information to know what the issue is.
A. The machine was never instructed to localboot.
B. The machine was instructed to localboot, but grub failed.
C. The machine booted onto the disk, but either didn't get network or failed to contact metadata.
D. There is a firmw...