[2.5.3] bug: deploy failed when more than 33 rack controller in dns resolver, cause argument list too long and dns resolution stop working

Bug #1828602 reported by Dylan Wang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Blake Rouse
2.5
Fix Committed
Critical
Blake Rouse

Bug Description

after upgrade to 2.5.3,

things went well at beginning, but suddenly we had this weird bug:

try deploy machine, everything went well from log, it finished deploy within 5 mins,

I saw "Installation complete - Node disabled netboot" from maas event,

however it just hang there stucking at deploying state after that, and eventually get into failed deployment.

I tried reinstall rack controaller, reboot region controller, delete the machine and re-deploy,
it always stuck at deploying even though it's finished (and boot into the system without problem)

Related branches

Revision history for this message
Dylan Wang (hyuwang) wrote :
description: updated
Revision history for this message
Andres Rodriguez (andreserl) wrote :

The error effectively means that the machine never booted into the installed environment, or if it did, it could have failed in many ways, like failing to configure networking by cloud-unit, or failing to boot from proper disk, or crashing, or failing to reboot, etc.

Please attach the following logs:

1. Console log
2. Maas install.log
3. Rsyslog for the machine
4. Curtin config

Changed in maas:
status: New → Incomplete
Revision history for this message
Dylan Wang (hyuwang) wrote :

ever since this happen, all the machine in all rack controller now stucking at this stage.

but they did boot into the system, the network looks good, it's reachable, just cloud-init failed as well as the post events failed.

Please see the attachment below (regiond log and rsyslog for one of the machine), I'm trying to get the cloud-init/curtin log as well ( can't login into the machine since ssh keys wasn't provision into the server due to cloudinit failure )

Not sure where is the mass install.log and console log?

Changed in maas:
status: Incomplete → New
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi there!

Marking this as incomplete as the serial console logs and install log are not provided. For the latter please see https://discourse.maas.io/t/getting-curtin-debug-logs/169 . For the serial
Consolé you will need to get that from your hardware sería output subscribed over IPMI or through the virtual console.

Changed in maas:
status: New → Incomplete
Revision history for this message
Dylan Wang (hyuwang) wrote :

Hi Anderes,

I didn't manage to get the curtin logs, I tried

maas profile node-results read system_id=xxxx name=/tmp/curtin-logs.tar

and it gives me:

Success.
Machine-readable output follows:
[]

Also for the console log, I can see it via BMC web console, but when I try to record it with ipmitool (ipmitool -H 10.101.33.200 -U Administrator -P Admin@9000 -I lanplus sol activate | tee /tmp/ipmi.log), the output always stop at booting (the logs after that seems not being caught by ipmi sol):

PXELINUX 6.03 lwIP 20171017 Copyright (C) 1994-2014 H. Peter Anvin et al
Booting local disk ...
Booting...

Could you please give some hint on how could I get the logs?

Revision history for this message
Dylan Wang (hyuwang) wrote :

I think I found the problem, the dns resolution for maas-internal domain somehow is not working after deploy.

I injected public key into squashfs and manage get into the fail booted system,

the cloud cfg looks good, but resolve 117-**-**-**--28.maas-internal failed

I look into the curtin config (by maas profile machine get-curtin-config id)

the dns config is the same as on server (/etc/netplan/50-cloud-init.yaml),

however, I found two thing that is really weird:

nameservers:
    addresses:
    - 117.**.**.18
    - 120.**.**.114
    - 10.101.2.1
    - 10.101.4.1
    - 223.**.**.210
    - 10.101.6.1
    - 10.101.8.1
    - 10.0.0.14
    - 10.101.9.1
    - 10.101.10.1
    - 10.101.12.1
    - 10.101.13.1
    - 10.101.14.1
    - 10.101.15.1
    - 10.101.16.1
    - 10.101.17.1
    - 10.101.18.1
    - 10.101.19.1
    - 10.101.20.1
    - 10.101.21.1
    - 223.**.**.195
    - 10.101.22.1
    - 10.101.23.1
    - 10.101.24.1
    - 10.101.25.1
    - 10.101.26.1
    - 10.101.27.1
    - 10.101.28.1
    - 58.**.**.2
    - 10.101.30.1
    - 10.101.31.1
    - 10.101.32.1
    - 120.**.**.110

1. the dns server contains a lot internal ip rather than public rack controller ips
2. I tried dig 117-**-**-**--28.maas-internal with every public ip, they all give back valid response, but when I dig @127.0.0.53, it doesn't

Changed in maas:
status: Incomplete → Opinion
status: Opinion → New
Revision history for this message
Dylan Wang (hyuwang) wrote :
Revision history for this message
Dylan Wang (hyuwang) wrote :

I found the root problem, according to systemd-resolved log, the dns servers list is too long, dns resolution is not working anymore.

May 13 13:41:07 ubuntu systemd-resolved[1109]: Failed to read DNS servers for interface eno1, ignoring: Argument list too long
May 13 13:41:07 ubuntu systemd[1]: Started Network Name Resolution.
May 13 13:41:09 ubuntu systemd-resolved[1109]: Failed to read DNS servers for interface eno1, ignoring: Argument list too long

summary: - [2.5.3] bug: stuck deploying while deploy is actually finished
+ [2.5.3] bug: deploy failed when more than 33 rack controller in dns
+ resolver, cause argument list too many and dns resolution stop working
summary: [2.5.3] bug: deploy failed when more than 33 rack controller in dns
- resolver, cause argument list too many and dns resolution stop working
+ resolver, cause argument list too long and dns resolution stop working
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Can you provide a bit more information on how your infrastructure is implemented? DNS servers should only be one of two things:

1. All of the rack controllers in the /same/ VLAN
2. Any region controller on the /same/ VLAN

Now, does that mean you have so many subnets in the same vlan ? Cna you provide the output of:

maas <user> vlans read ?

Changed in maas:
status: New → Incomplete
Revision history for this message
Dylan Wang (hyuwang) wrote :

Apparently that isn't true. We have a single, different vlan for each rack.

I dig a bit in the code, and found that the dns server list is conduct by something like this:

get curtin config -> NodeNetworkConfiguration -> get_default_dns_servers

This take a combination of region controller ip from get_dns_server_addresses and other routable address by (src/maasserver/models/node.py):

get_routable_address_map(RackController.objects.all(), self)

I think this is where it goes wrong, it return a list of all ip on all rack controller, instead of only the routable ones. Also even if it correctly return the routable ips, there should be a limit instead of all.

Changed in maas:
status: Incomplete → New
Revision history for this message
Andres Rodriguez (andreserl) wrote :

I agree with you, but without the required data to understand your envirownt we cannot effectively work to solve the issue.

Changed in maas:
status: New → Incomplete
milestone: none → 2.6.0rc1
assignee: nobody → Blake Rouse (blake-rouse)
importance: Undecided → Critical
Revision history for this message
Dylan Wang (hyuwang) wrote :

I thought I explained already:

1. All of the rack controllers in the /same/ VLAN:
  none of any rack controller share the same vlan

2. Any region controller on the /same/ VLAN:
  none of any region controller share the same vlan

In our infra, each fabric have a unique vlan with unique subnets

and we have 30+ fabris like this, please see the attached screenshot

Changed in maas:
status: Incomplete → New
Revision history for this message
Blake Rouse (blake-rouse) wrote :

I believe the issue is how get_routable_address_map is taking into account that undefined spaces are routeable between each other. Which is completely against the definition of the undefined space.

If you could try this quick patch to confirm it fixes your issue. I have confirmed that it does solve the problem with the undefined space, but want to confirm it solves the overall issue for you.

http://paste.ubuntu.com/p/gB3rd6PjGb/

Changed in maas:
status: New → In Progress
Revision history for this message
Dylan Wang (hyuwang) wrote :

Thanks, it does fix our issue.

So the routable address are decided by space?

We have never use that, leave the default value for all the rack...

Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [Bug 1828602] Re: [2.5.3] bug: deploy failed when more than 33 rack controller in dns resolver, cause argument list too long and dns resolution stop working

Dylan,

A space tells you that subnets inside the VLANs are routable between each
other!

https://docs.maas.io/2.5/en/intro-concepts

Spaces <https://docs.maas.io/2.5/en/intro-concepts#spaces>

A *space* is a logical grouping of VLANs whose subnets are able to
communicate with one another. VLANs within each space need not belong to
the same fabric. A default space is not created when MAAS is installed.

On Thu, May 16, 2019 at 2:05 PM Dylan Wang <email address hidden>
wrote:

> Thanks, it does fix our issue.
>
> So the routable address are decided by space?
>
> We have never use that, leave the default value for all the rack...
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1828602
>
> Title:
> [2.5.3] bug: deploy failed when more than 33 rack controller in dns
> resolver, cause argument list too long and dns resolution stop working
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1828602/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.6.0rc1; status=In Progress;
> importance=Critical; <email address hidden>;
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse hyuwang
> Launchpad-Bug-Reporter: Dylan Wang (hyuwang)
> Launchpad-Bug-Modifier: Dylan Wang (hyuwang)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>
--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Dylan Wang (hyuwang) wrote :

Hi Andres,

I understood the concept, the default space I mean, it's the undefined space,

and our racks are designed in this way:

each rack has a unique vlan, contains two subnets, one with public network, one internal (not routable to any other place)

We didn't make any use of space so just left it there.

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.