Failed to allocate the required AUTO IP addresses after 2 retries

Bug #1902425 reported by Guilherme G. Piccoli
72
This bug affects 11 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Alberto Donato
2.8
Fix Released
High
Alberto Donato
2.9
Fix Committed
High
Alberto Donato

Bug Description

Hi MAAS team, after a recent lab movement, we suddenly lost ability to deploy machines in MAAS 2.8 due to a "fake" IP exhaustion. I'm saying fake because given the managed subnet configuration (and some inspection on maasdb), seems we have plenty of IPs to allocate to machines when deploying; but instead, we see the following message on UI when trying to deploy:

"Failed to allocate the required AUTO IP addresses after 2 retries"
The machine instantly gets back to Allocated state.

In maas-regiond, I see the following backtrace:

2020-10-29 18:45:28 maasserver.websockets.protocol: [critical] Error on request (23) machine.action: Failed to allocate the required AUTO IP addresses after 2 retries.
 Traceback (most recent call last):
   File "/usr/lib/python3.6/threading.py", line 864, in run
     self._target(*self._args, **self._kwargs)
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 850, in worker
     return target()
   File "/usr/lib/python3/dist-packages/twisted/_threads/_threadworker.py", line 46, in work
     task()
   File "/usr/lib/python3/dist-packages/twisted/_threads/_team.py", line 190, in doWork
     task()
 --- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 250, in inContext
     result = inContext.theWork()
   File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 266, in <lambda>
     inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 122, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 85, in callWithContext
     return func(*args,**kw)
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 885, in callInContext
     return func(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 217, in wrapper
     result = func(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/maasserver/utils/orm.py", line 737, in call_within_transaction
     return func_outside_txn(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/maasserver/utils/orm.py", line 602, in __exit__
     self.fire()
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 217, in wrapper
     result = func(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/maasserver/utils/asynchronous.py", line 208, in fire
     self._fire_in_reactor(hook).wait(LONGTIME)
   File "/usr/lib/python3/dist-packages/crochet/_eventloop.py", line 231, in wait
     result.raiseException()
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 385, in raiseException
     raise self.value.with_traceback(self.tb)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/maasserver/models/node.py", line 4451, in retrier
     "after %d retries." % max_try_count
 maasserver.exceptions.StaticIPAddressExhaustion: Failed to allocate the required AUTO IP addresses after 2 retries.

2020-10-29 18:45:28 maasserver.dhcp: [info] Successfully configured DHCPv4 on rack controller 'foo (6ekha3)'.
2020-10-29 18:45:28 maasserver.dhcp: [info] Successfully configured DHCPv6 on rack controller 'foo (6ekha3)'.
2020-10-29 18:45:30 maasserver.dhcp: [info] Successfully configured DHCPv4 on rack controller 'foo (6ekha3)'.
2020-10-29 18:45:30 maasserver.dhcp: [info] Successfully configured DHCPv6 on rack controller 'foo (6ekha3)'.
2020-10-29 18:45:31 maasserver.region_controller: [info] Reloaded DNS configuration; ip XXX.XXX.XXX.XXX released
2020-10-29 18:45:36 maasserver.region_controller: [info] Reloaded DNS configuration:
 * ip XXX.XXX.XXX.XXX released
 * ip XXX.XXX.XXX.XXX released

Any advice on how to debug this is greatly appreciated! Thanks in advance

Tags: seg

Related branches

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

I forgot to mention: I restarted the MAAS machine and it didn't help.

Changed in maas:
status: New → Confirmed
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :
Revision history for this message
Junien Fridrick (axino) wrote :

I hit this bug as well with MAAS 2.8 and 2.9

Revision history for this message
David Britton (dpb) wrote :

Subscribing field-high as we have hit this during a new production openstack deployment.

Revision history for this message
David Britton (dpb) wrote :

Subscribing critical, after confirming with Junien that there is no known workaround.

Changed in maas:
importance: Undecided → High
assignee: nobody → Alberto Donato (ack)
Revision history for this message
Alberto Donato (ack) wrote :

could you please attach maas logs (regiond/rackd) when this happens?

Changed in maas:
status: Confirmed → Incomplete
Alberto Donato (ack)
Changed in maas:
milestone: none → 2.9.0rc1
Alberto Donato (ack)
Changed in maas:
status: Incomplete → In Progress
Lee Trager (ltrager)
Changed in maas:
milestone: 2.9.0rc1 → 2.9.0rc2
Revision history for this message
Alberto Donato (ack) wrote :

To give a bit of context, MAAS tries up to 3 times to assign AUTO IPs when deploying, and checks if those IPs are available.

If the check fails, it usually means those IPs are currently in use by something that MAAS doesn't know about (e.g. machines with statically-assigned IPs in a subnet where MAAS controls DHCP).

If the deploy is retried, MAAS will skip those IPs as it now knows they're in use.
This could be a workaround, but of course deploy will fail if the next 3 IPs are also in use.

Ideally, IP shouldn't be assigned outside of MAAS if the subnet has DHCP managed by MAAS.

Changed in maas:
status: In Progress → Fix Committed
Alberto Donato (ack)
no longer affects: maas/trunk
Changed in maas:
milestone: 2.9.0rc2 → 2.10-next
Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Running into this issue on MAAS 2.8

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Deployments beyond 3 physical nodes and 14 VMs on MAAS 2.8 are not working. I get at least one node stuck. Even if I clean MAAS from all nodes and redeploy from enlistment, I still see this issue.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

SOS reports for all 3 infra nodes available on: https://drive.google.com/drive/folders/1YgxJnh0Bp-j_fpU6fh5YHB3wWgWEglL9?usp=sharing

There is no valid workaround for field deployments since we cannot use 2.9 or 2.10 nor we can use 2.8 after the issue starts to show up.

Nobuto Murata (nobuto)
tags: added: ps5
tags: removed: ps5
Revision history for this message
Arif Ali (arif-ali) wrote :
Download full text (3.9 KiB)

Based on the patches, it seems, this got landed, and based on the snap version, I should have this fix.

I have deployed a brand new MAAS server (lab environment), and set all the interfaces of all my VMs to Auto-Assign (except for the external interface), but I am unable to deploy any VMs or machines through MAAS. I have the snap installed and using the maas-test-db. When I set the intrefaces to grab IP from DHCP, then MAAS successfully gives the IP via that method. Is there anything I am missing?

nap list | grep maas
maas 2.9.0-9137-g.8e920a12b 10860 2.9/stable canonical* -
maas-cli 0.6.5 16 latest/stable canonical* -
maas-test-db 12.4-17-g.9e70484 60 2.9/stable canonical* -

2020-12-29 15:10:43 maasserver.websockets.protocol: [critical] Error on request (56) machine.action: Failed to allocate the required AUTO IP addresses
        Traceback (most recent call last):
          File "/usr/lib/python3.8/threading.py", line 870, in run
            self._target(*self._args, **self._kwargs)
          File "/snap/maas/10860/lib/python3.8/site-packages/provisioningserver/utils/twisted.py", line 825, in worker
            return target()
          File "/snap/maas/10860/usr/lib/python3/dist-packages/twisted/_threads/_threadworker.py", line 46, in work
            task()
          File "/snap/maas/10860/usr/lib/python3/dist-packages/twisted/python/context.py", line 122, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/snap/maas/10860/usr/lib/python3/dist-packages/twisted/python/context.py", line 85, in callWithContext
            return func(*args,**kw)
          File "/snap/maas/10860/lib/python3.8/site-packages/provisioningserver/utils/twisted.py", line 860, in callInContext
            return func(*args, **kwargs)
          File "/snap/maas/10860/lib/python3.8/site-packages/provisioningserver/utils/twisted.py", line 192, in wrapper
            result = func(*args, **kwargs)
          File "/snap/maas/10860/lib/python3.8/site-packages/maasserver/utils/orm.py", line 737, in call_within_transaction
            return func_outside_txn(*args, **kwargs)
          File "/snap/maas/10860/lib/python3.8/site-packages/maasserver/utils/orm.py", line 602, in __exit__
            self.fire()
          File "/snap/maas/10860/lib/python3.8/site-packages/provisioningserver/utils/twisted.py", line 192, in wrapper
            result = func(*args, **kwargs)
          File "/snap/maas/10860/lib/python3.8/site-packages/maasserver/utils/asynchronous.py", line 207, in fire
            self._fire_in_reactor(hook).wait(LONGTIME)
          File "/snap/maas/10860/usr/lib/python3/dist-packages/crochet/_eventloop.py", line 231, in wait
            result.raiseException()
          File "/snap/maas/10860/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/snap/maas/10860/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/snap/maas/10860...

Read more...

Revision history for this message
Arif Ali (arif-ali) wrote :

Seems like I have resolved my issue by disabling and enabling managed for each subnet that I have. Not sure, why I would need to do that with a fresh installation though. But as mine is a lab, I'm not too fussed, and at least I can crack on.

for line in $(maas admin subnets read | jq ".[] | {id:.id, space:.space}" --compact-output)
do
  subnet_id=$(echo $line | jq ".id")
  space=$(echo $line | jq ".space" | sed s/\"//g)

  maas admin subnet update $subnet_id managed=False
  [[ $space != "external" ]] && maas admin subnet update $subnet_id managed=True
done

Revision history for this message
Diko Parvanov (dparv) wrote :

happened as well with
MAAS version: 2.9.0~rc3 (9131-g.26dc68728)
JuJu 2.8.6

machine failed to deploy with 'Failed to allocate the required AUTO IP addresses' in the juju machine log it tries 10 times and then fails.

Checking regiond.log found the proper IP allocation:
==> regiond.log <==
2021-01-15 12:37:36 maasserver.region_controller: [info] Reloaded DNS configuration:
  * ip X allocated
  * ip Y released
  * ip Z allocated
  * ip P allocated
  * ip Q allocated

then using those as Static assign on the bonds worked, but seems to be some problem on MAAS.

Revision history for this message
Albert Valiev (artscout) wrote :
Download full text (3.5 KiB)

Same here, still have trouble allocationg IP in AUTO mode:
2021-01-28 15:03:27 maasserver.websockets.protocol: [critical] Error on request (30) machine.action: Failed to allocate the required AUTO IP addresses
        Traceback (most recent call last):
          File "/usr/lib/python3.8/threading.py", line 870, in run
            self._target(*self._args, **self._kwargs)
          File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 825, in worker
            return target()
          File "/usr/lib/python3/dist-packages/twisted/_threads/_threadworker.py", line 46, in work
            task()
          File "/usr/lib/python3/dist-packages/twisted/_threads/_team.py", line 190, in doWork
            task()
        --- <exception caught here> ---
          File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 250, in inContext
            result = inContext.theWork()
          File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 266, in <lambda>
            inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
          File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 122, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 85, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 860, in callInContext
            return func(*args, **kwargs)
          File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 192, in wrapper
            result = func(*args, **kwargs)
          File "/usr/lib/python3/dist-packages/maasserver/utils/orm.py", line 737, in call_within_transaction
            return func_outside_txn(*args, **kwargs)
          File "/usr/lib/python3/dist-packages/maasserver/utils/orm.py", line 602, in __exit__
            self.fire()
          File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 192, in wrapper
            result = func(*args, **kwargs)
          File "/usr/lib/python3/dist-packages/maasserver/utils/asynchronous.py", line 207, in fire
            self._fire_in_reactor(hook).wait(LONGTIME)
          File "/usr/lib/python3/dist-packages/crochet/_eventloop.py", line 231, in wait
            result.raiseException()
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python3/dist-packages/maasserver/models/node.py", line 5555, in claim_auto_ips
            yield self._claim_auto_ips()
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
            resul...

Read more...

Revision history for this message
Arif Ali (arif-ali) wrote :

Currently on snap version at 2.9.1-9153-g.66318f531

And I am still facing this issue in my environment, even with a fresh installation (and what I thought was a workaround in comment #13), which I did over a the weekend of 22nd Jan. I have a /24 subnet with 50 for DHCP, 10 for VIPs, and 20 for servers reserved. I enabled debug in my environment, and I see that it tries to check for IPs

Here's a snippet from the regiond.log https://paste.ubuntu.com/p/QN7rHwBNjP/ and rackd.log https://paste.ubuntu.com/p/zvx7jzk53k/

I am hoping these logs will be useful to debug this further, and understand what is going on

Changed in maas:
milestone: 2.10.0 → 2.10-beta1
Revision history for this message
Junien Fridrick (axino) wrote :

Still hitting this with a fresh MAAS 2.9.1-9153-g.66318f531 (snap). The status "Fix released" for MAAS 2.9 should be updated, this is not fixed :)

Revision history for this message
Junien Fridrick (axino) wrote :

After debugging with the MAAS team, this happened to me because I had machines with static IPs not deployed by MAAS on the subnet. Reserving the used IPs in MAAS or adding the machines as "Devices" from discovery allowed me to get rid of this problem !

Revision history for this message
Arif Ali (arif-ali) wrote :

@axino, that was the hint for me, and solved it for me now, that deserves some karma :), and whoever helped you in the MAAS team.

Enabled discover, and realised, I took my 3 hypervisors out 3 weeks ago from MAAS managed, and hence had 3 IP addresses out of the 50 available that were taken in each of the ranges.

But, still, I had (N - 3) IP addresses still available, why should MAAS not just pick another IP, and just move on :thinking:

Revision history for this message
Albert Valiev (artscout) wrote :

I took patch made by Alberto from git and applied that to installed 2.9.2 (I had to, cause I haven't got a time to wait for 2.10 release, so...), it worked wonderfully after that, so that's sure fix, many thanks!

Revision history for this message
Matt Rae (mattrae) wrote :

I seem to be having this problem after the maas snap auto-updated to 2.8.3/4 from 2.8.2

Deploying with a static ip, in a reserved range, to any subnet gives an error in the regiond log:

maasserver.exceptions.StaticIPAddressExhaustion: Failed to allocate the required AUTO IP addresses

There doesn't appear to be any unexpected ips in the subnet when I go to the subnet page, or when i ping the subnet.

Revision history for this message
Ria Jairam (rjairam) wrote :

I'm also experiencing this. However, our DNS and DHCP is not managed by MAAS (no choice in the matter). How do I just get MAAS to relinquish control to my DHCP?

Revision history for this message
Matt Rae (mattrae) wrote :

I reproduced this issue multiple times in a 2.8.3 testbed.

Fortunately upgrading to 2.9.2 resolved the issue.

We were forced into this situation because the snap updated to 2.8.3 and 2.8.4, so 2.8.2 was no longer on the system to roll-back to.

To reproduce:
reserved range and then assign a machine to a static ip in that range

On deploy we will see the following in the logs:

maasserver.exceptions.StaticIPAddressExhaustion: Failed to allocate the required AUTO IP addresses

Revision history for this message
Tatu Ylonen (ylo) wrote :

I am very much suffering from this too. Completely unable to deploy new machines. 2.8/stable (2.8.4-8597-g.05313b458). I had unexplained unused ip addresses on two subnets inside dynamic ranges for managed networks, and one actual misconfigured switch management port inside a dynamic range. The misconfigured host has been reconfigured to be outside dynamic ranges. I have deleted the unknown-usage ip addresses and related interfaces using psql from maasserver_staticipaddress, maasserver_interface_ip_addresses, and maasserver_interfaces. I have tried unmanaging and re-managing all subnets. I've tried disabling and re-enabling DHCP. I've rebooted. I've removed and reinstalled the maas snap (I use a separate postgresql database as per production deployment instructions). I tried upgrading to 2.9/stable and then 2.9/edge but doing so breaks deployments via API, possibly due to an apparmor configuration bug (see #1917640). I have restored the postgresql database from a backup taken before 2.9 deployment and reinstalled maas 2.8/stable but the AUTO IP failure persists. So far nothing has worked.

Revision history for this message
James Vaughn (jmcvaughn) wrote :

I've just bumped into the same issue with snapped 2.9.2 (9164-g.ac176b5c4).

I seem to have been hitting this issue when repeatedly redeploying a machine, though I can't say for certain if this is a trigger for this behaviour. In this state, the machine can be deployed without issue but its name cannot be resolved via MAAS' DNS.

My workaround was to redeploy with a static address (the same address in this case), then release and set addressing back to auto-assign. Everything then went back to normal.

I've attached the past few days of regiond logging, but note that I've not explicitly tested/triggered this to reproduce the problem; all of the above was just in passing.

The activity with host clatter-deceased on the 2021-03-22 in the attached regiond.log covers the above.

Revision history for this message
Dan Streetman (ddstreet) wrote :

resetting 2.8 as it's not fix-released, this was encountered on 2.8.4

Revision history for this message
Dan Streetman (ddstreet) wrote :

ah, i can't modify the settings since this is for the maas project, not ubuntu/maas. this isn't fixed-released for 2.8.

Revision history for this message
Matt Rae (mattrae) wrote :

For me this issue started happening going from 2.8.2 to 2.8.4.

I reproduced the issue in 2.8.4. once I updated to 2.9.2 the error went away.

For the people still hitting this in 2.9.2, I wonder if there is a device using an ip on the subnet? If you enable auto discovery, maybe it will discover the device? I'm not sure of an easier way to find out if there is an ip maas is getting stuck on.

Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Tatu Ylonen (ylo) wrote :

I upgraded to 2.8.5-8600-g.efb54078a (channel 2.8/candidate), but I'm still getting the same "Failed to allocate the required AUTO IP addresses" error when trying to deploy. No explanation of why this is happening. The host has two network interfaces, both of which have a statically configured IP address (in different subnets). It has no automatically allocated addresses. I'm releasing the host, then deploying (it ends up in allocated state).

I cannot find any conflicting IP addresses or IP addresses inside dynamic ranges in any subnets. I've tried toggling "Managed allocation" for the involved networks. One thing of note may be that I've configured the IP addresses for the hosts via the API (python-libmaas).

It would be really helpful if the error message contained more information, such as the conflicting IP address or some other explanation why this is happening.

Revision history for this message
Tatu Ylonen (ylo) wrote :

I've done some more testing with 2.8.5, and I am able to deploy with Auto assign addresses if the subnets are set to unmanaged and I have a reserved range. However, once I assign a static address from the same subnet (outside the reserved ranges - the subnets are unmanaged), deploy will fail with the AUTO IP error. This happens regardless of whether I set the static IP via the GUI or via the API.

FURTHERMORE, deploy will succeed if I set EITHER of the two network interfaces to Auto assign. However, if both interfaces have static IP, then deploy will fail with the AUTO IP error. The interfaces are in different subnets (10.12.0.0/24 and 10.14.0.0/16). I tried setting the IP addresses to 10.12.0.123 and 10.14.0.123, respectively. If one or both are Auto assign, deployment succeeds. If both are set to Static assign, then deployment fails. Neither of these IP addresses have been used for anything ever in the cluster, as far as I know (I also tried with a different address for both).

So, the problem only seems to manifest on the two-interface machine IF BOTH INTERFACES ARE SET TO STATIC ASSIGN. Apparently the IP addresses and subnets are fine, as deploy will succeed if I set either interface to Auto assign. This really looks like a bug still in 2.8.5.

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :

I have also done some basic testing with 2.8.5 where I tried to deploy a machine with 2 interfaces, one with a static IP and one unconfigured but I received the same error message of "Failed to allocate the required AUTO IP addresses". I also tried setting both interfaces as static and that also fails with the same message.

I was able to get it working by setting one of the interfaces as Auto assign but this does not meet my requirements/constraints.

Revision history for this message
Djair Silva (djairdasilva) wrote :

Just to let you know guys,

I was able to upgrade to 2.8.5, I’m now able to deploy machines, but the PXE needs to be as auto-assign, as DHCP fails with the same AUTO IP issue. The problem here is, our 1k machines deployed are configured as DHCP in the PXE and we can’t change it. During the reinstall process, it will fail and cause problems for us. Would you please help to create another fix to prevent that the deployment will not if the PXE interface is configured as DHCP?

Thank you so much, the slowness and the search issue have been fixed and we are happy to have this fix in place, thank you so much for the help until now.

Revision history for this message
Tomas Blatan (tblatan) wrote :
Revision history for this message
Cedric Lemarchand (cedric-lemarchand) wrote :

Any information regarding a potential fix release for 2.9 branch ?

Revision history for this message
Cedric Lemarchand (cedric-lemarchand) wrote :

Just upgraded to 3.0.0-10029-g.986ea3e45, bug still occurs.

Revision history for this message
Zhanglei Mao (zhanglei-mao) wrote :

I got the similar issue which are caused by some machine in ready status but actually didn't power off and have used the IP which are assigned to new deployment machine.

Revision history for this message
Andrea Ieri (aieri) wrote :

Is this bug fixed in the 3.1 branch? According to comment #35 it's still happening in the latest 3.0/stable so it isn't too clear.

Revision history for this message
Alberto Donato (ack) wrote :

This bug has been used to track different issues which end up causing maas to fail picking an IP for the machine.

If you're still experiencing similar issues on 3.0/3.1, please open a new bug attaching regiond/rackd.log, and the output of `maas $profile subnets read`.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments