Provide user with feedback when dnsmasq runs out of ip addresses

Bug #1379494 reported by Sam Stoelinga
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Medium
Matthew Mosesohn

Bug Description

On 50 node environment we had about 80+ addresses available for dhcp range in the Admin/PXE network, but after rebooting etc and maybe changing cable between interfaces we noticed that dnsmasq said no more address available. This is a really important event that should get shown to a user so he knows what is going on, he shouldn't have to log in to the cobbler containing and checking the dnsmasq logs.

It seems that in our case leases were not released quick enough, but we should initially just have increased the size of our PXE network.

On a side note we had to redeploy our fuel node in order to change the PXE network subnet size, there should be an option to change the subnet size without custom hacks but as a built-in feature.

Current result:
When dnsmasq runs out of ip addresses user has to login to cobbler container and look at logs to know what's going on.

Expected result:
In fuel-web show a big red warning sign showing that user is running out of ip addresses on it's pxe dhcp range network.

Tags: scale
description: updated
Changed in fuel:
milestone: none → 6.0
assignee: nobody → Matthew Mosesohn (raytrac3r)
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Astute will need to check /var/log/dnsmasq.log (not docker-logs b/c it's in /var/log inside astute conatiner) for critical errors. I will investigate ways to see if we can predict if the deployment will cause the number of leases to run out. We can try to work around this in the short term by setting the default DHCP range to fit inside 10.20.0.0/23 instead of /24.

Back to Fuel configuration capabilities: Cobbler always supported changing its range. It just needs an update to astute.yaml and reapply puppet. You can do that today. Just open fuelmenu, change the ranges, save and quit, and restart cobbler container. If we need a quick and dirty script for those who don't want to run fuelmenu, we can write one. What isn't flexible is the static range. That's a task fuel python team deprioritized in the past. Maybe we can see if it's worth raising again.

And then on dnsmasq configuration, it seems the dynamic lease settings seem to give 1 IP during PXE and a second during OS install. Maybe it's giving a new lease because client headers changed. Let's try to see if we can concifugre dnsmasq to recycle if the same MAC requests an address.

Changed in fuel:
status: New → Triaged
tags: added: scale
Changed in fuel:
importance: Undecided → High
Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Fuel Library Team (fuel-library)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)
Revision history for this message
Dima Shulyak (dshulyak) wrote :

This week we removed static pool from nailgun, so effective pool of ips is almost /24 (by default ofcourse).
It should be enough for any deployment we support right now, if you are planning go beyond /24 nodes, consider
using multiple network groups with separate network for each group

I will mark this as incomplete, please check if current master fullfils your requirements

As for the discussion above:
- Nailgun/astute will not monitor dnsmasq leases, this problem is similar to disk space monitoring and should be addressed properly
- It is quite easy to validate at the time of deployment/provisioning how many ips in admin network is left, and we can add notification or smth

Changed in fuel:
status: Triaged → Incomplete
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

We believe it shouldn't reproduce based on the recent fix with DHCP pools and proper administrator planning. Lowering priority to medium and moving to 6.1.

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Matthew Mosesohn (raytrac3r)
milestone: 6.0 → 6.1
importance: High → Medium
Changed in fuel:
status: Incomplete → Invalid
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

This bug was incomplete for more than 4 weeks. We cannot investigate it further so we are setting the status to Invalid. If you think it is not correct, please feel free to provide requested information and reopen the bug, and we will look into it further.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.