Ubuntu

[SRU] dnsmasq fails at leasing issues when using vlan mode

Reported by Chuck Short on 2012-05-31
52
This bug affects 5 people
Affects Status Importance Assigned to Milestone
dnsmasq (Ubuntu)
Medium
Unassigned
Precise
Medium
Chuck Short

Bug Description

** Issue **

There is an issue with the way nova uses dnsmasq in VLAN mode. It starts
up a single copy of dnsmasq for each vlan on the network host (or on
every host in multi_host mode). The problem is in the way that dnsmasq
binds to an ip address and port[2]. Both copies can respond to broadcast
packet, but unicast packets can only be answered by one of the copies.

In nova this means that guests from only one project will get responses
to their unicast dhcp renew requests. Unicast projects from guests in
other projects get ignored. What happens next is different depending on
the guest os. Linux generally will send a broadcast packet out after
the unicast fails, and so the only effect is a small (tens of ms) hiccup
while interface is reconfigured. It can be much worse than that,
however. I have seen cases where Windows just gives up and ends up with
a non-configured interface.

This bug was first noticed by some users of openstack who rolled their
own fix. Basically, on linux, if you set the SO_BINDTODEVICE socket
option, it will allow different daemons to share the port and respond to
unicast packets, as long as they listen on different interfaces. I
managed to communicate with Simon Kelley, the maintainer of dnsmasq and
he has integrated a fix[3] for the issue in the current version[1] of
dnsmaq.

[3] http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=9380ba70d67db6b69f817d8e318de5ba1e990b12

** Development Fix **

This has been fixed in quantal with the newer version of dnmasq.

** Stable Fix **

I have backported the patch which fixes this issue, I have attached the debdiff and the buildlog.

** Test Case **

1. Install openstack with vlan mode.
2. Watch instances loose their IP addresses.

** Regression Potential **

Minimal, most installations dont use this type of networking.

Scott Moser (smoser) wrote :

this looks like something we should pull in.
Since Ubuntu has unmodified debian package, and debian maintainer is upstream maintainer, we should probably let the quantal package get synced from debian. Then, we can patch the 12.04 Ubuntu version in an SRU.

@Simon,
  If you're reading this, do you have plans for a 2.6.2 release and subsequent 2.6.2-1 upload soon?

Changed in dnsmasq (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Scott Moser (smoser) on 2012-05-31
Changed in dnsmasq (Ubuntu Precise):
status: New → Triaged
importance: Undecided → Medium

On 31/05/12 14:57, Scott Moser wrote:
> this looks like something we should pull in.
> Since Ubuntu has unmodified debian package, and debian maintainer is upstream maintainer, we should probably let the quantal package get synced from debian. Then, we can patch the 12.04 Ubuntu version in an SRU.
>
> @Simon,
> If you're reading this, do you have plans for a 2.6.2 release and subsequent 2.6.2-1 upload soon?

I do. There are a few nasty bugs in 2.61 in the new DHCPv6 and router
advertisement code, I plan to release 2.62 to address these in the next
few days.

Cheers,

Simon.

James Page (james-page) on 2012-05-31
Changed in dnsmasq (Ubuntu Precise):
milestone: none → ubuntu-12.04.1
Thierry Carrez (ttx) wrote :

2.62 is in Quantal

Changed in dnsmasq (Ubuntu):
status: Triaged → Fix Released
Download full text (3.8 KiB)

Hi,
Thanks to your work. It is very bad to do not have sound in this version of Ubuntu (kernel 3 2 025 and version 12.10 (Quanta). This problem comes with tha updating 3 2 024 to 3 2 025. 3 2 025 seems to forget the alsa.

For memory :
Matching subscriptions: No Audio after update kernel 3 2 025 in Ubuntu 12 04 32bits. Good in kernel 3 2 024 ! Alsa 1 025 not compiled in kernel 3 2 025. It's the same problem with ubuntu 12 10 alpha I tried today too !(but kernel 3 4 ...). A+
> https://bugs.launchpad.net/bugs/1006898

an info too :
dans la console cette ligne :
« cat /proc/asound/version »
Si cela te donne un truc de ce genre, alors le problème vient d'ailleurs :
Advanced Linux Sound Architecture Driver Version 1.0.25.
Compiled on Mar 9 2012 for kernel 3 2 025-generic PAE
Si la ligne ne e renvoie que la ligne :
Advanced Linux Sound Architecture Driver Version 1.0.24.
Alors pas besoin de cherche midi à 14 heures, ton son ne pourra fonctionné vu qu'Alsa n'est pas compilé avec le kernel que tu utilise à ce moment là.

Sorry, it's in french but i have only "Advanced Linux Sound Architecture Driver Version 1.0.24." when I tap " cat /proc/asound/version". And the sound driver installed is Version 1.0.25 !

Best regards.

Guy Roche

mail <email address hidden>
mail <email address hidden>

domicile 0324 376446
mobile 0619 178018

> Message du 15/06/12 17:22
> de : "ThierryCarrez"<email address hidden>
> à : <email address hidden>
> cc :
> objet : [Bug 1006898] Re: [SRU] dnsmasq fails at leasing issues when using vlan mode
>
>
> 2.62 is in Quantal
>
> ** Changed in: dnsmasq (Ubuntu)
> Status: Triaged => Fix Released
>
> --
> You received this bug notification because you are subscribed to Ubuntu
> ubuntu-12.04.1.
> Matching subscriptions: No Audio after update kernel 3 2 025 in Ubuntu 12 04 32bits. Good in kernel 3 2 024 ! Alsa 1 025 not compiled in kernel 3 2 025. It's the same problem with ubuntu 12 10 alpha I tried today too !(but kernel 3 4 ...). A+
> https://bugs.launchpad.net/bugs/1006898
>
> Title:
> [SRU] dnsmasq fails at leasing issues when using vlan mode
>
> Status in “dnsmasq” package in Ubuntu:
> Fix Released
> Status in “dnsmasq” source package in Precise:
> Triaged
>
> Bug description:
> There is an issue with the way nova uses dnsmasq in VLAN mode. It starts
> up a single copy of dnsmasq for each vlan on the network host (or on
> every host in multi_host mode). The problem is in the way that dnsmasq
> binds to an ip address and port[2]. Both copies can respond to broadcast
> packet, but unicast packets can only be answered by one of the copies.
>
> In nova this means that guests from only one project will get responses
> to their unicast dhcp renew requests. Unicast projects from guests in
> other projects get ignored. What happens next is different depending on
> the guest os. Linux generally will send a broadcast packet out after
> the unicast fails, and so the only effect is a small (tens of ms) hiccup
> while interface is reconfigured. It can be much worse than that,
> however. I have seen cases where Windows just gives up and ends up with
>...

Read more...

Chuck Short (zulcss) wrote :

** Issue **

There is an issue with the way nova uses dnsmasq in VLAN mode. It starts
up a single copy of dnsmasq for each vlan on the network host (or on
every host in multi_host mode). The problem is in the way that dnsmasq
binds to an ip address and port[2]. Both copies can respond to broadcast
packet, but unicast packets can only be answered by one of the copies.

In nova this means that guests from only one project will get responses
to their unicast dhcp renew requests. Unicast projects from guests in
other projects get ignored. What happens next is different depending on
the guest os. Linux generally will send a broadcast packet out after
the unicast fails, and so the only effect is a small (tens of ms) hiccup
while interface is reconfigured. It can be much worse than that,
however. I have seen cases where Windows just gives up and ends up with
a non-configured interface.

This bug was first noticed by some users of openstack who rolled their
own fix. Basically, on linux, if you set the SO_BINDTODEVICE socket
option, it will allow different daemons to share the port and respond to
unicast packets, as long as they listen on different interfaces. I
managed to communicate with Simon Kelley, the maintainer of dnsmasq and
he has integrated a fix[3] for the issue in the current version[1] of
dnsmaq.

[3] http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=9380ba70d67db6b69f817d8e318de5ba1e990b12

** Development Fix **

This has been fixed in quantal with the newer version of dnmasq.

** Stable Fix **

I have backported the patch which fixes this issue, I have attached the debdiff and the buildlog.

** Test Case **

1. Install openstack with vlan mode.
2. Watch instances loose their IP addresses.

** Regression Potential **

Minimal, most installations dont use this type of networking.

Chuck Short (zulcss) wrote :
Chuck Short (zulcss) wrote :
Chris Halse Rogers (raof) wrote :

This seems like an important bug to fix, but I have reservations about changing dnsmasq's behaviour in a stable update. When you say ‘most installations don't use this type of networking’, what do you mean by ‘most’, is it plausible that someone has relied on this behaviour, and if someone had relied on this behaviour how would this change affect them?

Could this be more safely worked-around in openstack?

Christian Parpart (trapni) wrote :

Hey,

sorry, with "most" I meant "the documentation recommends using VlanManager" (that is, VLAN mode) for networking.

Although, you cannot "rely" on such a behaviour, IMHO, because it absolutely makes no sense to let hosts (that send a DHCPREQUEST) not receive their DHCPACK.

Christian Parpart (trapni) wrote :

> Could this be more safely worked-around in openstack?

forgot to comment on this one, well, I am no OpenStack expert, however, OpenStack nova-network relies on dnsmasq for propagating IP addresses via DHCP to their (KVM/...) instances, and OpenStack supports simple networking (w/o VLAN) and VLAN-networking, and thus, I don't see how OpenStack could work around this except using a different software than dnsmasq (something that actually works) - or don't use VLAN at all.

Chuck Short (zulcss) wrote :

Roaf,

What I mean for "most". I mean we dont recommend that people use VLAN but some people do use it, and are not able to use vlan with the dnsmasq in precise without this fix.

Regards
chuck

Chris Halse Rogers (raof) wrote :

Well, what I meant was: the code that you're touching is in the dnsmasq-base package, and dnsmasq-base is installed on *all* Ubuntu systems, as a dependency of network-manager. It seems that the worst-case regression potential is that we break DNS on all Ubuntu systems, which would be bad :)

lxc and libvirt have run into the same problems, and they added their network interfaces to the global dnsmasq blacklist, which at least means that the behaviour is only changed for users who install lxc or libvirt.

Christian Parpart (trapni) wrote :

And that means what?

Will you (Ubuntu) ignore the bug and leave the patching up to the libvirt/lxc Ubuntu users?

I am confused. :-)

Steve Langasek (vorlon) wrote :

Chuck, please put SRU information in the bug description, not in a comment - it becomes hard to find this information when there are a dozen more comments from testers.

description: updated
Steve Langasek (vorlon) wrote :

Please also complete the test case with explicit information about how users can verify the *fix* for this bug.

Steve Langasek (vorlon) wrote :

I'm afraid I also don't understand this problem statement:

> There is an issue with the way nova uses dnsmasq in VLAN mode. It starts
> up a single copy of dnsmasq for each vlan on the network host (or on
> every host in multi_host mode). The problem is in the way that dnsmasq
> binds to an ip address and port[2]. Both copies can respond to broadcast
> packet, but unicast packets can only be answered by one of the copies.

What exactly is the network configuration that allows this to happen? Does the host have multiple vlan interfaces using the same IP address?

That's the only scenario I see in which SO_BINDTODEVICE should make a difference; but I don't understand why you would be using the same IP address on multiple interfaces, virtual or otherwise.

Steve Langasek (vorlon) wrote :

... and now I've reviewed the debdiff, and found it to not match the upstream commit. This part of the patch to src/network.c is missing:

@@ -254,6 +261,7 @@ static int iface_allowed(struct irec **irecp, int if_index,
       iface->addr = *addr;
       iface->netmask = netmask;
       iface->tftp_ok = tftp_ok;
+ iface->dhcp_ok = dhcp_ok;
       iface->mtu = mtu;
       iface->dad = dad;
       iface->done = 0;

This means the value of dhcp_ok on each interface is *undefined*, and this SRU would cause dnsmasq to *randomly* stop doing DHCP on configured interfaces.

Rejecting from the queue.

Steve Langasek (vorlon) wrote :

Before the SRU team will reconsider an SRU for this, based on the above I would also expect to see a regression test plan that accounts for making sure dnsmasq continues to work correctly in configurations other than the openstack one.

Chuck Short (zulcss) wrote :

Ill fix this up do as requested.

Changed in dnsmasq (Ubuntu Precise):
assignee: nobody → Stéphane Graber (stgraber)
Stéphane Graber (stgraber) wrote :

Assigned this bug to myself when going through the buglist as it was in my usual package list, though based on past comments, I'm now re-assigning to Chuck as he's more familiar with the issue.

I'll be interested in looking at the diff before it gets pushed to our users though. As Steve said, we have dnsmasq running on most Ubuntu systems (all desktops have it by default) and we really don't want to risk a regression for these.

Changed in dnsmasq (Ubuntu Precise):
assignee: Stéphane Graber (stgraber) → Chuck Short (zulcss)
James Page (james-page) on 2012-08-09
Changed in dnsmasq (Ubuntu Precise):
milestone: ubuntu-12.04.1 → precise-updates
Luc (gmi68745) wrote :

Was this update to dnsmasq released in 12.04.1 ?

root@ubuntu:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.1 LTS
Release: 12.04
Codename: precise
root@ubuntu:~# apt-cache policy dnsmasq
dnsmasq:
  Installed: 2.59-4
  Candidate: 2.59-4
  Version table:
 *** 2.59-4 0
        500 http://us.archive.ubuntu.com/ubuntu/ precise/universe amd64 Packages
        100 /var/lib/dpkg/status

Stéphane Graber (stgraber) wrote :

No

This is hitting a lot of openstack users who chose 12.04 due to the announcement of backporting openstack to precise for 3 years: https://wiki.ubuntu.com/ServerTeam/CloudArchive

We fall into that category. It's pretty much impossible to use stock versions of dnsmasq and openstack in 12.04. This adds a significant burden to our team. This is a pretty serious problem that needs to find its way into 12.04 (precise), sooner rather than later, especially since it affects the recommended openstack essex configuration (vlan). It's causing our Windows Server instances to fatally lose their IP address config. This essentially makes our cloud unpredictably unstable, with instances coming and going randomly. Yes, there are potential workarounds (at least for openstack), but they're ugly:

https://lists.launchpad.net/openstack/msg11696.html

nova.conf
# release leases immediately on terminate
force_dhcp_release=true
# one week lease time
dhcp_lease_time=604800
# two week disassociate timeout
fixed_ip_disassociate_timeout=1209600

Again, this needs to be fixed considering this is LTS, and considering that precise is supposed to be a solid foundation upon which to build an openstack cloud.

Soren Hansen (soren) wrote :

Chuck, this is still assigned to you. Is it going anywhere?

Chuck Short (zulcss) wrote :

No we are probably going to be backporting it to the cloud archive.

Did this make it into 12.04.2 LTS? We still experience breakage here, and must manually apply a newer dnsmasq out of band (which causes all sorts of other administration burdens). I don't see it in cloud-archive either according to the most recent comment 1 month ago.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related questions

Bug attachments