nova-network gets release_fixed_ip events from someplace, but the database still keeps them associated with instances

Bug #1026621 reported by Nick Moffitt
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
High
Unassigned
nova (Ubuntu)
Triaged
Medium
Unassigned

Bug Description

We've been seeing a lot of instances simply vanish from the network. Usually people have been willing to work around this by simply rebooting or re-creating their instances, but it's troubling for long-running instances (especially those that have volumes associated).

Here's the relevant bit of nova-network.log for one of these:

    2012-07-16 14:06:32 DEBUG nova.rpc.amqp [-] received {u'_context_roles': [u'admin'], u'_context_request_id': u'req-d0905711-c4d1-4452-a3b2-46815d1983d7', u'_context_read_deleted': u'no', u'args': {u'address': u'10.55.60.141'}, u'_context_auth_token': '<SANITIZED>', u'_context_is_admin': True, u'_context_project_id': None, u'_context_timestamp': u'2012-07-16T14:06:32.169100', u'_context_user_id': None, u'method': u'release_fixed_ip', u'_context_remote_address': None} from (pid=493) _safe_log /usr/lib/python2.7/dist-packages/nova/rpc/common.py:160
    2012-07-16 14:06:32 DEBUG nova.rpc.amqp [req-d0905711-c4d1-4452-a3b2-46815d1983d7 None None] unpacked context: {'user_id': None, 'roles': [u'admin'], 'timestamp': '2012-07-16T14:06:32.169100', 'auth_token': '<SANITIZED>', 'remote_address': None, 'is_admin': True, 'request_id': u'req-d0905711-c4d1-4452-a3b2-46815d1983d7', 'project_id': None, 'read_deleted': u'no'} from (pid=493) _safe_log /usr/lib/python2.7/dist-packages/nova/rpc/common.py:160
    2012-07-16 14:06:32 DEBUG nova.network.manager [req-d0905711-c4d1-4452-a3b2-46815d1983d7 None None] Released IP |10.55.60.141| from (pid=493) release_fixed_ip /usr/lib/python2.7/dist-packages/nova/network/manager.py:1260

Then the dhcpbridge shows it being revoked:

    2012-07-16 14:04:29 DEBUG nova.dhcpbridge [-] Called 'old' for mac 'fa:16:3e:11:c5:37' with ip '10.55.60.141' from (pid=23699) main /usr/bin/nova-dhcpbridge:113
    2012-07-16 14:06:32 DEBUG nova.dhcpbridge [-] Called 'del' for mac 'fa:16:3e:11:c5:37' with ip '10.55.60.141' from (pid=24946) main /usr/bin/nova-dhcpbridge:113

Is there any way we can find out what might have placed the release_fixed_ip event on the message queue? There doesn't seeem to be any other mention of the IP in the nova logs on any of our systems.

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

I should probably clarify that there are no API requests matching this release_fixed_ip event, and the instance owner explicitly did not want the event to take place.

Revision history for this message
Andrew Glen-Young (aglenyoung) wrote :

We're running nova with nova.network.manager.FlatDHCPManager. Grepping the nova-network machine's syslog shows that the last DHCP lease request was on Jul 16 14:04:29

Sample from the log file:

Jul 16 14:04:29 dziban dnsmasq-dhcp[30249]: DHCPREQUEST(br100) 10.55.60.141 fa:16:3e:11:c5:37
Jul 16 14:04:29 dziban dnsmasq-dhcp[30249]: DHCPACK(br100) 10.55.60.141 fa:16:3e:11:c5:37 server-13282

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Andrew-

Can't seem to reproduce this locally. Its a long shot, but I wonder if this might be related to the same problems reported in Bug #1006898. Are there multiple instances of dnsmasq running? I know you're not configured for VLANing, but I'm interested to know if the proposed fix there solves this issue since the final symptoms seem awfully similar.

Revision history for this message
Andrew Glen-Young (aglenyoung) wrote :

@Adam:

Bug #1006898 seems unlikely IMHO. We are not running with 'multi_host' nor with VLANS. So there is one dnsmasq process.

The following is just conjecture as the event has rotated out of the logs and I have not done any further testing. These events happen rather infrequently, but are a significant annoyance to be noticed.

The dnsmasq(8) man page implies that dnsmasq will block while running the dhcp-script. Ordinarily this would not be a problem for individual leases, however if the daemon receives a HUP signal it will parse the current lease database and run the dhcp-script with the 'old' action for each lease.

Assuming the above is correct and we have a dnsmasq lease file of 200 records and a runtime of 0.15s per invocation¹ of dhcp-script, then the daemon could be blocking for up to 30 seconds (or longer). If a host was renewing an expired lease at the time, it is conceivable that the lease may expire.

The default lease length is 120s which seems to correlate with the logs provided above. I am not sure if any of this helps.

[1]: Non-significant mean run time determined by running 'time /usr/bin/nova-dhcpbridge' ten times.

Revision history for this message
Thierry Carrez (ttx) wrote :

Keeping on the radar as incomplete...
What Nova version was this happening against ?

Changed in nova:
status: New → Incomplete
Revision history for this message
Thierry Carrez (ttx) wrote :

We cannot solve the issue you reported without more information. Could you please provide the requested information ?

Revision history for this message
gatuus (gatuus) wrote :
Download full text (3.7 KiB)

This is also happening to me EXACTLY as you...

How can I reproduce this bug? This way:

I have a Windows 2008 Raw Image UP in my openstack cloud (ESSEX).
(with a flatdhcp multi-host configuration)

When I create a New instance with this W2008 image everything comes up smoothly , Everything works as it should.. but 5 - 10 minutes later.. .something happens...:

Windows loses Connectivity via rdesktop. I can see the instance running on the compute machine. And I can connect to it via VNC.

After analyzing Windows LOGs I discovered that after Windows Synchronizes the clock with the network clock Windows disables and enables the local network connection (nic).
But it is done very very fast.

 Here, this is when my windows instance loses the IP!! The DHCP Ip address is set to: 169.254.141.59
Why?!? I don't know

I don't know if restarting the network interface is a normal process on the windows world but.. Doing this manually on windows: it has(after 20-100 tries) the same effect, DHCP address gets lost.

and looking at the /var/log/syslog file I have exactly the same error as you:

Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.dhcpbridge [-] Called 'del' for mac 'fa:16:3e:5d:88:fe' with ip '10.5.96.21' from (pid=26734) main /usr/bin/nova-dhcpbridge:113
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.rpc.amqp [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] Making asynchronous cast on network.osqcomp06.kloudopen.com... from (pid=26734) cast /usr/lib/python2.7/dist-packages/nova/rpc/amqp.py:351
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.rpc.amqp [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] Pool creating new connection from (pid=26734) create /usr/lib/python2.7/dist-packages/nova/rpc/amqp.py:60
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 INFO nova.rpc.common [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] Connected to AMQP server on osqctrl01:5672
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.rpc.amqp [-] received {u'_context_roles': [u'admin'], u'_context_request_id': u'req-060372df-c625-491e-8a59-7b037b2f7c2a', u'_context_read_deleted': u'no', u'args': {u'address': u'10.5.96.21'}, u'_context_auth_token': '<SANITIZED>', u'_context_is_admin': True, u'_context_project_id': None, u'_context_timestamp': u'2013-02-15T18:54:49.242450', u'_context_user_id': None, u'method': u'release_fixed_ip', u'_context_remote_address': None} from (pid=452) _safe_log /usr/lib/python2.7/dist-packages/nova/rpc/common.py:160
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.rpc.amqp [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] unpacked context: {'user_id': None, 'roles': [u'admin'], 'timestamp': '2013-02-15T18:54:49.242450', 'auth_token': '<SANITIZED>', 'remote_address': None, 'is_admin': True, 'request_id': u'req-060372df-c625-491e-8a59-7b037b2f7c2a', 'project_id': None, 'read_deleted': u'no'} from (pid=452) _safe_log /usr/lib/python2.7/dist-packages/nova/rpc/common.py:160
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.network.manager [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] Released IP |10.5.96.21| from (pid=452) release_fixed_ip /usr/lib/python2.7/dist-packages/n...

Read more...

Revision history for this message
gatuus (gatuus) wrote :

Forgot to tell..
When restarting the network interface from windows via VNC, the interface recovers its IP.

(It affects me everytime somebody creates a new Windows instance)

Changed in nova:
status: Incomplete → Confirmed
Revision history for this message
Bill Rich (bill-rich) wrote :

I am running into the same problem. It only seems to occur with Windows 2008 instances. I am running nova-network 2012.2.4 with flatdhcp, but I have seen the same thing with earlier versions of nova-net.

Revision history for this message
Bill Rich (bill-rich) wrote :

My issue is the timezone changing on my Win2k8 instances. When an instance boots up, it is on UTC. It gets a DHCP lease in this timezone. After a few minutes, the timezone is then changed to PDT and the time Windows thinks it obtained the lease gets screwed up (~136 years in the past). DHCP then gives up on renewing the lease until the instance is rebooted, or it is renewed manually. Once Windows misses renewing the lease, nova marks the lease as released in the db (and nova-dhcpbridge generates the "del" message).

Revision history for this message
Jay Farschman (jfarschman) wrote :

We are seeing the same thing with Windows2012. Lease times are setting to the year 1877.

Revision history for this message
Jay Farschman (jfarschman) wrote :

We appear to have solved this problem for ourselves by changing the way windows deals with time. Remember, the symptom is we are obtaining a lease from August in 1877. Seems like this is caused by windows using local time rather than UTC. The solution is to change windows to UTC.

Changed the RegKey HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\TimeZoneInformation and created a new DWORD called RealTimeIsUniversal and set it to one 1.

Now, I believe there may be a problem with this when we enter and leave daylight savings, but at this point I have a system that boots and keeps it's IP address. I may be using puppet to change RealTimeIsUniversal back to 0 after everything is working.

Comments?

Revision history for this message
Shawn Duex (shawn-5) wrote :

Hey Jay,

Funny you posted this. I was about to post the same. after a bunch of testing and comparing how CLoudStack deals with the Hardware Clock for Windows with how Openstack is configured by default we also noticed that the Nova VM template uses a hardware clock with a UTC offset. This is a standard for Linux but not for Windows. The reg key you listed does allow Windows to deal with a UTC offset hardware clock. In my looking into this I do see that this change is not without issue, as some things will not work correctly. But for the most part on server OS I think it should be fine. CloudStack manages this issue by having a OS_Type for each image and for images that are windows it local time rather that the UTC offset for the hardware clock. This then does not require an image side change for Windows images. Currently we do not see an easy way to implement the same kind of solution as the best we can do is set advanced metadata properties on an image in glance, but then having Nova key off this and use a Windows specific template would require some additional work. I think another aspect of this is that the DHCP leases are set for 120 seconds on OpenStack by default. I think this exacerbates the issue because the lease time is shorter that the time offset. Does anyone know why the DHCP lease times are so short for OpenStack? CloudStack configures the leases to not expire which would also make this issue not happen.

Michael Still (mikal)
tags: added: libvirt
Revision history for this message
Michael Still (mikal) wrote :

Hi! There seem to be two issues here in the one bug -- DHCP refreshes
are causing problems, and pain with windows instances because of how
they handle timezones and that causing extra DHCP refreshes. I'll
talk about each of those separately.

Windows timezones
=================

I'm sorry to hear you're having pain with Windows instances.
Unfortunately I don't know a lot about windows, but it sounds as though
if we could change the bios timezone for Windows instances to the
timezone of the compute node that would alleviate your issue. Is that
correct?

Looking at the code, each instance has an os_type field, which is
inherited from the image the instance is based off (there is an image
property called os_type). Xen expects this value to be "windows" for
windows instances. So, libvirt could do the same thing here and treat
the timezone for instances with os_type == "windows" differently.

I'm going to create a new bug to track this windows issue, so that I
can leave this bug focussed on the DHCP timeout issue. This isn't
because Windows is less important as a guest, but because I don't want
it to be missed in the DHCP discussion that's happening here.

DHCP refreshes
==============

It also sounds as if we could bear in increase the default DHCP interval,
although you can configure that yourself as well. It might also make
sense to ping an IP before we delete its lease.

Cheers,
Michael

Changed in nova:
importance: Undecided → High
status: Confirmed → Triaged
Revision history for this message
Michael Still (mikal) wrote :

The bug for the windows timezone issues is 1231254.

Revision history for this message
Shawn Duex (shawn-5) wrote :
Download full text (4.0 KiB)

Michael,

Thank you for taking a look at this and creating the new bug for changing the way libvirt virtualizes the hardware clock for Windows vs. Linux. I believe this to be the root cause for the bug defined here because of the way Windows handles DHCP renewals.

===DHCP===

As for DHCP, I am not sure anything is wrong with having a very short DHCP lease time of 120 sec as the default for nova. I was more curious what the reasoning behind this was. My understanding is that the DNSMasq default is 12 hours. I found this message also asking about the nova default - https://lists.launchpad.net/openstack/msg24200.html

I also wanted to say that the DHCP lease time of 120 seconds does not really exacerbate the issue as I stated above. What is does is make the issue present faster.

==Windows DHCP lease renewal behavior===

This is my understanding of why we are seeing a lose of IP address for Windows 2008R2 (my reproducible issue)

1. Windows instance boots up on KVM with the hardware clock set to UTC (expects localtime)

2. The instance gets it's DHCP lease. (2 min) it logs the obtained time and expire time for the lease
Ethernet adapter Local Area Connection:

   Connection-specific DNS Suffix . : novalocal
   Description . . . . . . . . . . . : Red Hat VirtIO Ethernet Adapter
   Physical Address. . . . . . . . . : FA-16-3E-23-D2-8E
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::d420:7346:d258:ffa2%11(Preferred)
   IPv4 Address. . . . . . . . . . . : 192.168.200.25(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Lease Obtained. . . . . . . . . . : Thursday, September 26, 2013 5:17:06 PM
   Lease Expires . . . . . . . . . . : Thursday, September 26, 2013 5:47:06 PM
   Default Gateway . . . . . . . . . : 192.168.200.1
   DHCP Server . . . . . . . . . . . : 192.168.200.1
   DHCPv6 IAID . . . . . . . . . . . : 251270718
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-19-D6-86-9A-FA-16-3E-23-D2-8E

   DNS Servers . . . . . . . . . . . : 192.168.200.1
   NetBIOS over Tcpip. . . . . . . . : Enabled

3. NTP time sync happens, in my timezone PST (windows default), this sets the instance clock back 7 hours and updates the DHCP lease information as follows
Ethernet adapter Local Area Connection:

   Connection-specific DNS Suffix . : novalocal
   Description . . . . . . . . . . . : Red Hat VirtIO Ethernet Adapter
   Physical Address. . . . . . . . . : FA-16-3E-23-D2-8E
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::d420:7346:d258:ffa2%11(Preferred)
   IPv4 Address. . . . . . . . . . . : 192.168.200.25(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Lease Obtained. . . . . . . . . . : Monday, August 20, 1877 10:48:50 AM
   Lease Expires . . . . . . . . . . : Thursday, September 26, 2013 10:47:09 AM
   Default Gateway . . . . . . . . . : 192.168.200.1
   DHCP Server . . . . . . . . . . . : 192.168.200.1
   DHCPv6 IAID . . . . . . . . . . . : 251270718
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-19-D6-86-9A-FA-16-3E-23-D2-8E

   DNS Servers . ...

Read more...

Revision history for this message
Shawn Duex (shawn-5) wrote :

===DHCP===

While I do agree that It would make sense to ping an IP before its lease is deleted as a extra safety measure. This would not have help this specific Windows issue as the IP is released by the instance.

sark2012 (386488135-o)
description: updated
James Page (james-page)
Changed in nova (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.