Bug #1026621 “nova-network gets release_fixed_ip events from som...” : Bugs : nova package : Ubuntu

Revision history for this message

Nick Moffitt (nick-moffitt) wrote on 2012-07-19:

#1

I should probably clarify that there are no API requests matching this release_fixed_ip event, and the instance owner explicitly did not want the event to take place.

Revision history for this message

Andrew Glen-Young (aglenyoung) wrote on 2012-07-19:

#2

We're running nova with nova.network.manager.FlatDHCPManager. Grepping the nova-network machine's syslog shows that the last DHCP lease request was on Jul 16 14:04:29

Sample from the log file:

Jul 16 14:04:29 dziban dnsmasq-dhcp[30249]: DHCPREQUEST(br100) 10.55.60.141 fa:16:3e:11:c5:37
Jul 16 14:04:29 dziban dnsmasq-dhcp[30249]: DHCPACK(br100) 10.55.60.141 fa:16:3e:11:c5:37 server-13282

Revision history for this message

Launchpad Janitor (janitor) wrote on 2012-07-19:

#3

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status:	New → Confirmed

Revision history for this message

Adam Gandelman (gandelman-a) wrote on 2012-08-07:

#4

Andrew-

Can't seem to reproduce this locally. Its a long shot, but I wonder if this might be related to the same problems reported in Bug #1006898. Are there multiple instances of dnsmasq running? I know you're not configured for VLANing, but I'm interested to know if the proposed fix there solves this issue since the final symptoms seem awfully similar.

Revision history for this message

Andrew Glen-Young (aglenyoung) wrote on 2012-08-08:

#5

@Adam:

Bug #1006898 seems unlikely IMHO. We are not running with 'multi_host' nor with VLANS. So there is one dnsmasq process.

The following is just conjecture as the event has rotated out of the logs and I have not done any further testing. These events happen rather infrequently, but are a significant annoyance to be noticed.

The dnsmasq(8) man page implies that dnsmasq will block while running the dhcp-script. Ordinarily this would not be a problem for individual leases, however if the daemon receives a HUP signal it will parse the current lease database and run the dhcp-script with the 'old' action for each lease.

Assuming the above is correct and we have a dnsmasq lease file of 200 records and a runtime of 0.15s per invocation¹ of dhcp-script, then the daemon could be blocking for up to 30 seconds (or longer). If a host was renewing an expired lease at the time, it is conceivable that the lease may expire.

The default lease length is 120s which seems to correlate with the logs provided above. I am not sure if any of this helps.

[1]: Non-significant mean run time determined by running 'time /usr/bin/nova-dhcpbridge' ten times.

Revision history for this message

Thierry Carrez (ttx) wrote on 2012-08-22:

#6

Keeping on the radar as incomplete...
What Nova version was this happening against ?

Changed in nova:
status:	New → Incomplete

Revision history for this message

Thierry Carrez (ttx) wrote on 2012-12-13:

#7

We cannot solve the issue you reported without more information. Could you please provide the requested information ?

Revision history for this message

gatuus (gatuus) wrote on 2013-02-15:

#8

Download full text (3.7 KiB)

This is also happening to me EXACTLY as you...

How can I reproduce this bug? This way:

I have a Windows 2008 Raw Image UP in my openstack cloud (ESSEX).
(with a flatdhcp multi-host configuration)

When I create a New instance with this W2008 image everything comes up smoothly , Everything works as it should.. but 5 - 10 minutes later.. .something happens...:

Windows loses Connectivity via rdesktop. I can see the instance running on the compute machine. And I can connect to it via VNC.

After analyzing Windows LOGs I discovered that after Windows Synchronizes the clock with the network clock Windows disables and enables the local network connection (nic).
But it is done very very fast.

Here, this is when my windows instance loses the IP!! The DHCP Ip address is set to: 169.254.141.59
Why?!? I don't know

I don't know if restarting the network interface is a normal process on the windows world but.. Doing this manually on windows: it has(after 20-100 tries) the same effect, DHCP address gets lost.

and looking at the /var/log/syslog file I have exactly the same error as you:

Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.dhcpbridge [-] Called 'del' for mac 'fa:16:3e:5d:88:fe' with ip '10.5.96.21' from (pid=26734) main /usr/bin/nova-dhcpbridge:113
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.rpc.amqp [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] Making asynchronous cast on network.osqcomp06.kloudopen.com... from (pid=26734) cast /usr/lib/python2.7/dist-packages/nova/rpc/amqp.py:351
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.rpc.amqp [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] Pool creating new connection from (pid=26734) create /usr/lib/python2.7/dist-packages/nova/rpc/amqp.py:60
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 INFO nova.rpc.common [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] Connected to AMQP server on osqctrl01:5672
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.rpc.amqp [-] received {u'_context_roles': [u'admin'], u'_context_request_id': u'req-060372df-c625-491e-8a59-7b037b2f7c2a', u'_context_read_deleted': u'no', u'args': {u'address': u'10.5.96.21'}, u'_context_auth_token': '<SANITIZED>', u'_context_is_admin': True, u'_context_project_id': None, u'_context_timestamp': u'2013-02-15T18:54:49.242450', u'_context_user_id': None, u'method': u'release_fixed_ip', u'_context_remote_address': None} from (pid=452) _safe_log /usr/lib/python2.7/dist-packages/nova/rpc/common.py:160
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.rpc.amqp [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] unpacked context: {'user_id': None, 'roles': [u'admin'], 'timestamp': '2013-02-15T18:54:49.242450', 'auth_token': '<SANITIZED>', 'remote_address': None, 'is_admin': True, 'request_id': u'req-060372df-c625-491e-8a59-7b037b2f7c2a', 'project_id': None, 'read_deleted': u'no'} from (pid=452) _safe_log /usr/lib/python2.7/dist-packages/nova/rpc/common.py:160
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.network.manager [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] Released IP |10.5.96.21| from (pid=452) release_fixed_ip /usr/lib/python2.7/dist-packages/n...

This is also happening to me  EXACTLY as you...

How can I reproduce this bug? This way:

I have a  Windows 2008 Raw Image UP in my openstack cloud (ESSEX).
(with a flatdhcp multi-host configuration)

When I create a New instance with this W2008 image everything comes up smoothly , Everything works as it should.. but 5 - 10 minutes later.. .something happens...:

Windows loses Connectivity via rdesktop.  I can see the instance running on the compute machine. And I can connect to it via VNC.

After analyzing  Windows LOGs I discovered that after Windows Synchronizes the clock with the network clock Windows disables and enables the local network connection (nic).
But it is done very very fast.

Here, this is when my windows instance loses the IP!! The DHCP Ip address is set to:  169.254.141.59
Why?!?  I don't know

I don't know if restarting the network interface is a normal process on the windows world but.. Doing this manually on windows: it has(after 20-100 tries) the same effect, DHCP address gets lost.

and looking at the /var/log/syslog  file I have exactly the same error as you:

Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.dhcpbridge [-] Called 'del' for mac 'fa:16:3e:5d:88:fe' with ip '10.5.96.21' from (pid=26734) main /usr/bin/nova-dhcpbridge:113
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.rpc.amqp [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] Making asynchronous cast on network.osqcomp06.kloudopen.com... from (pid=26734) cast /usr/lib/python2.7/dist-packages/nova/rpc/amqp.py:351
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.rpc.amqp [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] Pool creating new connection from (pid=26734) create /usr/lib/python2.7/dist-packages/nova/rpc/amqp.py:60
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 INFO nova.rpc.common [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] Connected to AMQP server on osqctrl01:5672
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.rpc.amqp [-] received {u'_context_roles': [u'admin'], u'_context_request_id': u'req-060372df-c625-491e-8a59-7b037b2f7c2a', u'_context_read_deleted': u'no', u'args': {u'address': u'10.5.96.21'}, u'_context_auth_token': '<SANITIZED>', u'_context_is_admin': True, u'_context_project_id': None, u'_context_timestamp': u'2013-02-15T18:54:49.242450', u'_context_user_id': None, u'method': u'release_fixed_ip', u'_context_remote_address': None} from (pid=452) _safe_log /usr/lib/python2.7/dist-packages/nova/rpc/common.py:160
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.rpc.amqp [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] unpacked context: {'user_id': None, 'roles': [u'admin'], 'timestamp': '2013-02-15T18:54:49.242450', 'auth_token': '<SANITIZED>', 'remote_address': None, 'is_admin': True, 'request_id': u'req-060372df-c625-491e-8a59-7b037b2f7c2a', 'project_id': None, 'read_deleted': u'no'} from (pid=452) _safe_log /usr/lib/python2.7/dist-packages/nova/rpc/common.py:160
Feb 15 12:54:49 osqcomp06 2013-02-15 12:54:49 DEBUG nova.network.manager [req-060372df-c625-491e-8a59-7b037b2f7c2a None None] Released IP |10.5.96.21| from (pid=452) release_fixed_ip /usr/lib/python2.7/dist-packages/nova/network/manager.py:1254

The IP is RELEASED...
Why?
I'm still looking for answers ..

Ahh and after that I have the next LOG repeated many times(at least once per minute):

Feb 15 13:32:24 osqcomp06 dnsmasq-dhcp[12632]: DHCPREQUEST(br100) 10.5.96.21 fa:16:3e:5d:88:fe 
Feb 15 13:32:24 osqcomp06 dnsmasq-dhcp[12632]: DHCPACK(br100) 10.5.96.21 fa:16:3e:5d:88:fe testx-030
Feb 15 13:32:24 osqcomp06 2013-02-15 13:32:24 DEBUG nova.dhcpbridge [-] Called 'old' for mac 'fa:16:3e:5d:88:fe' with ip '10.5.96.21' from (pid=31227) main /usr/bin/nova-dhcpbridge:113

Thanks in advance
gatuus

Revision history for this message

gatuus (gatuus) wrote on 2013-02-15:

#9

Forgot to tell..
When restarting the network interface from windows via VNC, the interface recovers its IP.

(It affects me everytime somebody creates a new Windows instance)

Davanum Srinivas (DIMS) (dims-v) on 2013-03-18

Changed in nova:
status:	Incomplete → Confirmed

Revision history for this message

Bill Rich (bill-rich) wrote on 2013-09-20:

#10

I am running into the same problem. It only seems to occur with Windows 2008 instances. I am running nova-network 2012.2.4 with flatdhcp, but I have seen the same thing with earlier versions of nova-net.

Revision history for this message

Bill Rich (bill-rich) wrote on 2013-09-21:

#11

My issue is the timezone changing on my Win2k8 instances. When an instance boots up, it is on UTC. It gets a DHCP lease in this timezone. After a few minutes, the timezone is then changed to PDT and the time Windows thinks it obtained the lease gets screwed up (~136 years in the past). DHCP then gives up on renewing the lease until the instance is rebooted, or it is renewed manually. Once Windows misses renewing the lease, nova marks the lease as released in the db (and nova-dhcpbridge generates the "del" message).

Revision history for this message

Jay Farschman (jfarschman) wrote on 2013-09-23:

#12

We are seeing the same thing with Windows2012. Lease times are setting to the year 1877.

Revision history for this message

Jay Farschman (jfarschman) wrote on 2013-09-25:

#13

We appear to have solved this problem for ourselves by changing the way windows deals with time. Remember, the symptom is we are obtaining a lease from August in 1877. Seems like this is caused by windows using local time rather than UTC. The solution is to change windows to UTC.

Changed the RegKey HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\TimeZoneInformation and created a new DWORD called RealTimeIsUniversal and set it to one 1.

Now, I believe there may be a problem with this when we enter and leave daylight savings, but at this point I have a system that boots and keeps it's IP address. I may be using puppet to change RealTimeIsUniversal back to 0 after everything is working.

Comments?

Revision history for this message

Shawn Duex (shawn-5) wrote on 2013-09-25:

#14

Hey Jay,

Funny you posted this. I was about to post the same. after a bunch of testing and comparing how CLoudStack deals with the Hardware Clock for Windows with how Openstack is configured by default we also noticed that the Nova VM template uses a hardware clock with a UTC offset. This is a standard for Linux but not for Windows. The reg key you listed does allow Windows to deal with a UTC offset hardware clock. In my looking into this I do see that this change is not without issue, as some things will not work correctly. But for the most part on server OS I think it should be fine. CloudStack manages this issue by having a OS_Type for each image and for images that are windows it local time rather that the UTC offset for the hardware clock. This then does not require an image side change for Windows images. Currently we do not see an easy way to implement the same kind of solution as the best we can do is set advanced metadata properties on an image in glance, but then having Nova key off this and use a Windows specific template would require some additional work. I think another aspect of this is that the DHCP leases are set for 120 seconds on OpenStack by default. I think this exacerbates the issue because the lease time is shorter that the time offset. Does anyone know why the DHCP lease times are so short for OpenStack? CloudStack configures the leases to not expire which would also make this issue not happen.

Michael Still (mikal) on 2013-09-26

tags:

added: libvirt

Revision history for this message

Michael Still (mikal) wrote on 2013-09-26:

#15

Hi! There seem to be two issues here in the one bug -- DHCP refreshes
are causing problems, and pain with windows instances because of how
they handle timezones and that causing extra DHCP refreshes. I'll
talk about each of those separately.

Windows timezones
=================

I'm sorry to hear you're having pain with Windows instances.
Unfortunately I don't know a lot about windows, but it sounds as though
if we could change the bios timezone for Windows instances to the
timezone of the compute node that would alleviate your issue. Is that
correct?

Looking at the code, each instance has an os_type field, which is
inherited from the image the instance is based off (there is an image
property called os_type). Xen expects this value to be "windows" for
windows instances. So, libvirt could do the same thing here and treat
the timezone for instances with os_type == "windows" differently.

I'm going to create a new bug to track this windows issue, so that I
can leave this bug focussed on the DHCP timeout issue. This isn't
because Windows is less important as a guest, but because I don't want
it to be missed in the DHCP discussion that's happening here.

DHCP refreshes
==============

It also sounds as if we could bear in increase the default DHCP interval,
although you can configure that yourself as well. It might also make
sense to ping an IP before we delete its lease.

Cheers,
Michael

Changed in nova:
importance:	Undecided → High
status:	Confirmed → Triaged

Revision history for this message

Michael Still (mikal) wrote on 2013-09-26:

#16

The bug for the windows timezone issues is 1231254.

Revision history for this message

Shawn Duex (shawn-5) wrote on 2013-09-26:

#17

Download full text (4.0 KiB)

Michael,

Thank you for taking a look at this and creating the new bug for changing the way libvirt virtualizes the hardware clock for Windows vs. Linux. I believe this to be the root cause for the bug defined here because of the way Windows handles DHCP renewals.

===DHCP===

As for DHCP, I am not sure anything is wrong with having a very short DHCP lease time of 120 sec as the default for nova. I was more curious what the reasoning behind this was. My understanding is that the DNSMasq default is 12 hours. I found this message also asking about the nova default - https://lists.launchpad.net/openstack/msg24200.html

I also wanted to say that the DHCP lease time of 120 seconds does not really exacerbate the issue as I stated above. What is does is make the issue present faster.

==Windows DHCP lease renewal behavior===

This is my understanding of why we are seeing a lose of IP address for Windows 2008R2 (my reproducible issue)

1. Windows instance boots up on KVM with the hardware clock set to UTC (expects localtime)

2. The instance gets it's DHCP lease. (2 min) it logs the obtained time and expire time for the lease
Ethernet adapter Local Area Connection:

   Connection-specific DNS Suffix . : novalocal
   Description . . . . . . . . . . . : Red Hat VirtIO Ethernet Adapter
   Physical Address. . . . . . . . . : FA-16-3E-23-D2-8E
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::d420:7346:d258:ffa2%11(Preferred)
   IPv4 Address. . . . . . . . . . . : 192.168.200.25(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Lease Obtained. . . . . . . . . . : Thursday, September 26, 2013 5:17:06 PM
   Lease Expires . . . . . . . . . . : Thursday, September 26, 2013 5:47:06 PM
   Default Gateway . . . . . . . . . : 192.168.200.1
   DHCP Server . . . . . . . . . . . : 192.168.200.1
   DHCPv6 IAID . . . . . . . . . . . : 251270718
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-19-D6-86-9A-FA-16-3E-23-D2-8E

DNS Servers . . . . . . . . . . . : 192.168.200.1
NetBIOS over Tcpip. . . . . . . . : Enabled

3. NTP time sync happens, in my timezone PST (windows default), this sets the instance clock back 7 hours and updates the DHCP lease information as follows
Ethernet adapter Local Area Connection:

   Connection-specific DNS Suffix . : novalocal
   Description . . . . . . . . . . . : Red Hat VirtIO Ethernet Adapter
   Physical Address. . . . . . . . . : FA-16-3E-23-D2-8E
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::d420:7346:d258:ffa2%11(Preferred)
   IPv4 Address. . . . . . . . . . . : 192.168.200.25(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Lease Obtained. . . . . . . . . . : Monday, August 20, 1877 10:48:50 AM
   Lease Expires . . . . . . . . . . : Thursday, September 26, 2013 10:47:09 AM
   Default Gateway . . . . . . . . . : 192.168.200.1
   DHCP Server . . . . . . . . . . . : 192.168.200.1
   DHCPv6 IAID . . . . . . . . . . . : 251270718
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-19-D6-86-9A-FA-16-3E-23-D2-8E

DNS Servers . ...

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Triaged	High	Unassigned
	nova (Ubuntu)	Triaged	Medium	Unassigned

Ubuntu
nova package

nova-network gets release_fixed_ip events from someplace, but the database still keeps them associated with instances

Bug Description

Other bug subscribers

Remote bug watches

Ubuntunova package

nova-network gets release_fixed_ip events from someplace, but the database still keeps them associated with instances

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
nova package