OpenStack Compute (nova)

IP's are recycled too quickly

Bug #714577 reported by Soren Hansen on 2011-02-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Undecided	Unassigned	OpenStack Compute (nova) 2011.2 "cactus"

Bug Description

My automated tests frequently fail. It turns out that it's almost exclusively caused by dnsmasq refusing to hand out an IP to a guest, because an old instance from a previous test run still holds the lease.

Once a lease expires, dnsmasq calls the dhcpbridge which in turn calls the network worker and disassociates the IP. That's great.

However, the compute worker also disassociates the IP when terminating an instance. This is almost always premature, since it's quite unlikely that it will coincide exactly with the expiration of the lease, so the IP doesn't belong back in the pool of available IP's.

We have a couple of options:

1) Add a "disassociated_at" column to fixed_ips. That timestamp is set when the instance is terminated. When we grab an ip from the pool, it must have been disassociate at least 120s earlier (that's what we pass to dnsmasq as the lease time). We might want to call it "recycle_after"/"keep_until" and adjust the semantics accordingly.

2) Leave it dnsmasq to release the IPs. We might leak IP's this way if dnsmasq is closed uncleanly or similar, but it does ensure that dnsmasq is willing to dish out this IP if we ask it to. We can probably solve this by doing some cleanup at startup.

3) Other options

Nevertheless, leaving it exclusively to dnsmasq could cause leaks if dnsmasq gets shut down uncleanly.

There

Related branches

lp:~soren/nova/lp714577

Merged into lp:~hudson-openstack/nova/trunk at revision 657

Devin Carlen (community): Approve on 2011-02-09

Vish Ishaya (community): Approve on 2011-02-09

Jay Pipes (community): Approve on 2011-02-09

Revision history for this message

Vish Ishaya (vishvananda) wrote on 2011-02-08: Re: [Bug 714577] [NEW] IP's are recycled too quickly

Download full text (3.6 KiB)

I think there is an easy solution to this. If you are blowing the database away in between tests, kill the running dnsmasq instances as well. Leases don't get returned to the pool until they have been released by dnsmasq or the timeout hits on the network periodic callback. The release of the ip by the compute host does not return the ip to the pool (unless there is some sort of bug that has been recently added).

Vish

On Feb 7, 2011, at 5:55 AM, Soren Hansen wrote:

> Public bug reported:
>
> My automated tests frequently fail. It turns out that it's almost
> exclusively caused by dnsmasq refusing to hand out an IP to a guest,
> because an old instance from a previous test run still holds the lease.
>
> Once a lease expires, dnsmasq calls the dhcpbridge which in turn calls
> the network worker and disassociates the IP. That's great.
>
> However, the compute worker also disassociates the IP when terminating
> an instance. This is almost always premature, since it's quite unlikely
> that it will coincide exactly with the expiration of the lease, so the
> IP doesn't belong back in the pool of available IP's.
>
> We have a couple of options:
>
> 1) Add a "disassociated_at" column to fixed_ips. That timestamp is set
> when the instance is terminated. When we grab an ip from the pool, it
> must have been disassociate at least 120s earlier (that's what we pass
> to dnsmasq as the lease time). We might want to call it
> "recycle_after"/"keep_until" and adjust the semantics accordingly.
>
> 2) Leave it dnsmasq to release the IPs. We might leak IP's this way if
> dnsmasq is closed uncleanly or similar, but it does ensure that dnsmasq
> is willing to dish out this IP if we ask it to. We can probably solve
> this by doing some cleanup at startup.
>
> 3) Other options
>
> Nevertheless, leaving it exclusively to dnsmasq could cause leaks if
> dnsmasq gets shut down uncleanly.
>
> There
>
> ** Affects: nova
> Importance: Undecided
> Status: New
>
> --
> You received this bug notification because you are a member of Nova Bug
> Team, which is subscribed to OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/714577
>
> Title:
> IP's are recycled too quickly
>
> Status in OpenStack Compute (Nova):
> New
>
> Bug description:
> My automated tests frequently fail. It turns out that it's almost
> exclusively caused by dnsmasq refusing to hand out an IP to a guest,
> because an old instance from a previous test run still holds the
> lease.
>
> Once a lease expires, dnsmasq calls the dhcpbridge which in turn calls
> the network worker and disassociates the IP. That's great.
>
> However, the compute worker also disassociates the IP when terminating
> an instance. This is almost always premature, since it's quite
> unlikely that it will coincide exactly with the expiration of the
> lease, so the IP doesn't belong back in the pool of available IP's.
>
> We have a couple of options:
>
> 1) Add a "disassociated_at" column to fixed_ips. That timestamp is set
> when the instance is terminated. When we grab an ip from the pool, it
> must have been disassociate at least 120s earlier (that's what we pass
> to dnsmasq...

I think there is an easy solution to this.  If you are blowing the database away in between tests, kill the running dnsmasq instances as well.  Leases don't get returned to the pool until they have been released by dnsmasq or the timeout hits on the network periodic callback.  The release of the ip by the compute host does not return the ip to the pool (unless there is some sort of bug that has been recently added).

Vish

On Feb 7, 2011, at 5:55 AM, Soren Hansen wrote:

> Public bug reported:
> 
> My automated tests frequently fail. It turns out that it's almost
> exclusively caused by dnsmasq refusing to hand out an IP to a guest,
> because an old instance from a previous test run still holds the lease.
> 
> Once a lease expires, dnsmasq calls the dhcpbridge which in turn calls
> the network worker and disassociates the IP. That's great.
> 
> However, the compute worker also disassociates the IP when terminating
> an instance. This is almost always premature, since it's quite unlikely
> that it will coincide exactly with the expiration of the lease, so the
> IP doesn't belong back in the pool of available IP's.
> 
> We have a couple of options:
> 
> 1) Add a "disassociated_at" column to fixed_ips. That timestamp is set
> when the instance is terminated. When we grab an ip from the pool, it
> must have been disassociate at least 120s earlier (that's what we pass
> to dnsmasq as the lease time). We might want to call it
> "recycle_after"/"keep_until" and adjust the semantics accordingly.
> 
> 2) Leave it dnsmasq to release the IPs. We might leak IP's this way if
> dnsmasq is closed uncleanly or similar, but it does ensure that dnsmasq
> is willing to dish out this IP if we ask it to. We can probably solve
> this by doing some cleanup at startup.
> 
> 3) Other options
> 
> Nevertheless, leaving it exclusively to dnsmasq could cause leaks if
> dnsmasq gets shut down uncleanly.
> 
> There
> 
> ** Affects: nova
>     Importance: Undecided
>         Status: New
> 
> -- 
> You received this bug notification because you are a member of Nova Bug
> Team, which is subscribed to OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/714577
> 
> Title:
>  IP's are recycled too quickly
> 
> Status in OpenStack Compute (Nova):
>  New
> 
> Bug description:
>  My automated tests frequently fail. It turns out that it's almost
>  exclusively caused by dnsmasq refusing to hand out an IP to a guest,
>  because an old instance from a previous test run still holds the
>  lease.
> 
>  Once a lease expires, dnsmasq calls the dhcpbridge which in turn calls
>  the network worker and disassociates the IP. That's great.
> 
>  However, the compute worker also disassociates the IP when terminating
>  an instance. This is almost always premature, since it's quite
>  unlikely that it will coincide exactly with the expiration of the
>  lease, so the IP doesn't belong back in the pool of available IP's.
> 
>  We have a couple of options:
> 
>  1) Add a "disassociated_at" column to fixed_ips. That timestamp is set
>  when the instance is terminated. When we grab an ip from the pool, it
>  must have been disassociate at least 120s earlier (that's what we pass
>  to dnsmasq as the lease time). We might want to call it
>  "recycle_after"/"keep_until" and adjust the semantics accordingly.
> 
>  2) Leave it dnsmasq to release the IPs. We might leak IP's this way if
>  dnsmasq is closed uncleanly or similar, but it does ensure that
>  dnsmasq is willing to dish out this IP if we ask it to. We can
>  probably solve this by doing some cleanup at startup.
> 
>  3) Other options
> 
>  Nevertheless, leaving it exclusively to dnsmasq could cause leaks if
>  dnsmasq gets shut down uncleanly.
> 
>  There
> 
>

Revision history for this message

Soren Hansen (soren) wrote on 2011-02-08:

I don't generally blow away the db between tests. I can't say for sure whether there may have been a stray dnsmasq left over from a debugging session. I'll make sure that's not the case and see if it still happens.

Thierry Carrez (ttx) on 2011-02-08

Changed in nova:
status:	New → Incomplete

Revision history for this message

Soren Hansen (soren) wrote on 2011-02-09: Re: [Bug 714577] Re: IP's are recycled too quickly

Problem found.

It turns out that converting the timeout timestamp to a string caused
the SQL to be very inclusive in it's recycling of IP's. The linked
branch makes my integration tests very, very happy indeed. No failures
over the last several hours (it used to fail once very 5 minutes or
so).

Revision history for this message

Vish Ishaya (vishvananda) wrote on 2011-02-09:

Nice. Propose it? Datetimes are still converted to isoformat if passed through json though, so it might be nice to know what the problem was. Was it just that sql didn't like iso format?

Revision history for this message

Soren Hansen (soren) wrote on 2011-02-09:

2011/2/9 Vish Ishaya <email address hidden>:
> Nice. Propose it? Datetimes are still converted to isoformat if passed
> through json though, so it might be nice to know what the problem was.
> Was it just that sql didn't like iso format?

That's why I haven't proposed it. I can see *very* clearly that this
has solved my problem (having run hundreds of instances since I
applied this patch, and not a single failure), and I could see in my
debugging that the periodic_task in VlanManager would frequently
disassociate a number of supposed stale fixed ip(s).

--
Soren Hansen | http://linux2go.dk/
Ubuntu Developer | http://www.ubuntu.com/
OpenStack Developer | http://www.openstack.org/

Revision history for this message

Soren Hansen (soren) wrote on 2011-02-09:

Ok, I've managed to replicate it now (I had to get an ip to deallocate,
but not get released):

SELECT * FROM fixed_ips WHERE network_id IN (SELECT id FROM networks
WHERE host = 'oxygen') AND updated_at < '2011-02-09T19:52:01.355919' AND
instance_id IS NOT NULL AND allocated = 0;

Gave me this result:
2011-02-09 12:25:23.756273|2011-02-09 20:01:43.789045||0|4|10.0.0.3|1|363|0|1|0

So "2011-02-09 20:01:43.789045" sorts earlier than
"2011-02-09T19:52:01.355919", strongly suggesting that because we're
passing a string, we end up doing lexicographical comparison rather than
temporal ditto. We could probably be in good shape if the used
isoformat(' ') instead of just isoformat(), but I think this handling
belongs in the db layer.

I'll propose my branch as is.

--
Soren Hansen | http://linux2go.dk/
Ubuntu Developer | http://www.ubuntu.com/
OpenStack Developer | http://www.openstack.org/