OpenStack Compute (Nova)

nova-network spends too much time in VlanManager._disassociate_stale_fixed_ips

Reported by Julien Danjou on 2012-03-26
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Vish Ishaya

Bug Description

2012-03-26 16:11:49 DEBUG nova.manager [-] Running periodic task VlanManager._disassociate_stale_fixed_ips from (pid=13
750) periodic_tasks /usr/lib/python2.7/dist-packages/nova/manager.py:152
2012-03-26 16:13:25 DEBUG nova.rpc.common [-] received...

It takes almost 2 minutes to complete, and while it does that it uses 100 % CPU. Therefore nova-network does not honor any other request, and sometimes nova async calls on AMQP die with timeout.

Julien Danjou (jdanjou) on 2012-03-26
tags: added: essex-rc-potential
Julien Danjou (jdanjou) wrote :
Download full text (4.1 KiB)

I've just hit this backtrace at some point:

2012-03-26 17:38:23 DEBUG nova.manager [-] Running periodic task VlanManager._disassociate_stale_fixed_ips from (pid=4929) periodic_tasks /usr/lib/python2.7/dist-packages/nova/manager.py:152

2012-03-26 17:39:34 ERROR nova.manager [-] Error during VlanManager._disassociate_stale_fixed_ips: (OperationalError) (1153, "Got a packet bigger than 'max_allowed_packet' bytes") 'SELECT fixed_ips.id AS fixed_ips_id \nFROM fixed_ips \nWHERE fixed_ips.id IN (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,...

Read more...

Julien Danjou (jdanjou) wrote :

Precision: this is with the network VlanManager.

justinsb (justin-fathomdb) wrote :

This makes sense. For whatever reason, you have a lot of stale fixed IPs. A stale fixed_ip has an instance_id but has allocated=False, and is sufficiently old.

I presume this is some sort of lazy reclamation to prevent network problems.

If you have a look in nova/db/sqlalchemy/api.py, you can see that there's a note in there that the direct update doesn't work. If it did work, this problem wouldn't arise.

The problems are:

1) The initial query may be slow; I don't see a helpful index, and so I think probably it's a table scan with a nasty join to the Instances table. I'd like to see the SQL query here though (this needs my SQL debug patch!)
2) Then we probably need to batch the updates to avoid this problem (e.g. do in groups of 100). It would be nice if SQLAlchemy did this for us, but it would be really complicated (technically it would have to know the MySQL packet limit).

For #1, it may be that we can break up the query into separate queries.

If this does take two minutes, I am worried about the locking here. Again, I'd love to see the SQL queries, but this could be a much bigger problem even than it first appears if it effectively locks the Instance, Network & FixedIp tables!

justinsb (justin-fathomdb) wrote :

Tagging as essex-rc-potential because this sounds like it might be causing serious DB issues.

Julien Danjou (jdanjou) wrote :

FYI

mysql> select count(*) from fixed_ips;
+----------+
| count(*) |
+----------+
| 25600 |
+----------+
1 row in set (0.01 sec)

mysql> select count(*) from fixed_ips where allocated=False and instance_id is not null;
+----------+
| count(*) |
+----------+
| 137 |
+----------+
1 row in set (0.00 sec)

I don't know how the lock works, but FWIW I have 3 computes nodes configured with multi_host=True so they all nova-network running. That means the the 3 are running this code at some point (and maybe at the same time).

justinsb (justin-fathomdb) wrote :

Thanks. Are you by any chance running from source? If so, can you please try setting 'echo': 'debug' instead of 'echo': False in /nova/nova/db/sqlalchemy/session.py

That will turn on (really verbose) SQL logging. Technically you only need to do that on one of the compute nodes.

Just so it's clear, I think I have a pretty good understanding of the error message, the command is simply "too big" when the number of fixed_ips gets large. We can patch to work around that fairly easily. I would have chosen 100 which might be too large if the 137 count above is the right count - I think there were more in your original SQL query; this could be another bug (e.g. duplicate values?)

However, I'm also trying to understand the 2 minute timing; that indicates to me that something else is probably going wrong. The query where it finds fixed_ips looks particularly difficult, and I'm worried that it might be causing deeper issues on the database.

Did you know which process was using 100% CPU - it sounds like it was the network process? That sounds like it might not even be the DB (e.g. maybe SQLAlchemy's query construction for an IN clause is N^2 ??)

Julien Danjou (jdanjou) wrote :

Nop, I'm not running from sources, but eventually I can modify session.py on the system.

The process that uses 100% CPU is nova-network. MySQL is doing almost nothing during that time. Unfortunately, I don't know how to profile nova-network to see where it spend its time. :-|

(FWIW I've deactived the code for now by setting by timeout_fixed_ips to False in the manager class, and that fixes almost all of the timeout problem I had with nova-compute. I'm adding this here so if anybody hits that too, there's a quick'n dirty fix).

Thierry Carrez (ttx) on 2012-03-27
Changed in nova:
status: New → Incomplete
Julien Danjou (jdanjou) wrote :
Download full text (4.6 KiB)

I tried with debug in session.py but not sure it changed too much thing. Anyhow reactivating the code did this today:

2012-03-27 17:07:54 DEBUG nova.manager [req-021674b2-060a-4857-9e39-4d8587864b39 None None] Running periodic task VlanManager._disassociate_stale_fixed_ips from (pid=17241) periodic_tasks /usr/lib/python2.7/dist-packages/nova/manager.py:152
2012-03-27 17:10:31 ERROR nova.manager [req-021674b2-060a-4857-9e39-4d8587864b39 None None] Error during VlanManager._disassociate_stale_fixed_ips: (OperationalError) (1153, "Got a packet bigger than 'max_allowed_packet' bytes") 'SELECT fixed_ips.id AS fixed_ips_id \nFROM fixed_ips \nWHERE fixed_ips.id IN (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, (TONS OF %s I WILL NOT PASTE) %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' ((4L,), (4L,), (4L,), (4L,), (4L,), (4L,), (4L,), (4L,) ... displaying 10 of 1752600 total bound parameter sets ... (286L,), (286L,))
(nova.manager): TRACE: Traceback (most recent call last):
(nova.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/manager.py", line 155, in periodic_tasks
(nova.manager): TRACE: task(self, context)
(nova.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 751, in _disassociate_stale_fixed_ips
(nova.manager): TRACE: time)
(nova.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/db/api.py", line 458, in fixed_ip_disassociate_all_by_timeout
(nova.manager): TRACE: return IMPL.fixed_ip_disassociate_all_by_timeout(context, host, time)
(nova.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 101, in wrapper
(nova.manager): TRACE: return f(*args, **kwargs)
(nova.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 1093, in fixed_ip_disassociate_all_by_timeout
(nova.manager): TRACE: synchronize_session='fetch')
(nova.manager): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2675, in update
(nova.manager): TRACE: params=self._params).fetchall()
(nova.manager): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 801, in execute
(nova.manager): TRACE: clause, params or {})
(nova.manager): TRACE: File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1405, in execute
(nova.manager): TRACE: params)
(nova.manager): TRACE: File "/usr/lib/python2.7/dist...

Read more...

Thierry Carrez (ttx) on 2012-03-27
Changed in nova:
importance: Undecided → Medium
justinsb (justin-fathomdb) wrote :

I still don't see the actual SQL statements, but we do have two clues:

1) Repeated '4' values indicates that values are repeating
2) "displaying 10 of 1752600" indicates that there are a lot of values, which could explain the CPU time. This could conceivably tally with 2 minutes of CPU time.

I'm guessing the first query is returning way too many values. I will look at the code to see if I can get a hint. It would have been great if the SQL statements were there, but I'll see if I can just see the problem.

justinsb (justin-fathomdb) wrote :

This does look similar to Bug #953708 (thanks dprince!) You may well to update to a version that includes that patch (what version are you running?)

It may be that there are still duplicate rows returned by SQL Alchemy (without the exact SQL statements, I'm not really sure). I'm going to propose a simple patch to detect & fix this.

Fix proposed to branch: master
Review: https://review.openstack.org/5872

Changed in nova:
assignee: nobody → justinsb (justin-fathomdb)
status: Incomplete → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/5893

Changed in nova:
assignee: justinsb (justin-fathomdb) → Vish Ishaya (vishvananda)
Changed in nova:
importance: Medium → High
milestone: none → essex-rc2
Vish Ishaya (vishvananda) wrote :

So the issue here was the old code was doing joins without relationships set in the model (due to us removing foreign keys, leading to a cartesian product)

OLD SQL:

SELECT fixed_ips.created_at AS fixed_ips_created_at, fixed_ips.updated_at AS fixed_ips_updated_at, fixed_ips.deleted_at AS fixed_ips_deleted_at, fixed_ips.deleted AS fixed_ips_deleted, fixed_ips.id AS fixed_ips_id, fixed_ips.address AS fixed_ips_address, fixed_ips.network_id AS fixed_ips_network_id, fixed_ips.virtual_interface_id AS fixed_ips_virtual_interface_id, fixed_ips.instance_id AS fixed_ips_instance_id, fixed_ips.allocated AS fixed_ips_allocated, fixed_ips.leased AS fixed_ips_leased, fixed_ips.reserved AS fixed_ips_reserved, fixed_ips.host AS fixed_ips_host
FROM fixed_ips, instances, networks
WHERE fixed_ips.updated_at < ? AND fixed_ips.instance_id IS NOT NULL AND fixed_ips.allocated = ? AND (instances.host = ? AND networks.multi_host = ? OR networks.host = ?)

(Note the lack of ON clauses in the joins)

SQL after the above patch:

SELECT fixed_ips.id AS fixed_ips_id
FROM fixed_ips JOIN networks ON networks.id = fixed_ips.network_id JOIN instances ON instances.id = fixed_ips.instance_id
WHERE fixed_ips.deleted = ? AND fixed_ips.allocated = ? AND fixed_ips.updated_at < ? AND (instances.host = ? AND networks.multi_host = ? OR networks.host = ?)

Thierry Carrez (ttx) on 2012-03-28
tags: removed: essex-rc-potential

Reviewed: https://review.openstack.org/5893
Committed: http://github.com/openstack/nova/commit/0384aa5416dedc61b661a347e625c827aa6f94a2
Submitter: Jenkins
Branch: master

commit 0384aa5416dedc61b661a347e625c827aa6f94a2
Author: Vishvananda Ishaya <email address hidden>
Date: Tue Mar 27 15:54:40 2012 -0700

    Fix disassociate query to remove foreign keys

     * fixes bug 965333

    Change-Id: I3f7605717825f91fc34d15c14a5fe86824bd3799

Changed in nova:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/6017
Committed: http://github.com/openstack/nova/commit/5379b045da533a5fce941bd659fad657a79a154f
Submitter: Jenkins
Branch: milestone-proposed

commit 5379b045da533a5fce941bd659fad657a79a154f
Author: Vishvananda Ishaya <email address hidden>
Date: Tue Mar 27 15:54:40 2012 -0700

    Fix disassociate query to remove foreign keys

     * fixes bug 965333

    Change-Id: I3f7605717825f91fc34d15c14a5fe86824bd3799

Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-04-05
Changed in nova:
milestone: essex-rc2 → 2012.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers