OpenStack Compute (nova)

Database tuning required

Bug #964824 reported by justinsb on 2012-03-25

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Medium	Sean Dague	OpenStack Compute (nova) 2012.2 "folsom"

Bug Description

I posted this on the ML, but I think it should also be a bug because we have some potentially serious issues. See below for details, but here are the issues (let me know if I should split into separate bugs etc):

"Critical"

Table scan of fixed_ips on the network service (row per IP address?)
Use of MyISAM tables, particularly for s3_images and block_device_mapping
Table scan of virtual_interfaces (row per instance?)
Verify that MySQL isn't doing a table scan on http://paste.openstack.org/show/12110/ when # of instances is large

"Naughty"

(Mostly because the tables are small)

Table scan of s3_images
Table scan of services
Table scan of networks

Low importance

(Re-fetches aren't a big deal if the queries are fast)

Row re-fetches & re-re-fetches

---

The performance of the metadata query with cloud-init has been causing some people problems (it's so slow cloud-init times out!), and has led to the suggestion that we need lots of caching. (My hypothesis is that we don't...)

By turning on SQL debugging in SQL Alchemy (for which I've proposed a patch for Essex: https://review.openstack.org/#change,5783), I was able to capture the SQL statements.

I'm focusing on the SQL statements for the metadata call.

The code does this:

1) Checks the cache to see if it has the data
2) Makes a message-bus call to the network service to get the fixed_ip info from the address
3) Looks up all sort of metadata in the database
4) Formats the reply

#1 means that the first call is slower than the others, so we need to focus on the first call.
#2 could be problematic, if the message queue is overloaded or if the network service is slow to response
#3 could be problematic if the DB isn't working properly
#4 is hopefully not the problem.

The relevant SQL log from the API server: http://paste.openstack.org/show/12109/

And from the network server: http://paste.openstack.org/show/12116/

I've analyzed each of the SQL statements:

API

http://paste.openstack.org/show/12110/ (Need to check that there isn't a table scan when instance_info_caches is large)
http://paste.openstack.org/show/12111/ Table scan on services table, but this is presumably smallish
http://paste.openstack.org/show/12112/ No index. Table scan on s3_images table. Also this table is MyISAM. Also seems to insert rows on the first call (not shown). Evil.
http://paste.openstack.org/show/12113/
http://paste.openstack.org/show/12114/ block_device_mapping is MyISAM.

Network

http://paste.openstack.org/show/12117/
http://paste.openstack.org/show/12118/ (Fetch virtual_interface by instance_id)
http://paste.openstack.org/show/12119/ (Fetch network by id)
http://paste.openstack.org/show/12120/ Missing index => table scan on networks. Unnecessary row re-fetch.
http://paste.openstack.org/show/12121/ Missing index => table scan on virtual_interfaces. Unnecessary row-refetch.
http://paste.openstack.org/show/12122/ (Fetch fixed_ips on virtual interface)
http://paste.openstack.org/show/12123/ Missing index => table scan on networks. Unnecessary row re-fetch. (Double re-fetch. What does it mean?)
http://paste.openstack.org/show/12124/ Missing index => table scan on virtual_interfaces. Another re-re-fetch.
http://paste.openstack.org/show/12125/ Missing index => table scan on fixed_ips (Uh-oh - I hope you didn't allocate a /8!!). We do have this row from the virtual interface lookup; perhaps we could remove this query?
http://paste.openstack.org/show/12126/
http://paste.openstack.org/show/12127/

We still have a bunch of MyISAM tables (at least with a devstack install):
http://paste.openstack.org/show/12115/

As I see it, these are the issues (in sort of priority order):

Critical

Naughty

(Mostly because the tables are small)

Table scan of s3_images
Table scan of services
Table scan of networks

Low importance

(Re-fetches aren't a big deal if the queries are fast)

Row re-fetches & re-re-fetches

My install in nowhere near big enough for any of these to actually cause a real problem, so I'd love to get timings / a log from someone that is having a problem. Even the table scan of fixed_ips should be OK if you have enough RAM.

Tags:

justinsb (justin-fathomdb) on 2012-03-25

tags:

added: essex-rc-potential

Thierry Carrez (ttx) on 2012-03-27

Changed in nova:
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-03-29: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/5970

Changed in nova:
assignee:	nobody → Sean Dague (sdague-b)
status:	Confirmed → In Progress

Revision history for this message

Sean Dague (sdague) wrote on 2012-03-29:

The fix I added is only for a small piece of this, we should probably explode the overall bug into a bunch of smaller ones and work them through individually.

Thierry Carrez (ttx) on 2012-04-03

tags:

removed: essex-rc-potential

Sean Dague (sdague) on 2012-04-05

Changed in nova:
assignee:	Sean Dague (sdague-b) → nobody

OpenStack Infra (hudson-openstack) on 2012-04-09

Changed in nova:
assignee:	nobody → Sean Dague (sdague-b)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-04-09: Fix merged to nova (master)

Reviewed: https://review.openstack.org/5970
Committed: http://github.com/openstack/nova/commit/27ea547fbd7c936bd017b64b31ecf09ed991c6c0
Submitter: Jenkins
Branch: master

commit 27ea547fbd7c936bd017b64b31ecf09ed991c6c0
Author: Sean Dague <email address hidden>
Date: Thu Mar 29 16:44:56 2012 -0400

Add index to fixed_ips.address

Addresses one of the Critical items in bug #964824

Add migration to add an index to fixed_ips.address to remove a table
scan from active code.

Change-Id: Ic5a56bbddd0d434f8a37ad049753e6d25c7ca760

Changed in nova:
status:	In Progress → Fix Committed

Sean Dague (sdague) on 2012-04-12

Changed in nova:
status:	Fix Committed → In Progress

Revision history for this message

Sean Dague (sdague) wrote on 2012-05-02:

I'm closing this out as most of the Critical pieces are now addressed in master. I think it's time for a new analysis and individual bugs to be opened up for individual tunings after this.

Changed in nova:
status:	In Progress → Fix Committed

aeva black (tenbrae) on 2012-05-03

tags:

added: db

Devin Carlen (devcamcar) on 2012-05-22

Changed in nova:
milestone:	none → folsom-1

Thierry Carrez (ttx) on 2012-05-23

Changed in nova:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2012-09-27

Changed in nova:
milestone:	folsom-1 → 2012.2

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.