Performance issues when have 1k+ Ironic BM instances

Bug #1559246 reported by sergiiF
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned

Bug Description

We have an Ironic deployment with about 1500 BMs, 1k+ of them are already provisioned.

The current Ironic architecture doesn't allow us to have more than one 'ironic compute node'. As a result nova-compute service is 100% busy with periodic tasks like updating instances status (this task takes about 1.5 minute!!).

Tags: ironic
Revision history for this message
Matt Riedemann (mriedem) wrote :

This lacking quite a bit of information. First, what version of nova/ironic are you on?

Have you done any profiling to see what bottlenecks there might be?

Which periodic tasks specifically are taking a long time?

Also, what is the size of the deployment (how big is the controller)? Talking CPUs/RAM here.

Changed in nova:
status: New → Invalid
Revision history for this message
Andrew Laski (alaski) wrote :

There's not enough here to classify anything as a bug, though there are surely things that could be improved. This is also related to the work proposed in https://review.openstack.org/294795

Revision history for this message
sergiiF (framin) wrote :

>>Have you done any profiling to see what bottlenecks there might be?
>>Which periodic tasks specifically are taking a long time?

Main CPU consuming task is update_available_resource. And in particular two subroutines:
1. objects.InstanceList.get_by_host_and_node
2. objects.MigrationList.get_in_progress_by_host_and_node

Revision history for this message
sergiiF (framin) wrote :

>>There's not enough here to classify anything as a bug
Kind of agree. But still, without making a code change Ironic is not usable for large scales.. I would say there is a bug and it is in design.
1. Nova compute design is not suitable for managing hundreds of instances per compute node.
2. Ironic design (unless Ironic: Multiple compute host support blueprint is implemented) assigns all BMs to the only compute node.

Revision history for this message
sergiiF (framin) wrote :

Btw mentioned blueprint expects EACH compute node to report all nodes. Which doesn't really solve the issue. The resource tracking is the only performance issue we are experiencing on a scale 1k+ nodes.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.