Moderate load on OpenStack REST API kills the cloud

Bug #1583546 reported by Dmitry Mescheryakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Confirmed
High
MOS Scale

Bug Description

Version: 9.0

Steps to reproduce:
Install 200 nodes cloud with Ceilometer. On each compute node run Ceilometer's pollster with incorrect settings:
ceilometer-polling --polling-namespace compute central --config-file /etc/ceilometer/ceilometer.conf

The setting cause each pollster to check each OpenStack REST API endpoint with a basic command once a minute. As a result, each API endpoint gets 3 requests per second. Or in total we got 3 * <number of REST API endpoints> requests.

Expected result:
OpenStack services operate normally, all goes fine

Actual result:
A number of things went wrong:
1. ssh login from master node on any controller took 30 seconds
2. pacemaker cluster constantly got broken. From crmd.log of controllers it could be seen that crmd (that is part of pacemaker) master constantly migrated, resources check failed and so on.
As a result of #2 RabbitMQ cluster got broken at some point.

We checked the following resources:
1. cpu
2. memory
3. disk
4. network

All 4 seem to be far from being exhausted. As an afterthought, we should have checked entropy pool as well.

We need to reproduce this issue without Ceilometer and see, which resource we exhausted, because such load seems to be fairly moderate. (Again, don't forget to check entropy).

Dina Belova (dbelova)
Changed in mos:
milestone: none → 10.0
Changed in mos:
assignee: nobody → MOS Scale (mos-scale)
Dina Belova (dbelova)
Changed in mos:
importance: Undecided → High
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.