cinder scheduler/backup using 100% CPU

Bug #1709346 reported by Kevin Carter
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Incomplete
Undecided
Unassigned
OpenStack-Ansible
Fix Released
Undecided
Kevin Carter

Bug Description

Within master we're seeing cinder-{backup,scheduler} run at 100% CPU post deployment. While tempest seems to be passing tests, within the gate, the process is running off the rails and causing significant impact resulting in gate timeouts.

To fix the issue the only thing we've been able to do is restart the processes thought this has to be done manually. Stracing the scheduler process seems to indicate that it's cycling on epoll_ctl(6) but I've not been able to pinpoint why.

We are able to recreate the conditions seen in the gate by performing the following actions:

systemctl stop cinder-{backup,volume}
lxc-destroy -fn $CINDER_API_CONTAINER
openstack-ansible lxc-container-create.yml --limit $CINDER_API_CONTAINER
openstack-ansible os-cinder-install.yml

In monitoring the deployment it looks like the HUP (ansible service reload)[https://github.com/cloudnull/os-ansible-deployment/blob/master/playbooks/os-cinder-install.yml#L145-L167] causes the issue though I have no idea why at this time. While I've been seeing this issue in master and have not investigated our stable branches, it's likely this problem is also in all of our stable branches as the same logic was back-ported as far back as newton.

description: updated
description: updated
description: updated
Revision history for this message
Sean McGinnis (sean-mcginnis) wrote :

Triggered at the point where SIGHUP is called.

https://docs.openstack.org/cinder/latest/upgrade.html#rolling-upgrade-process

My hunch is there is some loop that is not exiting correctly on this causing it to get in a hard loop and consume CPU. Hopefully should be easy to reproduce.

Revision history for this message
Andy McCrae (andrew-mccrae) wrote :

Not that this will be super helpful, but strace on the 100% cinder-sched process shows this just reoccurring constantly:

epoll_wait(6, [], 1023, 0) = 0
gettimeofday({1502716783, 609549}, NULL) = 0
gettimeofday({1502716783, 609623}, NULL) = 0

Unfortunately file descriptor 6 is an anon_inode so nothing particularly useful.

Changed in openstack-ansible:
assignee: nobody → Kevin Carter (kevin-carter)
Changed in openstack-ansible:
assignee: Kevin Carter (kevin-carter) → Andy McCrae (andrew-mccrae)
status: New → In Progress
Revision history for this message
Logan V (loganv) wrote :

workaround proposed and being tested - https://review.openstack.org/#/c/491550/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/493638

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/493639

Changed in openstack-ansible:
assignee: Andy McCrae (andrew-mccrae) → Kevin Carter (kevin-carter)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (master)

Reviewed: https://review.openstack.org/491550
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=7b39cbaa4f1f3254b7969ad9e8c8cb1ec506301a
Submitter: Jenkins
Branch: master

commit 7b39cbaa4f1f3254b7969ad9e8c8cb1ec506301a
Author: Kevin Carter <email address hidden>
Date: Mon Aug 7 12:28:25 2017 -0500

    Remove the reload from the cinder playbook

    This removes the reload from the cinder playbook because it's causing
    the cinder service(s) to consume 100% CPU which causes gate issues, and
    will result in misbehaving deployments in prod.

    Closes-Bug: 1709346
    Change-Id: Ifd3b7b7b177dfb7d6456f802284046dd7ce96a9a
    Signed-off-by: Kevin Carter <email address hidden>

Changed in openstack-ansible:
status: In Progress → Fix Released
no longer affects: cinder
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (stable/newton)

Reviewed: https://review.openstack.org/493639
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=191155546c655131222bf1ac7c0101b7eae5d840
Submitter: Jenkins
Branch: stable/newton

commit 191155546c655131222bf1ac7c0101b7eae5d840
Author: Kevin Carter <email address hidden>
Date: Mon Aug 7 12:28:25 2017 -0500

    Remove the reload from the cinder playbook

    This removes the reload from the cinder playbook because it's causing
    the cinder service(s) to consume 100% CPU which causes gate issues, and
    will result in misbehaving deployments in prod.

    Closes-Bug: 1709346
    Change-Id: Ifd3b7b7b177dfb7d6456f802284046dd7ce96a9a
    Signed-off-by: Kevin Carter <email address hidden>

tags: added: in-stable-newton
tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (stable/ocata)

Reviewed: https://review.openstack.org/493638
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=045c4c5601a7f1d2e63303c7420e9fd518704c4b
Submitter: Jenkins
Branch: stable/ocata

commit 045c4c5601a7f1d2e63303c7420e9fd518704c4b
Author: Kevin Carter <email address hidden>
Date: Mon Aug 7 12:28:25 2017 -0500

    Remove the reload from the cinder playbook

    This removes the reload from the cinder playbook because it's causing
    the cinder service(s) to consume 100% CPU which causes gate issues, and
    will result in misbehaving deployments in prod.

    Closes-Bug: 1709346
    Change-Id: Ifd3b7b7b177dfb7d6456f802284046dd7ce96a9a
    Signed-off-by: Kevin Carter <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-ansible 16.0.0.0rc1

This issue was fixed in the openstack/openstack-ansible 16.0.0.0rc1 release candidate.

Revision history for this message
Sean McGinnis (sean-mcginnis) wrote :

Would it be possible to add a sleep to the ansible script before sending the SIGHUP. Investigating now, but a good suspicion raised was that it may be sending the signal before all the services are fully up and ready for it.

Or is there already some delay between when the service starts and when it gets sent the SIGHUP?

Revision history for this message
Sean McGinnis (sean-mcginnis) wrote :

Curious too on the above as I have a running setup where I sent SIGHUP to the cinder-scheduler service and I was not able to reproduce this behavior.

Revision history for this message
Sean McGinnis (sean-mcginnis) wrote :

Also confirmed on my cinder-backup host. Sent a SIGHUP and did not observe any unusual CPU impact.

Changed in cinder:
status: New → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-ansible 15.1.8

This issue was fixed in the openstack/openstack-ansible 15.1.8 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-ansible 14.2.8

This issue was fixed in the openstack/openstack-ansible 14.2.8 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.