Cinder

cinder scheduler/backup using 100% CPU

Bug #1709346 reported by Kevin Carter on 2017-08-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Cinder	Incomplete	Undecided	Unassigned
	OpenStack-Ansible	Fix Released	Undecided	Kevin Carter

Bug Description

Within master we're seeing cinder-{backup,scheduler} run at 100% CPU post deployment. While tempest seems to be passing tests, within the gate, the process is running off the rails and causing significant impact resulting in gate timeouts.

To fix the issue the only thing we've been able to do is restart the processes thought this has to be done manually. Stracing the scheduler process seems to indicate that it's cycling on epoll_ctl(6) but I've not been able to pinpoint why.

We are able to recreate the conditions seen in the gate by performing the following actions:

systemctl stop cinder-{backup,volume}
lxc-destroy -fn $CINDER_API_CONTAINER
openstack-ansible lxc-container-create.yml --limit $CINDER_API_CONTAINER
openstack-ansible os-cinder-install.yml

In monitoring the deployment it looks like the HUP (ansible service reload)[https://github.com/cloudnull/os-ansible-deployment/blob/master/playbooks/os-cinder-install.yml#L145-L167] causes the issue though I have no idea why at this time. While I've been seeing this issue in master and have not investigated our stable branches, it's likely this problem is also in all of our stable branches as the same logic was back-ported as far back as newton.

See original description

Tags:

Kevin Carter (kevin-carter) on 2017-08-08

description:	updated
description:	updated
description:	updated

Revision history for this message

Sean McGinnis (sean-mcginnis) wrote on 2017-08-09:

Triggered at the point where SIGHUP is called.

https://docs.openstack.org/cinder/latest/upgrade.html#rolling-upgrade-process

My hunch is there is some loop that is not exiting correctly on this causing it to get in a hard loop and consume CPU. Hopefully should be easy to reproduce.

Revision history for this message

Andy McCrae (andrew-mccrae) wrote on 2017-08-14:

Not that this will be super helpful, but strace on the 100% cinder-sched process shows this just reoccurring constantly:

epoll_wait(6, [], 1023, 0) = 0
gettimeofday({1502716783, 609549}, NULL) = 0
gettimeofday({1502716783, 609623}, NULL) = 0

Unfortunately file descriptor 6 is an anon_inode so nothing particularly useful.

Kevin Carter (kevin-carter) on 2017-08-14

Changed in openstack-ansible:
assignee:	nobody → Kevin Carter (kevin-carter)

OpenStack Infra (hudson-openstack) on 2017-08-14

Changed in openstack-ansible:
assignee:	Kevin Carter (kevin-carter) → Andy McCrae (andrew-mccrae)
status:	New → In Progress

Revision history for this message

Logan V (loganv) wrote on 2017-08-14:

workaround proposed and being tested - https://review.openstack.org/#/c/491550/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-14: Fix proposed to openstack-ansible (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/493638

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-14: Fix proposed to openstack-ansible (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/493639

OpenStack Infra (hudson-openstack) on 2017-08-14

Changed in openstack-ansible:
assignee:	Andy McCrae (andrew-mccrae) → Kevin Carter (kevin-carter)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-15: Fix merged to openstack-ansible (master)

Reviewed: https://review.openstack.org/491550
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=7b39cbaa4f1f3254b7969ad9e8c8cb1ec506301a
Submitter: Jenkins
Branch: master

commit 7b39cbaa4f1f3254b7969ad9e8c8cb1ec506301a
Author: Kevin Carter <email address hidden>
Date: Mon Aug 7 12:28:25 2017 -0500

Remove the reload from the cinder playbook

    This removes the reload from the cinder playbook because it's causing
    the cinder service(s) to consume 100% CPU which causes gate issues, and
    will result in misbehaving deployments in prod.

    Closes-Bug: 1709346
    Change-Id: Ifd3b7b7b177dfb7d6456f802284046dd7ce96a9a
    Signed-off-by: Kevin Carter <email address hidden>

Changed in openstack-ansible:
status:	In Progress → Fix Released

Sean McGinnis (sean-mcginnis) on 2017-08-15

no longer affects:

cinder

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-15: Fix merged to openstack-ansible (stable/newton)

Reviewed: https://review.openstack.org/493639
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=191155546c655131222bf1ac7c0101b7eae5d840
Submitter: Jenkins
Branch: stable/newton

commit 191155546c655131222bf1ac7c0101b7eae5d840
Author: Kevin Carter <email address hidden>
Date: Mon Aug 7 12:28:25 2017 -0500

Remove the reload from the cinder playbook

    This removes the reload from the cinder playbook because it's causing
    the cinder service(s) to consume 100% CPU which causes gate issues, and
    will result in misbehaving deployments in prod.

    Closes-Bug: 1709346
    Change-Id: Ifd3b7b7b177dfb7d6456f802284046dd7ce96a9a
    Signed-off-by: Kevin Carter <email address hidden>

tags:	added: in-stable-newton
tags:	added: in-stable-ocata

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-15: Fix merged to openstack-ansible (stable/ocata)

Reviewed: https://review.openstack.org/493638
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=045c4c5601a7f1d2e63303c7420e9fd518704c4b
Submitter: Jenkins
Branch: stable/ocata

commit 045c4c5601a7f1d2e63303c7420e9fd518704c4b
Author: Kevin Carter <email address hidden>
Date: Mon Aug 7 12:28:25 2017 -0500

Remove the reload from the cinder playbook

    This removes the reload from the cinder playbook because it's causing
    the cinder service(s) to consume 100% CPU which causes gate issues, and
    will result in misbehaving deployments in prod.

    Closes-Bug: 1709346
    Change-Id: Ifd3b7b7b177dfb7d6456f802284046dd7ce96a9a
    Signed-off-by: Kevin Carter <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-18: Fix included in openstack/openstack-ansible 16.0.0.0rc1

This issue was fixed in the openstack/openstack-ansible 16.0.0.0rc1 release candidate.

Revision history for this message

Sean McGinnis (sean-mcginnis) wrote on 2017-08-25:

#10

Would it be possible to add a sleep to the ansible script before sending the SIGHUP. Investigating now, but a good suspicion raised was that it may be sending the signal before all the services are fully up and ready for it.

Or is there already some delay between when the service starts and when it gets sent the SIGHUP?

Revision history for this message

Sean McGinnis (sean-mcginnis) wrote on 2017-08-25:

#11

Curious too on the above as I have a running setup where I sent SIGHUP to the cinder-scheduler service and I was not able to reproduce this behavior.

Revision history for this message

Sean McGinnis (sean-mcginnis) wrote on 2017-08-25:

#12

Also confirmed on my cinder-backup host. Sent a SIGHUP and did not observe any unusual CPU impact.

Sean McGinnis (sean-mcginnis) on 2017-08-25

Changed in cinder:
status:	New → Incomplete

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-28: Fix included in openstack/openstack-ansible 15.1.8

#13

This issue was fixed in the openstack/openstack-ansible 15.1.8 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-28: Fix included in openstack/openstack-ansible 14.2.8

#14

This issue was fixed in the openstack/openstack-ansible 14.2.8 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.