Heavily loaded nova-compute instances don't sent reports frequently enough

Bug #1045152 reported by Tiantian Gao
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Tiantian Gao
Essex
Fix Released
Undecided
Unassigned
oslo-incubator
Invalid
High
Unassigned
nova (Ubuntu)
Fix Released
Undecided
Unassigned
Precise
Fix Released
Undecided
Unassigned

Bug Description

We knows that nova-compute will do report_state to update the service information in DB, And the default report_interval is 5s.
When scheduler run a new instance, it will check the time that nova-compute updated. If the interval is longer than 60 seconds, Scheduler see the host is dead.

But when the host(nova-compute) running a lot of instances ( about 70+) . The report_interval sometimes can be greater then 2minutes. This will result in ERROR in scheduler.

I think report_state should be in a thread, rather than in a eventlet.spawn.
What about guys think?

Thanks

Michael Still (mikal)
Changed in nova:
status: New → Triaged
importance: Undecided → High
summary: - The report_state interval is longger then 10s actually
+ Heavily loaded nova-compute instances don't sent reports frequently
+ enough
Revision history for this message
Tiantian Gao (gtt116) wrote :

Also in periodic_tasks.
nova-compute is busy running update_available_resource, could not response to AMQP request.

security vulnerability: no → yes
Changed in nova:
assignee: nobody → NetEase Cloud Team (netease-cloud)
Michael Still (mikal)
tags: added: canonistack
Changed in nova:
assignee: NetEase Cloud Team (netease-cloud) → TianTian Gao (gtt116)
status: Triaged → In Progress
Revision history for this message
Thierry Carrez (ttx) wrote :

@mikal: why did you flag this as a security vulnerability ? Looks like a plain bug to me. Any way this could be specifically triggered by an attacker ?

Revision history for this message
Thierry Carrez (ttx) wrote :

Hmm, above comment is actually meant for TianTian Gao.

Revision history for this message
Tiantian Gao (gtt116) wrote :

emm, Can some body explain why use eventlet to run pulse(report_state)?
I think pulse should be put into a native thread.
Thx

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/12335
Committed: http://github.com/openstack/nova/commit/be72921c6f38b8b71ffc474ceae58e67241dac22
Submitter: Jenkins
Branch: master

commit be72921c6f38b8b71ffc474ceae58e67241dac22
Author: TianTian Gao <gtt116@126.com>
Date: Tue Sep 4 12:01:41 2012 +0800

    Yield to another greenthread when some time-consuming task finished.

    Partially addresses bug #1045152

    On a heavily loaded compute node, it can be observed that periodic tasks
    take so long to run that the report_state() looping call can be blocked from
    running long enough that the scheduler thinks the host is dead.

    Reduce the chance of this happening by yielding to another greenthread
    after each periodic task has completed and each loop in some methods
    that has linear relationship with the number of instances.

    Change-Id: If2b125708da8298b20497e2e08e52280c102f1e1

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
security vulnerability: yes → no
Mark McLoughlin (markmc)
Changed in openstack-common:
status: New → Confirmed
importance: Undecided → High
Alan Pevec (apevec)
tags: added: essex-backport
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/essex)

Fix proposed to branch: stable/essex
Review: https://review.openstack.org/12985

Revision history for this message
Joshua Harlow (harlowja) wrote :

I'd hope we can make this a native thread right? Scattering greenthread sleep all over the code seems to show there is a larger architectural problem here imho.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/essex)

Reviewed: https://review.openstack.org/12985
Committed: http://github.com/openstack/nova/commit/47dabb30dc09282d56ad1e54c7652bf35394f7df
Submitter: Jenkins
Branch: stable/essex

commit 47dabb30dc09282d56ad1e54c7652bf35394f7df
Author: TianTian Gao <gtt116@126.com>
Date: Tue Sep 4 12:01:41 2012 +0800

    Yield to another greenthread when some time-consuming task finished.

    Partially addresses bug #1045152

    On a heavily loaded compute node, it can be observed that periodic tasks
    take so long to run that the report_state() looping call can be blocked from
    running long enough that the scheduler thinks the host is dead.

    Reduce the chance of this happening by yielding to another greenthread
    after each periodic task has completed and each loop in some methods
    that has linear relationship with the number of instances.

    Change-Id: If2b125708da8298b20497e2e08e52280c102f1e1

Thierry Carrez (ttx)
Changed in nova:
milestone: none → folsom-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-rc1 → 2012.2
Mark McLoughlin (markmc)
affects: openstack-common → oslo
Revision history for this message
Mark McLoughlin (markmc) wrote :

See also https://review.openstack.org/16605 - that review tried to pull in the fix for this bug as part of fixing another issue

Revision history for this message
flolle (florian-feldhaus) wrote :

I encountered the same problem when running get_vnc_console. After some debugging it turned out that update_available_resource was blocking the execution of get_vnc_console. This is a big problem as a GET request for the VNC console is waiting for get_vnc_console and can result in HTTP timeouts.
I wrote a little script to measure the time of get_vnc_console which can be found here https://gist.github.com/4197449
The result is, that whenever update_available_resource is running the duration of running get_vnc_console goes up from 1-2 seconds to 10-30 seconds.
I applied the patch from #8 restarted nova-compute, nova-api and nova-network but the problem still persists.

Changed in nova (Ubuntu):
status: New → Fix Released
information type: Public → Public Security
information type: Public Security → Private Security
information type: Private Security → Public Security
information type: Public Security → Public
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/18532

Mark McLoughlin (markmc)
Changed in oslo:
status: Confirmed → Triaged
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello TianTian, or anyone else affected,

Accepted nova into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/nova/2012.1.3+stable-20130423-e52e6912-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in nova (Ubuntu Precise):
status: New → Fix Committed
tags: added: verification-needed
Revision history for this message
Yolanda Robla (yolanda.robla) wrote : Verification report.

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/12335
Stable review: https://review.openstack.org/12985

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Revision history for this message
Yolanda Robla (yolanda.robla) wrote :

Test coverage log.

Revision history for this message
Yolanda Robla (yolanda.robla) wrote :

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/12335
Stable review: https://review.openstack.org/12985

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Revision history for this message
Yolanda Robla (yolanda.robla) wrote :

Test coverage log.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2012.1.3+stable-20130423-e52e6912-0ubuntu1

---------------
nova (2012.1.3+stable-20130423-e52e6912-0ubuntu1) precise-proposed; urgency=low

  * Resynchronize with stable/essex (e52e6912) (LP: #1089488):
    - [48e81f1] VNC proxy can be made to connect to wrong VM LP: 1125378
    - [3bf5a58] snat rule too broad for some network configurations LP: 1048765
    - [efaacda] DOS by allocating all fixed ips LP: 1125468
    - [b683ced] Add nosehtmloutput as a test dependency.
    - [45274c8] Nova unit tests not running, but still passing for stable/essex
      LP: 1132835
    - [e02b459] vnc unit-test fixes
    - [87361d3] Jenkins jobs fail because of incompatibility between sqlalchemy-
      migrate and the newest sqlalchemy-0.8.0b1 (LP: #1073569)
    - [e98928c] VNC proxy can be made to connect to wrong VM LP: 1125378
    - [c0a10db] DoS through XML entity expansion (CVE-2013-1664) LP: 1100282
    - [243d516] No authentication on block device used for os-volume_boot
      LP: 1069904
    - [80fefe5] use_single_default_gateway does not function correctly
      (LP: #1075859)
    - [bd10241] Essex 2012.1.3 : Error deleting instance with 2 Nova Volumes
      attached (LP: #1079745)
    - [86a5937] do_refresh_security_group_rules in nova.virt.firewall is very
      slow (LP: #1062314)
    - [ae9c5f4] deallocate_fixed_ip attempts to update an already deleted
      fixed_ip (LP: #1017633)
    - [20f98c5] failed to allocate fixed ip because old deleted one exists
      (LP: #996482)
    - [75f6922] snapshot stays in saving state if the vm base image is deleted
      (LP: #921774)
    - [1076699] lock files may be removed in error dues to permissions issues
      (LP: #1051924)
    - [40c5e94] ensure_default_security_group() does not call sgh (LP: #1050982)
    - [4eebe76] At termination, LXC rootfs is not always unmounted before
      rmtree() is called (LP: #1046313)
    - [47dabb3] Heavily loaded nova-compute instances don't sent reports
      frequently enough (LP: #1045152)
    - [b375b4f] When attach volume lost attach when node restart (LP: #1004791)
    - [4ac2dcc] nova usage-list returns wrong usage (LP: #1043999)
    - [014fcbc] Bridge port's hairpin mode not set after resuming a machine
      (LP: #1040537)
    - [2f35f8e] Nova flavor ephemeral space size reported incorrectly
      (LP: #1026210)
  * Dropped, superseeded by new snapshot:
    - debian/patches/CVE-2013-0335.patch: [48e81f1]
    - debian/patches/CVE-2013-1838.patch: [efaacda]
    - debian/patches/CVE-2013-1664.patch: [c0a10db]
    - debian/patches/CVE-2013-0208.patch: [243d516]
 -- Yolanda <email address hidden> Mon, 22 Apr 2013 12:37:08 +0200

Changed in nova (Ubuntu Precise):
status: Fix Committed → Fix Released
Revision history for this message
Scott Kitterman (kitterman) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Changed in oslo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.