Heavily loaded nova-compute instances don't sent reports frequently enough

Bug #1045152 reported by Tiantian Gao on 2012-09-03
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Tiantian Gao
Essex
Undecided
Unassigned
oslo-incubator
High
Unassigned
nova (Ubuntu)
Undecided
Unassigned
Precise
Undecided
Unassigned

Bug Description

We knows that nova-compute will do report_state to update the service information in DB, And the default report_interval is 5s.
When scheduler run a new instance, it will check the time that nova-compute updated. If the interval is longer than 60 seconds, Scheduler see the host is dead.

But when the host(nova-compute) running a lot of instances ( about 70+) . The report_interval sometimes can be greater then 2minutes. This will result in ERROR in scheduler.

I think report_state should be in a thread, rather than in a eventlet.spawn.
What about guys think?

Thanks

Michael Still (mikal) on 2012-09-03
Changed in nova:
status: New → Triaged
importance: Undecided → High
summary: - The report_state interval is longger then 10s actually
+ Heavily loaded nova-compute instances don't sent reports frequently
+ enough
Tiantian Gao (gtt116) wrote :

Also in periodic_tasks.
nova-compute is busy running update_available_resource, could not response to AMQP request.

security vulnerability: no → yes
Changed in nova:
assignee: nobody → NetEase Cloud Team (netease-cloud)
Michael Still (mikal) on 2012-09-04
tags: added: canonistack
Changed in nova:
assignee: NetEase Cloud Team (netease-cloud) → TianTian Gao (gtt116)
status: Triaged → In Progress
Thierry Carrez (ttx) wrote :

@mikal: why did you flag this as a security vulnerability ? Looks like a plain bug to me. Any way this could be specifically triggered by an attacker ?

Thierry Carrez (ttx) wrote :

Hmm, above comment is actually meant for TianTian Gao.

Tiantian Gao (gtt116) wrote :

emm, Can some body explain why use eventlet to run pulse(report_state)?
I think pulse should be put into a native thread.
Thx

description: updated

Reviewed: https://review.openstack.org/12335
Committed: http://github.com/openstack/nova/commit/be72921c6f38b8b71ffc474ceae58e67241dac22
Submitter: Jenkins
Branch: master

commit be72921c6f38b8b71ffc474ceae58e67241dac22
Author: TianTian Gao <gtt116@126.com>
Date: Tue Sep 4 12:01:41 2012 +0800

    Yield to another greenthread when some time-consuming task finished.

    Partially addresses bug #1045152

    On a heavily loaded compute node, it can be observed that periodic tasks
    take so long to run that the report_state() looping call can be blocked from
    running long enough that the scheduler thinks the host is dead.

    Reduce the chance of this happening by yielding to another greenthread
    after each periodic task has completed and each loop in some methods
    that has linear relationship with the number of instances.

    Change-Id: If2b125708da8298b20497e2e08e52280c102f1e1

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2012-09-05
security vulnerability: yes → no
Mark McLoughlin (markmc) on 2012-09-05
Changed in openstack-common:
status: New → Confirmed
importance: Undecided → High
Alan Pevec (apevec) on 2012-09-13
tags: added: essex-backport
Joshua Harlow (harlowja) wrote :

I'd hope we can make this a native thread right? Scattering greenthread sleep all over the code seems to show there is a larger architectural problem here imho.

Reviewed: https://review.openstack.org/12985
Committed: http://github.com/openstack/nova/commit/47dabb30dc09282d56ad1e54c7652bf35394f7df
Submitter: Jenkins
Branch: stable/essex

commit 47dabb30dc09282d56ad1e54c7652bf35394f7df
Author: TianTian Gao <gtt116@126.com>
Date: Tue Sep 4 12:01:41 2012 +0800

    Yield to another greenthread when some time-consuming task finished.

    Partially addresses bug #1045152

    On a heavily loaded compute node, it can be observed that periodic tasks
    take so long to run that the report_state() looping call can be blocked from
    running long enough that the scheduler thinks the host is dead.

    Reduce the chance of this happening by yielding to another greenthread
    after each periodic task has completed and each loop in some methods
    that has linear relationship with the number of instances.

    Change-Id: If2b125708da8298b20497e2e08e52280c102f1e1

Thierry Carrez (ttx) on 2012-09-19
Changed in nova:
milestone: none → folsom-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-09-27
Changed in nova:
milestone: folsom-rc1 → 2012.2
Mark McLoughlin (markmc) on 2012-11-06
affects: openstack-common → oslo
Mark McLoughlin (markmc) wrote :

See also https://review.openstack.org/16605 - that review tried to pull in the fix for this bug as part of fixing another issue

flolle (florian-feldhaus) wrote :

I encountered the same problem when running get_vnc_console. After some debugging it turned out that update_available_resource was blocking the execution of get_vnc_console. This is a big problem as a GET request for the VNC console is waiting for get_vnc_console and can result in HTTP timeouts.
I wrote a little script to measure the time of get_vnc_console which can be found here https://gist.github.com/4197449
The result is, that whenever update_available_resource is running the duration of running get_vnc_console goes up from 1-2 seconds to 10-30 seconds.
I applied the patch from #8 restarted nova-compute, nova-api and nova-network but the problem still persists.

Changed in nova (Ubuntu):
status: New → Fix Released
information type: Public → Public Security
information type: Public Security → Private Security
information type: Private Security → Public Security
information type: Public Security → Public
Mark McLoughlin (markmc) on 2013-01-12
Changed in oslo:
status: Confirmed → Triaged

Hello TianTian, or anyone else affected,

Accepted nova into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/nova/2012.1.3+stable-20130423-e52e6912-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in nova (Ubuntu Precise):
status: New → Fix Committed
tags: added: verification-needed

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/12335
Stable review: https://review.openstack.org/12985

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Yolanda Robla (yolanda.robla) wrote :

Test coverage log.

Yolanda Robla (yolanda.robla) wrote :

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/12335
Stable review: https://review.openstack.org/12985

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Yolanda Robla (yolanda.robla) wrote :

Test coverage log.

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2012.1.3+stable-20130423-e52e6912-0ubuntu1

---------------
nova (2012.1.3+stable-20130423-e52e6912-0ubuntu1) precise-proposed; urgency=low

  * Resynchronize with stable/essex (e52e6912) (LP: #1089488):
    - [48e81f1] VNC proxy can be made to connect to wrong VM LP: 1125378
    - [3bf5a58] snat rule too broad for some network configurations LP: 1048765
    - [efaacda] DOS by allocating all fixed ips LP: 1125468
    - [b683ced] Add nosehtmloutput as a test dependency.
    - [45274c8] Nova unit tests not running, but still passing for stable/essex
      LP: 1132835
    - [e02b459] vnc unit-test fixes
    - [87361d3] Jenkins jobs fail because of incompatibility between sqlalchemy-
      migrate and the newest sqlalchemy-0.8.0b1 (LP: #1073569)
    - [e98928c] VNC proxy can be made to connect to wrong VM LP: 1125378
    - [c0a10db] DoS through XML entity expansion (CVE-2013-1664) LP: 1100282
    - [243d516] No authentication on block device used for os-volume_boot
      LP: 1069904
    - [80fefe5] use_single_default_gateway does not function correctly
      (LP: #1075859)
    - [bd10241] Essex 2012.1.3 : Error deleting instance with 2 Nova Volumes
      attached (LP: #1079745)
    - [86a5937] do_refresh_security_group_rules in nova.virt.firewall is very
      slow (LP: #1062314)
    - [ae9c5f4] deallocate_fixed_ip attempts to update an already deleted
      fixed_ip (LP: #1017633)
    - [20f98c5] failed to allocate fixed ip because old deleted one exists
      (LP: #996482)
    - [75f6922] snapshot stays in saving state if the vm base image is deleted
      (LP: #921774)
    - [1076699] lock files may be removed in error dues to permissions issues
      (LP: #1051924)
    - [40c5e94] ensure_default_security_group() does not call sgh (LP: #1050982)
    - [4eebe76] At termination, LXC rootfs is not always unmounted before
      rmtree() is called (LP: #1046313)
    - [47dabb3] Heavily loaded nova-compute instances don't sent reports
      frequently enough (LP: #1045152)
    - [b375b4f] When attach volume lost attach when node restart (LP: #1004791)
    - [4ac2dcc] nova usage-list returns wrong usage (LP: #1043999)
    - [014fcbc] Bridge port's hairpin mode not set after resuming a machine
      (LP: #1040537)
    - [2f35f8e] Nova flavor ephemeral space size reported incorrectly
      (LP: #1026210)
  * Dropped, superseeded by new snapshot:
    - debian/patches/CVE-2013-0335.patch: [48e81f1]
    - debian/patches/CVE-2013-1838.patch: [efaacda]
    - debian/patches/CVE-2013-1664.patch: [c0a10db]
    - debian/patches/CVE-2013-0208.patch: [243d516]
 -- Yolanda <email address hidden> Mon, 22 Apr 2013 12:37:08 +0200

Changed in nova (Ubuntu Precise):
status: Fix Committed → Fix Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Changed in oslo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related questions