Querying Windows via WMI intermittently fails in get_device_number_for_target

Bug #1247901 reported by Jay Bryant
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Jay Bryant
Havana
Fix Released
Medium
Jay Bryant

Bug Description

The WMI query in basevolumeutils.get_device_number_for_target can incorrectly return no device_number during extended stress runs. We saw that it would return incorrect data after about an hour. It could also return an initiator_sessions object that was empty.

By adding a check to make sure that devices wasn't empty and adding a retry loop in volumeops._get_mounted_disk_from_lun we could avoid hitting the case where it thought it couldn't get a mounted disk for the target_iqn.

Jay Bryant (jsbryant)
tags: added: hyper-v
Revision history for this message
Alessandro Pilotti (alexpilotti) wrote :

Hi Jay,

Good catch, can you provide some info on your environment and possibly the integration tests that you are using to duplicate the issue?

Can you please post a pastebin with the Nova compute log as well?

Tx

Revision history for this message
Jay Bryant (jsbryant) wrote :

Alex,

The environment we were running in was using iSCSI from a Linux based control node. The volumes were on LVM.

We started seeing the attachments fail intermittently, after about an hour, with:
raise exception.NotFound(_('Unable to find a mounted disk for '
                                                     'target_iqn: %s') % target_iqn)

The person who has access to the system where this is happening is out on vacation right now. So, I can't get you a pastebin right now. I have a patch written up. We ran with the patch for 7 hours without seeing the problem recreate.

Is this type of patch the sort of thing you expect to have a unit test as well?

Revision history for this message
Jay Bryant (jsbryant) wrote :

As far as how we encountered the problem, I believe it was while running tempest.

Jay Bryant (jsbryant)
tags: added: havana-backport-potential
Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in nova:
status: New → In Progress
assignee: nobody → Jay Bryant (jsbryant)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/55449
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d143540ad1b69ec93c2b7bfadd1f654c4d8c7a34
Submitter: Jenkins
Branch: master

commit d143540ad1b69ec93c2b7bfadd1f654c4d8c7a34
Author: Jay S. Bryant <email address hidden>
Date: Wed Nov 6 10:49:00 2013 -0600

    hyperv: Retry after WMI query fails to find dev

    During long stress runs the WMI query that is looking for
    the iSCSI device number can incorrectly return no data.
    If the query is retried the appropriate data can then be
    obtained.

    This commit adds a retry loop, calling
    basevolumeutils.get_device_number_for_target to avoid this situation.
    It also handles the case where the devices list returned in
    get_device_number_for_target is empty. The retry loop is
    implemented with new mounted_disk_query_retry_count and
    mounted_disk_query_retry_interval configuration options.

    Unit tests have been added to check the good and bad paths for
    get_mounted_disk_from_lun.

    DocImpact
    Closes-bug: 1247901
    Change-Id: I082c4b1694efcd20cce65293cd330b7a0cf7d470

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/72771

Revision history for this message
Alessandro Pilotti (alexpilotti) wrote :

This fix introduced an issue in the test case:

https://bugs.launchpad.net/nova/+bug/1280379

Changed in nova:
milestone: none → icehouse-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/havana)

Reviewed: https://review.openstack.org/72771
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1c7ff2af939e4be049f65d87ef17547ee95285a0
Submitter: Jenkins
Branch: stable/havana

commit 1c7ff2af939e4be049f65d87ef17547ee95285a0
Author: Jay S. Bryant <email address hidden>
Date: Wed Nov 6 10:49:00 2013 -0600

    hyperv: Retry after WMI query fails to find dev

    During long stress runs the WMI query that is looking for
    the iSCSI device number can incorrectly return no data.
    If the query is retried the appropriate data can then be
    obtained.

    This commit adds a retry loop, calling
    basevolumeutils.get_device_number_for_target to avoid this situation.
    It also handles the case where the devices list returned in
    get_device_number_for_target is empty. The retry loop is
    implemented with new mounted_disk_query_retry_count and
    mounted_disk_query_retry_interval configuration options.

    Unit tests have been added to check the good and bad paths for
    get_mounted_disk_from_lun.

    DocImpact
    Closes-bug: 1247901
    Change-Id: I082c4b1694efcd20cce65293cd330b7a0cf7d470
    (cherry picked from commit d143540ad1b69ec93c2b7bfadd1f654c4d8c7a34)

tags: added: in-stable-havana
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-3 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.