when live-migrate failed, lun-id couldn't be rollback in havana

Bug #1419577 reported by Hyun Ha on 2015-02-09
32
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Lee Yarwood
OpenStack Security Advisory
Undecided
Unassigned

Bug Description

Hi, guys

When live-migrate failed with error, lun-id of connection_info column in Nova's block_deivce_mapping table couldn't be rollback.
and failed VM can have others volume.

my test environment is following :

Openstack Version : Havana ( 2013.2.3)
Compute Node OS : 3.5.0-23-generic #35~precise1-Ubuntu SMP Fri Jan 25 17:13:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Compute Node multipath : multipath-tools 0.4.9-3ubuntu7.2

test step is :

1) create 2 Compute node (host#1 and host#2)
2) create 1 VM on host#1 (vm01)
3) create 1 cinder volume (vol01)
4) attach 1 volume to vm01 (/dev/vdb)
5) live-migrate vm01 from host#1 to host#2
6) live-migrate success
     - please check the mapper by using multipath command in host#1 (# multipath -ll), then you can find mapper is not deleted.
       and the status of devices is "failed faulty"
     - please check the lun-id of vol01
7) Again, live-migrate vm01 from host#2 to host#1 (vm01 was migrated to host#2 at step 4)
8) live-migrate fail
     - please check the mapper in host#1
     - please check the lun-id of vol01, then you can find the lun hav "two" igroups
     - please check the connection_info column in Nova's block_deivce_mapping table, then you can find lun-id couldn't be rollback

This Bug is critical security issue because the failed VM can have others volume.

and every backend storage of cinder-volume can have same problem because this is the bug of live-migration's rollback process.

I suggest below methods to solve issue :

1) when live-migrate is complete, nova should delete mapper devices at origin host
2) when live-migrate is failed, nova should rollback lun-id in connection_info column
3) when live-migrate is failed, cinder should delete the mapping between lun and host (Netapp : igroup, EMC : storage_group ...)
4) when volume-attach is requested , cinder volume driver of vendors should make lun-id randomly for reduce of probability of mis-mapping

please check this bug.

Thank you.

CVE References

Hyun Ha (raymon-ha) on 2015-02-09
description: updated
affects: cinder → nova-project
Hyun Ha (raymon-ha) on 2015-02-09
information type: Private Security → Public Security

Since this report concerns a possible security risk, an incomplete security advisory task has been added while the core security reviewers for the affected project or projects confirm the bug and discuss the scope of any vulnerability along with potential solutions.

Is this only in Havana or does it also reproduce on Icehouse/Juno ?

Changed in ossa:
status: New → Incomplete
Hyun Ha (raymon-ha) wrote :

Hi, Tristan

Icehouse and Juno has the same bug.

please see below that i reported :
https://bugs.launchpad.net/cinder/+bug/1416314

when live-migration failed, the information of mapping between host and lun can be tangled.
this bug can affect not only security vulnerability but also data stability. (filesystem can be break when lun mis-mapped)

Thanks.

Thierry Carrez (ttx) wrote :

I agree that's a bug with critical consequences, however unless I'm mistaken, it's not a situation that can be triggered or predicted by an attacker, in which case it may not be considered a vulnerability ?

I agree with ttx here, without a way to make the migration process fail, this is a bug with security consequence, but not a vulnerability.

@Hahyun, is there a missing steps that an attacker can use to make the live migration (step 8) failed ?

Else let's triage this as a class D type of report and remove the advisory task. ( https://wiki.openstack.org/wiki/Vulnerability_Management#Incident_report_taxonomy )

Hyun Ha (raymon-ha) wrote :

Hi,
Thank you for your comment.

The reason that I reported above issue as vulnerability is that attacker can attach others volume to his own VM on purpose.
There are many ways to make live-migration fail.

Firstly, there are two issues on havana (tag : 2013.2.3)
One issue is the bug with multipath rescan.(https://bugs.launchpad.net/nova/+bug/1362916)
The other one is that when volume is detached, multipath device couldn’t be deleted.
Due to these reasons, live-migration process will be failed in the situation that one vm live-migrate to another compute-node and go back to the original compute-node.

Secondly, if live-migration is executed while process keep using big size of memory by benchmark tool or something like that in VM instance
and then the waiting status of live-migration could be persisted, eventually live-migration will be failed.

There are some ways to make live-migration fail except I explained above.
Make NIC of compute-node down and then excute live-migration, live-migration is going to be failed for example (using multipath, iscsi)

Using rollback bug is just one way that attacker can attach others volume to his VM.
I think the importance thing is that nova attach volume with lun-id so that if lun-id might be changed with errors or by attackers, it occurs critical security issues.
please think about below situation.
attaker get the admin authority of nova DB.
change lun-id of connection_info in block_device_mapping table.
reboot hard his VM with volume changed lun-id.
finally attacker get others volume on his vm easily.

I think the root-cause of this bug is that nova use “lun-id” for mapping VM with volume.
lun-id is not unique and could be changed in attach/detach process because it is generaged dynamically.
I'd like to suggest that nova should attach volume to vm with "unique-id" of lun not lun-id.
And additionally, the bug that I reported should be fixed.

Users who have VM on Public cloud based on Openstack can feel their vm is unsafe, if they know about the possibility of volume mis-mapping because one compute-node have many different customers vm.
So, I think this issue should be triaged as a Class A type.

Thank you.

Jeremy Stanley (fungi) wrote :

Unless I'm misreading, you're suggesting potential exploits involving :

1. disconnecting physical network interfaces

2. gaining administrative access to the nova database

Our report taxonomy is not based on feelings and user impressions, but rather on the feasibility of exploiting a bug weighed against its implied risks.

Thierry Carrez (ttx) wrote :

I tend to agree that it becomes a security vulnerability if live-migrate regularly fails. Even if the leak can't be triggered or controlled, it is still a privacy issue. We issued one in the past for OSSA-2013-006:

http://security.openstack.org/ossa/OSSA-2013-006.html

Hyun Ha (raymon-ha) wrote :

Hi,
Live-migrate regularly fails on Havana and all branch before Havana.
Live-migrate can be failed on Juno and Ice-House in specific condition as I reported above.
Thank you.

Thierry Carrez (ttx) wrote :

Agree that it's a vulnerability in Havana (since live-migration fails so often there). I wouldn't consider it a vulnerability in Icehouse/Juno, since you can't trigger live migration failure without administrative or physical access to the machines.

It is a bug with security consequences there, and it should be fixed as soon as possible.

Changed in nova-project:
status: New → Confirmed
Thierry Carrez (ttx) wrote :

As far as OSSA is going, I'd rate this class C1 or D.

affects: nova-project → nova
Changed in nova:
importance: Undecided → High
Jeremy Stanley (fungi) wrote :

Agreed on C1: this wouldn't qualify for an advisory since Havana is no longer supported by the VMT, but it's still something a distro carrying Havana packages of Nova might fix on their own and request a corresponding CVE to track.

information type: Public Security → Public
tags: added: security
Jeremy Stanley (fungi) on 2015-03-09
Changed in ossa:
status: Incomplete → Won't Fix
Garth Mollett (gmollett) wrote :

I'm going to go ahead and request a CVE for this on oss-sec at least for havana (which we [redhat] still support downstream) unless someone has a good reason not to? (or beats me to it)

Matthew Edmonds (edmondsw) wrote :

per http://seclists.org/oss-sec/2015/q1/990:

For purposes of CVE, we typically don't think of vulnerabilities in the way expressed in https://bugs.launchpad.net/nova/+bug/1419577/comments/4 "without a way to make the migration process fail, this is a bug with security consequence, but not a vulnerability." In other words, for a CVE, the attacker can be a person who wishes to have an unauthorized volume attachment after the bug is triggered. The attacker does not need to be a person who has determined a reproducible way to trigger the bug.

Jeremy Stanley (fungi) wrote :

And as I indicated in follow-up replies on that thread, the OpenStack VMT doesn't decide whether or not a bug is worthy of getting a CVE assigned (only whether or not we're going to embargo it and/or eventually issue a security advisory about it).

Matt Riedemann (mriedem) on 2015-05-26
tags: added: live-migration volumes
tags: added: live-migrate
removed: live-migration
Matt Riedemann (mriedem) wrote :

I'm trying to sort this out a bit.

Looking at the nova.virt.libvirt.driver.pre_live_migration() method, I see it's connecting to a volume and the connection_info dictionary is updated in the nova.virt.libvirt.volume code, but I don't see where that connection_info dict comes back to the virt driver's pre_live_migration method and persists the change to the database.

This is where pre_live_migration() connects the volume:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/driver.py?id=2015.1.0#n5813

Let's assume we're using the LibvirtISCSIVolumeDriver volume driver, the connect_volume method in there will update the connection_info dict here:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/volume.py?id=2015.1.0#n483

That change never gets persisted back to the block_device_mapping table for the bdm instance, but we've connected the volume potentially on another host so if live migration fails and we never rollback the volume connection_info to the source host (before pre_live_migration), and reboot the instance, then the bdm will be recreated from what's in the database which will be wrong.

Justin Shepherd (jshepher) wrote :

Should this bug remain open as it is targeted to a no longer supported version (havana)?

Matt Riedemann (mriedem) wrote :

I have a feeling that this is fixed via https://review.openstack.org/#/c/211051/ .

Matt Riedemann (mriedem) wrote :

Per comment 17, maybe not given that's post live migration which is only called for a successful live migration, the rollback is called for a failed live migration.

Matt Riedemann (mriedem) wrote :

@Justin, per comment 16, this was reported against Havana but as far as I can tell this is not yet resolved in master (Liberty right now).

Matt Riedemann (mriedem) wrote :

I'm wondering if the fix (https://review.openstack.org/#/c/202770/) for bug 1475411 plays here, i.e. the case that the live migration from A to B is considered successful even though we didn't disconnect the correct volumes, and then that causes the migration from B back to A to fail.

Paul Murray (pmurray) on 2015-11-06
tags: added: live-migration
removed: live-migrate
lvmxh (shaohef) on 2015-11-17
Changed in nova:
assignee: nobody → lvmxh (shaohef)
lvmxh (shaohef) on 2015-12-03
Changed in nova:
assignee: lvmxh (shaohef) → nobody
Tobias Urdin (tobias-urdin) wrote :

Can anybody please verify if my bug is a duplicate of this one? https://bugs.launchpad.net/nova/+bug/1525802

lvmxh (shaohef) on 2016-01-27
information type: Public → Public Security
Jeremy Stanley (fungi) wrote :

Since this bug was switched from public back to public security with no comment explaining why, I have reset it to public again. Please, whenever moving a bug to a security type, add a comment with your reasoning.

information type: Public Security → Public
Lee Yarwood (lyarwood) wrote :

We've seen this downstream against RHEL OSP 7 (kilo) documented (mostly privately) in the following RHBZ:

iscsi details changed for cinder volume using EMCCLIISCSIDriver
https://bugzilla.redhat.com/show_bug.cgi?id=1353147

We manually reverted the changes to the target_luns to workaround the issue in this case. This still looks possible against master so I'm going to propose a change refreshing connection_info on the source host during _rollback_live_migration.

Fix proposed to branch: master
Review: https://review.openstack.org/338929

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/342111
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b83cae02ece4c338e09c3606c6ae69b715bd6f8c
Submitter: Jenkins
Branch: master

commit b83cae02ece4c338e09c3606c6ae69b715bd6f8c
Author: Lee Yarwood <email address hidden>
Date: Thu Jul 14 11:53:09 2016 +0100

    block_device: Make refresh_conn_infos py3 compatible

    Also add a simple test ensuring that refresh_connection_info is called
    for each DriverVolumeBlockDevice derived device provided.

    Related-Bug: #1419577
    Partially-Implements: blueprint goal-python35
    Change-Id: Ib1ff00e7f4f5b599317d7111c322ce9af8a9a2b1

Changed in nova:
assignee: Lee Yarwood (lyarwood) → Dan Smith (danms)
Changed in nova:
assignee: Dan Smith (danms) → Lee Yarwood (lyarwood)

Change abandoned by Lee Yarwood (<email address hidden>) on branch: master
Review: https://review.openstack.org/391598

Matt Riedemann (mriedem) on 2016-12-09
Changed in nova:
status: In Progress → Confirmed
assignee: Lee Yarwood (lyarwood) → nobody
Sean Dague (sdague) on 2016-12-09
Changed in nova:
importance: High → Medium
Lee Yarwood (lyarwood) wrote :

https://review.openstack.org/#/c/338929/ is the correct WIP review that I've been working on. It's currently waiting on https://review.openstack.org/#/c/389608/ so we can use the stashed bdms to reset connection_info without additional calls to Cinder.

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: Confirmed → In Progress
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.