console connects to wrong guest during migration screwage

Bug #352302 reported by Mitchell Berger
4
Affects Status Importance Assigned to Milestone
Invirt Project
New
Undecided
Unassigned

Bug Description

At some point this evening, Greg ran into a problem which necessitated
a reboot of shadow-moses. One of my VMs, discuss64, was on s-m at
the time. He attempted to migrate it to citadel-station, and this didn't
quite work out; it was subsequently shown as paused on c-s, though the
migrating domain was still present and running on s-m.

Greg then rebooted shadow-moses, and when he went to restart all the
user VMs that had been unable to get off before the ship sunk, he was
given an error trying to create discuss64 on shadow-moses saying that
it was already running on citadel-station. At this point, he asked me to
see whether it was actually running (because it still appeared paused on
c-s).

I logged into the webpage and connected to what the webpage alleged
to be discuss64's console, but actually ended up connected to the
console of moo, and entirely different machine which was running on
citadel-station at the time. This was kind of alarming. We probed discuss64
remotely and found that it seemed to be running fine, wherever it was
hiding.

At this point, Greg destroyed discuss64 on citadel-station and tried to
recreate it via remctl, which spit out an error message at him saying that
it was already running on shadow-moses (where he was previously told
he couldn't create it because it was already running on citadel-station).
Indeed, it still appeared to be fine and running according to athinfo
queries.

When Greg then destroyed it on shadow-moses, it really did shut down.
I then disconnected from the console (of moo, though the webpage
thought it was discuss64) and tried to reconnect, and was told that the
connection failed because the machine was powered off. I turned it
back on via the webpage and connected to the console again, and got
the right machine.

This is all kind of worrisome, and the worst thing we feared was that this
might be a repeat of the near disaster we saw previously during the moves
where machines' metadata was pointing at the wrong LVs, but it doesn't
seem like that was in fact the issue here. We are a bit stumped about
what truly was going on, though, and are hoping someone else will have
thoughts about how I ended up attached to the console of a different
machine that I don't have bits on.

Revision history for this message
Quentin Smith (quentin-mit) wrote :

It sounds like the xenstored was corrupted. The code that's responsible for figuring out which VNC server to connect to is at:

https://xvm.scripts.mit.edu/browser/trunk/packages/invirt-vnc-server/python/vnc/get_port.py

As you can see, it uses the XenAPI to request the information for the specific VM name it is given and then extracts the virtual framebuffer location from that information.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.