powering off and on an instance can result in instance boot failure due to serial port handling race

Bug #1755981 reported by Chris Friesen
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

The following is specific to the libvirt driver.

When we call power_off() it calls _destroy(), which in turn calls self._get_serial_ports_from_guest() and loops over all the serial ports calling serial_console.release_port() on each. This removes the host TCP port from ALLOCATED_PORTS (which is the set of allocated ports on the host).

Then when we call power_on(), it again calls _destroy(), which again calls self._get_serial_ports_from_guest(). This will return the same set of ports that it did before. This is a problem, because those ports could have been allocated to another instance in the meantime!

So in the case where one or more of those ports had been allocated to another instance, we call serial_console.release_port() on them, and remove them from ALLOCATED_PORTS.

Then as part of power_on() we will create new XML with new serial ports, which could select the ports that we just removed from ALLOCATED_PORTS (which are actually in use by another instance). When qemu tries to bind to this port it will fail, causing the instance to error out and stay in the SHUTOFF state.

One possible solution would be to call guest.detach_device() on the "serial" and "console" devices from the guest in the power_off() routine. That way when we call _destroy() in the power_on() routine there wouldn't be any devices returned by _get_serial_ports_from_guest(). This is a bit messy though, so if anyone has any better ideas I'd like to hear about it.

melanie witt (melwitt)
Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
yangjie (yang.jie) wrote :

I am thinking about that we can change the structure of ALLOCATED_PORTS, use the instance_uuid as key to record these ports are being used by specific server. Only release ports that are not used by other servers when we call serial_console.release_port() during destroy a server.

Revision history for this message
David Geng (genggjh) wrote :

We hit the simillar issue with follow error in nova-compute log:

: libvirtError: internal error: process exited while connecting to monitor: 2021-01-15T02:26:20.537782Z qemu-kvm: -chardev socket,id=charserial0,host=192.168.33.67,port=10001,server,nowait,logfile=/dev/fdset/16,logappend=on: Failed to bind socket: Address already in use
2021-01-15 10:26:20.844 36347 ERROR nova.virt.libvirt.driver [req-11f3c907-ca7c-44fc-98f0-70d9cc01f9f0 650c5993a2d3498bbba3d62e6f338ca6 75767e74f09f4b4caf216f9bbd2a0832 - default default] [instance: e5595da6-0a21-4dcc-8463-cbb0b0dcfc65] Failed to start libvirt guest: libvirtError: internal error: process exited while connecting to monitor: 2021-01-15T02:26:20.537782Z qemu-kvm: -chardev socket,id=charserial0,host=192.168.33.67,port=10001,server,nowait,logfile=/dev/fdset/16,logappend=on: Failed to bind socket: Address already in use

The openstack version is Pike.
Is there any solution or workaround?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.