Comment 2 for bug 1628168

Revision history for this message
Kevin (kvasko) wrote :

So a little more information. I was able to get more than 1 VM to start with a GPU attached (e.g. I had 2 VMs, each had 1 GPU attached). I restarted the host VM with the GPUs.

It appears that some of the GPUs are getting into an "in-use" state and won't return.

On the host system that has the GPUs when I reboot the machine and use the command lspci -vnnn | grep VGA, all 8 GPUs show up as the following:

04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

This is with 0 VM instances running that have a GPU associated with them.

At this point after a fresh reboot I started and stopped multiple VMs (started 3x VMs each with 1 GPU attached). Stopped them, and started them back up. No issues. I did that a few more times and then randomly I saw this appear when running lspci -vnnn | grep VGA on one of the cards.

 0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff)

I've got 2 machines running with a GPU attached, now at this point any time I try to start another VM with a GPU I get the no hosts found error. So what I *think* is happening is.

After rebooting the host machine none of the GPUS are in that weird (prog-if ff) state. At that point the VMs start up fine with a GPU, until one of the GPUs go into that "(rev ff) (prog-if ff) state. At that point any time OS tries to schedule a new VM to be created it is trying to use the GPU that is "(rev ff) (prog-if)", since it is marked as available in the MySQL database. At that point no other VMs can be created with a VM.

Whatever is causing the GPUs to go into the (rev ff) (prog-if ff) state I'm not sure. All I am doing is creating the VM, seeing if it launches successfully, logging into it, making sure the VM has a GPU associated with the VM and then deleting it from OS.

I'm using the CentOS7 image to test with from here. http://docs.openstack.org/image-guide/obtain-images.html

I'm going to try to debug this issue some more to see if I can narrow down the cause of the cards going into that odd state.