So a little more information. I was able to get more than 1 VM to start with a GPU attached (e.g. I had 2 VMs, each had 1 GPU attached). I restarted the host VM with the GPUs.
It appears that some of the GPUs are getting into an "in-use" state and won't return.
On the host system that has the GPUs when I reboot the machine and use the command lspci -vnnn | grep VGA, all 8 GPUs show up as the following:
This is with 0 VM instances running that have a GPU associated with them.
At this point after a fresh reboot I started and stopped multiple VMs (started 3x VMs each with 1 GPU attached). Stopped them, and started them back up. No issues. I did that a few more times and then randomly I saw this appear when running lspci -vnnn | grep VGA on one of the cards.
I've got 2 machines running with a GPU attached, now at this point any time I try to start another VM with a GPU I get the no hosts found error. So what I *think* is happening is.
After rebooting the host machine none of the GPUS are in that weird (prog-if ff) state. At that point the VMs start up fine with a GPU, until one of the GPUs go into that "(rev ff) (prog-if ff) state. At that point any time OS tries to schedule a new VM to be created it is trying to use the GPU that is "(rev ff) (prog-if)", since it is marked as available in the MySQL database. At that point no other VMs can be created with a VM.
Whatever is causing the GPUs to go into the (rev ff) (prog-if ff) state I'm not sure. All I am doing is creating the VM, seeing if it launches successfully, logging into it, making sure the VM has a GPU associated with the VM and then deleting it from OS.
So a little more information. I was able to get more than 1 VM to start with a GPU attached (e.g. I had 2 VMs, each had 1 GPU attached). I restarted the host VM with the GPUs.
It appears that some of the GPUs are getting into an "in-use" state and won't return.
On the host system that has the GPUs when I reboot the machine and use the command lspci -vnnn | grep VGA, all 8 GPUs show up as the following:
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
This is with 0 VM instances running that have a GPU associated with them.
At this point after a fresh reboot I started and stopped multiple VMs (started 3x VMs each with 1 GPU attached). Stopped them, and started them back up. No issues. I did that a few more times and then randomly I saw this appear when running lspci -vnnn | grep VGA on one of the cards.
0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff)
I've got 2 machines running with a GPU attached, now at this point any time I try to start another VM with a GPU I get the no hosts found error. So what I *think* is happening is.
After rebooting the host machine none of the GPUS are in that weird (prog-if ff) state. At that point the VMs start up fine with a GPU, until one of the GPUs go into that "(rev ff) (prog-if ff) state. At that point any time OS tries to schedule a new VM to be created it is trying to use the GPU that is "(rev ff) (prog-if)", since it is marked as available in the MySQL database. At that point no other VMs can be created with a VM.
Whatever is causing the GPUs to go into the (rev ff) (prog-if ff) state I'm not sure. All I am doing is creating the VM, seeing if it launches successfully, logging into it, making sure the VM has a GPU associated with the VM and then deleting it from OS.
I'm using the CentOS7 image to test with from here. http:// docs.openstack. org/image- guide/obtain- images. html
I'm going to try to debug this issue some more to see if I can narrow down the cause of the cards going into that odd state.