HWRNG access failing in docker container for libvirt, killing VMs on start
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
kolla-ansible |
New
|
Undecided
|
Unassigned |
Bug Description
Trying to get VMs to have better access to the hardware available, i enabled `rng_dev_
The Docker container can "see" the device node:
```
(nova-libvirt)
crw-rw-rw- 1 root root 10, 183 Jul 31 02:24 /dev/hwrng
```
However, all VMs _silently_ fail to start (not recorded to ElasticSearch - only thing ES gets is "qemuMonitorIO:618 : internal error: End of file from qemu monitor") with the following error, presented on-CLI when manually trying to start the instance:
```
# virsh start instance-00000019; tail -f /var/log/
Domain instance-00000019 started
==> /var/log/
-device virtio-
-object rng-random,
-device virtio-
-sandbox on,obsolete=
-msg timestamp=on
2021-08-01 13:23:50.508+0000: Domain id=3 is tainted: host-cpu
**
ERROR:/
Bail out! ERROR:/
2021-08-01 13:24:03.876+0000: shutting down, reason=crashed
```
The container can clearly _see_ the device node, but cannot read from it:
```
# dd if=/dev/hwrng of=/tmp/rnd.bin bs=25 count=1
dd: error reading '/dev/hwrng': No such device
0+0 records in
0+0 records out
0 bytes copied, 0.000517002 s, 0.0 kB/s
```
which, if i am groking this right, prevents the VM from starting due to some Docker-ism.
This seems to be an order of operations issue: looks like inserting the module after the container starts is the problem here. A fresh boot of a compute node loads the module via udev prior to the container starting (which works fine), inserting the module while the container is running produces the effect described, and stopping then starting the container also resolves it after loading the module.
I think the only actual bug here is the lack of correct log output/collection by the logging stack, and possibly some weirdness with ubuntu (the host os).
So if anyone else does this in their setup, you need to first load the kernel modules involved and udev rule to set hwrng to 0666, and only then run the `kolla-ansible reconfigure --tags nova` piece which should perform container restarts in the right order. If it still doesn't work, my "fix" may have been the piece of fully stopping the container then starting it back up.