Comment 3 for bug 1808010

Revision history for this message
Clark Boylan (cboylan) wrote : Re: Tempest cirros boots fail due to lack of disk space

We've done a bit more digging. From an example failure we get cirros console logs that look like this:

2018-12-11 11:17:44.637794 | controller | b"cirros-ds 'local' up at 13.40"
2018-12-11 11:17:44.637907 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638006 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638104 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638201 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638299 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638396 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638531 | controller | b'failed to copy results from configdrive to /run/cirros/datasource'
2018-12-11 11:17:44.638596 | controller | b'Starting network...'
2018-12-11 11:17:44.638669 | controller | b'udhcpc (v1.20.1) started'
2018-12-11 11:17:44.638748 | controller | b'Sending discover...'
2018-12-11 11:17:44.638835 | controller | b'Sending select for 10.1.0.14...'
2018-12-11 11:17:44.638941 | controller | b'Lease of 10.1.0.14 obtained, lease time 86400'
2018-12-11 11:17:44.639038 | controller | b'sh: write error: No space left on device'
2018-12-11 11:17:44.639167 | controller | b'sh: write error: No space left on device'
2018-12-11 11:17:44.639300 | controller | b'route: SIOCADDRT: File exists'
2018-12-11 11:17:44.639459 | controller | b'WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"'

In particular note the failure to copy configdrive results to /run/cirros/datasource. Frickler managed to boot up a cirros 3.5 instance with 64MB system memory and found that /run is a tmpfs with 200kb of disk.

I think what is happening here is that the cirros nested VM is running out of "disk" (its actually memory on tmpfs) to write the config drive info which leads to failing to set a default route which breaks routing packets back to the tempest host (we use FIPs for the ssh test so these aren't direct attached shared l2 segments).

As for why this happens on certain cloud regions more often it is perhaps coincidence or it could be a timing issue with other things trying to write to the same tmpfs. Timing may be hit on one cloud more often than others.