OpenStack-Gate

Bug #1808010
Comment #3

Comment 3 for bug 1808010

Revision history for this message

Clark Boylan (cboylan) wrote on 2018-12-11: Re: Tempest cirros boots fail due to lack of disk space

#3

We've done a bit more digging. From an example failure we get cirros console logs that look like this:

2018-12-11 11:17:44.637794 | controller | b"cirros-ds 'local' up at 13.40"
2018-12-11 11:17:44.637907 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638006 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638104 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638201 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638299 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638396 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638531 | controller | b'failed to copy results from configdrive to /run/cirros/datasource'
2018-12-11 11:17:44.638596 | controller | b'Starting network...'
2018-12-11 11:17:44.638669 | controller | b'udhcpc (v1.20.1) started'
2018-12-11 11:17:44.638748 | controller | b'Sending discover...'
2018-12-11 11:17:44.638835 | controller | b'Sending select for 10.1.0.14...'
2018-12-11 11:17:44.638941 | controller | b'Lease of 10.1.0.14 obtained, lease time 86400'
2018-12-11 11:17:44.639038 | controller | b'sh: write error: No space left on device'
2018-12-11 11:17:44.639167 | controller | b'sh: write error: No space left on device'
2018-12-11 11:17:44.639300 | controller | b'route: SIOCADDRT: File exists'
2018-12-11 11:17:44.639459 | controller | b'WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"'

In particular note the failure to copy configdrive results to /run/cirros/datasource. Frickler managed to boot up a cirros 3.5 instance with 64MB system memory and found that /run is a tmpfs with 200kb of disk.

I think what is happening here is that the cirros nested VM is running out of "disk" (its actually memory on tmpfs) to write the config drive info which leads to failing to set a default route which breaks routing packets back to the tempest host (we use FIPs for the ssh test so these aren't direct attached shared l2 segments).

As for why this happens on certain cloud regions more often it is perhaps coincidence or it could be a timing issue with other things trying to write to the same tmpfs. Timing may be hit on one cloud more often than others.

We've done a bit more digging. From an example failure we get cirros console logs that look like this:

2018-12-11 11:17:44.637794 | controller |     b"cirros-ds 'local' up at 13.40"
2018-12-11 11:17:44.637907 | controller |     b'cp: write error: No space left on device'
2018-12-11 11:17:44.638006 | controller |     b'cp: write error: No space left on device'
2018-12-11 11:17:44.638104 | controller |     b'cp: write error: No space left on device'
2018-12-11 11:17:44.638201 | controller |     b'cp: write error: No space left on device'
2018-12-11 11:17:44.638299 | controller |     b'cp: write error: No space left on device'
2018-12-11 11:17:44.638396 | controller |     b'cp: write error: No space left on device'
2018-12-11 11:17:44.638531 | controller |     b'failed to copy results from configdrive to /run/cirros/datasource'
2018-12-11 11:17:44.638596 | controller |     b'Starting network...'
2018-12-11 11:17:44.638669 | controller |     b'udhcpc (v1.20.1) started'
2018-12-11 11:17:44.638748 | controller |     b'Sending discover...'
2018-12-11 11:17:44.638835 | controller |     b'Sending select for 10.1.0.14...'
2018-12-11 11:17:44.638941 | controller |     b'Lease of 10.1.0.14 obtained, lease time 86400'
2018-12-11 11:17:44.639038 | controller |     b'sh: write error: No space left on device'
2018-12-11 11:17:44.639167 | controller |     b'sh: write error: No space left on device'
2018-12-11 11:17:44.639300 | controller |     b'route: SIOCADDRT: File exists'
2018-12-11 11:17:44.639459 | controller |     b'WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"'

In particular note the failure to copy configdrive results to /run/cirros/datasource. Frickler managed to boot up a cirros 3.5 instance with 64MB system memory and found that /run is a tmpfs with 200kb of disk.

I think what is happening here is that the cirros nested VM is running out of "disk" (its actually memory on tmpfs) to write the config drive info which leads to failing to set a default route which breaks routing packets back to the tempest host (we use FIPs for the ssh test so these aren't direct attached shared l2 segments).

As for why this happens on certain cloud regions more often it is perhaps coincidence or it could be a timing issue with other things trying to write to the same tmpfs. Timing may be hit on one cloud more often than others.