Tempest cirros ssh setup fails due to lack of disk space causing config-drive setup to fail forcing fallback to metadata server which fails due to hitting 10 second timeout.

Bug #1808010 reported by Clark Boylan on 2018-12-11
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Unassigned
OpenStack-Gate
Undecided
Unassigned
devstack
Undecided
Unassigned

Bug Description

Some tempest tests fail because cirros boots up and isn't able to write files due to a lack of disk space. Currently unsure if this is a host cloud hypervisor disk limitation or if we are running out of disk in the test node, or if the nova test flavor is now too small for cirros.

Looking at logstash it seems restricted to a subset of our cloud regions (though not a single cloud) which may indicate it is a host cloud provider disk issue.

Adding this bug so we can track it.

Matt Riedemann (mriedem) wrote :

Until https://review.openstack.org/#/c/619319/ devstack uses root_gb=0 for the flavors that tempest uses, and root_gb=0 means the size of the image is used for the disk on nova-compute, not what is in the flavor (since it's essentially unbounded).

Matt Riedemann (mriedem) wrote :

Looks like this is by far (62%) happening the most in the networking-odl-tempest-fluorine job.

Clark Boylan (cboylan) wrote :

We've done a bit more digging. From an example failure we get cirros console logs that look like this:

2018-12-11 11:17:44.637794 | controller | b"cirros-ds 'local' up at 13.40"
2018-12-11 11:17:44.637907 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638006 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638104 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638201 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638299 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638396 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638531 | controller | b'failed to copy results from configdrive to /run/cirros/datasource'
2018-12-11 11:17:44.638596 | controller | b'Starting network...'
2018-12-11 11:17:44.638669 | controller | b'udhcpc (v1.20.1) started'
2018-12-11 11:17:44.638748 | controller | b'Sending discover...'
2018-12-11 11:17:44.638835 | controller | b'Sending select for 10.1.0.14...'
2018-12-11 11:17:44.638941 | controller | b'Lease of 10.1.0.14 obtained, lease time 86400'
2018-12-11 11:17:44.639038 | controller | b'sh: write error: No space left on device'
2018-12-11 11:17:44.639167 | controller | b'sh: write error: No space left on device'
2018-12-11 11:17:44.639300 | controller | b'route: SIOCADDRT: File exists'
2018-12-11 11:17:44.639459 | controller | b'WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"'

In particular note the failure to copy configdrive results to /run/cirros/datasource. Frickler managed to boot up a cirros 3.5 instance with 64MB system memory and found that /run is a tmpfs with 200kb of disk.

I think what is happening here is that the cirros nested VM is running out of "disk" (its actually memory on tmpfs) to write the config drive info which leads to failing to set a default route which breaks routing packets back to the tempest host (we use FIPs for the ssh test so these aren't direct attached shared l2 segments).

As for why this happens on certain cloud regions more often it is perhaps coincidence or it could be a timing issue with other things trying to write to the same tmpfs. Timing may be hit on one cloud more often than others.

melanie witt (melwitt) wrote :

Just a data point, I was looking at a recently failed job and found the "WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"" message in the cirros console but _no_ messages about disk space:

http://logs.openstack.org/82/623282/3/check/nova-next/a900344/logs/testr_results.html.gz

Here's a logstash query for finding other jobs that fail to add the route, 131 hits in the last 12 hours, all failures:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22WARN%3A%20failed%3A%20route%20add%20-net%20%5C%5C%5C%220.0.0.0%2F0%5C%5C%5C%22%20gw%20%5C%5C%5C%2210.1.0.1%5C%5C%5C%22%5C%22%20AND%20tags%3Aconsole

melanie witt (melwitt) wrote :

clarkb noted in #openstack-infra that even though we see the WARN message "WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"" in the cirros console log, we also later see:

=== network info ===
if-info: lo,up,127.0.0.1,8,::1
if-info: eth0,up,10.1.0.13,28,fe80::f816:3eff:fe08:fa26
ip-route:default via 10.1.0.1 dev eth0
ip-route:10.1.0.0/28 dev eth0 src 10.1.0.13
ip-route:169.254.169.254 via 10.1.0.1 dev eth0

which seems to show the route was added.

I had been thinking the 100% failure rate with the WARN message meant that it's related, but it might not be because the cirros console log is only collected if a job fails. So even if the WARN message is often emitted on successful job runs, we wouldn't be able to see it in a logstash query.

melanie witt (melwitt) wrote :

Adding more notes based on what clarkb found and described in #openstack-infra:

The cirros console log should print the authorized keys:

2018-12-11 11:17:44.640445 | controller | b'Printing cirros user authorized keys'

but there's no content, so that points to the problem being related to the metadata service not providing the data back.

He pointed out the GET to the metadata API taking over 10 seconds:

Dec 11 10:46:57.286705 ubuntu-xenial-inap-mtl01-0001122145 <email address hidden>[19907]: [pid: 19910|app: 0|req: 17/35] 198.72.124.65 () {40 vars in 705 bytes} [Tue Dec 11 10:46:46 2018] GET /2009-04-04/meta-data/public-keys => generated 27 bytes in 10527 msecs (HTTP/1.1 200) 3 headers in 99 bytes (1 switches on core 0)

and the cirros timeout is 10 seconds:

https://github.com/XANi/cirros/blob/42fdb92afa37a78fb2de179aeb0fc6da10430406/src/usr/bin/ec2metadata#L5

So, next we need to look at what could be causing the slowdown in the metadata API code.

Dr. Jens Harbott (j-harbott) wrote :

O.k., so I did some further local testing with cirros 0.3.5 and config-drive. If I add a keypair that contains 4 pubkeys, I get /run to be 100% filled. If I use a keypair with 8 pubkeys, I'm getting the same "write error" sequence as above, _but_ cirros seems to manage to clean up after that and reverts to using the network based metadata successfully. Also /run is only 32% filled then, so the "/run full" issue only seems to affect instances with a config drive attached. I'll open up a separate cirros bug for that.

So then the real issue seems to be flaky metadata responses, which may have different causes on the network and/or API side. However, we could also try to make our testing more robust against this by teaching tempest to login using the well-known default password if ssh using the key or password that is to have been set via metadata has failed. That way we might be able to gather some more data about the failure scenarios.

Another approach that I'll try to take a look at is to make cirros more robust against these failures by retrying failed attempts and also log more details about the errors.

Matt Riedemann (mriedem) wrote :

(11:25:40 AM) clarkb: frickler: mriedem melwitt to TL;DR the cirros ssh issues, they all root cause down to metadata being slow, but in the case of running out of disk we are attempting to use config drive first (which would ignore the metadata server), but since disk is full that fails then we fall back to metadata then that times out and fails?
(11:26:16 AM) mriedem: metadata api being slow?
(11:26:25 AM) mriedem: as in the guest doesn't get network info fast enough from the meta api?
(11:26:46 AM) clarkb: mriedem: yup, cirros has a 10 second timeout on that network request, and nova was taking 10.something seconds to respond

Which sounds related to bug 1799298 and just general inefficiencies in the metadata-api service serving requests.

tags: added: api gate-failure metadata performance
Clark Boylan (cboylan) on 2018-12-12
summary: - Tempest cirros boots fail due to lack of disk space
+ Tempest cirros ssh setup fails due to lack of disk space causing config-
+ drive setup to fail forcing fallback to metadata server which fails due
+ to hitting 10 second timeout.
Matt Riedemann (mriedem) wrote :

Bug 1567655 would also be related to performance issues in the nova metadata-api because when using neutron, the metadata API has to query security group information from neutron per instance:

https://github.com/openstack/nova/blob/c72dafad8011198b78fd906f2fc7d1126e3f79f3/nova/api/metadata/base.py#L147

If nova were to cache the security group information with the instance in the instance_info_caches table we wouldn't need to query the neutron API.

Dr. Jens Harbott (j-harbott) wrote :

The initial cirros issue when running from a config-drive has been fixed in https://bugs.launchpad.net/cirros/+bug/1808119 and there is a patch for devstack to use the updated cirros version in https://review.openstack.org/624716 . This should mitigate the network metadata issue at least in those scenarios where a config-drive is being used.

Matt Riedemann (mriedem) on 2018-12-13
Changed in nova:
status: New → Confirmed
Changed in devstack:
status: New → In Progress
assignee: nobody → Dr. Jens Harbott (j-harbott)
importance: Undecided → High
Changed in nova:
importance: Undecided → Medium

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/624778
Reason: The join in security_groups should be trivial since the table should be empty in neutron cases, and when getting base metadata we also return the password which is stored in instance.system_metadata, so pre-joining instance_system_metadata is actually something we should be doing when we pull the instance from the cell database - it's used more than just vendordata v2 requests, which makes this change essentially not worthwhile.

We either need to optimize how we do the system_metadata joins in the sqlalchemy API code, or we need to reconsider how we store the instance password (move it out of system_metadata).

Dr. Jens Harbott (j-harbott) wrote :

So the issue of cirros not being able to handle config-drive information properly has been fixed with cirros 0.3.6, which is now used by default in devstack. The remaining issue(s) seem(s) to be somewhere in the nova-neutron interaction, so marking as incomplete in DevStack for now.

Changed in devstack:
status: In Progress → Incomplete
importance: High → Undecided
assignee: Dr. Jens Harbott (j-harbott) → nobody
Matt Riedemann (mriedem) wrote :

https://review.opendev.org/#/c/672715/ might resolve this in the gate.

Matt Riedemann (mriedem) wrote :

I think we need to update the logstash query for this because I'm seeing out of space issues causing ssh to fail but the existing query isn't in the guest console output, e.g.:

https://zuul.opendev.org/t/openstack/build/73e216c0b8d94365a5e5f1c94e6af361/log/job-output.txt#78872

2019-09-29 08:15:56.988235 | controller | b'successful after 1/20 tries: up 16.34. iid=i-0000004a'

2019-09-29 08:15:56.988283 | controller | b'failed to get http://169.254.169.254/2009-04-04/user-data'

2019-09-29 08:15:56.988320 | controller | b'warning: no ec2 metadata for user-data'

2019-09-29 08:15:56.988374 | controller | b'sh: write error: No space left on device'

2019-09-29 08:15:56.988415 | controller | b'Top of dropbear init script'

2019-09-29 08:15:56.988445 | controller | b'Starting dropbear sshd: OK'

2019-09-29 08:15:56.988499 | controller | b'GROWROOT: NOCHANGE: partition 1 is size 2078687. it cannot be grown'

Note the http failure to get user-data but the logstash query is looking for meta-data.

We could probably just look for "sh: write error: No space left on device" or "GROWROOT...it cannot be grown".

This issue was fixed in the openstack/tempest 22.0.0 release.

Matt Riedemann (mriedem) wrote :

This is still an issue, and might be due to this change in devstack to set the root disk size in tempest flavors:

https://review.opendev.org/#/c/619319/

Before that, the flavor root disk was 0 so qemu-img would set the root disk in the guest to whatever was needed for the image and config drive. That change sets the flavor root disk to the size of the image rounded up to the nearest GiB but isn't accounting for config drive.

We keep seeing space/growroot failures in these failed jobs:

sh: write error: No space left on device
Top of dropbear init script
Starting dropbear sshd: OK
GROWROOT: NOCHANGE: partition 1 is size 2078687. it cannot be grown

So we should probably just pad an extra 1GiB on the root disk in those tempest flavors in devstack to be safe.

Matt Riedemann (mriedem) wrote :

Ignore the config drive comment in PS17, that's nothing to do with the flavor's root disk since it's a separate device.

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/690991

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers