Tempest cirros ssh setup fails due to lack of disk space causing config-drive setup to fail forcing fallback to metadata server which fails due to hitting 10 second timeout.

Bug #1808010 reported by Clark Boylan
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack-Gate
New
Undecided
Unassigned
devstack
Fix Released
Undecided
Unassigned

Bug Description

Some tempest tests fail because cirros boots up and isn't able to write files due to a lack of disk space. Currently unsure if this is a host cloud hypervisor disk limitation or if we are running out of disk in the test node, or if the nova test flavor is now too small for cirros.

Looking at logstash it seems restricted to a subset of our cloud regions (though not a single cloud) which may indicate it is a host cloud provider disk issue.

Adding this bug so we can track it.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Until https://review.openstack.org/#/c/619319/ devstack uses root_gb=0 for the flavors that tempest uses, and root_gb=0 means the size of the image is used for the disk on nova-compute, not what is in the flavor (since it's essentially unbounded).

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like this is by far (62%) happening the most in the networking-odl-tempest-fluorine job.

Revision history for this message
Clark Boylan (cboylan) wrote :

We've done a bit more digging. From an example failure we get cirros console logs that look like this:

2018-12-11 11:17:44.637794 | controller | b"cirros-ds 'local' up at 13.40"
2018-12-11 11:17:44.637907 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638006 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638104 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638201 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638299 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638396 | controller | b'cp: write error: No space left on device'
2018-12-11 11:17:44.638531 | controller | b'failed to copy results from configdrive to /run/cirros/datasource'
2018-12-11 11:17:44.638596 | controller | b'Starting network...'
2018-12-11 11:17:44.638669 | controller | b'udhcpc (v1.20.1) started'
2018-12-11 11:17:44.638748 | controller | b'Sending discover...'
2018-12-11 11:17:44.638835 | controller | b'Sending select for 10.1.0.14...'
2018-12-11 11:17:44.638941 | controller | b'Lease of 10.1.0.14 obtained, lease time 86400'
2018-12-11 11:17:44.639038 | controller | b'sh: write error: No space left on device'
2018-12-11 11:17:44.639167 | controller | b'sh: write error: No space left on device'
2018-12-11 11:17:44.639300 | controller | b'route: SIOCADDRT: File exists'
2018-12-11 11:17:44.639459 | controller | b'WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"'

In particular note the failure to copy configdrive results to /run/cirros/datasource. Frickler managed to boot up a cirros 3.5 instance with 64MB system memory and found that /run is a tmpfs with 200kb of disk.

I think what is happening here is that the cirros nested VM is running out of "disk" (its actually memory on tmpfs) to write the config drive info which leads to failing to set a default route which breaks routing packets back to the tempest host (we use FIPs for the ssh test so these aren't direct attached shared l2 segments).

As for why this happens on certain cloud regions more often it is perhaps coincidence or it could be a timing issue with other things trying to write to the same tmpfs. Timing may be hit on one cloud more often than others.

Revision history for this message
melanie witt (melwitt) wrote :

Just a data point, I was looking at a recently failed job and found the "WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"" message in the cirros console but _no_ messages about disk space:

http://logs.openstack.org/82/623282/3/check/nova-next/a900344/logs/testr_results.html.gz

Here's a logstash query for finding other jobs that fail to add the route, 131 hits in the last 12 hours, all failures:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22WARN%3A%20failed%3A%20route%20add%20-net%20%5C%5C%5C%220.0.0.0%2F0%5C%5C%5C%22%20gw%20%5C%5C%5C%2210.1.0.1%5C%5C%5C%22%5C%22%20AND%20tags%3Aconsole

Revision history for this message
melanie witt (melwitt) wrote :

clarkb noted in #openstack-infra that even though we see the WARN message "WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"" in the cirros console log, we also later see:

=== network info ===
if-info: lo,up,127.0.0.1,8,::1
if-info: eth0,up,10.1.0.13,28,fe80::f816:3eff:fe08:fa26
ip-route:default via 10.1.0.1 dev eth0
ip-route:10.1.0.0/28 dev eth0 src 10.1.0.13
ip-route:169.254.169.254 via 10.1.0.1 dev eth0

which seems to show the route was added.

I had been thinking the 100% failure rate with the WARN message meant that it's related, but it might not be because the cirros console log is only collected if a job fails. So even if the WARN message is often emitted on successful job runs, we wouldn't be able to see it in a logstash query.

Revision history for this message
melanie witt (melwitt) wrote :

Adding more notes based on what clarkb found and described in #openstack-infra:

The cirros console log should print the authorized keys:

2018-12-11 11:17:44.640445 | controller | b'Printing cirros user authorized keys'

but there's no content, so that points to the problem being related to the metadata service not providing the data back.

He pointed out the GET to the metadata API taking over 10 seconds:

Dec 11 10:46:57.286705 ubuntu-xenial-inap-mtl01-0001122145 <email address hidden>[19907]: [pid: 19910|app: 0|req: 17/35] 198.72.124.65 () {40 vars in 705 bytes} [Tue Dec 11 10:46:46 2018] GET /2009-04-04/meta-data/public-keys => generated 27 bytes in 10527 msecs (HTTP/1.1 200) 3 headers in 99 bytes (1 switches on core 0)

and the cirros timeout is 10 seconds:

https://github.com/XANi/cirros/blob/42fdb92afa37a78fb2de179aeb0fc6da10430406/src/usr/bin/ec2metadata#L5

So, next we need to look at what could be causing the slowdown in the metadata API code.

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

O.k., so I did some further local testing with cirros 0.3.5 and config-drive. If I add a keypair that contains 4 pubkeys, I get /run to be 100% filled. If I use a keypair with 8 pubkeys, I'm getting the same "write error" sequence as above, _but_ cirros seems to manage to clean up after that and reverts to using the network based metadata successfully. Also /run is only 32% filled then, so the "/run full" issue only seems to affect instances with a config drive attached. I'll open up a separate cirros bug for that.

So then the real issue seems to be flaky metadata responses, which may have different causes on the network and/or API side. However, we could also try to make our testing more robust against this by teaching tempest to login using the well-known default password if ssh using the key or password that is to have been set via metadata has failed. That way we might be able to gather some more data about the failure scenarios.

Another approach that I'll try to take a look at is to make cirros more robust against these failures by retrying failed attempts and also log more details about the errors.

Revision history for this message
Matt Riedemann (mriedem) wrote :

(11:25:40 AM) clarkb: frickler: mriedem melwitt to TL;DR the cirros ssh issues, they all root cause down to metadata being slow, but in the case of running out of disk we are attempting to use config drive first (which would ignore the metadata server), but since disk is full that fails then we fall back to metadata then that times out and fails?
(11:26:16 AM) mriedem: metadata api being slow?
(11:26:25 AM) mriedem: as in the guest doesn't get network info fast enough from the meta api?
(11:26:46 AM) clarkb: mriedem: yup, cirros has a 10 second timeout on that network request, and nova was taking 10.something seconds to respond

Which sounds related to bug 1799298 and just general inefficiencies in the metadata-api service serving requests.

tags: added: api gate-failure metadata performance
Clark Boylan (cboylan)
summary: - Tempest cirros boots fail due to lack of disk space
+ Tempest cirros ssh setup fails due to lack of disk space causing config-
+ drive setup to fail forcing fallback to metadata server which fails due
+ to hitting 10 second timeout.
Revision history for this message
Matt Riedemann (mriedem) wrote :

Bug 1567655 would also be related to performance issues in the nova metadata-api because when using neutron, the metadata API has to query security group information from neutron per instance:

https://github.com/openstack/nova/blob/c72dafad8011198b78fd906f2fc7d1126e3f79f3/nova/api/metadata/base.py#L147

If nova were to cache the security group information with the instance in the instance_info_caches table we wouldn't need to query the neutron API.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/624778

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

The initial cirros issue when running from a config-drive has been fixed in https://bugs.launchpad.net/cirros/+bug/1808119 and there is a patch for devstack to use the updated cirros version in https://review.openstack.org/624716 . This should mitigate the network metadata issue at least in those scenarios where a config-drive is being used.

Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
Changed in devstack:
status: New → In Progress
assignee: nobody → Dr. Jens Harbott (j-harbott)
importance: Undecided → High
Changed in nova:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/624778
Reason: The join in security_groups should be trivial since the table should be empty in neutron cases, and when getting base metadata we also return the password which is stored in instance.system_metadata, so pre-joining instance_system_metadata is actually something we should be doing when we pull the instance from the cell database - it's used more than just vendordata v2 requests, which makes this change essentially not worthwhile.

We either need to optimize how we do the system_metadata joins in the sqlalchemy API code, or we need to reconsider how we store the instance password (move it out of system_metadata).

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

So the issue of cirros not being able to handle config-drive information properly has been fixed with cirros 0.3.6, which is now used by default in devstack. The remaining issue(s) seem(s) to be somewhere in the nova-neutron interaction, so marking as incomplete in DevStack for now.

Changed in devstack:
status: In Progress → Incomplete
importance: High → Undecided
assignee: Dr. Jens Harbott (j-harbott) → nobody
Revision history for this message
Matt Riedemann (mriedem) wrote :

https://review.opendev.org/#/c/672715/ might resolve this in the gate.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I think we need to update the logstash query for this because I'm seeing out of space issues causing ssh to fail but the existing query isn't in the guest console output, e.g.:

https://zuul.opendev.org/t/openstack/build/73e216c0b8d94365a5e5f1c94e6af361/log/job-output.txt#78872

2019-09-29 08:15:56.988235 | controller | b'successful after 1/20 tries: up 16.34. iid=i-0000004a'

2019-09-29 08:15:56.988283 | controller | b'failed to get http://169.254.169.254/2009-04-04/user-data'

2019-09-29 08:15:56.988320 | controller | b'warning: no ec2 metadata for user-data'

2019-09-29 08:15:56.988374 | controller | b'sh: write error: No space left on device'

2019-09-29 08:15:56.988415 | controller | b'Top of dropbear init script'

2019-09-29 08:15:56.988445 | controller | b'Starting dropbear sshd: OK'

2019-09-29 08:15:56.988499 | controller | b'GROWROOT: NOCHANGE: partition 1 is size 2078687. it cannot be grown'

Note the http failure to get user-data but the logstash query is looking for meta-data.

We could probably just look for "sh: write error: No space left on device" or "GROWROOT...it cannot be grown".

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tempest 22.0.0

This issue was fixed in the openstack/tempest 22.0.0 release.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is still an issue, and might be due to this change in devstack to set the root disk size in tempest flavors:

https://review.opendev.org/#/c/619319/

Before that, the flavor root disk was 0 so qemu-img would set the root disk in the guest to whatever was needed for the image and config drive. That change sets the flavor root disk to the size of the image rounded up to the nearest GiB but isn't accounting for config drive.

We keep seeing space/growroot failures in these failed jobs:

sh: write error: No space left on device
Top of dropbear init script
Starting dropbear sshd: OK
GROWROOT: NOCHANGE: partition 1 is size 2078687. it cannot be grown

So we should probably just pad an extra 1GiB on the root disk in those tempest flavors in devstack to be safe.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/690991

Revision history for this message
Matt Riedemann (mriedem) wrote :

Ignore the config drive comment in PS17, that's nothing to do with the flavor's root disk since it's a separate device.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on devstack (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/690991

Revision history for this message
Lee Yarwood (lyarwood) wrote :

I'm still seeing this on nova-next, I'm assuming the initial config drive copy into /run/cirros/ is failing but these sysvinit scripts are pretty cryptic so it could be later during the public key import. I also don't see an attempt by cirros to reach out to the metadata service however likely due to the failure to set a default route?

https://51bb6dbe041694cdacc4-bbe0f194e8a12aaea6b359f6bab3a11a.ssl.cf2.rackcdn.com/705764/8/gate/nova-next/a04af99/testr_results.html

Initializing random number generator...
[ 16.109726] random: dd urandom read with 30 bits of entropy available
done.
Starting acpid: OK
[ 23.289087] random: nonblocking pool is initialized
sh: write error: No space left on device
Starting network...
udhcpc (v1.23.2) started
Sending discover...
Sending select for 10.1.0.5...
Lease of 10.1.0.5 obtained, lease time 86400
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"
Top of dropbear init script

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/730307

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (master)

Reviewed: https://review.opendev.org/730307
Committed: https://git.openstack.org/cgit/openstack/devstack/commit/?id=54edc7aeef97df768477b5fa14f8fc45266a9c2e
Submitter: Zuul
Branch: master

commit 54edc7aeef97df768477b5fa14f8fc45266a9c2e
Author: Lee Yarwood <email address hidden>
Date: Fri May 22 13:03:15 2020 +0100

    tempest: Increase m1.nano and m1.micro RAM by 64MB to avoid tmpfs exhaustion

    tmpfs exhaustion has long been suspected as the root issue behind
    failures to load ssh keys and other metadata from local config drives as
    documented in bug #1808010. This can also lead to failures fetching
    metadata from n-metadata-api leaving Tempest unable to SSH into
    instances.

    This change increases the RAM of the m1.nano and m1.micro flavors by
    64MB to hopefully avoid these errors going forward. This is also ahead
    of our eventual upgrade to Cirros 0.5.0 where 128MB becomes a
    requirement.

    Related-Bug: #1808010
    Change-Id: I4b597579cf89939955d3c110c0bd58ca05de61f0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/730456

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/730457

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/730458

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/730459

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (stable/rocky)

Reviewed: https://review.opendev.org/730459
Committed: https://git.openstack.org/cgit/openstack/devstack/commit/?id=4acdb97d27fd1db6e66c5bd50fbf201777c67088
Submitter: Zuul
Branch: stable/rocky

commit 4acdb97d27fd1db6e66c5bd50fbf201777c67088
Author: Lee Yarwood <email address hidden>
Date: Fri May 22 13:03:15 2020 +0100

    tempest: Increase m1.nano and m1.micro RAM by 64MB to avoid tmpfs exhaustion

    tmpfs exhaustion has long been suspected as the root issue behind
    failures to load ssh keys and other metadata from local config drives as
    documented in bug #1808010. This can also lead to failures fetching
    metadata from n-metadata-api leaving Tempest unable to SSH into
    instances.

    This change increases the RAM of the m1.nano and m1.micro flavors by
    64MB to hopefully avoid these errors going forward. This is also ahead
    of our eventual upgrade to Cirros 0.5.0 where 128MB becomes a
    requirement.

    Related-Bug: #1808010
    Change-Id: I4b597579cf89939955d3c110c0bd58ca05de61f0
    (cherry picked from commit 54edc7aeef97df768477b5fa14f8fc45266a9c2e)
    (cherry picked from commit 9707dba33830b8a3ca97fa6abf091d9d1d0d7899)

tags: added: in-stable-rocky
tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (stable/stein)

Reviewed: https://review.opendev.org/730458
Committed: https://git.openstack.org/cgit/openstack/devstack/commit/?id=d9225369eba2bdcbaead5a6d2029fa84a7f568bf
Submitter: Zuul
Branch: stable/stein

commit d9225369eba2bdcbaead5a6d2029fa84a7f568bf
Author: Lee Yarwood <email address hidden>
Date: Fri May 22 13:03:15 2020 +0100

    tempest: Increase m1.nano and m1.micro RAM by 64MB to avoid tmpfs exhaustion

    tmpfs exhaustion has long been suspected as the root issue behind
    failures to load ssh keys and other metadata from local config drives as
    documented in bug #1808010. This can also lead to failures fetching
    metadata from n-metadata-api leaving Tempest unable to SSH into
    instances.

    This change increases the RAM of the m1.nano and m1.micro flavors by
    64MB to hopefully avoid these errors going forward. This is also ahead
    of our eventual upgrade to Cirros 0.5.0 where 128MB becomes a
    requirement.

    Related-Bug: #1808010
    Change-Id: I4b597579cf89939955d3c110c0bd58ca05de61f0
    (cherry picked from commit 54edc7aeef97df768477b5fa14f8fc45266a9c2e)
    (cherry picked from commit 9707dba33830b8a3ca97fa6abf091d9d1d0d7899)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (stable/train)

Reviewed: https://review.opendev.org/730457
Committed: https://git.openstack.org/cgit/openstack/devstack/commit/?id=2e73509fbfc6f96b1f1ba690ba4ddf238bdc6266
Submitter: Zuul
Branch: stable/train

commit 2e73509fbfc6f96b1f1ba690ba4ddf238bdc6266
Author: Lee Yarwood <email address hidden>
Date: Fri May 22 13:03:15 2020 +0100

    tempest: Increase m1.nano and m1.micro RAM by 64MB to avoid tmpfs exhaustion

    tmpfs exhaustion has long been suspected as the root issue behind
    failures to load ssh keys and other metadata from local config drives as
    documented in bug #1808010. This can also lead to failures fetching
    metadata from n-metadata-api leaving Tempest unable to SSH into
    instances.

    This change increases the RAM of the m1.nano and m1.micro flavors by
    64MB to hopefully avoid these errors going forward. This is also ahead
    of our eventual upgrade to Cirros 0.5.0 where 128MB becomes a
    requirement.

    Related-Bug: #1808010
    Change-Id: I4b597579cf89939955d3c110c0bd58ca05de61f0
    (cherry picked from commit 54edc7aeef97df768477b5fa14f8fc45266a9c2e)
    (cherry picked from commit 9707dba33830b8a3ca97fa6abf091d9d1d0d7899)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (stable/ussuri)

Reviewed: https://review.opendev.org/730456
Committed: https://git.openstack.org/cgit/openstack/devstack/commit/?id=9707dba33830b8a3ca97fa6abf091d9d1d0d7899
Submitter: Zuul
Branch: stable/ussuri

commit 9707dba33830b8a3ca97fa6abf091d9d1d0d7899
Author: Lee Yarwood <email address hidden>
Date: Fri May 22 13:03:15 2020 +0100

    tempest: Increase m1.nano and m1.micro RAM by 64MB to avoid tmpfs exhaustion

    tmpfs exhaustion has long been suspected as the root issue behind
    failures to load ssh keys and other metadata from local config drives as
    documented in bug #1808010. This can also lead to failures fetching
    metadata from n-metadata-api leaving Tempest unable to SSH into
    instances.

    This change increases the RAM of the m1.nano and m1.micro flavors by
    64MB to hopefully avoid these errors going forward. This is also ahead
    of our eventual upgrade to Cirros 0.5.0 where 128MB becomes a
    requirement.

    Related-Bug: #1808010
    Change-Id: I4b597579cf89939955d3c110c0bd58ca05de61f0
    (cherry picked from commit 54edc7aeef97df768477b5fa14f8fc45266a9c2e)

Lee Yarwood (lyarwood)
no longer affects: nova
Changed in devstack:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.