tempest.api.compute.images.test_list_image_filters failing on victoria

Bug #1959014 reported by Rafael Castillo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Won't Fix
Critical
Unassigned

Bug Description

periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-victoria is failing tempest.api.compute.images.test_list_image_filters.

Affected jobs: https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-victoria

Tempest error:
```
ft1.1: setUpClass (tempest.api.compute.images.test_list_image_filters.ListImageFiltersTestJSON)testtools.testresult.real._StringException: Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tempest/api/compute/base.py", line 372, in create_image_from_server
    waiters.wait_for_image_status(client, image_id, wait_until)
  File "/usr/lib/python3.6/site-packages/tempest/common/waiters.py", line 167, in wait_for_image_status
    image = show_image(image_id)
  File "/usr/lib/python3.6/site-packages/tempest/lib/services/compute/images_client.py", line 75, in show_image
    resp, body = self.get("images/%s" % image_id)
  File "/usr/lib/python3.6/site-packages/tempest/lib/common/rest_client.py", line 315, in get
    return self.request('GET', url, extra_headers, headers)
  File "/usr/lib/python3.6/site-packages/tempest/lib/services/compute/base_compute_client.py", line 48, in request
    method, url, extra_headers, headers, body, chunked)
  File "/usr/lib/python3.6/site-packages/tempest/lib/common/rest_client.py", line 704, in request
    self._error_checker(resp, resp_body)
  File "/usr/lib/python3.6/site-packages/tempest/lib/common/rest_client.py", line 810, in _error_checker
    raise exceptions.NotFound(resp_body, resp=resp)
tempest.lib.exceptions.NotFound: Object not found
Details: {'code': 404, 'message': 'Image not found.'}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tempest/test.py", line 182, in setUpClass
    six.reraise(etype, value, trace)
  File "/usr/local/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/usr/lib/python3.6/site-packages/tempest/test.py", line 175, in setUpClass
    cls.resource_setup()
  File "/usr/lib/python3.6/site-packages/tempest/api/compute/images/test_list_image_filters.py", line 124, in resource_setup
    cls.server2['id'], wait_until='ACTIVE')
  File "/usr/lib/python3.6/site-packages/tempest/api/compute/base.py", line 387, in create_image_from_server
    image_id=image_id)
tempest.exceptions.SnapshotNotFoundException: Server snapshot image 926320d2-d53a-4aff-a679-de0645fd33e1 not found.
```

Example logs: https://logserver.rdoproject.org/60/31460/51/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-victoria/6a262b2/

Changed in tripleo:
status: New → Triaged
tags: added: promotion-blocker
Revision history for this message
Artom Lifshitz (notartom) wrote :
Download full text (10.1 KiB)

The relevant log is most likely:

2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server [req-a487eeb2-b7f9-4712-8a58-6cfcdf9836b4 c0aeae9dc16743558ee7e13bf4f432d9 3c00e69e1dc4414faa97b980e348e301 - default default] Excep
tion during message handling: libvirt.libvirtError: Unable to read from monitor: Connection reset by peer
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 2494, in snapshot
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server instance.image_ref)
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/imagebackend.py", line 452, in direct_snapshot
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server raise NotImplementedError(_('direct_snapshot() is not implemented'))
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server NotImplementedError: direct_snapshot() is not implemented
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/exception_wrapper.py", line 78, in wrapped
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server function_name, call_dict, binary)
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server self.force_reraise()
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/six.py", line 703, in reraise
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server raise value
2022-01-25 13:13:35.560 8 ERROR oslo_messaging.rpc.server File "/usr...

Revision history for this message
Artom Lifshitz (notartom) wrote :

So initially Nova tried to do a direct snapshot, which is not implemented in the Image backend (expected), so Nova falls back on a live snapshot, and libvirt seems to have gone away during that.

Here's what we see in the instance qemu log (logs/overcloud-novacompute-0/var/log/containers/libvirt/qemu/instance-0000003e.log.txt):

2022-01-25 13:13:35.058+0000: shutting down, reason=crashed

And if we look at the libvirtd log (./logs/overcloud-novacompute-0/var/log/containers/libvirt/libvirtd.log.txt), we see:

2022-01-25 13:13:27.647+0000: 31298: error : qemuDomainBlockJobAbort:14523 : invalid argument:
 disk vda does not have an active block job
2022-01-25 13:13:28.482+0000: 31298: error : qemuMonitorJSONCheckErrorFull:406 : internal erro
r: unable to execute QEMU command 'blockdev-del': Failed to find node with node-name='libvirt-
4-format'
2022-01-25 13:13:28.511+0000: 31298: error : qemuMonitorJSONCheckErrorFull:406 : internal erro
r: unable to execute QEMU command 'blockdev-del': Failed to find node with node-name='libvirt-
4-storage'
2022-01-25 13:13:33.727+0000: 31297: error : qemuDomainBlockJobAbort:14523 : invalid argument:
 disk vda does not have an active block job
2022-01-25 13:13:34.856+0000: 52554: error : qemuMonitorIORead:495 : Unable to read from monit
or: Connection reset by peer
2022-01-25 13:13:35.083+0000: 47753: warning : virSecuritySELinuxRestoreFileLabel:1454 : cannot resolve symlink /var/lib/nova/instances/snapshots/tmpbe0_42ae/72c39473b96a45bb8eeb51f96167fb79.delta: No such file or directory
2022-01-25 13:13:35.083+0000: 47753: warning : qemuProcessStop:8211 : Unable to restore security label on vda

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Interesting, it _looks_ like the instance crashed. I wonder if we can get the coredumps here.

We only see "blockdev-del" calls in the libvirt debug log. However, if a snapshot attempt _was_ really made, we should've also seen "blockdev-add" and "blockdev-snapshot-sync" calls. I'm wondering what's missing here. Perhaps we need more input from the libvirt/QEMU block layer devs

For context, check out the below related (resolved) bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1809409 — Failed to create volume from vm snapshot image

This was resolved via a libvirt fix in libvirt-6.0.0-17.el8 (https://bugzilla.redhat.com/show_bug.cgi?id=1820016). However the setup here is using libvirt-7.0.

Also, "blockdev-del" means, in QEMU language:

# @blockdev-del:
#
# Deletes a block device that has been added using blockdev-add.
# The command will fail if the node is attached to a device or is
# otherwise being used.

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Ah-ha, we don't have the full QMP traces of what libvirt is sending to QEMU because this job is not using DevStack -- as this is TripleO :-(

Can you please re-run this job after configuring libvirt log filters? Here are the notes on how to do it:

Do the following on all relevant compute nodes:

(1) Edit this file:
    /var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/libvirtd.conf

    And add the following two entries:

        log_filters="1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file 3:object 1:util"
        log_outputs="1:file:/var/log/libvirt/libvirtd.log"

(2) Restart the 'nova_libvirt' container.

(3) Confirm that the newly added config attributes show up in the
    libvirtd.conf in the 'nova_libvirt' container (if you see something
    else, comment them out and make sure only the below two show up):

        $> egrep -i 'log_filters=|log_outputs=' /etc/libvirt/libvirtd.conf
        log_filters="1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file 3:object 1:util"
        log_outputs="1:file:/var/log/libvirt/libvirtd.log"

(4) Then, repeat the test and attach the /var/log/libvirt/libvirtd.log
    (rename them if need be) from all relevant hosts as plain text to
    the bug.

And don't forget: Once done with debugging, disable the above config
filters by commenting out (or removing) the above two lines, and
restart the 'nova_libvirt' container once again.

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

@Kashyap - per your comment #5 - looks like we are already setting [1] the required logging conf:

e.g. at [2]

        * log_filters="1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file 3:object 1:util"
        * log_outputs="3:file:/var/log/libvirt/libvirtd.log"

But I can't find the libvirtd.log - even though we *should* be collecting it [3].

I'm coming to EOD so will see if dasm|rover can dig into why. Otherwise/also we can hold a node for you next week and you can check the logs.

[1] https://opendev.org/openstack/tripleo-heat-templates/src/commit/70a8c481df3b886a2172ea5ff0ceaf19273c56d7/deployment/logging/files/nova-libvirt.yaml#L72
[2] https://logserver.rdoproject.org/60/31460/52/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-victoria/d8f303b/logs/overcloud-novacompute-0/var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/libvirtd.conf.txt.gz
[3] https://opendev.org/openstack/tripleo-ci/commit/2bae18e0dbedd18f9bfd751d340e4c0fa1f982a9

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Hi, Marios

Ah! I just spotted a bug in the above log filters you posted for TripleO, that's why you're not getting the full QMP traces. (I thought this was fixed a long time ago in all TripleO filters, as I recall pointing this out in the past.)

The crucial bug, TripleO is setting:

    log_outputs="3:file:/var/log/libvirt/libvirtd.log"

Which will not give us useful logs. What we need is this:

    log_outputs="1:file:/var/log/libvirt/libvirtd.log"

Notice the "1" vs. "3" there. ("1" means useful full libvirt/QEMU exchange + warnings/errors, while "3" means only warnings/errors.)

To summarize:

   log_filters="1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file 3:object 1:util"
   log_outputs="1:file:/var/log/libvirt/libvirtd.log"

Revision history for this message
Ronelle Landy (rlandy) wrote :

So I went looking for said faulty string:

log_outputs="3:file:/var/log/libvirt/libvirtd.log

and codesearch pointed to https://opendev.org/openstack/tripleo-heat-templates/commit/5fd7b487fdb12c258759628dd4013e6407680683

  LibvirtLogOutputs:
    description: Defines how log outputs are sent, as described in
                 https://libvirt.org/logging.html .
    type: string
    default: '3:file:/var/log/libvirt/libvirtd.log'

but ...

looking at the stable/victoria version of that file,

https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/victoria/deployment/nova/nova-libvirt-container-puppet.yaml

the string is no longer there:

https://opendev.org/openstack/tripleo-heat-templates/commits/branch/stable/victoria/deployment/nova/nova-libvirt-container-puppet.yaml

Revision history for this message
Ronelle Landy (rlandy) wrote :

Unless pinned, the job should be picking up the current version of THT:

https://trunk.rdoproject.org/centos8-victoria/component/tripleo/current/

[ ] openstack-tripleo-heat-templates-13.5.1-0.20220127212904.c37277a.el8.noarch.rpm 2022-01-27 21:34 562K
[ ] openstack-tripleo-heat-templates-13.5.1-0.20220127212904.c37277a.el8.src.rpm 2022-01-27 21:33 858K

Revision history for this message
Marios Andreou (marios-b) wrote :

Thanks @Kashyap apologies didn't get to cycle back to this one for a few days dealing with other blockers.

ACK on 1 vs 3 for debug level sorry I missed that. I have posted a debug patch which should give us level 1 there https://review.opendev.org/c/openstack/tripleo-heat-templates/+/827660 and running the test with it there https://review.rdoproject.org/r/c/testproject/+/38983

If that gives us the required log level in logs/overcloud-novacompute-0/var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/libvirtd.conf.txt.gz then we can hold the node for more debug as needed let's see...

Revision history for this message
Marios Andreou (marios-b) wrote (last edit ):

@Kashyap - looks like the loglevels should be OK now with the debug patch [1] with test at [2]

        * https://logserver.rdoproject.org/83/38983/1/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-victoria/b21235c/logs/overcloud-novacompute-0/var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/libvirtd.conf.txt.gz
        * log_filters="1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file 3:object 1:util"
        * log_outputs="1:file:/var/log/libvirt/libvirtd.log"

        * https://logserver.rdoproject.org/83/38983/1/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-victoria/b21235c/logs/overcloud-novacompute-0/etc/libvirt/virtlogd.conf.txt.gz
        * log_outputs="1:file:/var/log/libvirt/virtlogd.log

I have reverted the skip from comment #5 above at [3] and rerun this at [4] - then you should be able to use the logs at [5] for example (from the new run). And/or we can get the node held for more debug.

[1] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/827660
[2] https://review.rdoproject.org/r/c/testproject/+/38983/1#message-a5466a99e808b2e3584320f6ebed17a55b262b39
[3] https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/828243
[4] https://review.rdoproject.org/r/c/testproject/+/38983/2#message-700f70c2778a159b7af8448174646987eff768e0
[5] https://logserver.rdoproject.org/83/38983/1/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-victoria/b21235c/logs/overcloud-novacompute-0/var/log/containers/libvirt/libvirtd.log.txt.gz

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Hi, Marios. Thanks for the detailed work here. Although now the log _does_ have broader libvirt <-> QEMU interaction logs (and both the "blockdev-add" and "blockdev-del" logging).

However, unfortunately, in your link [5] I don't see the 'blockdev-del' failure anymore:

    unable to execute QEMU command 'blockdev-del'

It feels like we're back to square-1. :-(

Revision history for this message
Marios Andreou (marios-b) wrote (last edit ):

@Kashyap - my comment #11 could have been clearer - the link there ([5]) was from the *previous* run...

to see the result of the test at [4] there which includes the debug - see https://review.rdoproject.org/r/c/testproject/+/38983/2#message-60216e495f8b81ec5051291afddf44de14b862f1

in particular you can find the blockdev-del stuff in https://logserver.rdoproject.org/83/38983/2/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-victoria/67d64d4/logs/overcloud-novacompute-0/var/log/containers/libvirt/libvirtd.log.1.gz

@Pooja the test is green there because you aren't running the test which fails (it is skipped) - in my test I am using https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/828243

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Download full text (8.9 KiB)

Hi!

Kashyap asked me to look into this from Nova perspective. So I did so.

I started with the original run that is linked in the report https://logserver.rdoproject.org/60/31460/51/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-victoria/6a262b2/

There I see the following:

# A server was created with the domain xml:

2022-01-25 13:13:22.963 8 DEBUG nova.virt.libvirt.driver [req-3f14a5ed-531f-4ff7-99ff-5b6197428175 c0aeae9dc16743558ee7e13bf4f432d9 3c00e69e1dc4414faa97b980e348e301 - default default] [instance: f54fd0a5-322a-4cb1-8bb8-604aecc2c4d1] End _get_guest_xml xml=<domain type="qemu">
  <uuid>f54fd0a5-322a-4cb1-8bb8-604aecc2c4d1</uuid>
  <name>instance-0000003e</name>
  <memory>131072</memory>
  <vcpu>1</vcpu>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="22.3.1-0.20220111022305.28d0059.el8"/>
      <nova:name>tempest-ListImageFiltersTestJSON-server-1201316046</nova:name>
      <nova:creationTime>2022-01-25 13:13:22</nova:creationTime>
      <nova:flavor name="m1.nano">
        <nova:memory>128</nova:memory>
        <nova:disk>1</nova:disk>
        <nova:swap>0</nova:swap>
        <nova:ephemeral>0</nova:ephemeral>
        <nova:vcpus>1</nova:vcpus>
      </nova:flavor>
      <nova:owner>
        <nova:user uuid="c0aeae9dc16743558ee7e13bf4f432d9">tempest-ListImageFiltersTestJSON-6230924-project</nova:user>
        <nova:project uuid="3c00e69e1dc4414faa97b980e348e301">tempest-ListImageFiltersTestJSON-6230924</nova:project>
      </nova:owner>
      <nova:root type="image" uuid="d929eb6c-32ee-439b-85a4-f9b181ea1bb8"/>
    </nova:instance>
  </metadata>
  <sysinfo type="smbios">
    <system>
      <entry name="manufacturer">RDO</entry>
      <entry name="product">OpenStack Compute</entry>
      <entry name="version">22.3.1-0.20220111022305.28d0059.el8</entry>
      <entry name="serial">f54fd0a5-322a-4cb1-8bb8-604aecc2c4d1</entry>
      <entry name="uuid">f54fd0a5-322a-4cb1-8bb8-604aecc2c4d1</entry>
      <entry name="family">Virtual Machine</entry>
    </system>
  </sysinfo>
  <os>
    <type machine="pc-i440fx-rhel7.6.0">hvm</type>
    <boot dev="hd"/>
    <smbios mode="sysinfo"/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cputune>
    <shares>1024</shares>
  </cputune>
  <clock offset="utc"/>
  <cpu match="exact">
    <topology sockets="1" cores="1" threads="1"/>
  </cpu>
  <devices>
    <disk type="file" device="disk">
      <driver name="qemu" type="qcow2" cache="none"/>
      <source file="/var/lib/nova/instances/f54fd0a5-322a-4cb1-8bb8-604aecc2c4d1/disk"/>
      <target bus="virtio" dev="vda"/>
    </disk>
    <serial type="pty">
      <log file="/var/lib/nova/instances/f54fd0a5-322a-4cb1-8bb8-604aecc2c4d1/console.log" append="off"/>
    </serial>
    <input type="tablet" bus="usb"/>
    <graphics type="vnc" autoport="yes" listen="fd00:fd00:fd00:2000::40"/>
    <video>
      <model type="cirrus"/>
    </video>
    <rng model="virtio">
      <backend model="random">/dev/urandom</backend>
    </rng>
    <memballoon model="virtio">
      <stats period="10"/>
    </memballoon>
  </devices>
</domain>

# A snapshot is requested ...

Read more...

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Thanks, Gibi for your analysis, and Ronelle too (for the update in comment#16):

Meanwhile, two points:

(1) The original instance crash that Artom pointed out in comment#2 is not reproducible in later test runs. It seems tobe spurious

(2) The QEMU error: "unable to execute QEMU command 'blockdev-del'" was caused by a bug in libvirt, but that is NOT the root cause of the crash in point (1). This seems to be a latent bug in libvirt, uncovered by this test. And is being fixed upstream (by Peter Krempa):

    https://listman.redhat.com/archives/libvir-list/2022-February/msg00790.html
    — qemu: blockjob: Avoid spurious log errors when cancelling a shallow copy with reused images

Revision history for this message
Ronelle Landy (rlandy) wrote :

victoria is EOL - closing this out

Changed in tripleo:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.