fs020, tempest, image corrupted after upload to glance (checksum mismatch)

Bug #1754036 reported by Matt Young
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Tempest jobs in FS 020 are failing, which is blocking promotion.

Image is uploaded:

- https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/a049ff7/undercloud/home/jenkins/tempest_output.log.txt.gz#_2018-03-06_11_08_59

- https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/a049ff7/overcloud-controller-foo-0/var/log/nova/nova-conductor.log.txt.gz#_2018-03-06_12_13_29_001

However it looks like there's a checksum mismatch when nova attempts to build the instance:

- https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/a049ff7/overcloud-controller-foo-0/var/log/nova/nova-conductor.log.txt.gz#_2018-03-06_11_38_17_521

RescheduledException: Build of instance ef1af128-1a0b-4aab-acbe-77b20898b542 was re-scheduled: [Errno 32] Corrupt image download. Checksum was e210650413b9a879eee7ceca4c934258 expected f8ab98ff5e73ebab884d80c9dc9c7290

===

Current build/status for this job can be found here:

- http://38.145.35.214/#periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master

This job failure is blocking promotion on master:

http://38.145.34.55/master.log

<snip>
2018-03-07 13:51:31,829 16867 INFO promoter Skipping promotion of tripleo-ci-testing to current-tripleo, missing successful jobs: ['periodic-ovb-3ctlr_1comp-featureset035', 'periodic-ovb-1ctlr_1comp-featureset020']
</snip>

===

root cause not yet determined.

Matt Young (halcyondude)
Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Matt Young (halcyondude) wrote :

Additional logs / investigation from arxcruz and chandankumar:

Tempest failure (summary):

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/a049ff7/tempest.html.gz

Glance logs:

- https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/a049ff7/overcloud-controller-foo-0/var/log/glance/api.log.txt.gz#_2018-03-06_11_14_40_514

===

- https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/a049ff7/overcloud-controller-foo-0/var/log/glance/api.log.txt.gz#_2018-03-06_11_17_53_887

^^ looks like an error accessing data from store...

2018-03-06 11:17:53.258 26958 INFO eventlet.wsgi.server [req-5dc2dd26-8695-4f7b-9731-a2a0873f615d 862e52cc310742fb85300ea27f7b9fd6 75a0772586844699b6fde3c6ee820e20 - default default] 192.168.24.13 - - [06/Mar/2018 11:17:53] "GET /v2/schemas/image HTTP/1.1" 200 4333 0.004149

2018-03-06 11:17:53.887 26958 ERROR glance_store._drivers.swift.store [req-bcd76943-a8ab-495c-943b-30a4cafb15a4 862e52cc310742fb85300ea27f7b9fd6 75a0772586844699b6fde3c6ee820e20 - default default] Error during chunked upload to backend, deleting stale chunks.: IOError: unexpected end of file while parsing chunked data

2018-03-06 11:17:53.892 26958 ERROR glance.api.v2.image_data [req-bcd76943-a8ab-495c-943b-30a4cafb15a4 862e52cc310742fb85300ea27f7b9fd6 75a0772586844699b6fde3c6ee820e20 - default default] Failed to upload image data due to internal error: IOError: unexpected end of file while parsing chunked data

2018-03-06 11:17:53.929 26958 ERROR glance.common.wsgi [req-bcd76943-a8ab-495c-943b-30a4cafb15a4 862e52cc310742fb85300ea27f7b9fd6 75a0772586844699b6fde3c6ee820e20 - default default] Caught error: unexpected end of file while parsing chunked data: IOError: unexpected end of file while parsing chunked data

<stack_trace/>

Revision history for this message
Matt Young (halcyondude) wrote :

chandankumar built a reproducer in RDO Cloud, it's still deploying OC at the moment but in ~1 hour should be up for a debug. ping chandankumar or myoung in #oooq for access

Revision history for this message
Matt Young (halcyondude) wrote :

The reproducer failed to deploy...launching another one

Revision history for this message
Ben Nemec (bnemec) wrote :

In case it helps, I've run into this sort of error when I have MTU problems on the overcloud. Make sure the overcloud network MTUs are 1450 or less so they fit under the 1500 MTU of the host cloud after adding tunnel overhead.

Revision history for this message
Matt Young (halcyondude) wrote :

Thanks for the suggestion! I think we're using 1350...

https://github.com/openstack/tripleo-quickstart-extras/blob/master/config/environments/rdocloud.yml#L34

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/a049ff7/undercloud/home/jenkins/overcloud_network_params.yaml.txt.gz

```
# Defines overcloud network parameters based on parameters given.

parameter_defaults:
  NeutronGlobalPhysnetMtu: 1350
```

but an initial look at the ifcfg files from OC controller doesn't show the MTU setting. I'll look into this tomorrow morning to see if we're setting it dynamically but not persisting (guessing this is the case?)

Revision history for this message
Matt Young (halcyondude) wrote :

I have a live reproducer for this now. Adding the key cschwede gave me, I can add whomever can look. I've confirmed that the same IOError exists in

overcloud-controller-foo-0:/var/log/glance/api.log

to get to UC (post reproducer run):

ssh zuul@38.145.32.140

Revision history for this message
yatin (yatinkarel) wrote :

<< # Defines overcloud network parameters based on parameters given.

parameter_defaults:
  NeutronGlobalPhysnetMtu: 1350
```

The above setting configures global_physnet_mtu:- neutron.conf which is correctly configured: https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/a049ff7/overcloud-novacompute-bar-0/etc/neutron/neutron.conf.txt.gz

mtu on compute nodes is 1500(2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000) which is possible the cause, see:- https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/a049ff7/overcloud-novacompute-bar-0/var/log/host_info.txt.gz

On a local reproducer shared by pdeore, i tried "sudo ifconfig eth0 mtu 1350" on compute node and then instance creation was successful in just few seconds.

Revision history for this message
yatin (yatinkarel) wrote :

On undercloud MTU's are set correctly by https://github.com/openstack/tripleo-quickstart-extras/blob/master/playbooks/baremetal-full-overcloud-prep.yml#L8-L12, using script https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/baremetal-prep-overcloud/templates/adjust-interface-mtus.sh.j2#L13-L16

For overcloud i can't find any such task running, may be something similar should be done for overcloud nodes to set correct mtu's.

Revision history for this message
yatin (yatinkarel) wrote :

<< For overcloud i can't find any such task running, may be something similar should be done for overcloud << nodes to set correct mtu's.
But before fixing it should be found why this issue starting appearing recently as there might be some other fix for it.

Revision history for this message
Matt Young (halcyondude) wrote :

Ok...in digging some more, for OVB we're setting the OC mtu (or trying to) via dnsmasq by running this on the UC:

- https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/baremetal-prep-overcloud/tasks/adjust-mtu-dnsmasq-ironic.yml

which uses this template:

- https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/baremetal-prep-overcloud/templates/adjust-interface-mtus.sh.j2

which sets the MTU for UC interfaces by calling "ip link set", and updates /etc/dnsmasq-ironic.conf on the undercloud with the MTU...

## Adjust interface MTU valuesfor undercloud and overcloud
## =======================================================

## * Adjust interface mtus
## ::

{% for interface in (mtu_interface) %}
    ip link set {{ interface }} mtu {{ mtu }}
    echo "MTU={{ mtu }}" >> /etc/sysconfig/network-scripts/ifcfg-{{ interface }}
{% endfor %}

## * Modify dnsmasq-ironic.conf
## ::

echo -e "\ndhcp-option-force=26,{{ mtu }}" >> /etc/dnsmasq-ironic.conf
systemctl restart 'neutron-*'
systemctl restart openstack-ironic-conductor

---

Digging more...

Revision history for this message
Matt Young (halcyondude) wrote :

The latest queens fs20 is passing

- https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-queens/e03e3a4/tempest.html.gz

ironic version there is:

openstack-ironic-api-10.1.2-0.20180308181524.233318f.el7.centos.noarch
openstack-ironic-common-10.1.2-0.20180308181524.233318f.el7.centos.noarch
openstack-ironic-conductor-10.1.2-0.20180308181524.233318f.el7.centos.noarch
openstack-ironic-inspector-7.2.1-0.20180302142656.397a98a.el7.centos.noarch
openstack-ironic-staging-drivers-0.9.0-0.20180220235748.de59d74.el7.centos.noarch

for the master failing job ironic version is:

- https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/a049ff7/undercloud/var/log/extra/rpm-list.txt.gz

openstack-ironic-api-10.2.0-0.20180306082945.3a0b56a.el7.centos.noarch
openstack-ironic-common-10.2.0-0.20180306082945.3a0b56a.el7.centos.noarch
openstack-ironic-conductor-10.2.0-0.20180306082945.3a0b56a.el7.centos.noarch
openstack-ironic-inspector-7.1.1-0.20180222014434.e26f11c.el7.centos.noarch
openstack-ironic-staging-drivers-0.9.0-0.20180216175338.de59d74.el7.centos.noarch

Revision history for this message
Ronelle Landy (rlandy) wrote :

# For overcloud i can't find any such task running, may be something similar should be done for overcloud << nodes to set correct mtu's.

This was supposed to modify the mtus on the overcloud:

echo -e "\ndhcp-option-force=26,{{ mtu }}" >> /etc/dnsmasq-ironic.conf
systemctl restart 'neutron-*'
systemctl restart openstack-ironic-conductor

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to instack-undercloud (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/552693

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to instack-undercloud (master)

Reviewed: https://review.openstack.org/552693
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=1bc7a07decc6e4ace5115711707a69d80b969b5f
Submitter: Zuul
Branch: master

commit 1bc7a07decc6e4ace5115711707a69d80b969b5f
Author: Alex Schultz <email address hidden>
Date: Tue Mar 13 15:46:11 2018 -0600

    Ensure mtu is set correctly on ctlplane

    We were not passing the MTU when we create the ctlplane network so if
    the local_mtu is less than 1500, 1500 is used by the ctlplane network in
    neutron.

    Change-Id: Ic7a4c5a62ff49b2f8964dd58bf5c97d9781c4ce1
    Related-Bug: #1754036

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to instack-undercloud (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/553024

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to instack-undercloud (stable/queens)

Reviewed: https://review.openstack.org/553024
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=a31ff5a850e6b099623b2ea5ed4d1c867bd3e067
Submitter: Zuul
Branch: stable/queens

commit a31ff5a850e6b099623b2ea5ed4d1c867bd3e067
Author: Alex Schultz <email address hidden>
Date: Tue Mar 13 15:46:11 2018 -0600

    Ensure mtu is set correctly on ctlplane

    We were not passing the MTU when we create the ctlplane network so if
    the local_mtu is less than 1500, 1500 is used by the ctlplane network in
    neutron.

    Change-Id: Ic7a4c5a62ff49b2f8964dd58bf5c97d9781c4ce1
    Related-Bug: #1754036
    (cherry picked from commit 1bc7a07decc6e4ace5115711707a69d80b969b5f)

tags: added: in-stable-queens
Revision history for this message
John Trowbridge (trown) wrote :

this should be resolved, but we are now seeing https://bugs.launchpad.net/tripleo/+bug/1757111

Changed in tripleo:
status: Triaged → Incomplete
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.