tripleo

swift timeout during undercloud deploy

Bug #1797167 reported by Martin Kopec on 2018-10-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Bogdan Dobrelya	tripleo stein-1

Bug Description

Gate job: tripleo-ci-centos-7-undercloud-oooq

Copy from a different bug based on the comment:
https://bugs.launchpad.net/tripleo/+bug/1764777/comments/2

In the most recent report, I found this error in the Mistral logs [2].

2018-10-10 08:03:55.473 26208 ERROR tripleo_common.actions.templates [req-bed99655-45c0-4455-9628-00e97accb2d7 7fff713b28d647d4bb0564dae6a00d32 c5a85f06ef4f47468d7054f618c0febd - default default] Error storing file network/service_net_map.yaml in container overcloud: ClientException: put_object(u'overcloud', u'network/service_net_map.yaml', ...) failure and no ability to reset contents for reupload.
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates [req-bed99655-45c0-4455-9628-00e97accb2d7 7fff713b28d647d4bb0564dae6a00d32 c5a85f06ef4f47468d7054f618c0febd - default default] Error occurred while processing custom roles.: Exception: Error storing file network/service_net_map.yaml in container overcloud
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates Traceback (most recent call last):
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates File "/usr/lib/python2.7/site-packages/tripleo_common/actions/templates.py", line 368, in run
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates self._process_custom_roles(context)
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates File "/usr/lib/python2.7/site-packages/tripleo_common/actions/templates.py", line 346, in _process_custom_roles
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates context=context)
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates File "/usr/lib/python2.7/site-packages/tripleo_common/actions/templates.py", line 157, in _j2_render_and_put
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates raise Exception(error_msg)
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates Exception: Error storing file network/service_net_map.yaml in container overcloud
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates

It looks like the upload to swift failed, but this code isn't written to retry on failure. Looking at the swift logs it happened due to a timeout [4]

Oct 10 08:03:52 centos-7-inap-mtl01-0002810891 proxy-server: ERROR with Object server 192.168.24.1:6000/1 re: Trying to get final status of PUT to /v1/AUTH_c5a85f06ef4f47468d7054f618c0febd/overcloud/network/service_net_map.yaml: Timeout (60.0s) (txn: tx4931d10773dc4044b0baf-005bbdb22c)
Oct 10 08:03:52 centos-7-inap-mtl01-0002810891 proxy-server: Object PUT returning 503 for [503] (txn: tx4931d10773dc4044b0baf-005bbdb22c) (client_ip: 192.168.24.1)
Oct 10 08:03:52 centos-7-inap-mtl01-0002810891 proxy-server: 192.168.24.1 192.168.24.1 10/Oct/2018/08/03/52 PUT /v1/AUTH_c5a85f06ef4f47468d7054f618c0febd/overcloud/network/service_net_map.yaml HTTP/1.0 503 - python-swiftclient-3.5.0 gAAAAABbvbHJ-yJb... 6147 118 - tx4931d10773dc4044b0baf-005bbdb22c - 60.0126 - - 1539158572.454325914...

[1] http://logs.openstack.org/24/608324/1/gate/tripleo-ci-centos-7-undercloud-oooq/0843431/logs/undercloud/home/zuul/undercloud_install.log.txt.gz#_2018-10-10_08_04_20

[2] http://logs.openstack.org/24/608324/1/gate/tripleo-ci-centos-7-undercloud-oooq/0843431/job-output.txt.gz#_2018-10-10_08_04_20_323422

[3] http://logs.openstack.org/24/608324/1/gate/tripleo-ci-centos-7-undercloud-oooq/0843431/logs/undercloud/var/log/mistral/executor.log.txt.gz#_2018-10-10_08_03_55_473

[4] http://logs.openstack.org/24/608324/1/gate/tripleo-ci-centos-7-undercloud-oooq/0843431/logs/undercloud/var/log/swift/swift.log.txt.gz#_Oct_10_08_03_52

Tags:

Martin Kopec (mkopec) on 2018-10-10

tags:

added: alert promotion-blocker

Dougal Matthews (d0ugal) on 2018-10-10

Changed in tripleo:
status:	New → Triaged
importance:	Undecided → High

wes hayutin (weshayutin) on 2018-10-10

Changed in tripleo:
milestone:	none → stein-1
importance:	High → Critical

Revision history for this message

Christian Schwede (cschwede) wrote on 2018-10-10:

Is this a single issue or happens repeatedly?

Looking at logfile #4 it seems that the container updater was running at the same time, and once it finished the storage servers (account, container, object) successfully finished storing the object: http://logs.openstack.org/24/608324/1/gate/tripleo-ci-centos-7-undercloud-oooq/0843431/logs/undercloud/var/log/swift/swift.log.txt.gz#_Oct_10_08_04_10

However, the proxy hit the timeout value earlier, so the upload failed from a user perspective.

So my initial guess is that this is a load issue, and because the undercloud runs with a single replica there is no fallback option. Right now I'm just wondering if we need to run the container-updater with a single replica on the undercloud at all - there is likely no benefit, but I need to confirm this.

Revision history for this message

Marios Andreou (marios-b) wrote on 2018-10-11:

Christian's assessment that this is a load issue makes sense at least that's what cistatus.tripleo.org tells me for tripleo-ci-centos-7-undercloud-oooq currently running green and all of yesterday - afaics the only exception is https://review.openstack.org/#/c/608324/ pointed to in the description but it seems to be an isolated case.

I think we should remove the alert promotion-blocker from this, adding ci.

tags:

added: ci
removed: alert promotion-blocker

Revision history for this message

Christian Schwede (cschwede) wrote on 2018-10-12:

There are definitely some Swift containers running on the UC that are not needed with a single replica, and will create some trouble. For example, the DB replicators and auditors will lock the DB, and new objects won't be stored during that time. It looks to me like exactly this happens here.

List of containers running on the UC:

[root@undercloud-0 ~]# docker ps --format="{{.Names}}" | grep swift
swift_proxy
swift_container_server
swift_account_server
swift_object_expirer
swift_object_server

swift_container_updater
swift_container_auditor
swift_object_updater
swift_container_replicator
swift_account_auditor
swift_object_replicator
swift_rsync
swift_account_reaper
swift_account_replicator
swift_object_auditor

The auditors and replicators are not required, and just create extra load. In fact these have been disabled on non-containerized UCs: https://github.com/openstack/instack-undercloud/commit/312f42a8c0cc5975e1338225d6b7ca6c9c30e6a8

These should be disabled on the containerized UC as well.

Bogdan Dobrelya (bogdando) on 2018-10-12

Changed in tripleo:
assignee:	nobody → Bogdan Dobrelya (bogdando)
tags:	added: rocky-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-12: Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/610012

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-12: Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/610017

Revision history for this message

Marios Andreou (marios-b) wrote on 2018-10-15:

fyi/debug:

https://github.com/willthames/ansible-lint/blob/a5828ff235d83624bcb080af2a5b199a3f4b257f/lib/ansiblelint/rules/NoFormattingInWhenRule.py#L11

& git blame
https://github.com/willthames/ansible-lint/blame/a5828ff235d83624bcb080af2a5b199a3f4b257f/lib/ansiblelint/rules/NoFormattingInWhenRule.py#L11

OpenStack Infra (hudson-openstack) on 2018-10-15

Changed in tripleo:
assignee:	Bogdan Dobrelya (bogdando) → Christian Schwede (cschwede)

Revision history for this message

Marios Andreou (marios-b) wrote on 2018-10-15:

sorry my comment #6 was intended for another bug my apologies (https://bugs.launchpad.net/tripleo/+bug/1797838 fwiw)

OpenStack Infra (hudson-openstack) on 2018-10-15

Changed in tripleo:
assignee:	Christian Schwede (cschwede) → Bogdan Dobrelya (bogdando)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-16: Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/610012
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=47f93e1792a99f2226b5842978cd99dd2d9ef3fb
Submitter: Zuul
Branch: master

commit 47f93e1792a99f2226b5842978cd99dd2d9ef3fb
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Oct 12 13:48:22 2018 +0200

Disable Swift auditors/replicators on undercloud

Maintain parity with instack-undercloud
Ic93082282e9ea481c13832f8ce1265a47f0ef3d5

    Swift is using only a single replica on the undercloud. Therefore
    recovering from a corrupted or lost object is not possible, and running
    replicators and auditors only wastes resources. And may create some
    trouble. For example, the DB replicators and auditors will lock the DB,
    and new objects won't be stored during that time.

Related-Bug: #1632885
Closes-Bug: #1797167

Change-Id: I584cdb03b99721fbdc28bf7f6019d914586341d2
Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-16: Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/610909

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-17: Related fix merged to tripleo-heat-templates (master)

#10

Reviewed: https://review.openstack.org/610017
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=1c56834aa801c806a8d32e970b59d344b7b13c59
Submitter: Zuul
Branch: master

commit 1c56834aa801c806a8d32e970b59d344b7b13c59
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Oct 12 13:57:29 2018 +0200

Use single replica for standalone AIO deployments

    Similarly to undercloud, Swift is using only a single replica on AIO
    (all-in-one standalone). Therefore recovering from a corrupted or lost object
    is not possible, and running replicators and auditors only wastes resources.
    And may create some trouble. For example, the DB replicators and auditors will
    lock the DB, and new objects won't be stored during that time.

Related-Bug: #1797167

Change-Id: I839393bf6cbb2303a0359f8aed32b2fc67d46f6a
Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-19: Fix merged to tripleo-heat-templates (stable/rocky)

#11

Reviewed: https://review.openstack.org/610909
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9f0975a52c33abcb4ad74a246837b321de3b4ffc
Submitter: Zuul
Branch: stable/rocky

commit 9f0975a52c33abcb4ad74a246837b321de3b4ffc
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Oct 12 13:48:22 2018 +0200

Disable Swift auditors/replicators on undercloud

Maintain parity with instack-undercloud
Ic93082282e9ea481c13832f8ce1265a47f0ef3d5

Related-Bug: #1632885
Closes-Bug: #1797167

    Change-Id: I584cdb03b99721fbdc28bf7f6019d914586341d2
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 47f93e1792a99f2226b5842978cd99dd2d9ef3fb)