Gate job: tripleo-ci-centos-7-undercloud-oooq
Copy from a different bug based on the comment:
https://bugs.launchpad.net/tripleo/+bug/1764777/comments/2
In the most recent report, I found this error in the Mistral logs [2].
2018-10-10 08:03:55.473 26208 ERROR tripleo_common.actions.templates [req-bed99655-45c0-4455-9628-00e97accb2d7 7fff713b28d647d4bb0564dae6a00d32 c5a85f06ef4f47468d7054f618c0febd - default default] Error storing file network/service_net_map.yaml in container overcloud: ClientException: put_object(u'overcloud', u'network/service_net_map.yaml', ...) failure and no ability to reset contents for reupload.
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates [req-bed99655-45c0-4455-9628-00e97accb2d7 7fff713b28d647d4bb0564dae6a00d32 c5a85f06ef4f47468d7054f618c0febd - default default] Error occurred while processing custom roles.: Exception: Error storing file network/service_net_map.yaml in container overcloud
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates Traceback (most recent call last):
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates File "/usr/lib/python2.7/site-packages/tripleo_common/actions/templates.py", line 368, in run
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates self._process_custom_roles(context)
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates File "/usr/lib/python2.7/site-packages/tripleo_common/actions/templates.py", line 346, in _process_custom_roles
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates context=context)
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates File "/usr/lib/python2.7/site-packages/tripleo_common/actions/templates.py", line 157, in _j2_render_and_put
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates raise Exception(error_msg)
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates Exception: Error storing file network/service_net_map.yaml in container overcloud
2018-10-10 08:03:55.474 26208 ERROR tripleo_common.actions.templates
It looks like the upload to swift failed, but this code isn't written to retry on failure. Looking at the swift logs it happened due to a timeout [4]
Oct 10 08:03:52 centos-7-inap-mtl01-0002810891 proxy-server: ERROR with Object server 192.168.24.1:6000/1 re: Trying to get final status of PUT to /v1/AUTH_c5a85f06ef4f47468d7054f618c0febd/overcloud/network/service_net_map.yaml: Timeout (60.0s) (txn: tx4931d10773dc4044b0baf-005bbdb22c)
Oct 10 08:03:52 centos-7-inap-mtl01-0002810891 proxy-server: Object PUT returning 503 for [503] (txn: tx4931d10773dc4044b0baf-005bbdb22c) (client_ip: 192.168.24.1)
Oct 10 08:03:52 centos-7-inap-mtl01-0002810891 proxy-server: 192.168.24.1 192.168.24.1 10/Oct/2018/08/03/52 PUT /v1/AUTH_c5a85f06ef4f47468d7054f618c0febd/overcloud/network/service_net_map.yaml HTTP/1.0 503 - python-swiftclient-3.5.0 gAAAAABbvbHJ-yJb... 6147 118 - tx4931d10773dc4044b0baf-005bbdb22c - 60.0126 - - 1539158572.454325914...
[1] http://logs.openstack.org/24/608324/1/gate/tripleo-ci-centos-7-undercloud-oooq/0843431/logs/undercloud/home/zuul/undercloud_install.log.txt.gz#_2018-10-10_08_04_20
[2] http://logs.openstack.org/24/608324/1/gate/tripleo-ci-centos-7-undercloud-oooq/0843431/job-output.txt.gz#_2018-10-10_08_04_20_323422
[3] http://logs.openstack.org/24/608324/1/gate/tripleo-ci-centos-7-undercloud-oooq/0843431/logs/undercloud/var/log/mistral/executor.log.txt.gz#_2018-10-10_08_03_55_473
[4] http://logs.openstack.org/24/608324/1/gate/tripleo-ci-centos-7-undercloud-oooq/0843431/logs/undercloud/var/log/swift/swift.log.txt.gz#_Oct_10_08_03_52
Is this a single issue or happens repeatedly?
Looking at logfile #4 it seems that the container updater was running at the same time, and once it finished the storage servers (account, container, object) successfully finished storing the object: http:// logs.openstack. org/24/ 608324/ 1/gate/ tripleo- ci-centos- 7-undercloud- oooq/0843431/ logs/undercloud /var/log/ swift/swift. log.txt. gz#_Oct_ 10_08_04_ 10
However, the proxy hit the timeout value earlier, so the upload failed from a user perspective.
So my initial guess is that this is a load issue, and because the undercloud runs with a single replica there is no fallback option. Right now I'm just wondering if we need to run the container-updater with a single replica on the undercloud at all - there is likely no benefit, but I need to confirm this.