tripleo-ci-centos-8-scenario001-standalone jobs are failing overcloud deploy on master - "ERROR gnocchi.utils: Unable to initialize coordination driver"

Bug #1946045 reported by Ronelle Landy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

tripleo-ci-centos-8-scenario001-standalone jobs run on master are failing the standalone deployment:

2021-10-04 19:55:53,642 [35] ERROR gnocchi.utils: Unable to initialize coordination driver
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tooz/drivers/redis.py", line 58, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/tooz/drivers/redis.py", line 476, in _start
    self._server_info = self._client.info()
  File "/usr/lib/python3.6/site-packages/redis/client.py", line 1304, in info
    return self.execute_command('INFO')
  File "/usr/lib/python3.6/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 1192, in get_connection
    connection.connect()
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 567, in connect
    self.on_connect()
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 643, in on_connect
    auth_response = self.read_response()
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 324, in read_response
    raw = self._buffer.readline()
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 256, in readline
    self._read_from_socket()
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 201, in _read_from_socket
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.

Full logs are below:

https://ffed76e78c0c5588d273-984ca498bb42ba0203f9a104f5aecd9c.ssl.cf1.rackcdn.com/810948/3/gate/tripleo-ci-centos-8-scenario001-standalone/c89650e/logs/undercloud/var/log/containers/gnocchi/gnocchi-metricd.log

https://ffed76e78c0c5588d273-984ca498bb42ba0203f9a104f5aecd9c.ssl.cf1.rackcdn.com/810948/3/gate/tripleo-ci-centos-8-scenario001-standalone/c89650e/logs/undercloud/home/zuul/standalone_deploy.log

The failure started on 10/04:

https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-8-scenario001-standalone

Ronelle Landy (rlandy)
Changed in tripleo:
milestone: none → xena-3
importance: Undecided → Critical
status: New → Triaged
tags: added: ci promotion-blocker
Revision history for this message
Michele Baldessari (michele) wrote :

Seems to be a problem with haproxy believing redis is down.

Redis itself is up:
187:M 04 Oct 2021 21:03:21.235 * MASTER MODE enabled (user request from 'id=13 addr=/var/run/redis/redis.sock:0 fd=8 name= age=0 idle=0 flags=U db=0 sub=0 psub=0 multi=-1 qbuf=34 qbuf-free=32734 obl=0 oll=0 omem=0 events=r cmd=slaveof')

tcp 0 0 192.168.24.1:6379 0.0.0.0:* LISTEN 90388/redis-server
tcp 0 0 192.168.24.3:6379 0.0.0.0:* LISTEN 79003/haproxy

But haproxy believes it is down:
Oct 4 21:11:31 standalone haproxy[12]: 192.168.24.3:57834 [04/Oct/2021:21:11:31.576] redis redis/<NOSRV> -1/-1/0 0 SC 7/1/0/0/0 0/0
Oct 4 21:11:31 standalone haproxy[12]: 192.168.24.3:57844 [04/Oct/2021:21:11:31.882] redis redis/<NOSRV> -1/-1/0 0 SC 7/1/0/0/0 0/0

The haproxy config is:
frontend redis
  bind 192.168.24.3:6379 transparent
  option tcplog
backend redis_be
  balance first
  tcp-check connect port 6379
  tcp-check send AUTH\ nHyR7JmXmTWzrA0UFH3QQRuVR\r\n
  tcp-check expect string +OK
  tcp-check send PING\r\n
  tcp-check expect string +PONG
  tcp-check send info\ replication\r\n
  tcp-check expect string role:master
  tcp-check send QUIT\r\n
  tcp-check expect string +OK
  option tcp-check
  server standalone.ctlplane.localdomain 192.168.24.1:6379 check fall 5 inter 2000 on-marked-down shutdown-sessions rise 2

Which looks fairly sane. I'll debug this with damien today

Revision history for this message
Michele Baldessari (michele) wrote :

Interestingly it seems to work just fine on master redis HA tls-e overcloud. We'll deploy a standalone. If we have a means to hold a broken env for a bit that would be grand too

Revision history for this message
Bhagyashri Shewale (bhagyashri-shewale) wrote :
Download full text (4.2 KiB)

Hi All,

This issue is also failing the rdo periodic sc001 and sc002 master jobs [1].

Also the sc003 and sc004 is also failing and looks like that is related to this one but with different log trace which is pasted below

Trace 1: (sc004 and sc001 is failing with this trace)

```
2021-10-05 07:26:37.543542 | fa163e8e-bc9b-cf6c-9490-00000000112d | FATAL | Container image prepare | undercloud | error={"changed": false, "error": "'config'", "msg": "Error running container image prepare: 'config'", "params": {}, "success": false}
2021-10-05 07:26:37.548548 | fa163e8e-bc9b-cf6c-9490-00000000112d | TIMING | tripleo_container_image_prepare : Container image prepare | undercloud | 0:10:23.737086 | 315.05s

PLAY RECAP *********************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0
standalone : ok=216 changed=115 unreachable=0 failed=0 skipped=65 rescued=0 ignored=0
undercloud : ok=24 changed=6 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0

```

Trace 2: (sc003 is failing with this trace)

```
INFO:__main__:Setting permission for /var/log/designate/designate-manage.log
++ cat /run_command
+ CMD='/usr/bin/bootstrap_host_exec designate_central su designate -s /bin/bash -c '\''/bin/designate-manage pool update'\'''
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
+ echo 'Running command: '\''/usr/bin/bootstrap_host_exec designate_central su designate -s /bin/bash -c '\''/bin/designate-manage pool update'\'''\'''
+ exec /usr/bin/bootstrap_host_exec designate_central su designate -s /bin/bash -c ''\''/bin/designate-manage' pool 'update'\'''
2021-10-05 07:38:19.367117 | fa163e1f-163f-164c-cb19-0000000026e7 | FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_5 | standalone | error={"changed": false, "msg": "Failed containers: designate_pool_manage"}
2021-10-05 07:38:19.368806 | fa163e1f-163f-164c-cb19-0000000026e7 | TIMING | tripleo_container_manage : Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_5 | standalone | 0:26:06.866815 | 28.31s

PLAY RECAP *********************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0
standalone : ok=424 changed=221 unreachable=0 failed=1 skipped=139 rescued=0 ignored=0
undercloud : ok=84 changed=32 unreachable=0 failed=0 skipped=11 rescued=0 ignored=0
2021-10-05 07:38:19.427101 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Summary Information ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2021-10-05 07:38:19.427764 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Total Tasks: 687 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2021-10-05 07:38:19.428496 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Elapsed Time: 0:26:06.926550 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

```

[1]: https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-scenario001-standalone-master&job_name=perio...

Read more...

Revision history for this message
Damien Ciabrini (dciabrin) wrote :

Sadly when the original review that enabled the new haproxy config got merged, the gate didn't exercize sc001 or sc004 so we didn't spot anything ahead of time.
I'm deploying a standalone env locally with redis enabled, this should tell us what is wrong and what is the best way forward to fix upstream CI.
More on that later today

Revision history for this message
Takashi Kajinami (kajinamit) wrote (last edit ):
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by "Takashi Kajinami <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/812461
Reason: The change in puppet-tripleo triggers sc001 job which has redis enabled.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)
Revision history for this message
Damien Ciabrini (dciabrin) wrote :

I think Takashi's patch from comment #5 should fix the proxy issue.
Coincidentally, Michele found the exact same thing [1] on our local deployment not too long ago and it fixed the redis proxy for us.

[1] https://review.opendev.org/c/openstack/puppet-tripleo/+/812477

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (master)

Change abandoned by "Michele Baldessari <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/puppet-tripleo/+/812477
Reason: https://review.opendev.org/c/openstack/puppet-tripleo/+/812457

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Alex Schultz <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/puppet-tripleo/+/812457
Reason: this is going to fail in the gat until we get the fix for scenario001/004 landed. will restore in a sec

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.opendev.org/c/openstack/puppet-tripleo/+/812457
Committed: https://opendev.org/openstack/puppet-tripleo/commit/12f14d95bb1ea4bd78e96f1f350b4bf9e2d38c3a
Submitter: "Zuul (22348)"
Branch: master

commit 12f14d95bb1ea4bd78e96f1f350b4bf9e2d38c3a
Author: Takashi Kajinami <email address hidden>
Date: Tue Oct 5 17:53:44 2021 +0900

    haproxy: Add missing default_backend option

    This change ensures the default_backend option is included in all
    frontends, to solve the issue caused by unavailable backend service.

    Closes-Bug: #1946045
    Change-Id: I4856d33e6eaf5ca0fbfd9c4f33bd0d9433d107fa

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/puppet-tripleo/+/812567

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/puppet-tripleo/+/812567
Committed: https://opendev.org/openstack/puppet-tripleo/commit/cf44010ff6c69e62328785beaf10c8cf33bdf3b5
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit cf44010ff6c69e62328785beaf10c8cf33bdf3b5
Author: Takashi Kajinami <email address hidden>
Date: Tue Oct 5 17:53:44 2021 +0900

    haproxy: Add missing default_backend option

    This change ensures the default_backend option is included in all
    frontends, to solve the issue caused by unavailable backend service.

    Closes-Bug: #1946045
    Change-Id: I4856d33e6eaf5ca0fbfd9c4f33bd0d9433d107fa
    (cherry picked from commit 12f14d95bb1ea4bd78e96f1f350b4bf9e2d38c3a)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 16.0.0

This issue was fixed in the openstack/puppet-tripleo 16.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.