mariadbcheck.socket: Failed to create listening socket

Bug #2003631 reported by Jose Gaitan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
High
Dmitriy Rabotyagov

Bug Description

Hello,

After rebooting any of the LXC containers of the Galera cluster, HAProxy reports the galera service backend DOWN and it never comes back UP again without manual intervention.

Upon further investigation, we noticed that journalctl -xe --unit=mariadbcheck.socket reports the following:

===
-- Boot 9d7a4cb973fb45d58a63570f39f0b3db --
Jan 21 19:44:51 inf2-mia-galera-container-3b0c3b78 systemd[32]: mariadbcheck.socket: Failed to create listening socket (10.10.37.209:9200): Cannot assign requested address
Jan 21 19:44:51 inf2-mia-galera-container-3b0c3b78 systemd[1]: mariadbcheck.socket: Failed to receive listening socket (10.10.37.209:9200): Input/output error
Jan 21 19:44:51 inf2-mia-galera-container-3b0c3b78 systemd[1]: mariadbcheck.socket: Failed to listen on sockets: Input/output error
Jan 21 19:44:51 inf2-mia-galera-container-3b0c3b78 systemd[1]: mariadbcheck.socket: Failed with result 'resources'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit mariadbcheck.socket has entered the 'failed' state with result 'resources'.
Jan 21 19:44:51 inf2-mia-galera-container-3b0c3b78 systemd[1]: Failed to listen on mariadbcheck socket.
░░ Subject: A start job for unit mariadbcheck.socket has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit mariadbcheck.socket has finished with a failure.
░░
░░ The job identifier is 54 and the job result is failed.
===

WORKAROUND: Log into the affected galera container and manually perform a "systemctl restart mariadbcheck.socket" which immediately brings the failed services back up on the container and HAProxy updates the backend status back to UP.
===

Steps to reproduce:
1-Login to the galera LXC container.
2-Perform a reboot on the container

Environment variables:
-Debian 11 on all hosts.
-Openstack-Ansible version: stable/zed 26.1.0.dev45
-3x infra nodes and 2x compute hosts.
-HAProxy
-KeepAlived
===

Any suggestions would be appreciated.

Thank you.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Hi, Roger.

Can you kindly check if that might be related and fixed by https://bugs.launchpad.net/openstack-ansible/+bug/2002653 ?

The fix has been applied for stable/zed just couple of days ago and would require rerunning of bootstrap-ansible.sh script.

Revision history for this message
Jose Gaitan (vchjgaitan) wrote (last edit ):

Hello Dmitriy,

I appreciate your prompt response to this issue. Unfortunately, I do not think this is related to the network.target being removed from the systemd unit. I noticed the change from commit If4729eca992a0e647e2f15b3d77ad6300bbf9c12 to remove the network.target from the mariadbcheck.socket unit was merged on January 13, 2023. In turn, our Openstack-Ansible deployment that is generating this issue was pulled and bootstrapped from the stable/zed repo on January 20, 2023.

Additionally, after looking into the /etc/systemd/system/mariadbcheck.socket unit file in all three Galera LXC containers we confirmed the network.target is not included as a dependency, which indicates the patch from commit If4729eca992a0e647e2f15b3d77ad6300bbf9c12 was already applied on this deployment. See file content below:

====
root@inf2-mia-galera-container-3b0c3b78:~# cat /etc/systemd/system/mariadbcheck.socket
# Ansible managed

[Unit]
Description=mariadbcheck socket

[Socket]

ListenStream=10.10.37.209:9200
IPAddressDeny=any
IPAddressAllow=10.10.37.132 10.10.37.209 10.10.39.236 10.10.36.11 10.10.36.12 10.10.36.13 127.0.0.1
Accept=yes

[Install]
WantedBy=sockets.target
====

Out of an abundance of caution, we will rerun the bootstrap-ansible.sh script and test again. We will report back the results.

Thank you,

Roger

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Ok, according to your output there seems to be no need in re-running bootstrap and playbooks as indeed you seem to already have patch applied.

I will setup debian sandbox to try reproduce your issue shortly

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-role-systemd_service (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-role-systemd_service (master)

Reviewed: https://review.opendev.org/c/openstack/ansible-role-systemd_service/+/871487
Committed: https://opendev.org/openstack/ansible-role-systemd_service/commit/6a40ec0b85b96e529eb0e3e1e1a1f62cf34d80d2
Submitter: "Zuul (22348)"
Branch: master

commit 6a40ec0b85b96e529eb0e3e1e1a1f62cf34d80d2
Author: Dmitriy Rabotyagov <email address hidden>
Date: Mon Jan 23 16:29:46 2023 +0100

    Ensure daemon is reloaded on socket change

    At the moment our verification if socket has been changed
    is not valid, since we're checking if string 'true' is presnet in the
    list, while list consist of only boolean variables. So we replace
    map filter with selectattr as it can apply truthy test to the elements
    while selecting them and checking list length.

    Change-Id: Ib456b4dc2d631bf81633035820444f13ec0f06cb
    Related-Bug: #2003631

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-role-systemd_service (stable/zed)

Related fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/ansible-role-systemd_service/+/871751

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-role-systemd_service (stable/yoga)

Related fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/ansible-role-systemd_service/+/871752

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-role-systemd_service (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/ansible-role-systemd_service/+/871751
Committed: https://opendev.org/openstack/ansible-role-systemd_service/commit/6551e127f283e202956e0efb89d9e3caed2c4007
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 6551e127f283e202956e0efb89d9e3caed2c4007
Author: Dmitriy Rabotyagov <email address hidden>
Date: Mon Jan 23 16:29:46 2023 +0100

    Ensure daemon is reloaded on socket change

    At the moment our verification if socket has been changed
    is not valid, since we're checking if string 'true' is presnet in the
    list, while list consist of only boolean variables. So we replace
    map filter with selectattr as it can apply truthy test to the elements
    while selecting them and checking list length.

    Change-Id: Ib456b4dc2d631bf81633035820444f13ec0f06cb
    Related-Bug: #2003631
    (cherry picked from commit 6a40ec0b85b96e529eb0e3e1e1a1f62cf34d80d2)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-role-systemd_service (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/ansible-role-systemd_service/+/871752
Committed: https://opendev.org/openstack/ansible-role-systemd_service/commit/5c5813f88f651ee64e507458f39cb67ecaf640cc
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 5c5813f88f651ee64e507458f39cb67ecaf640cc
Author: Dmitriy Rabotyagov <email address hidden>
Date: Mon Jan 23 16:29:46 2023 +0100

    Ensure daemon is reloaded on socket change

    At the moment our verification if socket has been changed
    is not valid, since we're checking if string 'true' is presnet in the
    list, while list consist of only boolean variables. So we replace
    map filter with selectattr as it can apply truthy test to the elements
    while selecting them and checking list length.

    Change-Id: Ib456b4dc2d631bf81633035820444f13ec0f06cb
    Related-Bug: #2003631
    (cherry picked from commit 6a40ec0b85b96e529eb0e3e1e1a1f62cf34d80d2)

tags: added: in-stable-yoga
Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Updating status here - I wasn't able to find the reason of the reported bug yet. All PRs that were pushed and merged are only related to this bug, but likely not fixing it directly.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Ok, I was able to reproduce the issue. It's intermittent and seems like a result of race-condition during container startup. And I still think it's related to the https://bugs.launchpad.net/openstack-ansible/+bug/2002653 is one way or another.

root@aio1:/home/debian/openstack-ansible# lxc-stop -n aio1_galera_container-f231c2a5
root@aio1:/home/debian/openstack-ansible# lxc-start -n aio1_galera_container-f231c2a5
root@aio1:/home/debian/openstack-ansible# lxc-attach -n aio1_galera_container-f231c2a5
root@aio1-galera-container-f231c2a5:/# systemctl status mariadbcheck.socket
● mariadbcheck.socket - mariadbcheck socket
     Loaded: loaded (/etc/systemd/system/mariadbcheck.socket; enabled; vendor preset: enabled)
     Active: active (listening) since Thu 2023-02-09 10:17:06 UTC; 3s ago
     Listen: 172.29.239.109:9200 (Stream)
   Accepted: 1; Connected: 0;
      Tasks: 0 (limit: 19191)
     Memory: 4.0K
        CPU: 1ms
     CGroup: /system.slice/mariadbcheck.socket

root@aio1-galera-container-f231c2a5:/#
exit
root@aio1:/home/debian/openstack-ansible# lxc-stop -n aio1_galera_container-f231c2a5
root@aio1:/home/debian/openstack-ansible# lxc-start -n aio1_galera_container-f231c2a5
root@aio1:/home/debian/openstack-ansible# lxc-attach -n aio1_galera_container-f231c2a5
root@aio1-galera-container-f231c2a5:/# systemctl status mariadbcheck.socket
● mariadbcheck.socket - mariadbcheck socket
     Loaded: loaded (/etc/systemd/system/mariadbcheck.socket; enabled; vendor preset: enabled)
     Active: failed (Result: resources)
     Listen: 172.29.239.109:9200 (Stream)
   Accepted: 0; Connected: 0;
        CPU: 547us

Feb 09 10:17:20 aio1-galera-container-f231c2a5 systemd[32]: mariadbcheck.socket: Failed to create listening socket (172.29.239.109:9200): Cannot assign requested address
Feb 09 10:17:20 aio1-galera-container-f231c2a5 systemd[1]: mariadbcheck.socket: Failed to receive listening socket (172.29.239.109:9200): Input/output error
Feb 09 10:17:20 aio1-galera-container-f231c2a5 systemd[1]: mariadbcheck.socket: Failed to listen on sockets: Input/output error
Feb 09 10:17:20 aio1-galera-container-f231c2a5 systemd[1]: mariadbcheck.socket: Failed with result 'resources'.
Feb 09 10:17:20 aio1-galera-container-f231c2a5 systemd[1]: Failed to listen on mariadbcheck socket.
root@aio1-galera-container-f231c2a5:/#

Changed in openstack-ansible:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Ok, yes, this one is caused by solution proposed for #2002653. Will work on fixing both of them and not creating some third one...

Changed in openstack-ansible:
assignee: nobody → Dmitriy Rabotyagov (noonedeadpunk)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible-galera_server (master)
Changed in openstack-ansible:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible-galera_server (master)

Reviewed: https://review.opendev.org/c/openstack/openstack-ansible-galera_server/+/873334
Committed: https://opendev.org/openstack/openstack-ansible-galera_server/commit/8a8d29ea490fba6695e3356831846466f6991089
Submitter: "Zuul (22348)"
Branch: master

commit 8a8d29ea490fba6695e3356831846466f6991089
Author: Dmitriy Rabotyagov <email address hidden>
Date: Thu Feb 9 22:19:36 2023 +0100

    Allow maridbcheck socket to FreeBind

    Once we've removed network.target from wanted targets for
    mariadbcheck.socket, it started to fail to startup intermitently in LXC
    deployments, since it was trying to bind on IP address that is not
    brought up yet. At the same time we can't wait for IP being up, as
    OVS while providing network, waits for socket.target as it needs
    to have ovsdb started up, so waiting for network.target does
    create circular dependency.

    To avoid that we're allowing socket to bind on IP even when IP is not
    UP yet. Other possible solution would be to bind on 0.0.0.0.

    Depends-On: https://review.opendev.org/c/openstack/openstack-ansible/+/872896
    Change-Id: Ia4cde2153813e68419d261cd94e3017523177142
    Closes-Bug: #2003631
    Related-Bug: #2002653

Changed in openstack-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible-galera_server (stable/zed)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible-galera_server (stable/yoga)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible-galera_server (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/openstack-ansible-galera_server/+/874732
Committed: https://opendev.org/openstack/openstack-ansible-galera_server/commit/f9a8567e61e09e3c6ffd6b8885cb493c6c7a7a70
Submitter: "Zuul (22348)"
Branch: stable/zed

commit f9a8567e61e09e3c6ffd6b8885cb493c6c7a7a70
Author: Dmitriy Rabotyagov <email address hidden>
Date: Thu Feb 9 22:19:36 2023 +0100

    Allow maridbcheck socket to FreeBind

    Once we've removed network.target from wanted targets for
    mariadbcheck.socket, it started to fail to startup intermitently in LXC
    deployments, since it was trying to bind on IP address that is not
    brought up yet. At the same time we can't wait for IP being up, as
    OVS while providing network, waits for socket.target as it needs
    to have ovsdb started up, so waiting for network.target does
    create circular dependency.

    To avoid that we're allowing socket to bind on IP even when IP is not
    UP yet. Other possible solution would be to bind on 0.0.0.0.

    Depends-On: https://review.opendev.org/c/openstack/openstack-ansible/+/872896
    Change-Id: Ia4cde2153813e68419d261cd94e3017523177142
    Closes-Bug: #2003631
    Related-Bug: #2002653
    (cherry picked from commit 8a8d29ea490fba6695e3356831846466f6991089)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible-galera_server (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/openstack-ansible-galera_server/+/874733
Committed: https://opendev.org/openstack/openstack-ansible-galera_server/commit/4acaf657873452e0720a1b3f5ba2f889ab88d96e
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 4acaf657873452e0720a1b3f5ba2f889ab88d96e
Author: Dmitriy Rabotyagov <email address hidden>
Date: Thu Feb 9 22:19:36 2023 +0100

    Allow maridbcheck socket to FreeBind

    Once we've removed network.target from wanted targets for
    mariadbcheck.socket, it started to fail to startup intermitently in LXC
    deployments, since it was trying to bind on IP address that is not
    brought up yet. At the same time we can't wait for IP being up, as
    OVS while providing network, waits for socket.target as it needs
    to have ovsdb started up, so waiting for network.target does
    create circular dependency.

    To avoid that we're allowing socket to bind on IP even when IP is not
    UP yet. Other possible solution would be to bind on 0.0.0.0.

    Depends-On: https://review.opendev.org/c/openstack/openstack-ansible/+/872896
    Change-Id: Ia4cde2153813e68419d261cd94e3017523177142
    Closes-Bug: #2003631
    Related-Bug: #2002653
    (cherry picked from commit 8a8d29ea490fba6695e3356831846466f6991089)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-ansible-galera_server yoga-eom

This issue was fixed in the openstack/openstack-ansible-galera_server yoga-eom release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.