Comment 3 for bug 1979208

Revision history for this message
Davide De Pasquale (davidedepasquale) wrote :

Dear Dmitriy,

I have modified as per your current change proposal. The playbook continues to run until the following task where it seems etcd cluster (probably implemented inside the container and not using the infrastructure one) is not able to communicate between all containers.

TASK [systemd_service : Load service] ************************************************************************************************************
failed: [infra1_zun_api_container-ca03ba7a] (item={'service_name': 'etcd', 'service_type': 'notify', 'enabled': True, 'state': 'started', 'execstarts': '/usr/local/bin/etcd', 'config_overrides': {'Unit': {'Description': 'etcd - highly-available key value store', 'Documentation': 'https://github.com/coreos/etcd', 'Wants': {'network-online.target': None}}, 'Service': {'EnvironmentFile': '-/etc/default/%p', 'LimitNOFILE': 65536}}}) => {"ansible_loop_var": "item", "changed": false, "item": {"config_overrides": {"Service": {"EnvironmentFile": "-/etc/default/%p", "LimitNOFILE": 65536}, "Unit": {"Description": "etcd - highly-available key value store", "Documentation": "https://github.com/coreos/etcd", "Wants": {"network-online.target": null}}}, "enabled": true, "execstarts": "/usr/local/bin/etcd", "service_name": "etcd", "service_type": "notify", "state": "started"}, "msg": "Unable to start service etcd.service: Job for etcd.service failed because a timeout was exceeded.\nSee \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"}

from infra1-zun container I have checked the "journalctl -xe" and collected the following fragment that seem to be pertinent for the error clarification:

Jun 20 22:02:52 infra1-zun-api-container-ca03ba7a etcd[4273]: publish error: etcdserver: request timed out
Jun 20 22:02:53 infra1-zun-api-container-ca03ba7a etcd[4273]: e8c46a04d116fb6 is starting a new election at term 134
Jun 20 22:02:53 infra1-zun-api-container-ca03ba7a etcd[4273]: e8c46a04d116fb6 became candidate at term 135
Jun 20 22:02:53 infra1-zun-api-container-ca03ba7a etcd[4273]: e8c46a04d116fb6 received MsgVoteResp from e8c46a04d116fb6 at term 135
Jun 20 22:02:53 infra1-zun-api-container-ca03ba7a etcd[4273]: e8c46a04d116fb6 [logterm: 1, index: 3] sent MsgVote request to 42216b9f4e65b1c at term 135
Jun 20 22:02:53 infra1-zun-api-container-ca03ba7a etcd[4273]: e8c46a04d116fb6 [logterm: 1, index: 3] sent MsgVote request to 7e13711afbb18a91 at term 135
Jun 20 22:02:55 infra1-zun-api-container-ca03ba7a etcd[4273]: health check for peer 42216b9f4e65b1c could not connect: dial tcp 172.29.236.52:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
Jun 20 22:02:55 infra1-zun-api-container-ca03ba7a etcd[4273]: health check for peer 42216b9f4e65b1c could not connect: dial tcp 172.29.236.52:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
Jun 20 22:02:55 infra1-zun-api-container-ca03ba7a etcd[4273]: health check for peer 7e13711afbb18a91 could not connect: dial tcp 172.29.236.218:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
Jun 20 22:02:55 infra1-zun-api-container-ca03ba7a etcd[4273]: health check for peer 7e13711afbb18a91 could not connect: dial tcp 172.29.236.218:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
Jun 20 22:02:55 infra1-zun-api-container-ca03ba7a zun-wsproxy[4345]: 2022-06-20 22:02:55.404 4345 INFO zun.websocket.websocketproxy [-] 172.29.236.11 - - [20/Jun/2022 22:02:55] code 405, message Method Not Allowed

For completeness it looks like at infra2-zun container and infra3-zun container no etcd service is installed.
To be honest apart the first few statements, only infra1-zun container is logged in the playbook output during execution until the error point.

I have also tried to destroy the zun containers and then re-execute setup-hosts, infrastructure (with limit to the zun containers container names) and os-zun-install.yml but the problem is still there.

Hope this helps you to identify a possible fix/workaround.
Regards,
Davide