magnum stable/xena launching kubernetes cluster

Bug #1979898 reported by Davide De Pasquale
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
New
Undecided
Unassigned

Bug Description

Dear all,

I am moving slowly in the new possible production configuration of my stable/xena cluster.
I have succefully installed with openstack-ansible stable/xena the overal ecosystem and later added several additional services (I already open some bugs regarding what I found).
Yesterday I tried to lunch the first kubernetes cluster using magnum.

I found an annoying issue regarding the verification of self-signed SSL certificates that have been used to install openstack during the creation of the kube-master virtual machine.
The cluster is initialized, heat stack created and initialized. The master node is created, and initialized properly with all the software onboard. But the stack then stuck until 60min timeout and cluster creation fails.

After a few investigations and readings online of 5 years old discussions, I found someone that was pointing out a possible SSL issue between the master node with keystone...

During this 60min time, I am able to get access via ssh to Fedora Coreos 35 image I have download for the purpose and query journaltcl -xef to monitor what is happning and I find:

Jun 26 12:00:05 c3-upfijbzzjaut-master-0 systemd[1]: Started Hostname Service.
░░ Subject: A start job for unit systemd-hostnamed.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit systemd-hostnamed.service has finished successfully.
░░
░░ The job identifier is 1160.
Jun 26 12:00:05 c3-upfijbzzjaut-master-0 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jun 26 12:00:05 c3-upfijbzzjaut-master-0 audit[2532]: USER_START pid=2532 uid=0 auid=1000 ses=1 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=login id=1000 exe="/usr/sbin/sshd" hostname=? addr=192.168.2.6 terminal=ssh res=success'
Jun 26 12:00:05 c3-upfijbzzjaut-master-0 audit[2532]: CRYPTO_KEY_USER pid=2532 uid=0 auid=1000 ses=1 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=destroy kind=server fp=SHA256:d3:4c:a5:d7:0e:5b:c2:6e:3a:8f:84:e4:75:f7:40:97:b9:11:e0:70:e8:b8:36:3e:e9:33:68:3f:22:6d:dd:0c direction=? spid=2571 suid=1000 exe="/usr/sbin/sshd" hostname=? addr=? terminal=? res=success'
Jun 26 12:00:05 c3-upfijbzzjaut-master-0 audit[2532]: USER_END pid=2532 uid=0 auid=1000 ses=1 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=login id=1000 exe="/usr/sbin/sshd" hostname=? addr=192.168.2.6 terminal=ssh res=success'
Jun 26 12:00:12 c3-upfijbzzjaut-master-0 conmon[2386]: Authorization failed: SSL exception connecting to https://10.0.0.10:5000/v3/auth/tokens: HTTPSConnectionPool(host='10.0.0.10', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
Jun 26 12:00:12 c3-upfijbzzjaut-master-0 conmon[2386]: Source [heat] Unavailable.
Jun 26 12:00:12 c3-upfijbzzjaut-master-0 podman[2346]: Authorization failed: SSL exception connecting to https://10.0.0.10:5000/v3/auth/tokens: HTTPSConnectionPool(host='10.0.0.10', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
Jun 26 12:00:12 c3-upfijbzzjaut-master-0 podman[2346]: Source [heat] Unavailable.
Jun 26 12:00:12 c3-upfijbzzjaut-master-0 podman[2346]: /var/lib/os-collect-config/local-data not found. Skipping
Jun 26 12:00:12 c3-upfijbzzjaut-master-0 conmon[2386]: /var/lib/os-collect-config/local-data not found. Skipping
Jun 26 12:00:28 c3-upfijbzzjaut-master-0 conmon[2386]: Authorization failed: SSL exception connecting to https://10.0.0.10:5000/v3/auth/tokens: HTTPSConnectionPool(host='10.0.0.10', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
Jun 26 12:00:28 c3-upfijbzzjaut-master-0 podman[2346]: Authorization failed: SSL exception connecting to https://10.0.0.10:5000/v3/auth/tokens: HTTPSConnectionPool(host='10.0.0.10', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
Jun 26 12:00:28 c3-upfijbzzjaut-master-0 podman[2346]: Source [heat] Unavailable.
Jun 26 12:00:28 c3-upfijbzzjaut-master-0 podman[2346]: /var/lib/os-collect-config/local-data not found. Skipping
Jun 26 12:00:28 c3-upfijbzzjaut-master-0 conmon[2386]: Source [heat] Unavailable.
Jun 26 12:00:28 c3-upfijbzzjaut-master-0 conmon[2386]: /var/lib/os-collect-config/local-data not found. Skipping
Jun 26 12:00:35 c3-upfijbzzjaut-master-0 systemd[1]: systemd-hostnamed.service: Deactivated successfully.

It is confirmed that the master node is not able to check as valid the SSL certificate of the Keystone service.
I have also tried to manually install the same SSL certificates that are used by my HAProxy nodes (HA config) in the machine because this will for sure unlock the situation, but without any lucky.

Does someone create a Kubenetes cluster with Magnum recently after Xena was release?
Do you have any workaround to this problem?

I have also tried to find if the software making the request is a scripting and luckily uses curl, but I did not find it. If it is a curl call i can modify the request statement by simply put -k parameter to the call to ignore the SSL certicates check

Thanks for any help.
Davide

Revision history for this message
Davide De Pasquale (davidedepasquale) wrote :

I start to think this is an heat problem.
I restarted again a new cluster installation and when I log into master node (kube master VM) I see printed:

[systemd]
Failed Units: 1
  heat-container-agent.service
[core@c1-2ancod3fmckw-master-0 ~]$

and journalctl shows always the same error.

I think this is something not related to openstack-ansible.
Someone agree with me?

Revision history for this message
Jonathan Rosser (jrosser) wrote :

Magnum uses heat to inject a bunch of scripting into the cluster nodes using cloud-init (or perhaps 'ignition' in newer things?)

There will be logs for that first boot cloud-init type stuff, then logs for the scripts that it unpacks and executes to deploy the cluster.

There are also callbacks that the heat agent makes to the heat API to signal that the deployment is complete.

This all needs unpicking from start to end in the logs to see where it's broken. From the logs you've included i'm not sure which of these stages things are at when the SSL error occurs.

Revision history for this message
Davide De Pasquale (davidedepasquale) wrote :

Dear Jonathan you are right.
I have tried to monitor with multiple logging sessions all the process.
I will try to be more precise in the nexts days as I can invest more time on the topic.
For the moment I can tell you that docker software is installed and I find inside anything seems to be present for completing the installation (heat config files, agents, kubernetes...).
I think the process stucks in communicating with Keystone before to communicate Heat that the master installation is complete.
I will further investigate and eventually also get in contact with the magnum Dev team for some clarification.

Revision history for this message
Davide De Pasquale (davidedepasquale) wrote :

Just for the readers, I think I am affected by this:
http://lists.openstack.org/pipermail/openstack-discuss/2020-January/012146.html

it is the same scenario I see on my master node: Empty certificate.
Probably this is not due to openstack-ansible error.
Regards,
Davide

Revision history for this message
Aref (rfak) wrote :

I got the same error.
a1e95625c8d6 Authorization failed: Unable to establish connection to http://PUBLICURL:5000/v3/auth/tokens
a1e95625c8d6 Source [heat] Unavailable.
a1e95625c8d6 /var/lib/os-collect-config/local-data not found. Skipping
a1e95625c8d6 /var/lib/os-collect-config/local-data not found. Skipping
a1e95625c8d6 /var/lib/os-collect-config/local-data not found. Skipping
a1e95625c8d6 /var/lib/os-collect-config/local-data not found. Skipping
a1e95625c8d6 /var/lib/os-collect-config/local-data not found. Skipping
a1e95625c8d6 /var/lib/os-collect-config/local-data not found. Skipping
a1e95625c8d6 /var/lib/os-collect-config/local-data not found. Skipping
a1e95625c8d6 /var/lib/os-collect-config/local-data not found. Skipping

what's the reason of that???

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.