nova DB sync failed due to DNS resolution failure of mysql-router service

Bug #2033680 reported by Bas de Bruijne
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Snap
Incomplete
Low
Unassigned
microk8s
New
Unknown

Bug Description

Test run https://solutions.qa.canonical.com/testruns/1cce9709-db0c-4bb1-80cb-a3fc9652a12c (microstack on jammy, single-node), fails with the following status: https://oil-jenkins.canonical.com/artifacts/1cce9709-db0c-4bb1-80cb-a3fc9652a12c/generated/generated/sunbeam/juju_status_openstack.txt

In the debug-log I see a lot of these messages:
=====
machine-0: 07:11:16 INFO juju.kubernetes.klog Waited for 5.723907577s due to client-side throttling, not priority and fairness, request: GET:https://10.245.130.51:16443/api/v1/namespaces/openstack/pods?labelSelector=app.kubernetes.io%2Fname%3Dcinder-mysql-router
machine-0: 07:11:16 ERROR juju.apiserver.uniter resolving "": lookup : no such host
=====

Also, the nova pod logs show:
=====
(...)
2023-08-31T08:07:25.729Z [nova-scheduler] 2023-08-31 08:07:25.715 96 ERROR nova self.dbapi_connection = connection = pool._invoke_creator(self)
2023-08-31T08:07:25.729Z [nova-scheduler] 2023-08-31 08:07:25.715 96 ERROR nova File "/usr/lib/python3/dist-packages/sqlalchemy/engine/create.py", line 590, in connect
2023-08-31T08:07:25.729Z [nova-scheduler] 2023-08-31 08:07:25.715 96 ERROR nova return dialect.connect(*cargs, **cparams)
2023-08-31T08:07:25.729Z [nova-scheduler] 2023-08-31 08:07:25.715 96 ERROR nova File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 597, in connect
2023-08-31T08:07:25.729Z [nova-scheduler] 2023-08-31 08:07:25.715 96 ERROR nova return self.dbapi.connect(*cargs, **cparams)
2023-08-31T08:07:25.729Z [nova-scheduler] 2023-08-31 08:07:25.715 96 ERROR nova File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 353, in __init__
2023-08-31T08:07:25.729Z [nova-scheduler] 2023-08-31 08:07:25.715 96 ERROR nova self.connect()
2023-08-31T08:07:25.729Z [nova-scheduler] 2023-08-31 08:07:25.715 96 ERROR nova File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 664, in connect
2023-08-31T08:07:25.729Z [nova-scheduler] 2023-08-31 08:07:25.715 96 ERROR nova raise exc
2023-08-31T08:07:25.729Z [nova-scheduler] 2023-08-31 08:07:25.715 96 ERROR nova oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'nova-api-mysql-router.openstack.svc.cluster.local' ([Errno -2] Name or service not known)")
=====

I'm not sure what the cause of the name resolutions errors is.

Logs and configs can be found here: https://oil-jenkins.canonical.com/artifacts/1cce9709-db0c-4bb1-80cb-a3fc9652a12c/index.html

Revision history for this message
James Page (james-page) wrote :

The failure to resolve hostnames within K8S is most likely in the coredns service within Kubernetes - either its not working or its just not up-to-date.

FWIW I have seen this issue once before but I never got to the bottom of it.

Revision history for this message
James Page (james-page) wrote :

From the coredns pod logs:

[ERROR] plugin/errors: 2 neutron-mysql-router.openstack.svc.cluster.local.maas. A: read udp 10.1.96.129:44899->10.1.24.3:53: i/o timeout
[INFO] 10.1.96.177:56308 - 46111 "AAAA IN nova-api-mysql-router.openstack.svc.cluster.local.maas. udp 72 false 512" - - 0 2.001306915s
[ERROR] plugin/errors: 2 nova-api-mysql-router.openstack.svc.cluster.local.maas. AAAA: read udp 10.1.96.129:36760->10.1.10.2:53: i/o timeout
[INFO] 10.1.96.177:60337 - 37440 "AAAA IN nova-api-mysql-router.openstack.svc.cluster.local.maas. udp 72 false 512" - - 0 2.000504262s
[ERROR] plugin/errors: 2 nova-api-mysql-router.openstack.svc.cluster.local.maas. AAAA: read udp 10.1.96.129:51755->10.1.10.3:53: i/o timeout
[INFO] 10.1.96.141:57493 - 7682 "A IN neutron-mysql-router.openstack.svc.cluster.local.maas. udp 71 false 512" - - 0 2.001294922s
[ERROR] plugin/errors: 2 neutron-mysql-router.openstack.svc.cluster.local.maas. A: read udp 10.1.96.129:33389->10.1.10.3:53: i/o timeout

Revision history for this message
James Page (james-page) wrote :

That looks like the coredns server does not have an answer for "nova-api-mysql-router.openstack.svc.cluster.local" so passes it upstream with the .maas suffix attached - which then fails in some way due to the io timeout

Revision history for this message
James Page (james-page) wrote :

Please would it be possible to get the output of other kubectl commands so we can check the status of the information Juju passes to K8S?

kubectl get svc --all-namespaces would be a good start.

Changed in snap-openstack:
status: New → Incomplete
summary: - Nova DB sync failed
+ nova DB sync failed due to DNS resolution failure of mysql-router
+ service
Changed in snap-openstack:
importance: Undecided → Low
Changed in microk8s:
status: Unknown → New
Revision history for this message
Marian Gasparovic (marosg) wrote :

Saw the same also in ceilometer after enabling telemetry plugin.
I will try to add kubectl get svc --all-namespaces to our log collection

Revision history for this message
Peter Jose De Sousa (pjds) wrote :
Download full text (8.9 KiB)

hitting this issue following quickstart guide at https://ubuntu.com/openstack/install - my nova unit and cinder-ceph unit are both blocked. Cinder-ceph pod logging does not reveal much, although I have the failure to resolve mysql in nova.

Unit Workload Agent Address Ports Message
certificate-authority/0* active idle 10.1.2.199
cinder-ceph-mysql-router/0* active idle 10.1.2.224
cinder-ceph/0* blocked idle 10.1.2.226 (ceph) integration missing
cinder-mysql-router/0* active idle 10.1.2.217
cinder-mysql/0* active idle 10.1.2.229 Primary
cinder/0* active idle 10.1.2.225
glance-mysql-router/0* active idle 10.1.2.221
glance-mysql/0* active idle 10.1.2.220 Primary
glance/0* active idle 10.1.2.237
horizon-mysql-router/0* active idle 10.1.2.228
horizon-mysql/0* active idle 10.1.2.236 Primary
horizon/0* active idle 10.1.2.233
keystone-mysql-router/0* active idle 10.1.2.235
keystone-mysql/0* active idle 10.1.2.210 Primary
keystone/0* active idle 10.1.2.241
neutron-mysql-router/0* active idle 10.1.2.234
neutron-mysql/0* active idle 10.1.2.230 Primary
neutron/0* active idle 10.1.2.242
nova-api-mysql-router/0* active idle 10.1.2.243
nova-cell-mysql-router/0* active idle 10.1.2.238
nova-mysql-router/0* active idle 10.1.2.223
nova-mysql/0* active idle 10.1.2.232 Primary
nova/0* blocked idle 10.1.2.231 (workload) DB sync failed
ovn-central/0* active idle 10.1.2.214
ovn-relay/0* active idle 10.1.2.202
placement-mysql-router/0* active idle 10.1.2.215
placement-mysql/0* active idle 10.1.2.212 Primary
placement/0* active idle 10.1.2.222
rabbitmq/0* active idle 10.1.2.209
traefik-public/0* active idle 10.1.2.206
traefik/0* active idle 10.1.2.218

Offer Application Charm Rev Connected Endpoint Interface Role
cert-distributor keystone keystone-k8s 195 0/0 send-ca-cert certificate_transfer provider
certificate-authority certificate-authority self-signed-certificates 151 0/0 certificates tls-certificates provider
cinder-ceph cinder-ceph cinder-ceph-k8s 77 0/0 ceph-access cinder-ceph-key provider
keystone-credentials keystone keystone-k8s 195 0/0 identity-credentials keystone-credentials provider
keystone-endpoints keystone ...

Read more...

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

After some investigation - with this being a maas deployed machine - I also have the .maas search domain (and others) I commented those out as a workaround, and now it has unblocked.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.