Unhandled error - WSREP has not yet prepared node for application use

Bug #1863579 reported by Michal Nasiadka on 2020-02-17
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Medium
Unassigned
oslo.db
Undecided
Unassigned

Bug Description

Neutron fails with the following error:
2020-02-14 15:08:21.663 29 CRITICAL neutron [req-b7615e16-e470-4444-8342-86fca030273c - - - - -] Unhandled error: oslo_db.exception.DBError: (pymysql.err.InternalError) (1047, 'WSREP has not yet prepared node for application use')
[SQL: SELECT agents.id AS agents_id, agents.agent_type AS agents_agent_type, agents.`binary` AS agents_binary, agents.topic AS agents_topic, agents.host AS agents_host, agents.availability_zone AS agents_availability_zone, agents.admin_state_up AS agents_admin_state_up, agents.created_at AS agents_created_at, agents.started_at AS agents_started_at, agents.heartbeat_timestamp AS agents_heartbeat_timestamp, agents.description AS agents_description, agents.configurations AS agents_configurations, agents.resource_versions AS agents_resource_versions, agents.`load` AS agents_load, agents.resources_synced AS agents_resources_synced
FROM agents
WHERE agents.admin_state_up IN (%(admin_state_up_1)s)]
[parameters: {'admin_state_up_1': 1}]
(Background on this error at: http://sqlalche.me/e/2j85)

We see that mainly in the kolla-ansible upgrade CI - we upgrade the single node Galera instance - and it seems Oslo.db is stuck in claiming the 1047 error on the connection. Once neutron-server container gets restarted - it works properly.

CI logs of neutron-server: https://zuul.opendev.org/t/openstack/build/bd8c3d336030431686071cb26f2793d2/log/primary/logs/kolla/neutron/neutron-server.txt#4361

Critical event related to DB:
https://zuul.opendev.org/t/openstack/build/bd8c3d336030431686071cb26f2793d2/log/primary/logs/kolla/neutron/neutron-server.txt#4175

After neutron-server restart as part of the upgrade job - everything starts working fine:
https://zuul.opendev.org/t/openstack/build/bd8c3d336030431686071cb26f2793d2/log/primary/logs/kolla/neutron/neutron-server.txt#4373

Radosław Piliszek (yoctozepto) wrote :

As far as neutron goes, this is part of Ml2Plugin method _start_rpc_notifiers

https://zuul.opendev.org/t/openstack/build/bd8c3d336030431686071cb26f2793d2/log/primary/logs/kolla/neutron/neutron-server.txt#260

2020-02-14 14:45:22.591 6 DEBUG neutron.plugins.ml2.plugin [req-b7615e16-e470-4444-8342-86fca030273c - - - - -] neutron.plugins.ml2.plugin.Ml2Plugin method _start_rpc_notifiers called with arguments () {} wrapper /var/lib/kolla/venv/lib/python3.6/site-packages/oslo_log/helpers.py:66

Since this error message is more like "I'm not ready, retry in a bit", rather than "I'm broken, die", we would like it to be treated as such.

Changed in neutron:
status: New → Confirmed
Changed in oslo.db:
status: New → Confirmed
Akihiro Motoki (amotoki) on 2020-02-17
tags: added: db
Changed in neutron:
importance: Undecided → Medium
tags: added: low-hanging-fruit
Radosław Piliszek (yoctozepto) wrote :

Extra info: I noticed we are getting this randomly when restarting neutron-server - it prevents clean shutdown because this background thread seems stuck with error.

The scenario is that the potential window when this error could have really occurred was like 15 minutes earlier and it's only thrown when shutting neutron-server down. Neutron-server might have been handicapped before that restart.

Radosław Piliszek (yoctozepto) wrote :

Ignoring it in kolla-ansible CI: https://review.opendev.org/713133

Radosław Piliszek (yoctozepto) wrote :

It seemingly started happening since Ussuri. Not seeing this behavior on Train.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers