Messages for security rules are being sent to a wrong MQ topic. Security rules are out of sync

Bug #1814209 reported by Jack Ivanov
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

Hello,

We deployed Neutron + OVS + DVR long time ago.
After upgrade from Ocata->Pike->Queens we've got a problem with security groups. They all are out of sync because messages are being sent to a queue with no consumers (there were some old Ocata consumers though, but we turned them off for the testing).

Request logs - https://pastebin.com/80BMDLai

Queue q-agent-notifier-security_group-update doesn't have any consumers at all. So, the compute nodes don't get it, thus they don't update security rules accordingly. Is this queue used in rocky?

Sometimes, I can see messages are being sent to neutron-vo-SecurityGroupRule-1.0 and all the compute nodes get it. It looks like a floating problem.

How to reproduce: Upgrade sequentially from Ocata to Pike and to Rocky.

Why it may happen and how to fix it?

If you need any additional information just let me know.

Thanks!

Revision history for this message
Jack Ivanov (gunph1ld) wrote :

Neutron versions:
python-neutron-12.0.4-1.el7.noarch
python2-neutronclient-6.7.0-1.el7.noarch
python2-neutron-lib-1.13.0-1.el7.noarch
openstack-neutron-common-12.0.4-1.el7.noarch
python-neutron-lbaas-12.0.0-1.el7.noarch
openstack-neutron-ml2-12.0.4-1.el7.noarch
openstack-neutron-openvswitch-12.0.4-1.el7.noarch
openstack-neutron-12.0.4-1.el7.noarch
openstack-neutron-lbaas-12.0.0-1.el7.noarch

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

I suspect that there may be some patch missing to get into this state.

Revision history for this message
Jack Ivanov (gunph1ld) wrote :

What patch?

For some reason push notifications for SecurityGroupRule are not being sent

Revision history for this message
masha atakova (matakova) wrote :

I'm also affected by this bug, it seems.

Steps to reproduce:
0. my system was upgraded from Ocata to Pike to Queens, I'm currently running Queens with the same 12.0.4 version of neutron provided above
1. Start vm and add it to a security group
2. After vm is started on a hypervisor, add / remove rule to the security group
3. check iptables on the hypervisor and see that this rule isn't present there

However, if I restart ovs-agent on the hypervisor, it correctly updates all the rules from security group, so the problem only appears when ovs-agent has been running for more than 5-10 minutes (this number isn't fixed).

Currently, I'm forced to restart ovs-agent on the hypervisor each time I update my security group rules as a workaround.

Revision history for this message
Brian Haley (brian-haley) wrote :

I'm confused. Were all of your neutron-server processes restarted? It almost seems like there is one running on an older version and still publishing to the wrong exchange.

Revision history for this message
Pierre Riteau (priteau) wrote :

I've seen the same issue on a deployment that had been upgraded to Queens and was using a cluster of three RabbitMQ servers for messaging.

I captured the traffic between neutron-server and all cluster instances of RabbitMQ. After isolating the connection established by neutron-server to publish an event such as a security group update, I noticed the following pattern:

- neutron-server sends Basic.publish exchange=neutron-vo-<object_type>-1.1_fanout
- RabbitMQ broker sends TCP ACK, acknowledging reception of the message
- [ No more traffic on this connection ]

Since publisher confirms are in use, we would expect to receive a Basic.Ack message from the broker in addition to the TCP ACK. The correct pattern was visible on other message publications.

I traced the code down to the basic_publish_confirm function of the py-amqp library [1]. This function waits until it receives an Ack message, apparently with no timeout. Broken connections are meant to be detected via AMQP heartbeats or TCP keepalive. However, in this case there appeared to be nothing wrong with the TCP connection: there was simply no AMQP Ack from the broker.

Thus, a Neutron API worker would spawn a green thread to publish events, which would stay stuck in the py-amqp library, unable to release the “event-dispatch” lock. Progressively this would extend to all Neutron API workers.

The documentation about Publisher Confirms in Highly Available (Mirrored) Queues [2] points out that a message will only be confirmed to the publisher when it has been accepted by all of the mirrors. This lead me to think that one of the mirrors was misbehaving. On one of the RabbitMQ servers, the command `rabbitmqctl node_health_check` was failing after a timeout of 70 seconds, while the two other nodes reported OK right away. After restarting rabbitmq-server on the server that was failing health checks, the problem could not be reproduced anymore.

[1] https://github.com/celery/py-amqp/blob/v2.1.4/amqp/channel.py#L1754
[2] https://www.rabbitmq.com/ha.html#confirms-transactions

Changed in neutron:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.