Bug #1814209 “Messages for security rules are being sent to a wr...” : Bugs : neutron

Revision history for this message

Jack Ivanov (gunph1ld) wrote on 2019-02-01:

#1

Neutron versions:
python-neutron-12.0.4-1.el7.noarch
python2-neutronclient-6.7.0-1.el7.noarch
python2-neutron-lib-1.13.0-1.el7.noarch
openstack-neutron-common-12.0.4-1.el7.noarch
python-neutron-lbaas-12.0.0-1.el7.noarch
openstack-neutron-ml2-12.0.4-1.el7.noarch
openstack-neutron-openvswitch-12.0.4-1.el7.noarch
openstack-neutron-12.0.4-1.el7.noarch
openstack-neutron-lbaas-12.0.0-1.el7.noarch

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2019-02-04:

#2

I suspect that there may be some patch missing to get into this state.

Revision history for this message

Jack Ivanov (gunph1ld) wrote on 2019-02-05:

#3

What patch?

For some reason push notifications for SecurityGroupRule are not being sent

Revision history for this message

masha atakova (matakova) wrote on 2019-03-08:

#4

I'm also affected by this bug, it seems.

Steps to reproduce:
0. my system was upgraded from Ocata to Pike to Queens, I'm currently running Queens with the same 12.0.4 version of neutron provided above
1. Start vm and add it to a security group
2. After vm is started on a hypervisor, add / remove rule to the security group
3. check iptables on the hypervisor and see that this rule isn't present there

However, if I restart ovs-agent on the hypervisor, it correctly updates all the rules from security group, so the problem only appears when ovs-agent has been running for more than 5-10 minutes (this number isn't fixed).

Currently, I'm forced to restart ovs-agent on the hypervisor each time I update my security group rules as a workaround.

Revision history for this message

Brian Haley (brian-haley) wrote on 2019-03-14:

#5

I'm confused. Were all of your neutron-server processes restarted? It almost seems like there is one running on an older version and still publishing to the wrong exchange.

Revision history for this message

Pierre Riteau (priteau) wrote on 2019-09-17:

#6

I've seen the same issue on a deployment that had been upgraded to Queens and was using a cluster of three RabbitMQ servers for messaging.

I captured the traffic between neutron-server and all cluster instances of RabbitMQ. After isolating the connection established by neutron-server to publish an event such as a security group update, I noticed the following pattern:

- neutron-server sends Basic.publish exchange=neutron-vo-<object_type>-1.1_fanout
- RabbitMQ broker sends TCP ACK, acknowledging reception of the message
- [ No more traffic on this connection ]

Since publisher confirms are in use, we would expect to receive a Basic.Ack message from the broker in addition to the TCP ACK. The correct pattern was visible on other message publications.

I traced the code down to the basic_publish_confirm function of the py-amqp library [1]. This function waits until it receives an Ack message, apparently with no timeout. Broken connections are meant to be detected via AMQP heartbeats or TCP keepalive. However, in this case there appeared to be nothing wrong with the TCP connection: there was simply no AMQP Ack from the broker.

Thus, a Neutron API worker would spawn a green thread to publish events, which would stay stuck in the py-amqp library, unable to release the “event-dispatch” lock. Progressively this would extend to all Neutron API workers.

The documentation about Publisher Confirms in Highly Available (Mirrored) Queues [2] points out that a message will only be confirmed to the publisher when it has been accepted by all of the mirrors. This lead me to think that one of the mirrors was misbehaving. On one of the RabbitMQ servers, the command `rabbitmqctl node_health_check` was failing after a timeout of 70 seconds, while the two other nodes reported OK right away. After restarting rabbitmq-server on the server that was failing health checks, the problem could not be reproduced anymore.

[1] https://github.com/celery/py-amqp/blob/v2.1.4/amqp/channel.py#L1754
[2] https://www.rabbitmq.com/ha.html#confirms-transactions

I've seen the same issue on a deployment that had been upgraded to Queens and was using a cluster of three RabbitMQ servers for messaging.

I captured the traffic between neutron-server and all cluster instances of RabbitMQ. After isolating the connection established by neutron-server to publish an event such as a security group update, I noticed the following pattern:

- neutron-server sends Basic.publish exchange=neutron-vo-<object_type>-1.1_fanout
- RabbitMQ broker sends TCP ACK, acknowledging reception of the message
- [ No more traffic on this connection ]

Since publisher confirms are in use, we would expect to receive a Basic.Ack message from the broker in addition to the TCP ACK. The correct pattern was visible on other message publications.

I traced the code down to the basic_publish_confirm function of the py-amqp library [1]. This function waits until it receives an Ack message, apparently with no timeout. Broken connections are meant to be detected via AMQP heartbeats or TCP keepalive. However, in this case there appeared to be nothing wrong with the TCP connection: there was simply no AMQP Ack from the broker.

Thus, a Neutron API worker would spawn a green thread to publish events, which would stay stuck in the py-amqp library, unable to release the “event-dispatch” lock. Progressively this would extend to all Neutron API workers.

The documentation about Publisher Confirms in Highly Available (Mirrored) Queues [2] points out that a message will only be confirmed to the publisher when it has been accepted by all of the mirrors. This lead me to think that one of the mirrors was misbehaving. On one of the RabbitMQ servers, the command `rabbitmqctl node_health_check` was failing after a timeout of 70 seconds, while the two other nodes reported OK right away. After restarting rabbitmq-server on the server that was failing health checks, the problem could not be reproduced anymore.

[1] https://github.com/celery/py-amqp/blob/v2.1.4/amqp/channel.py#L1754
[2] https://www.rabbitmq.com/ha.html#confirms-transactions

Brian Haley (brian-haley) on 2023-01-23

Changed in neutron:
status:	New → Invalid

neutron

Messages for security rules are being sent to a wrong MQ topic. Security rules are out of sync

Bug Description

Other bug subscribers

Remote bug watches