Killing "backup" amphora before "master" is recovered leads to the fact that topology is not restored

Bug #1703547 reported by Anastasia Kuznetsova
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
octavia
Invalid
High
Unassigned

Bug Description

Env: ocata packages with octavia
Set

[controller_worker]
loadbalancer_topology = ACTIVE_STANDBY

[house_keeping]
spare_amphora_pool_size = 1

in the etc/octavia.conf

Steps to reproduce:
1. Check that 1 amphora VM was created
2. Create an instance
3. SSH to the instance and start two servers
4. Create a load balancer graph with two members and with ROUND_ROBIN algorithm (LB, listener, pool with members)
5. Associate the VIP from public net
6. Check that one more amphora per LB was created and there "master" and "backup" amphoras for LB.
Spare pool was filled again by creation of one more amphora
7. Send NUM requests to the floating ip and check that they are shared
between the two servers
8. Delete amphora with "master" role
9. "Backup" amphora should balance traffic between the two servers
10. Kill "backup" amphora

Expected result:
After some time new "master" amphorae from spare pool is created and new backup amphora too.
Topology is recovered.

Observed result:
After some time new "master" amphorae from spare pool is created, but there is no "backup" amphora.
Deletion of such LB leads to error like "lb is in immutable state".
Neutron and octavia DB have different info about LB state.

Tags: auto-abandon
Changed in octavia:
status: New → Incomplete
Revision history for this message
Michael Johnson (johnsom) wrote :

Hello agian,

So the issue with the neutron database getting out of sync is a known issue. We are doing two things to address this issue:
1. We are getting rid of neutron-lbaas so we will no longer have the neutron database to get out of sync.
2. In the interim there is a patch up for review that will put in place a workaround: https://review.openstack.org/#/c/478385/

Can you provide the octavia database status information? Preferably the amphora record, amphora_health record, load balancer record, and listener record?

What I would expect is the health monitor should come back around and notice that the backup amphora is not healthy and start a rebuild on it as well. Having the octavia database state information would help us debug the situation.

Revision history for this message
Anastasia Kuznetsova (akuznetsova) wrote :
Download full text (12.4 KiB)

Case was reproduced, but the behaviour was different: neither master or backup wasn't restored, just one amphora with no role, which do nothing.
1) initial situation
mysql> select * from amphora;
+--------------------------------------+--------------------------------------+-----------+--------------------------------------+---------------+------------+------------+--------------------------------------+--------------------------------------+------------+---------------------+-----------+----------------+---------+---------------+
| id | compute_id | status | load_balancer_id | lb_network_ip | vrrp_ip | ha_ip | vrrp_port_id | ha_port_id | role | cert_expiration | cert_busy | vrrp_interface | vrrp_id | vrrp_priority |
+--------------------------------------+--------------------------------------+-----------+--------------------------------------+---------------+------------+------------+--------------------------------------+--------------------------------------+------------+---------------------+-----------+----------------+---------+---------------+
| 3af251d8-df87-4812-b907-fb69d2dae3db | 9a58b8df-a5de-49e9-aaad-e02d8ffb9648 | ALLOCATED | 2f88ac8e-ba2c-43ef-aa48-17fda3a200a9 | 192.168.0.4 | 10.20.0.12 | 10.20.0.13 | ddf9ab5c-66c0-4b28-bbaf-b27b47c94fba | c600c3c2-61f1-47e0-a1e7-da4296342990 | STANDALONE | 2019-08-04 10:57:43 | 0 | NULL | 1 | NULL |
| bc11b3f1-ee12-4cc2-9019-bd62dd15dd0a | b7b2be08-055f-4626-a257-5dd31b8d9030 | ALLOCATED | 2f88ac8e-ba2c-43ef-aa48-17fda3a200a9 | 192.168.0.10 | 10.20.0.22 | 10.20.0.13 | 8cfa09c3-0934-41ff-9ca0-2b9541cd8347 | c600c3c2-61f1-47e0-a1e7-da4296342990 | BACKUP | 2019-08-04 10:54:32 | 0 | eth1 | 1 | 90 |
+--------------------------------------+--------------------------------------+-----------+--------------------------------------+---------------+------------+------------+--------------------------------------+--------------------------------------+------------+---------------------+-----------+----------------+---------+---------------+
2 rows in set (0.00 sec)

2) after killing master and a little bit lately backup
mysql> select * from amphora;
+--------------------------------------+--------------------------------------+-----------+--------------------------------------+---------------+------------+------------+--------------------------------------+--------------------------------------+------------+---------------------+-----------+----------------+---------+---------------+
| id | compute_id | status | load_balancer_id | lb_network_ip | vrrp_ip | ha_ip | vrrp_port_id | ha_port_id | role | cert_expiration | cert_busy | vrrp_interface | vrrp_id | vrrp_priority |
+--------------------------------------+--------------------------------------+-----------+--------------------------------------+---------------+------------+------...

Changed in octavia:
status: Incomplete → New
Revision history for this message
Michael Johnson (johnsom) wrote :

The database information provided above is inconsistent. The initial state output does not have a master amphora. There should be one amphora with role "MASTER" and "BACKUP".

Changed in octavia:
status: New → Incomplete
Changed in octavia:
status: Incomplete → Triaged
importance: Undecided → Critical
Revision history for this message
Michael Johnson (johnsom) wrote :

Ok, I can reproduce this.

I setup an act/stby load balancer with members. I then nova deleted the master, waited ten seconds then nova deleted the backup. This was not enough time for the master failover to start, it was still in the health manager timeout window before failover started.

Is see in the logs that the master starts the failover flow. Then HM decides the backup amp is also failed and starts it's failover flow.

When the second flow fails to get a lock on the load balancer (as expected since the master failover has the lock), the revert fails with: TypeError: revert() got an unexpected keyword argument 'flow_failures'.

The unusual part is the Master failover flow appears to not be completing successfully.

Changed in octavia:
importance: Critical → High
Revision history for this message
Anastasia Kuznetsova (akuznetsova) wrote :

Michael,

Sorry, my fault, incorrect copypaste from the database. In 1) should be info about both members as you mentioned.

Revision history for this message
Vu Nguyen Duy (CBR09) (nguyenduyvu099) wrote :

Hello all,
What about this issue?, how can I fix this?, is there any workaround for this?
Thanks

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote : auto-abandon-script

Abandoned after re-enabling the Octavia launchpad.

Changed in octavia:
status: Triaged → Invalid
tags: added: auto-abandon
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.