Bug #1703547 " Killing “backup” amphora before “master” is recov...” : Bugs : octavia

Michael Johnson (johnsom) on 2017-07-11

Changed in octavia:
status:	New → Incomplete

Revision history for this message

Michael Johnson (johnsom) wrote on 2017-07-11:

#1

Hello agian,

So the issue with the neutron database getting out of sync is a known issue. We are doing two things to address this issue:
1. We are getting rid of neutron-lbaas so we will no longer have the neutron database to get out of sync.
2. In the interim there is a patch up for review that will put in place a workaround: https://review.openstack.org/#/c/478385/

Can you provide the octavia database status information? Preferably the amphora record, amphora_health record, load balancer record, and listener record?

What I would expect is the health monitor should come back around and notice that the backup amphora is not healthy and start a rebuild on it as well. Having the octavia database state information would help us debug the situation.

Revision history for this message

Anastasia Kuznetsova (akuznetsova) wrote on 2017-08-04:

#2

Download full text (12.4 KiB)

2) after killing master and a little bit lately backup
mysql> select * from amphora;
+--------------------------------------+--------------------------------------+-----------+--------------------------------------+---------------+------------+------------+--------------------------------------+--------------------------------------+------------+---------------------+-----------+----------------+---------+---------------+
| id | compute_id | status | load_balancer_id | lb_network_ip | vrrp_ip | ha_ip | vrrp_port_id | ha_port_id | role | cert_expiration | cert_busy | vrrp_interface | vrrp_id | vrrp_priority |
+--------------------------------------+--------------------------------------+-----------+--------------------------------------+---------------+------------+------...

3) finally what I have
mysql> select * from amphora;
+--------------------------------------+--------------------------------------+-----------+--------------------------------------+---------------+------------+------------+--------------------------------------+--------------------------------------+------------+---------------------+-----------+----------------+---------+---------------+
| id                                   | compute_id                           | status    | load_balancer_id                     | lb_network_ip | vrrp_ip    | ha_ip      | vrrp_port_id                         | ha_port_id                           | role       | cert_expiration     | cert_busy | vrrp_interface | vrrp_id | vrrp_priority |
+--------------------------------------+--------------------------------------+-----------+--------------------------------------+---------------+------------+------------+--------------------------------------+--------------------------------------+------------+---------------------+-----------+----------------+---------+---------------+
| 3af251d8-df87-4812-b907-fb69d2dae3db | 9a58b8df-a5de-49e9-aaad-e02d8ffb9648 | ALLOCATED | 2f88ac8e-ba2c-43ef-aa48-17fda3a200a9 | 192.168.0.4   | 10.20.0.12 | 10.20.0.13 | ddf9ab5c-66c0-4b28-bbaf-b27b47c94fba | c600c3c2-61f1-47e0-a1e7-da4296342990 | STANDALONE | 2019-08-04 10:57:43 |         0 | NULL           |       1 |          NULL |
+--------------------------------------+--------------------------------------+-----------+--------------------------------------+---------------+------------+------------+--------------------------------------+--------------------------------------+------------+---------------------+-----------+----------------+---------+---------------+
mysql> select * from amphora_health;
+--------------------------------------+---------------------+------+
| amphora_id                           | last_update         | busy |
+--------------------------------------+---------------------+------+
| 3af251d8-df87-4812-b907-fb69d2dae3db | 2017-08-04 11:02:16 |    0 |
| d81d303d-f31e-4286-bbf7-227cf32c8f3e | 2017-08-04 11:00:14 |    1 |
+--------------------------------------+---------------------+------+
mysql> select * from load_balancer;
+----------------------------------+--------------------------------------+------+-------------+---------------------+------------------+---------+----------------+-----------------+---------------------+---------------------+
| project_id                       | id                                   | name | description | provisioning_status | operating_status | enabled | topology       | server_group_id | created_at          | updated_at          |
+----------------------------------+--------------------------------------+------+-------------+---------------------+------------------+---------+----------------+-----------------+---------------------+---------------------+
| 46c7fdf9384949ad914f90231b4e39af | 2f88ac8e-ba2c-43ef-aa48-17fda3a200a9 | lb   |             | ERROR               | ONLINE           |       1 | ACTIVE_STANDBY | NULL            | 2017-08-03 12:55:07 | 2017-08-04 11:00:20 |
+----------------------------------+--------------------------------------+------+-------------+---------------------+------------------+---------+----------------+-----------------+---------------------+---------------------+
mysql> select * from listener;
+----------------------------------+--------------------------------------+--------------+-------------+----------+---------------+------------------+--------------------------------------+--------------------+--------------------------------------+---------------------+------------------+---------+-----------+----------------+---------------------+---------------------+
| project_id                       | id                                   | name         | description | protocol | protocol_port | connection_limit | load_balancer_id                     | tls_certificate_id | default_pool_id                      | provisioning_status | operating_status | enabled | peer_port | insert_headers | created_at          | updated_at          |
+----------------------------------+--------------------------------------+--------------+-------------+----------+---------------+------------------+--------------------------------------+--------------------+--------------------------------------+---------------------+------------------+---------+-----------+----------------+---------------------+---------------------+
| 46c7fdf9384949ad914f90231b4e39af | 80776a64-7aa7-4261-818c-288c60519115 | test-lb-http |             | HTTP     |            80 |               -1 | 2f88ac8e-ba2c-43ef-aa48-17fda3a200a9 | NULL               | a56d09d3-2e58-436b-a108-01c79268239a | ERROR               | ONLINE           |       1 |      1025 | NULL           | 2017-08-03 13:02:09 | 2017-08-04 11:00:19 |
+----------------------------------+--------------------------------------+--------------+-------------+----------+---------------+------------------+--------------------------------------+--------------------+--------------------------------------+---------------------+------------------+---------+-----------+----------------+---------------------+---------------------+

Interesting thing that there was some kind of activity after killing nodes:
10.20.0.22 - - [04/Aug/2017 10:57:07] "GET / HTTP/1.1" 200 -
10.20.0.22 - - [04/Aug/2017 10:57:11] "GET / HTTP/1.1" 200 -
10.20.0.22 - - [04/Aug/2017 11:00:08] "GET / HTTP/1.1" 200 -
10.20.0.22 - - [04/Aug/2017 11:00:12] "GET / HTTP/1.1" 200
but that is all.

Attempt to recover one of nodes failed:
2017-08-04 11:00:18.458 17842 DEBUG octavia.controller.worker.controller_worker [-] Task 'octavia.controller.worker.tasks.amphora_driver_tasks.AmphoraePostNetworkPlug' (e034556c-ee70-4615-be74-1951a9ef4dae) transitioned into state 'REVERTING' from state 'FAILURE' _task_receiver /usr/lib/python2.7/dist-packages/taskflow/listeners/logging.py:194
2017-08-04 11:00:19.634 17842 ERROR octavia.controller.worker.controller_worker [-] Failover exception: Internal Server Error

Changed in octavia:
status:	Incomplete → New

Revision history for this message

Michael Johnson (johnsom) wrote on 2017-08-05:

#3

The database information provided above is inconsistent. The initial state output does not have a master amphora. There should be one amphora with role "MASTER" and "BACKUP".

Changed in octavia:
status:	New → Incomplete

Michael Johnson (johnsom) on 2017-08-05

Changed in octavia:
status:	Incomplete → Triaged
importance:	Undecided → Critical

Revision history for this message

Michael Johnson (johnsom) wrote on 2017-08-05:

#4

Ok, I can reproduce this.

I setup an act/stby load balancer with members. I then nova deleted the master, waited ten seconds then nova deleted the backup. This was not enough time for the master failover to start, it was still in the health manager timeout window before failover started.

Is see in the logs that the master starts the failover flow. Then HM decides the backup amp is also failed and starts it's failover flow.

When the second flow fails to get a lock on the load balancer (as expected since the master failover has the lock), the revert fails with: TypeError: revert() got an unexpected keyword argument 'flow_failures'.

The unusual part is the Master failover flow appears to not be completing successfully.

Michael Johnson (johnsom) on 2017-08-05

Changed in octavia:
importance:	Critical → High

Revision history for this message

Anastasia Kuznetsova (akuznetsova) wrote on 2017-08-07:

#5

Michael,

Sorry, my fault, incorrect copypaste from the database. In 1) should be info about both members as you mentioned.

Revision history for this message

Vu Nguyen Duy (CBR09) (nguyenduyvu099) wrote on 2018-10-02:

#6

Hello all,
What about this issue?, how can I fix this?, is there any workaround for this?
Thanks

Revision history for this message

Gregory Thiemonge (gthiemonge) wrote on 2023-03-31: auto-abandon-script

#7

Abandoned after re-enabling the Octavia launchpad.

Changed in octavia:
status:	Triaged → Invalid
tags:	added: auto-abandon

octavia

Killing "backup" amphora before "master" is recovered leads to the fact that topology is not restored

Bug Description

Other bug subscribers

Remote bug watches