Octavia active/standby config+ pool with sourceip session persistence configuration- Service is not available and LB is not deleted after test

Bug #1690812 reported by Alex Stafeyev
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
octavia
Invalid
Critical
Unassigned

Bug Description

description of problem:
I run octavia (rhel amphora/Ubuntu) in active_standby mode.
The test is lbaas scenario session persistence test:

Ocata

How reproducible:

100
Steps to Reproduce:
1.Deploy setup with octavia support ( 2 compute nodes better), run all needed post deployment.
https://review.openstack.org/#/c/447496/

2. run the following neutron tempest test
https://github.com/openstack/neutron-lbaas/blob/master/neutron_lbaas/tests/tempest/v2/scenario/test_session_persistence.py

Actual results:
The test fails and the LB can not be deleted ( pls see log file attached)
curl VIP address results with 503 service unavailable. ( with octavia SINGLE config it is not reproduced)

Expected results:
The test should be successful - on a "single" and not "active_backup" octavia config it runs well.

Tags: auto-abandon
Revision history for this message
Alex Stafeyev (astafeye) wrote :

Ubuntu amphora

Revision history for this message
Alex Stafeyev (astafeye) wrote :

More investigation logs

Revision history for this message
Nir Magnezi (nmagnezi) wrote :

Debugging shows that this is indeed an issue, probably in how Octavia configures HAProxy.

Looking at the logs inside the amphora instance i noticed the following:

host-192-168-199-59 haproxy: [ALERT] 134/082747 (11416) : Proxy 'cce34da1-7b4d-4659-bb0b-6cf01ffbcd68': unable to find local peer 'amphora-b8928a25-ca71-4389-8753-6ab3b2fb3d2c.localdomain' in peers section '9c530de5653d474181b73fe70c398ad5_peers'.
host-192-168-199-59 haproxy: [ALERT] 134/082747 (11416) : Fatal errors found in configuration.

Revision history for this message
Michael Johnson (johnsom) wrote :

The first issue I see is a python 3 bug:

  File "/usr/local/lib/python3.5/dist-packages/octavia/amphorae/backends/agent/api_server/listener.py", line 230, in start_stop_listener
    if 'Job is already running' not in e.output:
TypeError: a bytes-like object is required, not 'str'

That comparison "if 'Job is already running' not in e.output:" should probably be "if b'Job is already running' not in e.output:"

That should fix the first issue. I confirmed, that has not yet been fixed.

Changed in octavia:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Michael Johnson (johnsom) wrote :

Issue two, with the peers being incorrect. I think you hit the bug reported here: https://launchpad.net/bugs/1681623

There is a patch here: https://review.openstack.org/#/c/455569
I have been struggling with that patch as I could not reproduce, but I think you found the magic combination.

We will move forward with that bug/patch for your second issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to octavia (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/480919

Revision history for this message
Nir Magnezi (nmagnezi) wrote :

Michael, Thanks for looking into this.
I submitted a tiny patch based on what you wrote in comment#4.
I'll test both this and the patch submitted against bug 1681623 and report my findings.

Revision history for this message
Nir Magnezi (nmagnezi) wrote :
Download full text (5.8 KiB)

Looks like the fix for bug 1681623 did not resolve this bug.

Using loadbalancer_topology = ACTIVE_STANDBY and tested with the following pool configuration:
neutron lbaas-pool-create --lb-algorithm ROUND_ROBIN --listener listener1 --protocol HTTP --session-persistence type=SOURCE_IP
Created a new pool:
+---------------------+------------------------------------------------+
| Field | Value |
+---------------------+------------------------------------------------+
| admin_state_up | True |
| description | |
| healthmonitor_id | |
| id | b084ed49-038b-45dc-9b4b-8a277f60ba5b |
| lb_algorithm | ROUND_ROBIN |
| listeners | {"id": "8ac3b6b3-680e-4a58-b51d-883283a3caf1"} |
| loadbalancers | {"id": "d9800fd6-f010-4540-8d41-ac24ae325cc2"} |
| members | |
| name | |
| protocol | HTTP |
| session_persistence | {"cookie_name": null, "type": "SOURCE_IP"} |
| tenant_id | d67bee545d534850aedfbe77da709c68 |
+---------------------+------------------------------------------------+

The Octavia Worker shows the following exception:
https://paste.fedoraproject.org/paste/vpzv0KnOjaqYdbtf5K9Bng

Digging inside the amphora VMs, I noticed the following

One Amphora managed to spawn haproxy with no errors, looks like the standby amphora.

haproxy.cfg:
============
root@amphora-e643829b-4de3-4ebf-a96c-0bef10389f6f:~# cat /var/lib/octavia/8ac3b6b3-680e-4a58-b51d-883283a3caf1/haproxy.cfg
# Configuration for lb_nir
global
    daemon
    user nobody
    log /dev/log local0
    log /dev/log local1 notice
    stats socket /var/lib/octavia/8ac3b6b3-680e-4a58-b51d-883283a3caf1.sock mode 0666 level user

defaults
    log global
    retries 3
    option redispatch
    timeout connect 5000
    timeout client 50000
    timeout server 50000

peers 8ac3b6b3680e4a58b51d883283a3caf1_peers
    peer yZB0PtEhlFqOwQHLLY3Zj9U_QAg 10.0.0.6:1025
    peer C2BTWaiZ2FW-oF4uOy7c2LeC0mU 10.0.0.14:1025

frontend 8ac3b6b3-680e-4a58-b51d-883283a3caf1
    option httplog
    bind 10.0.0.9:80
    mode http

haproxy log:
============
cat /var/log/haproxy.log
Jul 11 11:11:06 amphora-e643829b-4de3-4ebf-a96c-0bef10389f6f haproxy[1723]: Proxy 8ac3b6b3-680e-4a58-b51d-883283a3caf1 started.
Jul 11 11:11:06 amphora-e643829b-4de3-4ebf-a96c-0bef10389f6f haproxy[1723]: Proxy 8ac3b6b3-680e-4a58-b51d-883283a3caf1 started.

The second amphora fails to spawn haproxy, looks like the active amphora:

haproxy.cfg:
============
root@amphora-6ca8b5ae-3185-48b0-9615-9bf8d5df30f0:~# cat /var/lib/octavia/8ac3b6b3-680e-4a58-b51d-883283a3caf1/haproxy.cfg
# Configuration for lb_nir
global
    daemon
    user nobody
    log /dev/log local0
    log /dev/log local1 notice
    stats socket /var/lib/octavia/8ac3b6b3-680e-4a58-b51d-883283a3caf1...

Read more...

Revision history for this message
Michael Johnson (johnsom) wrote :

Can you confirm you have a newly built image with the fix from: https://review.openstack.org/#/c/455569

It is behaving like the "-L" command line flag is missing which is what the fix addressed.

The manual run below is also missing the -L flag.

/usr/sbin/haproxy -f /var/lib/octavia/8ac3b6b3-680e-4a58-b51d-883283a3caf1/haproxy.cfg -f /var/lib/octavia/haproxy-default-user-group.conf -c -q

See these lines in the source:
https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/agent/api_server/templates/systemd.conf.j2#L27
https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/agent/api_server/templates/systemd.conf.j2#L28

Revision history for this message
Nir Magnezi (nmagnezi) wrote :

@Michael, you were right.
apparently, my devstack used a cached amphora image without this fix.
As soon as I generated a new one, test_session_persistence worked with loadbalancer_topology = ACTIVE_STANDBY

{0} neutron_lbaas.tests.tempest.v2.scenario.test_session_persistence.TestSessionPersistence.test_session_persistence [246.013035s] ... ok

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on octavia (master)

Change abandoned by Nir Magnezi (<email address hidden>) on branch: master
Review: https://review.openstack.org/480919

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote : auto-abandon-script

Abandoned after re-enabling the Octavia launchpad.

Changed in octavia:
status: Triaged → Invalid
tags: added: auto-abandon
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.