Octavia active/standby config+ pool with sourceip session persistence configuration- Service is not available and LB is not deleted after test

Bug #1690812 reported by Alex Stafeyev on 2017-05-15
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
octavia
Triaged
Critical
Unassigned

Bug Description

description of problem:
I run octavia (rhel amphora/Ubuntu) in active_standby mode.
The test is lbaas scenario session persistence test:

Ocata

How reproducible:

100
Steps to Reproduce:
1.Deploy setup with octavia support ( 2 compute nodes better), run all needed post deployment.
https://review.openstack.org/#/c/447496/

2. run the following neutron tempest test
https://github.com/openstack/neutron-lbaas/blob/master/neutron_lbaas/tests/tempest/v2/scenario/test_session_persistence.py

Actual results:
The test fails and the LB can not be deleted ( pls see log file attached)
curl VIP address results with 503 service unavailable. ( with octavia SINGLE config it is not reproduced)

Expected results:
The test should be successful - on a "single" and not "active_backup" octavia config it runs well.

Alex Stafeyev (astafeye) wrote :

Ubuntu amphora

Alex Stafeyev (astafeye) wrote :

More investigation logs

Nir Magnezi (nmagnezi) wrote :

Debugging shows that this is indeed an issue, probably in how Octavia configures HAProxy.

Looking at the logs inside the amphora instance i noticed the following:

host-192-168-199-59 haproxy: [ALERT] 134/082747 (11416) : Proxy 'cce34da1-7b4d-4659-bb0b-6cf01ffbcd68': unable to find local peer 'amphora-b8928a25-ca71-4389-8753-6ab3b2fb3d2c.localdomain' in peers section '9c530de5653d474181b73fe70c398ad5_peers'.
host-192-168-199-59 haproxy: [ALERT] 134/082747 (11416) : Fatal errors found in configuration.

Michael Johnson (johnsom) wrote :

The first issue I see is a python 3 bug:

  File "/usr/local/lib/python3.5/dist-packages/octavia/amphorae/backends/agent/api_server/listener.py", line 230, in start_stop_listener
    if 'Job is already running' not in e.output:
TypeError: a bytes-like object is required, not 'str'

That comparison "if 'Job is already running' not in e.output:" should probably be "if b'Job is already running' not in e.output:"

That should fix the first issue. I confirmed, that has not yet been fixed.

Changed in octavia:
status: New → Triaged
importance: Undecided → Critical
Michael Johnson (johnsom) wrote :

Issue two, with the peers being incorrect. I think you hit the bug reported here: https://launchpad.net/bugs/1681623

There is a patch here: https://review.openstack.org/#/c/455569
I have been struggling with that patch as I could not reproduce, but I think you found the magic combination.

We will move forward with that bug/patch for your second issue.

Nir Magnezi (nmagnezi) wrote :

Michael, Thanks for looking into this.
I submitted a tiny patch based on what you wrote in comment#4.
I'll test both this and the patch submitted against bug 1681623 and report my findings.

Nir Magnezi (nmagnezi) wrote :
Download full text (5.8 KiB)

Looks like the fix for bug 1681623 did not resolve this bug.

Using loadbalancer_topology = ACTIVE_STANDBY and tested with the following pool configuration:
neutron lbaas-pool-create --lb-algorithm ROUND_ROBIN --listener listener1 --protocol HTTP --session-persistence type=SOURCE_IP
Created a new pool:
+---------------------+------------------------------------------------+
| Field | Value |
+---------------------+------------------------------------------------+
| admin_state_up | True |
| description | |
| healthmonitor_id | |
| id | b084ed49-038b-45dc-9b4b-8a277f60ba5b |
| lb_algorithm | ROUND_ROBIN |
| listeners | {"id": "8ac3b6b3-680e-4a58-b51d-883283a3caf1"} |
| loadbalancers | {"id": "d9800fd6-f010-4540-8d41-ac24ae325cc2"} |
| members | |
| name | |
| protocol | HTTP |
| session_persistence | {"cookie_name": null, "type": "SOURCE_IP"} |
| tenant_id | d67bee545d534850aedfbe77da709c68 |
+---------------------+------------------------------------------------+

The Octavia Worker shows the following exception:
https://paste.fedoraproject.org/paste/vpzv0KnOjaqYdbtf5K9Bng

Digging inside the amphora VMs, I noticed the following

One Amphora managed to spawn haproxy with no errors, looks like the standby amphora.

haproxy.cfg:
============
root@amphora-e643829b-4de3-4ebf-a96c-0bef10389f6f:~# cat /var/lib/octavia/8ac3b6b3-680e-4a58-b51d-883283a3caf1/haproxy.cfg
# Configuration for lb_nir
global
    daemon
    user nobody
    log /dev/log local0
    log /dev/log local1 notice
    stats socket /var/lib/octavia/8ac3b6b3-680e-4a58-b51d-883283a3caf1.sock mode 0666 level user

defaults
    log global
    retries 3
    option redispatch
    timeout connect 5000
    timeout client 50000
    timeout server 50000

peers 8ac3b6b3680e4a58b51d883283a3caf1_peers
    peer yZB0PtEhlFqOwQHLLY3Zj9U_QAg 10.0.0.6:1025
    peer C2BTWaiZ2FW-oF4uOy7c2LeC0mU 10.0.0.14:1025

frontend 8ac3b6b3-680e-4a58-b51d-883283a3caf1
    option httplog
    bind 10.0.0.9:80
    mode http

haproxy log:
============
cat /var/log/haproxy.log
Jul 11 11:11:06 amphora-e643829b-4de3-4ebf-a96c-0bef10389f6f haproxy[1723]: Proxy 8ac3b6b3-680e-4a58-b51d-883283a3caf1 started.
Jul 11 11:11:06 amphora-e643829b-4de3-4ebf-a96c-0bef10389f6f haproxy[1723]: Proxy 8ac3b6b3-680e-4a58-b51d-883283a3caf1 started.

The second amphora fails to spawn haproxy, looks like the active amphora:

haproxy.cfg:
============
root@amphora-6ca8b5ae-3185-48b0-9615-9bf8d5df30f0:~# cat /var/lib/octavia/8ac3b6b3-680e-4a58-b51d-883283a3caf1/haproxy.cfg
# Configuration for lb_nir
global
    daemon
    user nobody
    log /dev/log local0
    log /dev/log local1 notice
    stats socket /var/lib/octavia/8ac3b6b3-680e-4a58-b51d-883283a3caf1...

Read more...

Michael Johnson (johnsom) wrote :

Can you confirm you have a newly built image with the fix from: https://review.openstack.org/#/c/455569

It is behaving like the "-L" command line flag is missing which is what the fix addressed.

The manual run below is also missing the -L flag.

/usr/sbin/haproxy -f /var/lib/octavia/8ac3b6b3-680e-4a58-b51d-883283a3caf1/haproxy.cfg -f /var/lib/octavia/haproxy-default-user-group.conf -c -q

See these lines in the source:
https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/agent/api_server/templates/systemd.conf.j2#L27
https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/agent/api_server/templates/systemd.conf.j2#L28

Nir Magnezi (nmagnezi) wrote :

@Michael, you were right.
apparently, my devstack used a cached amphora image without this fix.
As soon as I generated a new one, test_session_persistence worked with loadbalancer_topology = ACTIVE_STANDBY

{0} neutron_lbaas.tests.tempest.v2.scenario.test_session_persistence.TestSessionPersistence.test_session_persistence [246.013035s] ... ok

Change abandoned by Nir Magnezi (<email address hidden>) on branch: master
Review: https://review.openstack.org/480919

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers