Timeout when reading response headers from daemon process 'octavia-api' after restart

Bug #1909004 reported by Jake Hill
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Octavia Charm
New
Undecided
Unassigned

Bug Description

Perhaps or not related to #1885936 but I think not so raising new bug.

Install cs:openstack-base-70 with the attached overlay for octavia. When everything is green run e.g.

  watch openstack loadbalancer list

in a terminal. You can see the empty list of loadbalancers.

Forcably restart the octavia unit, e.g.

  juju ssh octavia/0 -- sudo reboot

The unit seems to come back, but the API no longer responds. The once the unit has settled again, the client returns;

Unable to establish connection to https://192.168.200.193:9876/v2.0/lbaas/loadbalancers: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

In /var/log/apache2/octavia_error.log there are these messages;

[Tue Dec 22 11:13:38.488467 2020] [wsgi:error] [pid 3449:tid 139866935383808] [client 127.0.0.1:50482] Timeout when reading response headers from daemon process 'octavia-api': /usr/bin/octavia-wsgi
[Tue Dec 22 11:15:14.200171 2020] [wsgi:error] [pid 3449:tid 139866901812992] [client 127.0.0.1:52026] Timeout when reading response headers from daemon process 'octavia-api': /usr/bin/octavia-wsgi
[Tue Dec 22 11:16:49.404853 2020] [wsgi:error] [pid 3448:tid 139867030517504] [client 127.0.0.1:53520] Timeout when reading response headers from daemon process 'octavia-api': /usr/bin/octavia-wsgi
[Tue Dec 22 11:18:24.722825 2020] [wsgi:error] [pid 3449:tid 139866658555648] [client 127.0.0.1:55038] Timeout when reading response headers from daemon process 'octavia-api': /usr/bin/octavia-wsgi
[Tue Dec 22 11:20:00.493323 2020] [wsgi:error] [pid 3449:tid 139866264295168] [client 127.0.0.1:56534] Timeout when reading response headers from daemon process 'octavia-api': /usr/bin/octavia-wsgi
[Tue Dec 22 11:21:35.836805 2020] [wsgi:error] [pid 3449:tid 139866650162944] [client 127.0.0.1:58016] Timeout when reading response headers from daemon process 'octavia-api': /usr/bin/octavia-wsgi
[Tue Dec 22 11:23:11.372837 2020] [wsgi:error] [pid 3449:tid 139866415298304] [client 127.0.0.1:59540] Timeout when reading response headers from daemon process 'octavia-api': /usr/bin/octavia-wsgi
[Tue Dec 22 11:24:47.687568 2020] [wsgi:error] [pid 3449:tid 139866364942080] [client 127.0.0.1:32792] Timeout when reading response headers from daemon process 'octavia-api': /usr/bin/octavia-wsgi

Revision history for this message
Jake Hill (routergod) wrote :
Revision history for this message
Liam Young (gnuoy) wrote :

I could not reproduce this issue. Please could include the logs from octavia/0:var/log/octavia ?

$ openstack loadbalancer list

$ juju ssh octavia/0 -- sudo reboot

Connection to 172.20.0.8 closed by remote host.
Connection to 172.20.0.8 closed.

<Wait for restart>

$ openstack loadbalancer list

$

Changed in charm-octavia:
status: New → Incomplete
Revision history for this message
Jake Hill (routergod) wrote :

Thanks for looking. I've included apache2 and octavia logs in the tarball. It looks perhaps like some problem with the WSGI invocation, the octavia logs don't give me anything.

Will also add a paste of a console session pinging the API during the reboot. It is a bit odd, the API does appear to come back for at least one request post-reboot.

Revision history for this message
Jake Hill (routergod) wrote :

Console session

Revision history for this message
Jake Hill (routergod) wrote :

Apologies that was pretty pointless as I did not give enough time for the WGSI timeout. Here is a replacement /var/log/apache2/octavia_error.log

Revision history for this message
Jake Hill (routergod) wrote :

Not wanting this bug to expire...

I have just started from scratch with a completely fresh MAAS stack and have deployed a new model based on openstack-base and loadbalancer-octavia.yaml. I am still suffering this issue. Some time (not immediately) after forcibly restarting octavia/0 machine I get the message;

Unable to establish connection to https://10.23.129.95:9876/v2.0/lbaas/loadbalancers: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Changed in charm-octavia:
status: Incomplete → New
Revision history for this message
Jake Hill (routergod) wrote :

Thanks @Alex

I have discovered something. I wonder this might actually end up being an OVN issue? I found these logs in the ovn-controller.log on the unit, and have noticed similar logs on compute chassis previously. Restarting OVN clears it and has done so in this case too;

routergod@juju:~$ juju ssh octavia/3 -- sudo tail /var/log/ovn/ovn-controller.log
2021-04-20T18:51:14.704Z|00010|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2021-04-20T18:51:14.705Z|00001|pinctrl(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2021-04-20T18:51:14.706Z|00002|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2021-04-20T18:51:14.706Z|00003|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2021-04-20T18:51:14.720Z|00011|binding|INFO|Not claiming lport 8a28d5fc-a5ef-4e08-91c2-796a130953a7, chassis juju-dd7be7-2-lxd-9 requested-chassis juju-dd7be7-2-lxd-9.maas
2021-04-20T18:51:31.489Z|00012|binding|INFO|Dropped 1 log messages in last 16 seconds (most recently, 16 seconds ago) due to excessive rate
2021-04-20T18:51:31.489Z|00013|binding|INFO|Not claiming lport 8a28d5fc-a5ef-4e08-91c2-796a130953a7, chassis juju-dd7be7-2-lxd-9 requested-chassis juju-dd7be7-2-lxd-9.maas
2021-04-20T18:51:31.966Z|00014|binding|INFO|Claiming lport 8a28d5fc-a5ef-4e08-91c2-796a130953a7 for this chassis.
2021-04-20T18:51:31.966Z|00015|binding|INFO|8a28d5fc-a5ef-4e08-91c2-796a130953a7: Claiming fa:16:3e:ce:0a:2f fc00:9a48:3a7d:4d21:f816:3eff:fece:a2f
2021-04-20T18:51:31.966Z|00016|binding|INFO|8a28d5fc-a5ef-4e08-91c2-796a130953a7: Claiming unknown
Connection to 10.23.129.104 closed.

routergod@juju:~$ juju run --unit=octavia/3 -- sudo systemctl restart ovn-host
routergod@juju:~$ juju run --unit=octavia/3 -- sudo systemctl restart ovn-controller
routergod@juju:~$ juju run --unit=octavia/3 -- sudo systemctl restart ovsdb-server

routergod@juju:~$ juju ssh octavia/3 -- sudo tail /var/log/ovn/ovn-controller.log
2021-04-20T18:55:28.103Z|00004|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connection closed by peer
2021-04-20T18:55:28.103Z|00014|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connection closed by peer
2021-04-20T18:55:28.198Z|00015|jsonrpc|WARN|unix:/var/run/openvswitch/db.sock: send error: Broken pipe
2021-04-20T18:55:28.198Z|00016|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection dropped (Broken pipe)
2021-04-20T18:55:29.104Z|00005|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2021-04-20T18:55:29.105Z|00006|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2021-04-20T18:55:29.105Z|00017|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2021-04-20T18:55:29.105Z|00018|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2021-04-20T18:55:29.198Z|00019|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2021-04-20T18:55:29.199Z|00020|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
Connection to 10.23.129.104 closed.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Hi Jake. As this is (possibly?) in a lab setting, it may be helpful to grab a juju-crashdump of the model which would grab all the logs from all the units? https://github.com/juju/juju-crashdump -- there are/have been some ovn issues fixed during the most recent dev cycle, so it also may be worth (in a lab setting only) trying it with the ~openstack-charmers-next charms.

Revision history for this message
Jake Hill (routergod) wrote :

Hi Alex. Thanks I will try the -next charms. Uploading the crashdump here was not working, I'm getting a timeout error but I think due to the size. It seems the -b option on crashdump does not work either, I'll need to find somewhere to stash the file for you.

Revision history for this message
Jake Hill (routergod) wrote :
Revision history for this message
Jake Hill (routergod) wrote :

I get the same with the OVN charms from -next. Have filed https://bugs.launchpad.net/charm-ovn-chassis/+bug/1925793

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.