OpenStack Nova Cloud Controller Charm

Guest VM is stuck in RESIZE state randomly, never going to VERIFY_RESIZE as expected; Connection aborted, Remote end closed connection without response

Bug #1944467 reported by Vladimir Grevtsev on 2021-09-21

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Nova Cloud Controller Charm	Invalid	Undecided	Unassigned

Bug Description

== Environment
focal/ussuri + ovn, latest stable charms
juju status: https://paste.ubuntu.com/p/2725tV47ym/
Hardware: Huawei CH121 V5 with MZ532,4*25GE Mezzanine Card,PCIE 3.0 X16 NICs
External storage: Huawei OceanStor Dorado 6000 V6
juju crashdump: https://drive.google.com/file/d/1oZ6K8PW2euEKgKyaqtq2bdr8gIpznvX5/view?usp=sharing

== Problem

As a part of the cloud testing, we are launching a set of Rally tests, including NovaServers.boot_server_from_volume_and_resize [0].

Currently, we are observing a situation when after invoking a server resize, the guest might remain in RESIZE state, never going to the VERIFY_RESIZE (or, as another theory, it might get there but test can't reach out to ncc to verify this), as expected by the test [1], failing with the following traceback:

Traceback (most recent call last):
  File "/snap/fcbtest/14/lib/python3.6/site-packages/rally/task/runner.py", line 69, in _run_scenario_once
    getattr(scenario_inst, method_name)(**scenario_kwargs)
  File "/snap/fcbtest/14/lib/python3.6/site-packages/rally_openstack/task/scenarios/nova/servers.py", line 598, in run
    self._resize(server, to_flavor)
  File "/snap/fcbtest/14/lib/python3.6/site-packages/rally/task/atomic.py", line 91, in func_atomic_actions
    f = func(self, *args, **kwargs)
  File "/snap/fcbtest/14/lib/python3.6/site-packages/rally_openstack/task/scenarios/nova/utils.py", line 699, in _resize
    check_interval=CONF.openstack.nova_server_resize_poll_interval
  File "/snap/fcbtest/14/lib/python3.6/site-packages/rally/task/utils.py", line 213, in wait_for_status
    resource = update_resource(resource)
  File "/snap/fcbtest/14/lib/python3.6/site-packages/rally/task/utils.py", line 78, in _get_from_manager
    raise exceptions.GetResourceFailure(resource=resource, err=e)
rally.exceptions.GetResourceFailure: Failed to get the resource <Server: s_rally_4e75e321_W7NYCnLn>: Unable to establish connection to https://nova.fqdn:8774/v2.1/servers/fe2b3ad9-0bdd-467f-aa9f-cc84dd013856: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

[0] https://github.com/openstack/rally-openstack/blob/ec1f31a685bdcbe10212701f8fd0c35598d4745f/rally_openstack/task/scenarios/nova/servers.py#L555
[1] https://github.com/openstack/rally-openstack/blob/master/rally_openstack/task/scenarios/nova/utils.py#L692-L700

Rally log is provided for the timestamp reference; crashdump to be attached as long as we'll get it from the environment.

2021-09-21 19:54:08.331 1051375 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 1 END: Error GetResourceFailure: Failed to ge
t the resource <Server: s_rally_d7e83197_Xv73v83E>: Unable to establish connection to https://nova.fqdn:8774/v2.1/servers/ba0e2b57-237
8-4869-94f2-37661c9b05ff: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
2021-09-21 19:54:08.333 1051375 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 3 START
2021-09-21 19:55:27.800 1051377 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 2 END: OK
2021-09-21 19:55:27.802 1051377 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 4 START
2021-09-21 19:56:37.150 1051375 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 3 END: OK
2021-09-21 19:56:37.152 1051375 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 5 START
2021-09-21 19:57:41.926 1051377 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 4 END: OK
2021-09-21 19:57:41.928 1051377 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 6 START
2021-09-21 19:58:24.976 1051375 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 5 END: OK
2021-09-21 19:58:24.978 1051375 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 7 START
2021-09-21 19:59:30.661 1051375 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 7 END: Error GetResourceFailure: Failed to ge
t the resource <Server: s_rally_d7e83197_cBm5J6wn>: Unable to establish connection to https://nova.fqdn:8774/v2.1/servers/a1a70ee1-bf8
9-475e-9f51-480c1cda9bc7: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
2021-09-21 19:59:30.663 1051375 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 8 START
2021-09-21 20:00:09.757 1051377 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 6 END: OK
2021-09-21 20:00:09.759 1051377 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 9 START
2021-09-21 20:00:38.332 1051375 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 8 END: Error GetResourceFailure: Failed to ge
t the resource <Server: s_rally_d7e83197_XqsmBECG>: Unable to establish connection to https://nova.fqdn:8774/v2.1/servers/9c59e7eb-5fe
7-49ea-acea-4aa7a90e74f7: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
2021-09-21 20:00:38.334 1051375 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 10 START
2021-09-21 20:01:12.398 1051377 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 9 END: Error GetResourceFailure: Failed to ge
t the resource <Server: s_rally_d7e83197_Ju6WcWMd>: Unable to establish connection to https://nova.fqdn:8774/v2.1/servers/031e9c72-669
8-4bd5-b6cf-9df9531790d4: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
2021-09-21 20:01:37.845 1051375 INFO rally.task.runner [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | ITER: 10 END: Error GetResourceFailure: Failed to g
et the resource <Server: s_rally_d7e83197_5mMdRhmh>: Unable to establish connection to https://nova.fqdn:8774/v2.1/servers/a1976740-ee
e9-4582-bed1-b0421f418c41: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
2021-09-21 20:01:37.862 1050774 INFO rally.task.context [-] Task 084a8dd3-6112-4239-baa3-0de54eb6a2a9 | Context cleanup@openstack cleanup() started

== Steps were tried to reproduce the issue manually (note: I was unable to reproduce it by hand - will try replicating the Rally's behaviour)

openstack server create --boot-from-volume 30 --image auto-sync/ubuntu-focal-20.04-amd64-server-20210907-disk1.img --flavor m1.medium --key-name ubuntu-keypair --network internal1-net test-resize --min 20 --max 20
for x in $(seq 1 20); do openstack server resize test-resize-$x --flavor m1.large; done

See original description

Tags:

Revision history for this message

Vladimir Grevtsev (vlgrevtsev) wrote on 2021-09-22:

crashdump: https://drive.google.com/file/d/1oZ6K8PW2euEKgKyaqtq2bdr8gIpznvX5/view?usp=sharing

description:

updated

Revision history for this message

Vladimir Grevtsev (vlgrevtsev) wrote on 2021-09-22:

+ field-high as this affects ongoing engagement, and we would like to ask for some guidance in troubleshooting.

tags:

added: field-high

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2021-09-22:

@Vladimir, there are several rabbitmq connection errors at the bottom of the nova-api-wsgi.log.1 such as:

2021-09-20 23:29:37.136 2195757 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2021-09-20 23:29:37.171 2195757 INFO oslo.messaging._drivers.impl_rabbit [-] [8a71c4bc-9960-4483-9937-958632127a6a] Reconnected to AMQP server on 10.35.174.58:5672 via [amqp] client with port 34240.
2021-09-20 23:34:37.113 2195758 ERROR oslo.messaging._drivers.impl_rabbit [-] [c1890463-e025-46e4-be9f-34e7c05892dc] AMQP server on 10.35.174.47:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection

And the corresponding rabbitmq units (0/lxd/18, 1/lxd/19, and 2/lxd/19) don't appear to have /var/log/rabbitmq/ directories or /var/log/juju/unit-rabbitmq-server-*.log files. Is that something to do with the lograted subordinate?

Corey

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-05:

I'm still looking at the crashdump, but I've found this (which might or might not related to the original issue)

[nova-cloud-controller_1]
$ cat 1/lxd/14/var/log/haproxy.log | cut -d' ' -f 12 | sort | uniq -c
    514 --
   1190 cD
    434 CD

According to https://cbonte.github.io/haproxy-dconv/2.0/configuration.html#8.5

"--" - properly completed TCP sessions.
"cD" - "the client-side timeout expired while waiting for the client to send or receive data" + "the session was in the DATA phase"
"CD" - "the TCP session was unexpectedly aborted by the client" + "the session was in the DATA phase"

I'm looking more to see if there is any pattern with the source IP addresses.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-05:

The large number of cD/CD itself might be a red herring. Our haproxy health check setup may cause this as not completing/closing the tcp connection properly.

But the following should be the number of success / failure connections from the Rally client excluding other sources.

$ cat 1/lxd/14/var/log/haproxy.log | fgrep "<SOURCE_IP_OF_RALLY_CLIENT>" | cut -d' ' -f 12 | sort | uniq -c
    510 --
     11 cD
    266 CD

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-11:

We've bumped the number of workers and haproxy client/server timeout as follows, but there was no significant difference in terms of the failure rate of the test case

$ for app in keystone nova-cloud-controller neutron-api glance cinder; do
    juju config $app \
        haproxy-client-timeout=300000 \
        haproxy-server-timeout=300000 \
        worker-multiplier=0.1 ## translated to 12 instead of the default 4 workers
done

Many Connections are still marked as cD at the new timeout as 5 min. But Rally test cases are failing with the same error before 5 min timeout actually.

Failed to get the resource <Server: s_rally_ad38efdc_BwUDRf7N>: Unable to establish connection to https://FQDN:8774/v2.1/servers/dfb92fa8-6f79-4022-8bb5-536946396e41: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-11:

Also, we suspected some networking issues between the Rally client and the API services. So we ran Rally on the same physical host with LXD containers for API services but it didn't make difference.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-13:

> The large number of cD/CD itself might be a red herring. Our haproxy health check setup may cause this as not completing/closing the tcp connection properly.

This has nothing to do with health check of haproxy. cD/CD status is actually expected in the end.

“CD” could happen when clients respect and leverage HTTP Keep-Alive like Keystone Sessions method[1] instead of Client method through requests/urllib3’s connection pooling[2]. And when a subsequent request has more than 5 seconds interval (Keep-Alive: timeout=5) since the previous request.

Will try to see what's going on in the client side as Rally itself.

[1] https://docs.openstack.org/python-keystoneclient/latest/using-api-v3.html#authenticating-using-sessions
[2] https://docs.python-requests.org/en/master/user/advanced/#keep-alive

“cD” could also happen in the same logic where Session was opened but no subsequent request is made within the haproxy-server/client-timeout (90 seconds by default in the charms).

from keystoneauth1.identity import v3
from keystoneauth1 import session
from keystoneclient.v3 import client

import logging

logging.basicConfig(level=logging.DEBUG)

auth = v3.Password(auth_url="https://192.168.151.112:5000/v3",
                   username="admin", password="MY_PASSWORD",
                   project_name="admin",
                   user_domain_name="admin_domain",
                   project_domain_name="admin_domain")

sess = session.Session(auth=auth,
verify="/home/ubuntu/snap/openstackclients/common/root-ca.crt")
keystone = client.Client(session=sess)

keystone.projects.list()

>>> keystone.projects.list()
DEBUG:keystoneauth.session:REQ: curl -g -i -X GET https://192.168.151.112:35357/v3/projects? -H "Accept: application/json" -H "User-Agent: python-keystoneclient" -H "X-Auth-Token: {SHA256}c7ec577b587eee8c5be6ddcd29b8ee01569a1c07de1b08e7dc32be5d2f9d800f"
DEBUG:urllib3.connectionpool:https://192.168.151.112:35357 "GET /v3/projects HTTP/1.1" 200 1458
DEBUG:keystoneauth.session:RESP: [200] Connection: Keep-Alive Content-Length: 1458 Content-Type: application/json Date: Tue, 12 Oct 2021 16:54:57 GMT Keep-Alive: timeout=5, max=99 Server: Apache/2.4.41 (Ubuntu) Vary: X-Auth-Token x-openstack-request-id: req-da09258c-2ece-41d8-b4c2-24eb13db424b

After >5 seconds.

>>> keystone.projects.list()
DEBUG:keystoneauth.session:REQ: curl -g -i -X GET https://192.168.151.112:35357/v3/projects? -H "Accept: application/json" -H "User-Agent: python-keystoneclient" -H "X-Auth-Token: {SHA256}c7ec577b587eee8c5be6ddcd29b8ee01569a1c07de1b08e7dc32be5d2f9d800f"
*** DEBUG:urllib3.connectionpool:Resetting dropped connection: 192.168.151.112 ***
DEBUG:urllib3.connectionpool:https://192.168.151.112:35357 "GET /v3/projects HTTP/1.1" 200 1458
DEBUG:keystoneauth.session:RESP: [200] Connection: Keep-Alive Content-Length: 1458 Content-Type: application/json Date: Tue, 12 Oct 2021 16:53:22 GMT Keep-Alive: timeout=5, max=100 Server: Apache/2.4.41 (Ubuntu) Vary: X-Auth-Token x-openstack-request-id: req-828dcf42-dab2-4571-84b7-91620198adea

> The large number of cD/CD itself might be a red herring. Our haproxy health check setup may cause this as not completing/closing the tcp connection properly.

This has nothing to do with health check of haproxy. cD/CD status is actually expected in the end.

Will try to see what's going on in the client side as Rally itself.

[1] https://docs.openstack.org/python-keystoneclient/latest/using-api-v3.html#authenticating-using-sessions
[2] https://docs.python-requests.org/en/master/user/advanced/#keep-alive

“cD” could also happen in the same logic where Session was opened but no subsequent request is made within the haproxy-server/client-timeout (90 seconds by default in the charms).

from keystoneauth1.identity import v3
from keystoneauth1 import session
from keystoneclient.v3 import client

import logging

logging.basicConfig(level=logging.DEBUG)

sess = session.Session(auth=auth,
    verify="/home/ubuntu/snap/openstackclients/common/root-ca.crt")
keystone = client.Client(session=sess)

keystone.projects.list()

After >5 seconds.

>>> keystone.projects.list()
DEBUG:keystoneauth.session:REQ: curl -g -i -X GET https://192.168.151.112:35357/v3/projects? -H "Accept: application/json" -H "User-Agent: python-keystoneclient" -H "X-Auth-Token: {SHA256}c7ec577b587eee8c5be6ddcd29b8ee01569a1c07de1b08e7dc32be5d2f9d800f"
*** DEBUG:urllib3.connectionpool:Resetting dropped connection: 192.168.151.112 ***
DEBUG:urllib3.connectionpool:https://192.168.151.112:35357 "GET /v3/projects HTTP/1.1" 200 1458
DEBUG:keystoneauth.session:RESP: [200] Connection: Keep-Alive Content-Length: 1458 Content-Type: application/json Date: Tue, 12 Oct 2021 16:53:22 GMT Keep-Alive: timeout=5, max=100 Server: Apache/2.4.41 (Ubuntu) Vary: X-Auth-Token x-openstack-request-id: req-828dcf42-dab2-4571-84b7-91620198adea

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-13:

As per the previous doc,

https://cbonte.github.io/haproxy-dconv/2.0/configuration.html#8.5
> CD The client unexpectedly aborted during data transfer. This can be
> caused by a browser crash, by an intermediate equipment between the
> client and haproxy which decided to actively break the connection,
> by network routing issues between the client and haproxy, or by a
> keep-alive session between the server and the client terminated first
> by the client.

I think we are seeing this exact case for "CD":

"by a keep-alive session between the server and the client terminated first by the client"

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-13:

#10

I *think* I know what's going on here.

There is a tricky case where clients make some requests almost at the same timing of HTTP Keep-alive timeout. In this case, our charms do not use any customized Keep-alive timeout if I'm not mistaken and inherit the default value as Keep-alive timeout = 5 seconds from the apache2 package.

And Rally has nova_server_resize_poll_interval with the default value as 5 seconds which is exactly the same as our Keep-alive timeout.
https://github.com/openstack/rally-openstack/blob/ec1f31a685bdcbe10212701f8fd0c35598d4745f/rally_openstack/common/cfg/nova.py#L240-L243

If I understand the existing bug reports correctly, keystone token issue won't be retried in the corner case scenario since it's a POST request, not GET or something. Keystone library could have some workarounds or mitigation covered in the following links:
https://github.com/psf/requests/issues/4664
https://bugs.python.org/issue41345
https://bugs.python.org/msg374466

But fundamentally, we would be able to work around it by using nova_server_resize_poll_interval < 5.0 in Rally or HTTP Keep-alive timeout > 5 in apache2. Fwiw, 5 seconds timeout is too small and a bit inefficient in a busy cloud, IMO.

We're trying to figure out how to set nova_server_resize_poll_interval in Rally since we don't want to change HTTP Keep-alive by hand outside of the charm.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-13:

#11

Filed a Rally upstream bug in the meantime:
https://bugs.launchpad.net/rally/+bug/1946912

Now we are testing the following method to see if we can still reproduce the issue or not.

[./config/rally/rally.conf]
====
[openstack]
nova_server_resize_poll_interval = 4
====

fcbtest.rally --config-file ./config/rally/rally.conf \
task start ./generated/openstack/rally-combined.yaml \
--task-args '{flavor_name: m1.small, image_name: auto-sync/ubuntu-focal-20.04-amd64-server-20211004-disk1.img}'

Changed in charm-nova-cloud-controller:
status:	New → Incomplete

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-15:

#12

Marking this as Invalid for now since there is no "Remote end closed connection without response" after passing `--config-file ./config/rally/rally.conf` explicitly. We will update the config of the test runner in the field.

For the fundamental "fixes", we can focus on the following bug instead.
https://bugs.launchpad.net/charm-helpers/+bug/1947010

Filed an upstream bug against Rally but it's not that importance since it's just a workaround anyway:
https://bugs.launchpad.net/rally/+bug/1946912