oslo cache mempool issues with python3

Bug #1812935 reported by Michele Baldessari on 2019-01-22
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Status tracked in Stein
Rocky
High
Unassigned
Stein
High
Unassigned
oslo.cache
High
Herve Beraud
python-oslo.cache (Ubuntu)
High
Unassigned
Cosmic
High
Unassigned
Disco
High
Unassigned

Bug Description

nova conductor running on a rhel8 host inside f28 based containers hits the following error:
2019-01-17 13:59:37.049 46 DEBUG oslo_concurrency.lockutils [req-284f3071-8eee-4dcb-903c-838f2e024b48 40ca1490773f49f791d3a834af3702c8 8671bdf05abf48f58a9bdcdb0ef4b740 - default default] Lock "b051d003-482d-4cb7-810c-8d256e6c879e" acquired by "nova.context.set_target_cell.<locals>.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2019-01-17 13:59:37.050 46 DEBUG oslo_concurrency.lockutils [req-284f3071-8eee-4dcb-903c-838f2e024b48 40ca1490773f49f791d3a834af3702c8 8671bdf05abf48f58a9bdcdb0ef4b740 - default default] Lock "b051d003-482d-4cb7-810c-8d256e6c879e" released by "nova.context.set_target_cell.<locals>.get_or_set_cached_cell_and_set_connections" :: held 0.002s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339
2019-01-17 13:59:37.060 46 DEBUG oslo_db.sqlalchemy.engines [req-284f3071-8eee-4dcb-903c-838f2e024b48 40ca1490773f49f791d3a834af3702c8 8671bdf05abf48f58a9bdcdb0ef4b740 - default default] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/lib/python3.6/site-packages/oslo_db/sqlalchemy/engines.py:307
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server [req-284f3071-8eee-4dcb-903c-838f2e024b48 40ca1490773f49f791d3a834af3702c8 8671bdf05abf48f58a9bdcdb0ef4b740 - default default] Exception during message handling: TypeError: object() takes no parameters
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_cache/_memcache_pool.py", line 163, in _get
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server conn = self.queue.pop().connection
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server IndexError: pop from an empty deque
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 1303, in schedule_and_build_instances
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server instance.create()
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 226, in wrapper
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server return fn(self, *args, **kwargs)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/objects/instance.py", line 607, in create
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server self._load_ec2_ids()
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/objects/instance.py", line 983, in _load_ec2_ids
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server self.ec2_ids = objects.EC2Ids.get_by_instance(self._context, self)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server result = fn(cls, context, *args, **kwargs)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/objects/ec2.py", line 216, in get_by_instance
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server ec2_ids = cls._get_ec2_ids(context, instance)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/objects/ec2.py", line 200, in _get_ec2_ids
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server ec2_ids['instance_id'] = ec2utils.id_to_ec2_inst_id(instance.uuid)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/api/ec2/ec2utils.py", line 188, in id_to_ec2_inst_id
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server int_id = get_int_id_from_instance_uuid(ctxt, instance_id)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/api/ec2/ec2utils.py", line 47, in memoizer
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server value = _CACHE.get(key)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/cache_utils.py", line 107, in get
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server value = self.region.get(key)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/dogpile/cache/region.py", line 645, in get
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server value = self.backend.get(key)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/dogpile/cache/backends/memcached.py", line 161, in get
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server value = self.client.get(key)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_cache/backends/memcache_pool.py", line 31, in _run_method
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server with self.client_pool.acquire() as client:
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server return next(self.gen)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_cache/_memcache_pool.py", line 128, in acquire
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server conn = self.get(timeout=self._connection_get_timeout)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 303, in get
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server return self._get()
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_cache/_memcache_pool.py", line 214, in _get
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server conn = ConnectionPool._get(self)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_cache/_memcache_pool.py", line 165, in _get
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server conn = self._create_connection()
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_cache/_memcache_pool.py", line 206, in _create_connection
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server return _MemcacheClient(self.urls, **self._arguments)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server TypeError: object() takes no parameters
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server

nova.conf has:
()[nova@standalone /]$ grep -v ^# /etc/nova/nova.conf |grep '[a-zA-Z]' |grep -i mem
backend=oslo_cache.memcache_pool
memcache_servers=192.168.24.2:11211
memcached_servers=192.168.24.2:11211

memcache seems to be up:
[root@standalone ~]# podman top memcached
USER PID PPID %CPU ELAPSED TTY TIME COMMAND
memcached 1 0 0.000 2h2m27.45050385s ? 0s dumb-init --single-child -- /bin/bash -c source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m ${CACHESIZE} -c ${MAXCONN} $OPTIONS
memcached 8 1 0.000 2h2m27.450722828s ? 0s /bin/bash -c source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m ${CACHESIZE} -c ${MAXCONN} $OPTIONS
memcached 9 8 0.000 2h2m27.450781466s ? 0s /usr/bin/memcached -p 11211 -u memcached -m 11970 -c 8192 -v -l 192.168.24.2 -U 0 -X -t 8 >> /var/log/memcached.log 2>&1

Workaround:
disable caching in nova
crudini --set /var/lib/config-data/puppet-generated/nova/etc/nova/nova.conf cache enabled false
podman restart nova_conductor

Versions inside the f28 nova-conductor container:
python-oslo-versionedobjects-lang-1.34.1-0.20181128123056.50474ad.fc28.noarch
python3-oslo-config-6.7.0-0.20181108120643.64e020a.fc28.noarch
python3-oslo-cache-1.32.0-0.20190108083242.eb68d73.fc28.noarch
python3-oslo-versionedobjects-1.34.1-0.20181128123056.50474ad.fc28.noarch
python-oslo-middleware-lang-3.37.0-0.20181211135004.a609e68.fc28.noarch
puppet-oslo-14.2.0-0.20190111032249.b937844.fc28.noarch
python3-oslo-middleware-3.37.0-0.20181211135004.a609e68.fc28.noarch
python3-oslo-service-1.34.0-0.20190114140259.d987a4a.fc28.noarch
python3-oslo-policy-1.44.0-0.20190108082943.c9ea8f7.fc28.noarch
python-oslo-policy-lang-1.44.0-0.20190108082943.c9ea8f7.fc28.noarch
python-oslo-concurrency-lang-3.29.0-0.20181128184857.0767ddf.fc28.noarch
python3-oslo-utils-3.39.0-0.20190110184625.3823707.fc28.noarch
python3-oslo-vmware-2.32.1-0.20181126101324.04f82ae.fc28.noarch
python-oslo-log-lang-3.42.2-0.20190114115634.1babd44.fc28.noarch
python3-oslo-rootwrap-5.15.1-0.20181226111925.27f2314.fc28.noarch
python3-oslo-context-2.22.0-0.20190114134914.f65408d.fc28.noarch
python3-oslo-privsep-1.30.1-0.20181127103633.9391cbf.fc28.noarch
python-oslo-utils-lang-3.39.0-0.20190110184625.3823707.fc28.noarch
python-oslo-db-lang-4.43.0-0.20190108081838.a6d2cc5.fc28.noarch
python3-memcached-1.58-5.fc29.noarch
python-oslo-cache-lang-1.32.0-0.20190108083242.eb68d73.fc28.noarch
python3-oslo-i18n-3.23.0-0.20190111201133.a5fde9a.fc28.noarch
python3-oslo-concurrency-3.29.0-0.20181128184857.0767ddf.fc28.noarch
python3-oslo-db-4.43.0-0.20190108081838.a6d2cc5.fc28.noarch
python3-oslo-reports-1.29.1-0.20181126102912.dde49a4.fc28.noarch
python-oslo-i18n-lang-3.23.0-0.20190111201133.a5fde9a.fc28.noarch
python3-oslo-messaging-9.3.0-0.20190114135848.13fa4f5.fc28.noarch
python3-oslo-serialization-2.28.1-0.20181001122254.0371c1d.fc28.noarch
python-oslo-vmware-lang-2.32.1-0.20181126101324.04f82ae.fc28.noarch
python-oslo-privsep-lang-1.30.1-0.20181127103633.9391cbf.fc28.noarch
python3-oslo-log-3.42.2-0.20190114115634.1babd44.fc28.noarch
puppet-memcached-3.3.0-0.20180803162752.e517b44.fc28.noarch

------------------------------------------------

Ubuntu SRU details
------------------

[Impact]
See description above.

[Test Case]
1. Deploy an HA Rocky cloud using the openstack-next-charms, with memcache related to nova-cloud-controller.
2. Configure a simple network, upload a bionic image.
3. Try to start an instance via the API. It will stay in build.
4. Try to use the openstack cli to read availability zones. It will
return an error.

For more info see bug 1823740

[Regression Potential]
The regression potential is low. This is a minimal fix that has successfully been reviewed upstream and passed all upstream gate tests. It has already landed in upstream master branch and Ubuntu Disco and received 3 +1's and zuul tests +1 on stable/rocky gerrit reviews.

Herve Beraud (herveberaud) wrote :

Some debug informations to move a little bit forward on this topic...

The exception is raised inside the nova-conductor container.

The problem seems to occur when the cache is enable and the backend used is oslo.cache.

Config in use in /var/lib/config-data/puppet-generated/nova/etc/nova/nova.conf :
backend=oslo_cache.memcache_pool
enabled=True
debug_cache_backend=True

Nova services seems to be unregister:
$ openstack compute service list
+----+--------+------+------+--------+-------+------------+-----------------+-------------+ │·| Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | Forced down | │·
+----+--------+------+------+--------+-------+------------+-----------------+-------------+ │·+----+--------+------+------+--------+-------+------------+-----------------+-------------+

Else when the cache is disable the previous command return results
[stack@win1 ~]$ openstack compute service list
+----+------------------+------------------+----------+---------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+----+------------------+------------------+----------+---------+-------+----------------------------+
| 1 | nova-conductor | win1.localdomain | internal | enabled | up | 2019-01-24T16:07:47.000000 |
| 2 | nova-consoleauth | win1.localdomain | internal | enabled | up | 2019-01-24T16:07:41.000000 |
| 3 | nova-scheduler | win1.localdomain | internal | enabled | up | 2019-01-24T16:07:43.000000 |
| 8 | nova-compute | win1.localdomain | nova | enabled | up | 2019-01-24T16:07:47.000000 |
+----+------------------+------------------+----------+---------+-------+----------------------------+

Same problem when we use `nova service-list`

Herve Beraud (herveberaud) wrote :

I have realized some tests by using other cache backend (dogpile) and the error doesn't occur with it.

With oslo.cache backend we can notice that all acquiring connections seems dropped at each usage of the ConnectionPool.

Oslo cache was managed by oslo.messaging and eventlet.

Also I guess we have a similar issue than this other bug (https://bugs.launchpad.net/oslo.messaging/+bug/1813029), I need to try to bump the version in use to try to fix the problem.

http://lists.openstack.org/pipermail/openstack-discuss/2019-January/001998.html

Herve Beraud (herveberaud) wrote :

When oslo.cache try to create connection by using python-memcached no connections was created...

The following traceback appear when we try to establish a connection http://paste.openstack.org/show/744079/

Herve Beraud (herveberaud) wrote :

If I try to connect me to the given url ['192.168.24.2:11211'] by using a tiny script the connection was establish and I can set and get values.

#!/usr/bin/python3
import memcache

def main():
    mc = memcache.Client(['192.168.24.2:11211'], debug=0)

    mc.set("some_key", "Some value")
    value = mc.get("some_key")
    print(value)

if __name__ == "__main__":
    main()

Herve Beraud (herveberaud) wrote :

Setup seem incomplete, see the following traceback from nova-conductor:

2019-01-28 19:05:15.241 21 ERROR nova.conductor.manager [req-083c4907-07b0-4dc8-bb0f-dff8aaf598e4 45c77dea28a445858e29c376235a021b 96b4d23c11ea4fa8baa6f26761bb2ae8 - default default] No host-to-cell mapping foun│·
d for selected host herve.localdomain. Setup is incomplete.: nova.exception.HostMappingNotFound: Host 'herve.localdomain' is not mapped to any cell

Herve Beraud (herveberaud) wrote :

Setup seem incomplete, see the following traceback from nova-conductor:

2019-01-28 19:05:15.241 21 ERROR nova.conductor.manager [req-083c4907-07b0-4dc8-bb0f-dff8aaf598e4 45c77dea28a445858e29c376235a021b 96b4d23c11ea4fa8baa6f26761bb2ae8 - default default] No host-to-cell mapping foun│·
d for selected host herve.localdomain. Setup is incomplete.: nova.exception.HostMappingNotFound: Host 'herve.localdomain' is not mapped to any cell

During stack deployment we can see in the paunch log that the nova_cellv2_discover_hosts exit in error.
This step try to register services and it fail. See the following traceback => http://paste.openstack.org/show/744238/

We can observe that this paunch command was started at 9:31:13 and we observe that the original memcache error described in this issue occur at 09:31:22 (http://paste.openstack.org/show/744241/) and is due to the fact that services are unregistred. The paunch command exit in error at 09:36:59 certainly due to a timeout or something like that.

Herve Beraud (herveberaud) wrote :

We need to try to find why the following command fail to register services:

podman run --name nova_cellv2_discover_hosts --label config_id=tripleo_step5 --label container_name=nova_cellv2_discover_hosts --label managed_by=paunch --net=host --user=root --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/lib/config-data/nova_libvirt/etc/my.cnf.d/:/etc/my.cnf.d/:ro --volume=/var/lib/config-data/nova_libvirt/etc/nova/:/etc/nova/:ro --volume=/var/log/containers/nova:/var/log/nova --volume=/var/lib/docker-config-scripts/:/docker-config-scripts/ 192.168.122.1:5000/fedora-binary-nova-compute:ospsprint /docker-config-scripts/pyshim.sh /docker-config-scripts/nova_cell_v2_discover_host.py

Herve Beraud (herveberaud) wrote :

To reproduce:
- activate debug mode before deploying a task =>
- be sure to have the nova cache activated in the config =>
- deploy the stack by using ./nuke.sh and ./deploy.sh
- observe the log file =>

* Warning *:
We also encountered the following podman bug => https://github.com/containers/libpod/issues/2240
when we try to replay the "podman run --name nova_cellv2_discover_hosts" who try to register services.

Herve Beraud (herveberaud) wrote :

===========
* Erratum *
===========
To reproduce:
- activate debug mode before deploying a task => vim tripleo-heat-installer-templates/puppet/services/nova-base.yaml
- be sure to have the nova cache activated in the config => tripleo-heat-installer-templates/standalone_parameters.yaml
- deploy the stack by using ./nuke.sh and ./deploy.sh
- observe the log file => /var/log/paunch.log

Can be related to:
- https://review.openstack.org/#/c/632717/
- https://review.openstack.org/#/c/632097/

Herve Beraud (herveberaud) wrote :

s/deploying a task/deploying a stack/

Ben Nemec (bnemec) on 2019-01-30
Changed in oslo.cache:
status: New → Confirmed
importance: Undecided → High
Herve Beraud (herveberaud) wrote :

Trying to revert this patch (https://review.openstack.org/#/c/632097/):
$ openstack -q --os-interface internal compute service list -c Host -c Zone -f value

Now the previous command become:
$ openstack -q compute service list -c Host -c Zone -f value

And the previous described error still to occur... (service registration error):
python3 /docker-config-scripts/nova_cell_v2_discover_host.py
Could not find versioned identity endpoints when attempting to authenticate. Please check that your auth_url is correct. Unable to establish connection to http://192.168.24.2:5000: HTTPConnectionPool(host='192.168.24.2', port=5000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f34df66dac8>: Failed to establish a new connection: [Errno 111] Connection refused',))

Also when I try to play these commands directly from the CLI they returns "Cloud standalone was not found."

Herve Beraud (herveberaud) wrote :

I'm not really sure that this issue is an oslo.cache issue.
I think the oslo.cache error described in this issue is a side effect due to service unregistered and the root cause is the service registration fail.

This does not prevent to better secure the oslo.cache code to better notify users when they meets this problem instead crash unproperly.

Thoughts?

Herve Beraud (herveberaud) wrote :

Another interesting behavior to observe is that when I run the the nova_cellv2_discover_hosts from podman and it fail to register service then the container exit and if I try to rerun a new similar container the called script "nova_cell_v2_discover_host.py" mounted from host "/var/lib/docker-config/scripts" have been removed by something (surely at the container shutdown when it facing the issue).

See the following traceback for a fully example => http://paste.openstack.org/show/744274/

Herve Beraud (herveberaud) wrote :

The previous message ^^^^ is due to a "os.unlink(__file__)" who was inside the script so no problems here.

Herve Beraud (herveberaud) wrote :

During the deployment at step 2 and especially during who start containers we facing an issue (http://paste.openstack.org/show/744353/)

TASK [Start containers for step 2] ********************************************************************

Broadcast message from <email address hidden> (Fri 2019-02-01 09:07:43 UTC):

haproxy[536412]: proxy cinder has no server available!

Broadcast message from <email address hidden> (Fri 2019-02-01 09:07:43 UTC):

haproxy[536412]: proxy glance_api has no server available!

Broadcast message from <email address hidden> (Fri 2019-02-01 09:07:43 UTC):

haproxy[536412]: proxy horizon has no server available!

Changed in oslo.cache:
assignee: nobody → Herve Beraud (herveberaud)
Herve Beraud (herveberaud) wrote :

Same kind of errors during step 3:

TASK [Start containers for step 3] *********************************************************************

Broadcast message from <email address hidden> (Fri 2019-02-01 09:45:25 UTC):

haproxy[610039]: proxy mysql has no server available!

Bogdan Dobrelya (bogdando) wrote :

A side note, wrt those haproxy backends flapping events, that should be fixed as of https://review.openstack.org/#/c/633944/

Herve Beraud (herveberaud) wrote :

Each time the main error described in this issue arrive we get the following error just before:

http://paste.openstack.org/show/744400/

2019-02-01 15:18:40.504 23 DEBUG oslo_db.sqlalchemy.engines [req-da88246a-b2fc-45e4-ae37-8b7e7b0f3704 1dbc237008334abcbf881d976b9ee3bb 7e14699fcedc4864bc81c50c8fe8ecd3 - default default] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/lib/python3.6/site-packages/oslo_db/sqlalchemy/engines.py:307
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server [req-da88246a-b2fc-45e4-ae37-8b7e7b0f3704 1dbc237008334abcbf881d976b9ee3bb 7e14699fcedc4864bc81c50c8fe8ecd3 - default default] Exception during message handling: TypeError: object() takes no parameters
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 1262, in schedule_and_build_instances
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server context, host.service_host)
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server result = fn(cls, context, *args, **kwargs)
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/objects/host_mapping.py", line 100, in get_by_host
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server db_mapping = cls._get_by_host_from_db(context, host)
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 1012, in wrapper
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server return fn(*args, **kwargs)
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/objects/host_mapping.py", line 95, in _get_by_host_from_db
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server raise exception.HostMappingNotFound(name=host)
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server nova.exception.HostMappingNotFound: Host 'herve.localdomain' is not mapped to any cell
2019-02-01 15:18:40.542 23 ERROR oslo_messaging.rpc.server

Michele Baldessari (michele) wrote :
Download full text (3.7 KiB)

So I have done a bit of a code-walkthrough (note I have no idea about oslo nor memcached nor nova)L

1) The first exception we hit is this one:
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.
rpc.server [req-284f3071-8eee-4dcb-903c-838f2e024b48 40ca1490773f49f791d3a834af3702c8 8671bdf05abf48f58a9bdcdb0ef4b740 - default default] Exception during message handling: TypeError: object() takes no parameters
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_cache/_memcache_pool.py", line 163, in _get
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server conn = self.queue.pop().connection
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server IndexError: pop from an empty deque

The above exception comes from the following lines:
    def _get(self):
        try:
            conn = self.queue.pop().connection
        except IndexError:
            conn = self._create_connection()
        self._acquired += 1
        return conn

2)
So this was the first connection and it failed because the queue of connections is empty and so we caught the IndexError exception but we failed inside the catch branch:
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
...snip...
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_cache/_memcache_pool.py", line 214, in _get
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server conn = ConnectionPool._get(self)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_cache/_memcache_pool.py", line 165, in _get
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server conn = self._create_connection()
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_cache/_memcache_pool.py", line 206, in _create_connection
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server return _MemcacheClient(self.urls, **self._arguments)
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server TypeError: object() takes no parameters
2019-01-17 13:59:37.096 46 ERROR oslo_messaging.rpc.server

Now the reason we failed seems to me due to this sequence:
2.1) We call conn = self._create_connection()
    def _create_connection(self):
        return _MemcacheClient(self.urls, **self._arguments)

2.2) The problem seems to be in how _MemcacheClient(...) overrides stuff:
class _MemcacheClient(memcache.Client):
    """Thread global memcache client

    As client is inherited from threading.local we have to restore object
    methods overloaded by threading.local so we can reuse clients in
    different threads
    """
    __delattr__ = object.__delattr__
    __getattribute__ = object.__getattribute__
    __new__ = object.__new__
    __setattr_...

Read more...

Fix proposed to branch: master
Review: https://review.openstack.org/634457

Changed in oslo.cache:
status: Confirmed → In Progress
Herve Beraud (herveberaud) wrote :

FYI The same bug seems to also occur on rocky:
https://bugs.launchpad.net/cloud-archive/+bug/1812672

Ben Nemec (bnemec) wrote :

Looking at https://bugs.launchpad.net/cloud-archive/+bug/1812672 it appears this only happens with eventlet monkey patching in force, which explains why it doesn't reproduce in our unit tests.

I've come up with what appears to be a backportable fix (read: hack) in https://review.openstack.org/639457 I don't fully understand why one thing works in one environment and another thing in a different one, but it works (for a unit test version of "works") in the various combinations of Python and eventlet that I've tried.

Ideally we would merge that before the pymemcache switch and then rip it all out again in favor of pymemcache as soon as the backports are proposed.

Reviewed: https://review.openstack.org/639457
Committed: https://git.openstack.org/cgit/openstack/oslo.cache/commit/?id=f4a25f642991a7114b86f6eb7d0bac3d599953a6
Submitter: Zuul
Branch: master

commit f4a25f642991a7114b86f6eb7d0bac3d599953a6
Author: Ben Nemec <email address hidden>
Date: Tue Feb 26 22:12:23 2019 +0000

    Fix memcache pool client in monkey-patched environments

    First off, this is an ugly hack, but we're dealing with code that
    essentially monkey-patches a monkey-patch. You reap what you sow.

    Per the linked bug, our connection pool client explodes on python 3
    with eventlet monkey-patching in force:

    TypeError: object() takes no parameters

    This is due to the way __new__ is overridden in the class. We need
    to strip arguments from the call before they get to object(), which
    doesn't accept args.

    Unfortunately, when we're _not_ monkey-patched, adding the new
    override implementation fails with:

    TypeError: object.__new__(_MemcacheClient) is not safe,
    use Client.__new__()

    As such, we need different implementations depending on whether we
    are monkey-patched or not. This change passes both with and without
    monkey-patching and adds a unit test that exposes the bug.

    Note that this is a temporary, backportable fix that will ultimately
    be replaced by a switch to the pymemcache library which does not
    have the threading.local problem being worked around here.

    Change-Id: I039dffadeebd0ff4479b9c870c257772c43aba53
    Partial-Bug: 1812935

Ben Nemec (bnemec) wrote :

The above fix was released at the 11th hour for Stein, and Michele verified that it fixes the problem for him. We'll need to backport it at least to Rocky since it's also been reported there, but at least going forward this should no longer be a problem.

There's still a followup patch in the works to move away from this hacky solution and start using pymemcache, but at this point I'm not sure we'll be able to do that for Stein. Planning to discuss it more in the Oslo meeting on Monday.

James Page (james-page) on 2019-04-09
Changed in python-oslo.cache (Ubuntu Cosmic):
status: New → Triaged
Changed in python-oslo.cache (Ubuntu Disco):
status: New → Triaged
Changed in python-oslo.cache (Ubuntu Cosmic):
importance: Undecided → High
Changed in python-oslo.cache (Ubuntu Disco):
importance: Undecided → High
James Page (james-page) on 2019-04-09
Changed in python-oslo.cache (Ubuntu Disco):
status: Triaged → Fix Released
Corey Bryant (corey.bryant) wrote :

For Ubuntu we'll just be SRUing the current temporary backportable fix (ie. gerrit reviews seen above this comment). It may feel a bit misleading to mark disco and stein as fix released at this point but that is what we need to do as we are only really tracking the temporary backportable fix at this point. We'll certainly pick up the permanent fix once that lands in master.

tags: added: cdo-qa cdo-release-blocker foundations-engine
Changed in python-oslo.cache (Ubuntu Disco):
status: Fix Released → In Progress
status: In Progress → Fix Released
Corey Bryant (corey.bryant) wrote :

This has been uploaded to the cosmic unapproved queue where it is awaiting SRU review: https://launchpad.net/ubuntu/cosmic/+queue?queue_state=1&queue_text=python-oslo.cache

description: updated
description: updated

Hello Michele, or anyone else affected,

Accepted python-oslo.cache into cosmic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-oslo.cache/1.30.1-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-cosmic to verification-done-cosmic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-cosmic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in python-oslo.cache (Ubuntu Cosmic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-cosmic
Corey Bryant (corey.bryant) wrote :

Hello Michele, or anyone else affected,

Accepted python-oslo.cache into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:rocky-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-rocky-needed
Jason Hobbs (jason-hobbs) wrote :

With the package 1.30.1-0ubuntu1.1 from rocky-proposed in place, I ran through the test case for this and did not hit any failures. I can verify this fixes the issue for rocky.

tags: added: verification-rocky-done
removed: verification-rocky-needed

I can confirm it works.

The rocky-proposed package python-oslo.cache-1.30.1-0ubuntu1.1 fixes bug #1822541 too.

I manually upgraded the n-c-c units on a Rocky cloud (applying the proper bindings for n-c-c & memcached according with the comment #14 of bug #1823740) to get the new python-oslo packaged installed and now n-c-c fully works.

The verification of the Stable Release Update for python-oslo.cache has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page) wrote :

This bug was fixed in the package python-oslo.cache - 1.30.1-0ubuntu1.1~cloud0
---------------

 python-oslo.cache (1.30.1-0ubuntu1.1~cloud0) bionic-rocky; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 python-oslo.cache (1.30.1-0ubuntu1.1) cosmic; urgency=medium
 .
   * d/gbp.conf: Create stable/rocky branch.
   * d/p/fix-memcache-pool-client-in-monkey-patched-environme.patch:
     Cherry-picked from upstream gerrit review to fix Py3 oslo cache
     mempool issue (https://review.openstack.org/#/c/640500) (LP: #1812935).

Reviewed: https://review.opendev.org/640500
Committed: https://git.openstack.org/cgit/openstack/oslo.cache/commit/?id=caf5443de8ff4ff353d741a125f782c67f579f52
Submitter: Zuul
Branch: stable/rocky

commit caf5443de8ff4ff353d741a125f782c67f579f52
Author: Ben Nemec <email address hidden>
Date: Tue Feb 26 22:12:23 2019 +0000

    Fix memcache pool client in monkey-patched environments

    First off, this is an ugly hack, but we're dealing with code that
    essentially monkey-patches a monkey-patch. You reap what you sow.

    Per the linked bug, our connection pool client explodes on python 3
    with eventlet monkey-patching in force:

    TypeError: object() takes no parameters

    This is due to the way __new__ is overridden in the class. We need
    to strip arguments from the call before they get to object(), which
    doesn't accept args.

    Unfortunately, when we're _not_ monkey-patched, adding the new
    override implementation fails with:

    TypeError: object.__new__(_MemcacheClient) is not safe,
    use Client.__new__()

    As such, we need different implementations depending on whether we
    are monkey-patched or not. This change passes both with and without
    monkey-patching and adds a unit test that exposes the bug.

    Note that this is a temporary, backportable fix that will ultimately
    be replaced by a switch to the pymemcache library which does not
    have the threading.local problem being worked around here.

    Change-Id: I039dffadeebd0ff4479b9c870c257772c43aba53
    Partial-Bug: 1812935
    Closes-Bug: 1812672
    (cherry picked from commit f4a25f642991a7114b86f6eb7d0bac3d599953a6)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/655468
Committed: https://git.openstack.org/cgit/openstack/oslo.cache/commit/?id=38dcffa904c7d4b5bd61896b972c1e16477fcc56
Submitter: Zuul
Branch: stable/queens

commit 38dcffa904c7d4b5bd61896b972c1e16477fcc56
Author: Ben Nemec <email address hidden>
Date: Tue Feb 26 22:12:23 2019 +0000

    Fix memcache pool client in monkey-patched environments

    First off, this is an ugly hack, but we're dealing with code that
    essentially monkey-patches a monkey-patch. You reap what you sow.

    Per the linked bug, our connection pool client explodes on python 3
    with eventlet monkey-patching in force:

    TypeError: object() takes no parameters

    This is due to the way __new__ is overridden in the class. We need
    to strip arguments from the call before they get to object(), which
    doesn't accept args.

    Unfortunately, when we're _not_ monkey-patched, adding the new
    override implementation fails with:

    TypeError: object.__new__(_MemcacheClient) is not safe,
    use Client.__new__()

    As such, we need different implementations depending on whether we
    are monkey-patched or not. This change passes both with and without
    monkey-patching and adds a unit test that exposes the bug.

    Note that this is a temporary, backportable fix that will ultimately
    be replaced by a switch to the pymemcache library which does not
    have the threading.local problem being worked around here.

    Change-Id: I039dffadeebd0ff4479b9c870c257772c43aba53
    Partial-Bug: 1812935
    Closes-Bug: 1812672
    (cherry picked from commit f4a25f642991a7114b86f6eb7d0bac3d599953a6)
    (cherry picked from commit caf5443de8ff4ff353d741a125f782c67f579f52)

tags: added: in-stable-queens
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.