Zun

VM and zun container errors after host reboot

Bug #1850936 reported by BN
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zun
Fix Committed
High
hongbin

Bug Description

**Bug Report**

What happened:

Multinode openstack deployed with Zun service. Private (demo-net) and public network were created using init-runone script. Instance was created and started. Zun container was created and started without specifying a network. Thus, container started successfully, and it was assigned to private demo-net network. However, network name was changed to 7629c76e6b80443e033554fb9f3098937e311934e2650586f7c895a64bebcd75 where 7629c76e6b80 (docker network ls). Everything was working fine and I could create other instances and containers as well. After hosts reboot, errors started coming up (I could not start some instances; some of containers could not be started as well with errors):

VM (nova-conductor.log) -

2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager [req-0ef3fa2c-1db7-4588-9e30-df729a6c64dd db41ed54317a4f6e96ebbaf14a750ba0 02f1fcd1831845ff9c89cdb6906d052e - default default] Failed to schedule instances: MessagingTimeout: Timed out waiting for a reply to message ID 06e283038df54a43a4dc21626eff0b58
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager Traceback (most recent call last):
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/conductor/manager.py", line 1356, in schedule_and_build_instances
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager instance_uuids, return_alternates=True)
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/conductor/manager.py", line 810, in _schedule_instances
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager return_alternates=return_alternates)
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager instance_uuids, return_objects, return_alternates)
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager return cctxt.call(ctxt, 'select_destinations', **msg_args)
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 178, in call
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager retry=self.retry)
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/transport.py", line 128, in _send
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager retry=retry)
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 645, in send
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager call_monitor_timeout, retry=retry)
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager call_monitor_timeout)
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 520, in wait
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager message = self.waiters.get(msg_id, timeout=timeout)
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager 'to message ID %s' % msg_id)
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager MessagingTimeout: Timed out waiting for a reply to message ID 06e283038df54a43a4dc21626eff0b58
2019-11-01 16:03:05.919 19 ERROR nova.conductor.manager
2019-11-01 16:03:06.514 19 WARNING nova.scheduler.utils [req-0ef3fa2c-1db7-4588-9e30-df729a6c64dd db41ed54317a4f6e96ebbaf14a750ba0 02f1fcd1831845ff9c89cdb6906d052e - default default] Failed to compute_task_build_instances: Timed out waiting for a reply to message ID 06e283038df54a43a4dc21626eff0b58: MessagingTimeout: Timed out waiting for a reply to message ID 06e283038df54a43a4dc21626eff0b58
2019-11-01 16:03:06.519 19 WARNING nova.scheduler.utils [req-0ef3fa2c-1db7-4588-9e30-df729a6c64dd db41ed54317a4f6e96ebbaf14a750ba0 02f1fcd1831845ff9c89cdb6906d052e - default default] [instance: 7d4f4b95-515a-4efc-85ef-9d6019c0a34d] Setting instance to ERROR state.: MessagingTimeout: Timed out waiting for a reply to message ID 06e283038df54a43a4dc21626eff0b58

Zun (zun-compute.log) -

2019-11-01 16:09:04.272 6 ERROR zun.compute.manager [req-3b945ea8-db1d-42e3-bfeb-0746877cd711 db41ed54317a4f6e96ebbaf14a750ba0 02f1fcd1831845ff9c89cdb6906d052e default - -] Unexpected exception: Cannot act on container in 'Error' state: Conflict: Cannot act on container in 'Error' state
2019-11-01 16:09:04.272 6 ERROR zun.compute.manager Traceback (most recent call last):
2019-11-01 16:09:04.272 6 ERROR zun.compute.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/zun/compute/manager.py", line 730, in container_logs
2019-11-01 16:09:04.272 6 ERROR zun.compute.manager since=since)
2019-11-01 16:09:04.272 6 ERROR zun.compute.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/zun/common/utils.py", line 243, in decorated_function
2019-11-01 16:09:04.272 6 ERROR zun.compute.manager return function(*args, **kwargs)
2019-11-01 16:09:04.272 6 ERROR zun.compute.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/zun/container/docker/driver.py", line 99, in decorated_function
2019-11-01 16:09:04.272 6 ERROR zun.compute.manager handle_not_found(e, context, container)
2019-11-01 16:09:04.272 6 ERROR zun.compute.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/zun/container/docker/driver.py", line 86, in handle_not_found
2019-11-01 16:09:04.272 6 ERROR zun.compute.manager "Cannot act on container in '%s' state") % container.status)
2019-11-01 16:09:04.272 6 ERROR zun.compute.manager Conflict: Cannot act on container in 'Error' state
2019-11-01 16:09:04.272 6 ERROR zun.compute.manager
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server [req-3b945ea8-db1d-42e3-bfeb-0746877cd711 db41ed54317a4f6e96ebbaf14a750ba0 02f1fcd1831845ff9c89cdb6906d052e default - -] Exception during message handling: Conflict: Cannot act on container in 'Error' state
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/zun/common/utils.py", line 222, in decorated_function
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/zun/compute/manager.py", line 730, in container_logs
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server since=since)
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/zun/common/utils.py", line 243, in decorated_function
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server return function(*args, **kwargs)
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/zun/container/docker/driver.py", line 99, in decorated_function
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server handle_not_found(e, context, container)
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/zun/container/docker/driver.py", line 86, in handle_not_found
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server "Cannot act on container in '%s' state") % container.status)
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server Conflict: Cannot act on container in 'Error' state
2019-11-01 16:09:04.275 6 ERROR oslo_messaging.rpc.server

Moreover, I have tried to create new instance and was getting the same error:

Traceback (most recent call last): File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/conductor/manager.py", line 1356, in schedule_and_build_instances instance_uuids, return_alternates=True) File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/conductor/manager.py", line 810, in _schedule_instances return_alternates=return_alternates) File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations instance_uuids, return_objects, return_alternates) File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 178, in call retry=self.retry) File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/transport.py", line 128, in _send retry=retry) File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 645, in send call_monitor_timeout, retry=retry) File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send call_monitor_timeout) File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 520, in wait message = self.waiters.get(msg_id, timeout=timeout) File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get 'to message ID %s' % msg_id) MessagingTimeout: Timed out waiting for a reply to message ID 06e283038df54a43a4dc21626eff0b58

However, when I created new container - its started without any issues, it was just old containers showing the errors.

In conclusion, once I ran kolla-ansible -i multinode reconfigure and its finished without any issues, I was able to create and start new vm instances. However, I was not able to start instances which showed errors after I rebooted the hosts therefore, they cannot be restored. I could not start containers which were showing errors after reboot either but I could create new container as it was even before reconfiguration.

P.S. Maybe I did not configured networks correctly, even though I could not find documentation what is the right way to prepare environment/networking for zun and nova so they can work together without any issues.

Thank you

What you expected to happen:

How to reproduce it (minimal and precise): Create instance in private network. Create container without specifying a network. Reboot hosts. Check your results.

**Environment**:
* OS (e.g. from /etc/os-release): Ubuntu 18.04.3 LTS
* Kernel (e.g. `uname -a`): 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
* Docker version if applicable (e.g. `docker version`): 19.03.4
* Kolla-Ansible version (e.g. `git head or tag or stable branch` or pip package version if using release): 2.8
* Docker image Install type (source/binary): source
* Docker image distribution: ubuntu
* Are you using official images from Docker Hub or self built? official

* Share your inventory file, globals.yml and other configuration files if relevant

kolla_base_distro: "ubuntu"
kolla_install_type: "source"
openstack_release: "stein"
kolla_internal_vip_address: "10.0.225.254"
network_interface: "enp2s0f0"
neutron_external_interface: "enp2s0f1"
enable_barbican: "yes"
enable_cinder: "yes"
enable_cinder_backup: "yes"
enable_fluentd: "yes"
enable_zun: "yes"
enable_kuryr: "yes"
enable_etcd: "yes"
docker_configure_for_zun: "yes"
enable_magnum: "yes"
enable_ceph: "no"
glance_backend_ceph: "yes"
cinder_backend_ceph: "yes"
nova_backend_ceph: "yes"
glance_enable_rolling_upgrade: "no"
barbican_crypto_plugin: "simple_crypto"
barbican_library_path: "/usr/lib/libCryptoki2_64.so"
ironic_dnsmasq_dhcp_range:
tempest_image_id:
tempest_flavor_ref_id:
tempest_public_network_id:
tempest_floating_network_name:
horizon_port: 48000

----

[control]
localhost ansible_connection=local become=true

[network]
localhost ansible_connection=local become=true

[compute]
localhost ansible_connection=local become=true
10.0.2.1 ansible_user=root ansible_become=true
10.0.3.1 ansible_user=root ansible_become=true

[monitoring]
localhost ansible_connection=local become=true

[storage]
localhost ansible_connection=local become=true
10.0.2.1 ansible_user=root ansible_become=true
10.0.3.1 ansible_user=root ansible_become=true

[deployment]
localhost ansible_connection=local become=true

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Thanks for a comprehensive bug report. Notifying Zun about a possible bug. The error from Nova Conductor, hovewer, looks like the issue is somewhere else. Could you check if all containers are up on all the nodes (docker ps -a) and not constantly restarting or down. Then attach logs from b0rken containers.

Changed in kolla-ansible:
status: New → Incomplete
Revision history for this message
BN (zatoichy) wrote :

Hi Radoslaw,

I have rebooted hosts again. All docker containers are up and running except that chrony container is restarting.

hongbin (hongbin034)
Changed in zun:
status: New → Triaged
importance: Undecided → High
assignee: nobody → hongbin (hongbin034)
Revision history for this message
hongbin (hongbin034) wrote :

At Zun side, it should be fixed after https://review.opendev.org/#/c/696779/ is merged.

Revision history for this message
hongbin (hongbin034) wrote :

As I mentioned in comment #3, I changed the status as 'fix committed'. Feel free to reset the status if https://review.opendev.org/#/c/696779/ couldn't address the problem

Changed in zun:
status: Triaged → Fix Committed
Revision history for this message
BN (zatoichy) wrote :

I would like to confirm that using ubuntu:source:train its still not fixed therefore, kolla/ubuntu-source-kuryr-libnetwork:train needs to be manually restarted if you want to run containers without any errors after everything is up and running.

Revision history for this message
hongbin (hongbin034) wrote :

OK, let me backport the patch all the way to train.

no longer affects: kolla-ansible
Revision history for this message
BN (zatoichy) wrote :

to hongbin:

Also, I ve set "Restart Always" to all my containers and after reboot they were not restarted and stuck and "Error/Stopped" state.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.