FATAL | Check Keystone project status | undercloud | item=service failing on fs001, fs035 and fs039

Bug #1918366 reported by Bhagyashri Shewale
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Triaged
High
Unassigned

Bug Description

2021-03-09 12:12:22 | 2021-03-09 12:12:22.596214 | fa163e38-77ee-9ad0-e896-000000009502 | WAITING | Check Keystone project status | undercloud | 1 retries left
2021-03-09 12:12:27 | 2021-03-09 12:12:27.860996 | fa163e38-77ee-9ad0-e896-000000009502 | FATAL | Check Keystone project status | undercloud | item=service | error={"ansible_job_id": "579585367263.345415", "ansible_loop_var": "tripleo_keystone_resources_project_async_result_item", "attempts": 30, "changed": false, "finished": 0, "started": 1, "tripleo_keystone_resources_project_async_result_item": {"ansible_job_id": "579585367263.345415", "ansible_loop_var": "tripleo_keystone_resources_project", "changed": true, "failed": false, "finished": 0, "results_file": "/root/.ansible_async/579585367263.345415", "started": 1, "tripleo_keystone_resources_project": "service"}}
2021-03-09 12:12:27 | 2021-03-09 12:12:27.863547 | fa163e38-77ee-9ad0-e896-000000009502 | TIMING | tripleo_keystone_resources : Check Keystone project status | undercloud | 1:03:07.306702 | 205.05s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.866493 | fa163e38-77ee-9ad0-e896-000000009502 | TIMING | tripleo_keystone_resources : Check Keystone project status | undercloud | 1:03:07.309700 | 205.06s
2021-03-09 12:12:27 |
2021-03-09 12:12:27 | PLAY RECAP *********************************************************************
2021-03-09 12:12:27 | overcloud-controller-0 : ok=458 changed=217 unreachable=0 failed=0 skipped=190 rescued=0 ignored=0
2021-03-09 12:12:27 | overcloud-controller-1 : ok=446 changed=208 unreachable=0 failed=0 skipped=189 rescued=0 ignored=0
2021-03-09 12:12:27 | overcloud-controller-2 : ok=443 changed=208 unreachable=0 failed=0 skipped=189 rescued=0 ignored=0
2021-03-09 12:12:27 | overcloud-novacompute-0 : ok=348 changed=149 unreachable=0 failed=0 skipped=171 rescued=0 ignored=0
2021-03-09 12:12:27 | undercloud : ok=26 changed=8 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
2021-03-09 12:12:27 | 2021-03-09 12:12:27.910846 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Summary Information ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2021-03-09 12:12:27 | 2021-03-09 12:12:27.911978 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Total Tasks: 1500 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2021-03-09 12:12:27 | 2021-03-09 12:12:27.912528 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Elapsed Time: 1:03:07.355729 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2021-03-09 12:12:27 | 2021-03-09 12:12:27.913018 | UUID | Info | Host | Task Name | Run Time
2021-03-09 12:12:27 | 2021-03-09 12:12:27.913518 | fa163e38-77ee-9ad0-e896-000000003267 | SUMMARY | undercloud | tripleo_container_image_prepare : Container image prepare | 1362.88s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.914073 | fa163e38-77ee-9ad0-e896-000000003d45 | SUMMARY | overcloud-controller-2 | Wait for puppet host configuration to finish | 284.45s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.914583 | fa163e38-77ee-9ad0-e896-0000000035d8 | SUMMARY | overcloud-controller-0 | Pre-fetch all the containers | 254.36s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.915055 | fa163e38-77ee-9ad0-e896-0000000034da | SUMMARY | overcloud-controller-1 | Pre-fetch all the containers | 252.88s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.915460 | fa163e38-77ee-9ad0-e896-000000009502 | SUMMARY | undercloud | tripleo_keystone_resources : Check Keystone project status | 205.06s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.915901 | fa163e38-77ee-9ad0-e896-000000008d34 | SUMMARY | overcloud-controller-0 | tripleo_container_manage : Check podman create status | 188.93s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.916390 | fa163e38-77ee-9ad0-e896-0000000056fb | SUMMARY | overcloud-controller-0 | tripleo_ha_wrapper : Run init bundle puppet on the host for haproxy | 161.81s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.917033 | fa163e38-77ee-9ad0-e896-000000003559 | SUMMARY | overcloud-controller-2 | Pre-fetch all the containers | 143.98s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.917461 | fa163e38-77ee-9ad0-e896-00000000942c | SUMMARY | undercloud | Clean up legacy Cinder keystone catalog entries | 114.05s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.918084 | fa163e38-77ee-9ad0-e896-00000000348d | SUMMARY | overcloud-controller-0 | tripleo_container_tag : Pull 192.168.24.1:8787/tripleomaster/openstack-cinder-volume:dad45a3fb5cf1366c41134a9c25dffdb-updated-20210309094709 image | 107.34s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.918532 | fa163e38-77ee-9ad0-e896-000000003419 | SUMMARY | overcloud-novacompute-0 | Pre-fetch all the containers | 103.92s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.919043 | fa163e38-77ee-9ad0-e896-000000003df3 | SUMMARY | overcloud-controller-1 | Wait for puppet host configuration to finish | 85.76s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.919445 | fa163e38-77ee-9ad0-e896-000000009128 | SUMMARY | overcloud-controller-0 | tripleo_container_manage : Check podman create status | 85.46s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.919852 | fa163e38-77ee-9ad0-e896-00000000348d | SUMMARY | overcloud-controller-1 | tripleo_container_tag : Pull 192.168.24.1:8787/tripleomaster/openstack-cinder-volume:dad45a3fb5cf1366c41134a9c25dffdb-updated-20210309094709 image | 83.95s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.920319 | fa163e38-77ee-9ad0-e896-0000000071d4 | SUMMARY | overcloud-controller-0 | tripleo_ha_wrapper : Run init bundle puppet on the host for ovn_dbs | 75.99s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.920746 | fa163e38-77ee-9ad0-e896-000000003e31 | SUMMARY | overcloud-controller-0 | Wait for puppet host configuration to finish | 73.66s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.921245 | fa163e38-77ee-9ad0-e896-00000000570b | SUMMARY | overcloud-controller-0 | tripleo_ha_wrapper : Run init bundle puppet on the host for mysql | 55.20s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.921714 | fa163e38-77ee-9ad0-e896-00000000571d | SUMMARY | overcloud-controller-0 | tripleo_ha_wrapper : Run init bundle puppet on the host for oslo_messaging_rpc | 53.25s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.922163 | fa163e38-77ee-9ad0-e896-00000000460a | SUMMARY | overcloud-controller-0 | tripleo_container_manage : Check containers status | 53.13s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.922595 | fa163e38-77ee-9ad0-e896-0000000043f2 | SUMMARY | overcloud-controller-1 | tripleo_container_manage : Check containers status | 45.96s
2021-03-09 12:12:27 | 2021-03-09 12:12:27.923167 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ End Summary Information ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2021-03-09 12:12:27 | 2021-03-09 12:12:27.923647 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ State Information ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2021-03-09 12:12:27 | 2021-03-09 12:12:27.924072 | ~~~~~~~~~~~~~~~~~~ Number of nodes which did not deploy successfully: 1 ~~~~~~~~~~~~~~~~~
2021-03-09 12:12:27 | 2021-03-09 12:12:27.924578 | The following node(s) had failures: undercloud
2021-03-09 12:12:27 | 2021-03-09 12:12:27.925014 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2021-03-09 12:12:31 | 2021-03-09 12:12:31.298 167893 INFO tripleoclient.utils.utils [-] Temporary directory [ /tmp/tripleo46t0s1ji ] cleaned up
2021-03-09 12:12:31 | 2021-03-09 12:12:31.299 167893 ERROR tripleoclient.utils.utils [-] Ansible execution failed. playbook: /home/zuul/overcloud-deploy-overcloud/config-download/overcloud/deploy_steps_playbook.yaml, Run Status: failed, Return Code: 2, To rerun the failed command manually execute the following script: /home/zuul/overcloud-deploy-overcloud/config-download/ansible-playbook-command.sh
2021-03-09 12:12:31 | 2021-03-09 12:12:31.308 167893 WARNING tripleoclient.utils.safe_write [-] The output file /home/zuul/overcloud-deploy-overcloud/overcloud-deployment_status.yaml will be overriden: RuntimeError: Ansible execution failed. playbook: /home/zuul/overcloud-deploy-overcloud/config-download/overcloud/deploy_steps_playbook.yaml, Run Status: failed, Return Code: 2, To rerun the failed command manually execute the following script: /home/zuul/overcloud-deploy-overcloud/config-download/ansible-playbook-command.sh
2021-03-09 12:12:31 | Host 10.0.0.5 not found in /home/zuul/.ssh/known_hosts
2021-03-09 12:12:34 | Overcloud Endpoint: https://10.0.0.5:13000
2021-03-09 12:12:34 | Overcloud Horizon Dashboard URL: https://10.0.0.5:443/dashboard
2021-03-09 12:12:34 | Overcloud rc file: /home/zuul/overcloud-deploy-overcloud/overcloudrc and /home/zuul/overcloudrc
2021-03-09 12:12:34 | Overcloud Deployed with error
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud [-] Exception occured while running the command: RuntimeError: Ansible execution failed. playbook: /home/zuul/overcloud-deploy-overcloud/config-download/overcloud/deploy_steps_playbook.yaml, Run Status: failed, Return Code: 2, To rerun the failed command manually execute the following script: /home/zuul/overcloud-deploy-overcloud/config-download/ansible-playbook-command.sh
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud Traceback (most recent call last):
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 34, in run
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud super(Command, self).run(parsed_args)
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 39, in run
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud return super(Command, self).run(parsed_args)
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud File "/usr/lib/python3.6/site-packages/cliff/command.py", line 185, in run
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud return_code = self.take_action(parsed_args) or 0
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 1126, in take_action
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud raise(deploy_trace)
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 1050, in take_action
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud setup_only=parsed_args.setup_only
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/deployment.py", line 439, in config_download
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud forks=forks
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud File "/usr/lib/python3.6/site-packages/tripleoclient/utils.py", line 717, in run_ansible_playbook
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud raise RuntimeError(err_msg)
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud RuntimeError: Ansible execution failed. playbook: /home/zuul/overcloud-deploy-overcloud/config-download/overcloud/deploy_steps_playbook.yaml, Run Status: failed, Return Code: 2, To rerun the failed command manually execute the following script: /home/zuul/overcloud-deploy-overcloud/config-download/ansible-playbook-command.sh
2021-03-09 12:12:34 | 2021-03-09 12:12:34.903 167893 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud 
2021-03-09 12:12:34 | 2021-03-09 12:12:34.956 167893 ERROR openstack [-] Ansible execution failed. playbook: /home/zuul/overcloud-deploy-overcloud/config-download/overcloud/deploy_steps_playbook.yaml, Run Status: failed, Return Code: 2, To rerun the failed command manually execute the following script: /home/zuul/overcloud-deploy-overcloud/config-download/ansible-playbook-command.sh: RuntimeError: Ansible execution failed. playbook: /home/zuul/overcloud-deploy-overcloud/config-download/overcloud/deploy_steps_playbook.yaml, Run Status: failed, Return Code: 2, To rerun the failed command manually execute the following script: /home/zuul/overcloud-deploy-overcloud/config-download/ansible-playbook-command.sh
2021-03-09 12:12:34 | 2021-03-09 12:12:34.956 167893 INFO osc_lib.shell [-] END return value: 1
2021-03-09 12:12:37 | + status_code=1
2021-03-09 12:12:37 | + openstack stack list
2021-03-09 12:12:37 | + grep -q overcloud
2021-03-09 12:12:44 | + openstack stack list
2021-03-09 12:12:44 | + grep -Eq '(CREATE|UPDATE)_COMPLETE'
2021-03-09 12:12:47 | + openstack overcloud status --plan overcloud
2021-03-09 12:12:47 | + grep -Eq DEPLOY_SUCCESS
2021-03-09 12:12:53 | + openstack overcloud failures --plan overcloud
2021-03-09 12:12:56 | + true
2021-03-09 12:12:56 | + exit 1

Affected jobs:

1. periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master
2. periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master
3. periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master

#links:

1. https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/6b98df3/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

2. https://review.rdoproject.org/zuul/builds?pipeline=openstack-periodic-integration-main#

3. https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/6b98df3/job-output.txt

4. https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/6b98df3/logs/undercloud/var/log/extra/errors.txt.txt.gz

Tags: alert ci
Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
description: updated
Revision history for this message
Javier Peña (jpena-c) wrote :

I have been checking the logs for the neutron db sync issue. While checking the logs, we see that jobs running on host-id 144026dc3a6267aa9593dce15cccee121a718779ba75fb6ce23b65e7 tend to exhibit this behavior. Here are some examples, all on the same host-id:

- https://logserver.rdoproject.org/51/29351/23/check/periodic-tripleo-ci-centos-8-scenario010-standalone-master/cc0c0bf/logs/undercloud/var/log/containers/stdouts/neutron_db_sync.log.txt.gz
- https://logserver.rdoproject.org/59/32159/4/check/periodic-tripleo-ci-centos-8-scenario010-ovn-provider-standalone-ussuri/28b571c/logs/undercloud/var/log/containers/stdouts/neutron_db_sync.log.txt.gz
- https://logserver.rdoproject.org/59/32159/4/check/periodic-tripleo-ci-centos-8-scenario010-standalone-ussuri/ac89ed2/logs/undercloud/var/log/containers/stdouts/neutron_db_sync.log.txt.gz
- https://logserver.rdoproject.org/60/31460/8/check/periodic-tripleo-ci-centos-8-standalone-full-tempest-api-victoria/0799ef8/logs/undercloud/var/log/containers/stdouts/neutron_db_sync.log.txt.gz
- https://logserver.rdoproject.org/62/29562/5/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-ussuri/731a24c/logs/undercloud/var/log/containers/stdouts/neutron_db_sync.log.txt.gz
- https://logserver.rdoproject.org/89/32289/1/check/rdoinfo-tripleo-master-testing-centos-8-containers-multinode/7a6705d/logs/undercloud/var/log/containers/stdouts/neutron_db_sync.log.txt.gz

When examining the VM behavior during the neutron db-sync operation, we find that the disk usage is very high, but performance is low (99%busy, avq > 6 ms, avio > 10ms, but < 1.0 MB/s written). In another job, running on a different host (host-id cd94de9976e8f8b3f2212dccfe55af90ebe4613f70f21e858f6b7d20), the operation finished in about 20 seconds, and disk performance was normal (15%busy, avq < 5ms, avio < 1ms, > 2 MB/s written). No other VM parameter was abnormal (CPU and memory usage were good).

I've opened a ticket to our cloud provider so they can check the host.

Revision history for this message
Michele Baldessari (michele) wrote :

So I have also been observing keystone timeouts on RDO on some of my patches. The timeouts are because some of the memcached patches that have been merged lately are broken in regards to opening up the firewall rules.

Let's take the first failure mentioned at the beginning:
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/6b98df3/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

We see that nova is configured to use memcached:
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/6b98df3/logs/overcloud-controller-0/var/lib/config-data/puppet-generated/nova/etc/nova/nova.conf.txt.gz

memcache_servers=172.17.0.139:11211,172.17.0.15:11211,172.17.0.220:11211

The problem is there are no iptables rules for memcached:
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/6b98df3/logs/overcloud-controller-0/var/log/extra/network.txt.gz

In there you will see no rules opening up 11211. I think the logic I7a38a01f498d350d065a7c312a6654832fe24e6a is broken in regards to firewall rules, hence the timeouts.

Note: I don't think my analysis applies in case of standalone as there I believe iptables rules do not really play a role.

Revision history for this message
Michele Baldessari (michele) wrote :

Let''s keep this LP for the perf issues seen in general and let's move the memcached firewall rules discussion here https://bugs.launchpad.net/tripleo/+bug/1918891

wes hayutin (weshayutin)
tags: removed: promotion-blocker
Changed in tripleo:
milestone: wallaby-3 → wallaby-rc1
Changed in tripleo:
milestone: wallaby-rc1 → xena-1
Changed in tripleo:
milestone: xena-1 → xena-2
Changed in tripleo:
milestone: xena-2 → xena-3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.