cinder cmd not working intermittently

Bug #1826886 reported by Peng Peng on 2019-04-29
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Ma Zheng

Bug Description

Brief Description
-----------------
as title

Severity
--------
Major

Steps to Reproduce
------------------
....
TC-name: nova/test_evacuate_vms.py::TestOneHostAvail::()::test_reboot_only_host

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Intermittent 2/10

System Configuration
--------------------
one node system

Lab-name: SM-2

Branch/Pull Time/Commit
-----------------------
stx master as of 20190428T233000Z

Last Pass
---------
20190426T013000Z

Timestamp/Logs
--------------
[2019-04-29 08:28:29,687] 262 DEBUG MainThread ssh.send :: Send 'cinder --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os_region_name RegionOne show 087fb700-5218-48ad-b2d4-b44009bab31a'
[2019-04-29 08:30:38,402] 387 DEBUG MainThread ssh.expect :: Output:
ERROR: Unable to establish connection to http://cinder-api.openstack.svc.cluster.local:8776/v3/1be5e0ced98a44ecb892c5726077ea2c/volumes/087fb700-5218-48ad-b2d4-b44009bab31a: HTTPConnectionPool(host='cinder-api.openstack.svc.cluster.local', port=8776): Max retries exceeded with url: /v3/1be5e0ced98a44ecb892c5726077ea2c/volumes/087fb700-5218-48ad-b2d4-b44009bab31a (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7faed17a4610>: Failed to establish a new connection: [Errno 110] Connection timed out',))

Test Activity
-------------
Sanity

Peng Peng (ppeng) wrote :
Frank Miller (sensfan22) wrote :

There is not enough info in this bug to start an investigation.
Please provide the following:
1) What steps are done in the TC nova/test_evacuate_vms.py::TestOneHostAvail::()::test_reboot_only_host?
2) What steps were done on this lab in the 10 minutes or so leading up to the cinder command failing?

Note: It is expected openstack commands will fail if a AIO-SX was recently rebooted as it takes some time after the reboot for all of the services to come up.

Changed in starlingx:
status: New → Incomplete
Peng Peng (ppeng) wrote :

Rebooting was not invoked before this cinder CMD failure.

10 mins before the failure happened:
[2019-05-01 08:22:37,006 to [2019-05-01 08:28:01,851, running test_ping_between_two_vms TC
test steps:
- Create a favor
- Create a volume
- Boot 2 VMs, then ping

[2019-05-01 08:29:27,306 TC test_reboot_only_host started

[2019-05-01 08:29:42,583] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne quota-update --cores 20 --instances 10 55d4101913c14204a98450ac7c3b09e8'

[2019-05-01 08:29:49,839] 262 DEBUG MainThread ssh.send :: Send 'cinder --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os_region_name RegionOne quota-update --volumes 20 55d4101913c14204a98450ac7c3b09e8'

[2019-05-01 08:30:05,371] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne flavor-create --ephemeral 1 --swap 512 flv_ephemswap auto 1024 2 1'

[2019-05-01 08:30:16,901] 262 DEBUG MainThread ssh.send :: Send 'cinder --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os_region_name RegionOne list --all-tenants'

[2019-05-01 08:30:20,576] cinder create: Failed to establish a new connection: [Errno 110] Connection timed out',))

Frank Miller (sensfan22) wrote :

Peng - the timeline you provide is helpful. Since it is taken from a load that was running on May 1st, please attach the logs for this May 1st timeline.

Peng Peng (ppeng) wrote :
Frank Miller (sensfan22) wrote :

Marking stx.2.0 gating as continual access to the cinder CLI is a basic requirement for OpenStack containers. Issue is intermittent so setting priority to medium.

Assigning to Cindy and request assistance to identify a prime to investigate this issue.

Changed in starlingx:
status: Incomplete → Triaged
importance: Undecided → Medium
assignee: nobody → Cindy Xie (xxie1)
tags: added: stx.2.0 stx.retestneeded
Liang Fang (liangfang) on 2019-05-08
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Liang Fang (liangfang)
tags: added: stx.storage
Bruce Jones (brucej) wrote :

Can this bug please be re-tested and verified if it still occurs? It may have been fixed by the big Ceph upgrade.

Changed in starlingx:
status: Triaged → Incomplete
Liang Fang (liangfang) wrote :

From the log, it shows cinder failed to connect db at 09:24:45, but succeed to connect db after 09:52:08

could you please run "kubectl get pod --namespace=openstack | grep cinder-api" to identify if cinder-api pod are running?

also please run "kubectl get pod --namespace=openstack | grep mariadb" to identify if DB are ready.

e.g.
controller-0:~$ kubectl get pod --namespace=openstack | grep mariadb
mariadb-ingress-868456b8dd-8j5wl 0/1 Pending 0 15d
mariadb-ingress-868456b8dd-kshqt 1/1 Running 444 15d
mariadb-ingress-error-pages-9b86bc9f8-r29wr 1/1 Running 19 15d
mariadb-server-0 1/1 Running 118 15d

Is the configuration AIO bare metal or AIO virtualization?
Thanks

Peng Peng (ppeng) wrote :

We are not able to reproduce this issue in recent loads.

tags: removed: stx.retestneeded
Cindy Xie (xxie1) on 2019-05-29
Changed in starlingx:
status: Incomplete → Invalid
Wendy Mitchell (wmitchellwr) wrote :
Download full text (6.4 KiB)

This issue still seems to be apparent (intermittently)

HW lab eg. PV-0
BUILD_ID="20190718T013000Z"

...
[2019-07-25 13:03:25,341] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume show ab0abd17-0da7-4747-b135-827f4b46b7ce'
[2019-07-25 13:03:27,330] 423 DEBUG MainThread ssh.expect :: Output:
+------------------------------+--------------------------------------+
| Field | Value |
+------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | true |
| consistencygroup_id | None |
| created_at | 2019-07-25T13:02:13.000000 |
| description | None |
| encrypted | False |
| id | ab0abd17-0da7-4747-b135-827f4b46b7ce |
| multiattach | False |
| name | win_2016-1 |
| os-vol-tenant-attr:tenant_id | be2e1a5553164f4db6392523b4ab9fb6 |
| properties | |
| replication_status | None |
| size | 29 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| type | None |
| updated_at | 2019-07-25T13:02:13.000000 |
| user_id | 0ec62a5feafb4be186e9b6d20543a3bc |
+------------------------------+--------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$
[2019-07-25 13:03:27,330] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-07-25 13:03:27,434] 423 DEBUG MainThread ssh.expect :: Output:
0
[sysadmin@controller-0 ~(keystone_admin)]$
[2019-07-25 13:03:32,439] 1534 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for pv0
[2019-07-25 13:03:32,440] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-07-25 13:03:32,440] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume show ab0abd17-0da7-4747-b135-827f4b46b7ce'
[2019-07-25 13:05:41,732] 423 DEBUG Mai...

Read more...

Wendy Mitchell (wmitchellwr) wrote :
Changed in starlingx:
status: Invalid → Confirmed
tags: added: stx.retestneeded
Liang Fang (liangfang) wrote :

The log is 2019-07-24, but test scripts runs on 2019-07-25.
Could you please make sure the log is valid? Thanks.

From the log, cinder part, I only find DB connection lost at early time. Cinder just use DB, but not deploy DB. So from cinder's point of view, it knows DB lost, but has no idea why DB lost.

Please collect log like this:
1. logs on controller0 and controller1 (if have)
2. logs of test script
3. The time of above logs should match
4. Logs should be collected in time, e.g. within 5 hours, because the logs may be purged after a certain time if something go error and pod restarted.
5. run "kubectl get pod --namespace=openstack", and paste here, so we can know at least the status of pods

Thanks a lot!

Changed in starlingx:
status: Confirmed → Incomplete
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Austin Sun (sunausti) wrote :

Hi pengpeng:
   would you answer fangliang's question ?

Thanks.
BR
Austin Sun.

Liang Fang (liangfang) on 2019-11-26
Changed in starlingx:
assignee: Liang Fang (liangfang) → nobody
Austin Sun (sunausti) on 2019-11-26
Changed in starlingx:
assignee: nobody → Peng Peng (ppeng)
Ma Zheng (zheng-vm) on 2019-11-26
Changed in starlingx:
assignee: Peng Peng (ppeng) → Ma Zheng (zheng-vm)
Wendy Mitchell (wmitchellwr) wrote :

I am seeing the command intermittently fail during volume creation eg. where the volume is "downloading" state
see 500 error during downloading in the attached logs (controller-1)

23b20342-607b-40e6-8415-4c421945f969 | win-size29-1 | downloading | 29 |

2019-11-29T22:31:36.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
[sysadmin@controller-0 ~(keystone_admin)]$ openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-aaffb3cb-f95d-45a3-8716-8e237ee86d39)
[sysadmin@controller-0 ~(keystone_admin)]$ openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
+--------------------------------------+-----------------+-------------+------+-------------+
| ID | Name | Status | Size | Attached to |
+--------------------------------------+-----------------+-------------+------+-------------+
| 23b20342-607b-40e6-8415-4c421945f969 | win-size29-1 | downloading | 29 | |
| 63262bd5-b627-467a-99f3-9a7debb0434d | centos-size29-1 | available | 29 | |
| da0812ad-7f05-447a-a9d5-e9d85703937d | win_2016-w2 | error | 29 | |
| 7e80ddc0-6900-487e-92a2-fd44abd2939c | win_2016-1 | error | 29 | |
+--------------------------------------+-----------------+-------------+------+-------------+

Wendy Mitchell (wmitchellwr) wrote :
Download full text (47.8 KiB)

2019-11-22_20-00-00
$ system application-list
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-19-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applied | completed

$ kubectl get pods -o wide -n openstack
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cinder-api-8548c4c585-cxvcv 1/1 Running 0 3h54m 172.16.192.108 controller-0 <none> <none>
cinder-api-8548c4c585-s9lxl 1/1 Running 0 3h54m 172.16.166.136 controller-1 <none> <none>
cinder-backup-5c89d8894b-spnvw 1/1 Running 0 3h54m 172.16.166.153 controller-1 <none> <none>
cinder-backup-5c89d8894b-vbmws 1/1 Running 0 3h54m 172.16.192.67 controller-0 <none> <none>
cinder-backup-storage-init-p625l 0/1 Completed 0 3h54m 172.16.192.98 controller-0 <none> <none>
cinder-bootstrap-9qdrt 0/1 Completed 0 3h54m 172.16.166.183 controller-1 <none> <none>
cinder-create-internal-tenant-w82jz 0/1 Completed 0 3h54m 172.16.166.176 controller-1 <none> <none>
cinder-db-init-tkkvq 0/1 Completed 0 3h54m 172.16.166.135 controller-1 <none> <none>
cinder-db-sync-7nx8w 0/1 Completed 0 3h54m 172.16.166.134 controller-1 <none> <none>
cinder-ks-endpoints-bn5fl 0/9 Completed 0 3h54m 172.16.192.68 controller-0 <none> <none>
cinder-ks-service-tcbkj 0/3 Completed 0 3h54m 172.16.166.170 controller-1 <none> <none>
cinder-ks-user-qp4bt 0/1 Completed 0 3h54m 172.16.166.152 controller-1 <none> <none>
cinder-rabbit-init-6f62w 0/1 Completed 0 3h54m 172.16.166.138 controller-1 <none> <none>
cinder-scheduler-7f46f6654f-88x4f 1/1 Running 0 3h54m 172.16.192.101 controller-0 <none> <none>
cinder-scheduler-7f46f6654f-nl4kq 1/1 Running 0 3h54m 172.16.166.150 controller-1 <none> <none>
cinder-storage-init-vkjsr 0/1 ...

Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Download full text (4.1 KiB)

see bash.log (controller-0)
2019-11-29T22:27:58.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
2019-11-29T22:28:52.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --image=8263b829-062c-47c4-b765-de0b64f87724 --size=29 --bootable centos-size29-1
2019-11-29T22:28:59.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
2019-11-29T22:30:25.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --image=a1a2fbfb-e793-4d27-884e-89fc09b9849f --size=29 --bootable win-size29-1
2019-11-29T22:30:31.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
2019-11-29T22:31:07.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne image show a1a2fbfb-e793-4d27-884e-89fc09b9849f
2019-11-29T22:31:36.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
2019-11-29T22:32:20.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 sudo tail /var/log/bash.log
2019-11-29T22:33:56.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.loca...

Read more...

Changed in starlingx:
status: Incomplete → Confirmed
Ma Zheng (zheng-vm) wrote :

Latest case reported by Mitchell is not the same as Peng reported I think.

For the latest report, I think this error is caused by resource limited.
Let's take req-aaffb3cb-f95d-45a3-8716-8e237ee86d39 as an example. As posted in #comment-15, cinder-api received HTTP-500. What I have found in the log of controller-1 (/var/log/containers/cinder-api-8548c4c585-s9lxl_openstack_cinder-api-5c984b95c1058fb9900da2125445bcc4d41dfac2353d7bb0700796d1d611c981.log) at line-7461 is
{"log":"2019-11-29 22:31:43.910 19 ERROR oslo_db.sqlalchemy.exc_filters [req-aaffb3cb-f95d-45a3-8716-8e237ee86d39 f9f2c5d355544777808fc2ea96ae5c9a 109492cf8aed4438b7ecb126063c8025 - default default] DBAPIError exception wrapped from (pymysql.err.InternalError) (1021, u'Disk full (/tmp/#sql_136_1.MAI); waiting for someone to free some space... (errno: 28 \"No space left on device\")')\n","stream":"stdout","time":"2019-11-29T22:31:43.911447596Z"}

DB returned "disk full", which I think caused this error. Plus I found a lot of 'Evicted' status of cinder-volume pods in #comment-16, maybe the "not-enough-space-error" happened on the storage node caused the eviction of cinder-volume pods.

Ma Zheng (zheng-vm) on 2019-12-05
Changed in starlingx:
status: Confirmed → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers