cinder cmd not working intermittently

Bug #1826886 reported by Peng Peng
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Ma Zheng

Bug Description

Brief Description
-----------------
as title

Severity
--------
Major

Steps to Reproduce
------------------
....
TC-name: nova/test_evacuate_vms.py::TestOneHostAvail::()::test_reboot_only_host

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Intermittent 2/10

System Configuration
--------------------
one node system

Lab-name: SM-2

Branch/Pull Time/Commit
-----------------------
stx master as of 20190428T233000Z

Last Pass
---------
20190426T013000Z

Timestamp/Logs
--------------
[2019-04-29 08:28:29,687] 262 DEBUG MainThread ssh.send :: Send 'cinder --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os_region_name RegionOne show 087fb700-5218-48ad-b2d4-b44009bab31a'
[2019-04-29 08:30:38,402] 387 DEBUG MainThread ssh.expect :: Output:
ERROR: Unable to establish connection to http://cinder-api.openstack.svc.cluster.local:8776/v3/1be5e0ced98a44ecb892c5726077ea2c/volumes/087fb700-5218-48ad-b2d4-b44009bab31a: HTTPConnectionPool(host='cinder-api.openstack.svc.cluster.local', port=8776): Max retries exceeded with url: /v3/1be5e0ced98a44ecb892c5726077ea2c/volumes/087fb700-5218-48ad-b2d4-b44009bab31a (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7faed17a4610>: Failed to establish a new connection: [Errno 110] Connection timed out',))

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

There is not enough info in this bug to start an investigation.
Please provide the following:
1) What steps are done in the TC nova/test_evacuate_vms.py::TestOneHostAvail::()::test_reboot_only_host?
2) What steps were done on this lab in the 10 minutes or so leading up to the cinder command failing?

Note: It is expected openstack commands will fail if a AIO-SX was recently rebooted as it takes some time after the reboot for all of the services to come up.

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Peng Peng (ppeng) wrote :

Rebooting was not invoked before this cinder CMD failure.

10 mins before the failure happened:
[2019-05-01 08:22:37,006 to [2019-05-01 08:28:01,851, running test_ping_between_two_vms TC
test steps:
- Create a favor
- Create a volume
- Boot 2 VMs, then ping

[2019-05-01 08:29:27,306 TC test_reboot_only_host started

[2019-05-01 08:29:42,583] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne quota-update --cores 20 --instances 10 55d4101913c14204a98450ac7c3b09e8'

[2019-05-01 08:29:49,839] 262 DEBUG MainThread ssh.send :: Send 'cinder --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os_region_name RegionOne quota-update --volumes 20 55d4101913c14204a98450ac7c3b09e8'

[2019-05-01 08:30:05,371] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne flavor-create --ephemeral 1 --swap 512 flv_ephemswap auto 1024 2 1'

[2019-05-01 08:30:16,901] 262 DEBUG MainThread ssh.send :: Send 'cinder --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os_region_name RegionOne list --all-tenants'

[2019-05-01 08:30:20,576] cinder create: Failed to establish a new connection: [Errno 110] Connection timed out',))

Revision history for this message
Frank Miller (sensfan22) wrote :

Peng - the timeline you provide is helpful. Since it is taken from a load that was running on May 1st, please attach the logs for this May 1st timeline.

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking stx.2.0 gating as continual access to the cinder CLI is a basic requirement for OpenStack containers. Issue is intermittent so setting priority to medium.

Assigning to Cindy and request assistance to identify a prime to investigate this issue.

Changed in starlingx:
status: Incomplete → Triaged
importance: Undecided → Medium
assignee: nobody → Cindy Xie (xxie1)
tags: added: stx.2.0 stx.retestneeded
Liang Fang (liangfang)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Liang Fang (liangfang)
tags: added: stx.storage
Revision history for this message
Bruce Jones (brucej) wrote :

Can this bug please be re-tested and verified if it still occurs? It may have been fixed by the big Ceph upgrade.

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Liang Fang (liangfang) wrote :

From the log, it shows cinder failed to connect db at 09:24:45, but succeed to connect db after 09:52:08

could you please run "kubectl get pod --namespace=openstack | grep cinder-api" to identify if cinder-api pod are running?

also please run "kubectl get pod --namespace=openstack | grep mariadb" to identify if DB are ready.

e.g.
controller-0:~$ kubectl get pod --namespace=openstack | grep mariadb
mariadb-ingress-868456b8dd-8j5wl 0/1 Pending 0 15d
mariadb-ingress-868456b8dd-kshqt 1/1 Running 444 15d
mariadb-ingress-error-pages-9b86bc9f8-r29wr 1/1 Running 19 15d
mariadb-server-0 1/1 Running 118 15d

Is the configuration AIO bare metal or AIO virtualization?
Thanks

Revision history for this message
Peng Peng (ppeng) wrote :

We are not able to reproduce this issue in recent loads.

tags: removed: stx.retestneeded
Cindy Xie (xxie1)
Changed in starlingx:
status: Incomplete → Invalid
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (6.4 KiB)

This issue still seems to be apparent (intermittently)

HW lab eg. PV-0
BUILD_ID="20190718T013000Z"

...
[2019-07-25 13:03:25,341] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume show ab0abd17-0da7-4747-b135-827f4b46b7ce'
[2019-07-25 13:03:27,330] 423 DEBUG MainThread ssh.expect :: Output:
+------------------------------+--------------------------------------+
| Field | Value |
+------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | true |
| consistencygroup_id | None |
| created_at | 2019-07-25T13:02:13.000000 |
| description | None |
| encrypted | False |
| id | ab0abd17-0da7-4747-b135-827f4b46b7ce |
| multiattach | False |
| name | win_2016-1 |
| os-vol-tenant-attr:tenant_id | be2e1a5553164f4db6392523b4ab9fb6 |
| properties | |
| replication_status | None |
| size | 29 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| type | None |
| updated_at | 2019-07-25T13:02:13.000000 |
| user_id | 0ec62a5feafb4be186e9b6d20543a3bc |
+------------------------------+--------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$
[2019-07-25 13:03:27,330] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-07-25 13:03:27,434] 423 DEBUG MainThread ssh.expect :: Output:
0
[sysadmin@controller-0 ~(keystone_admin)]$
[2019-07-25 13:03:32,439] 1534 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for pv0
[2019-07-25 13:03:32,440] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-07-25 13:03:32,440] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume show ab0abd17-0da7-4747-b135-827f4b46b7ce'
[2019-07-25 13:05:41,732] 423 DEBUG Mai...

Read more...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Changed in starlingx:
status: Invalid → Confirmed
tags: added: stx.retestneeded
Revision history for this message
Liang Fang (liangfang) wrote :

The log is 2019-07-24, but test scripts runs on 2019-07-25.
Could you please make sure the log is valid? Thanks.

From the log, cinder part, I only find DB connection lost at early time. Cinder just use DB, but not deploy DB. So from cinder's point of view, it knows DB lost, but has no idea why DB lost.

Please collect log like this:
1. logs on controller0 and controller1 (if have)
2. logs of test script
3. The time of above logs should match
4. Logs should be collected in time, e.g. within 5 hours, because the logs may be purged after a certain time if something go error and pod restarted.
5. run "kubectl get pod --namespace=openstack", and paste here, so we can know at least the status of pods

Thanks a lot!

Changed in starlingx:
status: Confirmed → Incomplete
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Revision history for this message
Austin Sun (sunausti) wrote :

Hi pengpeng:
   would you answer fangliang's question ?

Thanks.
BR
Austin Sun.

Liang Fang (liangfang)
Changed in starlingx:
assignee: Liang Fang (liangfang) → nobody
Austin Sun (sunausti)
Changed in starlingx:
assignee: nobody → Peng Peng (ppeng)
Ma Zheng (zheng-vm)
Changed in starlingx:
assignee: Peng Peng (ppeng) → Ma Zheng (zheng-vm)
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

I am seeing the command intermittently fail during volume creation eg. where the volume is "downloading" state
see 500 error during downloading in the attached logs (controller-1)

23b20342-607b-40e6-8415-4c421945f969 | win-size29-1 | downloading | 29 |

2019-11-29T22:31:36.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
[sysadmin@controller-0 ~(keystone_admin)]$ openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-aaffb3cb-f95d-45a3-8716-8e237ee86d39)
[sysadmin@controller-0 ~(keystone_admin)]$ openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
+--------------------------------------+-----------------+-------------+------+-------------+
| ID | Name | Status | Size | Attached to |
+--------------------------------------+-----------------+-------------+------+-------------+
| 23b20342-607b-40e6-8415-4c421945f969 | win-size29-1 | downloading | 29 | |
| 63262bd5-b627-467a-99f3-9a7debb0434d | centos-size29-1 | available | 29 | |
| da0812ad-7f05-447a-a9d5-e9d85703937d | win_2016-w2 | error | 29 | |
| 7e80ddc0-6900-487e-92a2-fd44abd2939c | win_2016-1 | error | 29 | |
+--------------------------------------+-----------------+-------------+------+-------------+

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (47.8 KiB)

2019-11-22_20-00-00
$ system application-list
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-19-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applied | completed

$ kubectl get pods -o wide -n openstack
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cinder-api-8548c4c585-cxvcv 1/1 Running 0 3h54m 172.16.192.108 controller-0 <none> <none>
cinder-api-8548c4c585-s9lxl 1/1 Running 0 3h54m 172.16.166.136 controller-1 <none> <none>
cinder-backup-5c89d8894b-spnvw 1/1 Running 0 3h54m 172.16.166.153 controller-1 <none> <none>
cinder-backup-5c89d8894b-vbmws 1/1 Running 0 3h54m 172.16.192.67 controller-0 <none> <none>
cinder-backup-storage-init-p625l 0/1 Completed 0 3h54m 172.16.192.98 controller-0 <none> <none>
cinder-bootstrap-9qdrt 0/1 Completed 0 3h54m 172.16.166.183 controller-1 <none> <none>
cinder-create-internal-tenant-w82jz 0/1 Completed 0 3h54m 172.16.166.176 controller-1 <none> <none>
cinder-db-init-tkkvq 0/1 Completed 0 3h54m 172.16.166.135 controller-1 <none> <none>
cinder-db-sync-7nx8w 0/1 Completed 0 3h54m 172.16.166.134 controller-1 <none> <none>
cinder-ks-endpoints-bn5fl 0/9 Completed 0 3h54m 172.16.192.68 controller-0 <none> <none>
cinder-ks-service-tcbkj 0/3 Completed 0 3h54m 172.16.166.170 controller-1 <none> <none>
cinder-ks-user-qp4bt 0/1 Completed 0 3h54m 172.16.166.152 controller-1 <none> <none>
cinder-rabbit-init-6f62w 0/1 Completed 0 3h54m 172.16.166.138 controller-1 <none> <none>
cinder-scheduler-7f46f6654f-88x4f 1/1 Running 0 3h54m 172.16.192.101 controller-0 <none> <none>
cinder-scheduler-7f46f6654f-nl4kq 1/1 Running 0 3h54m 172.16.166.150 controller-1 <none> <none>
cinder-storage-init-vkjsr 0/1 ...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (4.1 KiB)

see bash.log (controller-0)
2019-11-29T22:27:58.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
2019-11-29T22:28:52.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --image=8263b829-062c-47c4-b765-de0b64f87724 --size=29 --bootable centos-size29-1
2019-11-29T22:28:59.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
2019-11-29T22:30:25.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --image=a1a2fbfb-e793-4d27-884e-89fc09b9849f --size=29 --bootable win-size29-1
2019-11-29T22:30:31.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
2019-11-29T22:31:07.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne image show a1a2fbfb-e793-4d27-884e-89fc09b9849f
2019-11-29T22:31:36.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
2019-11-29T22:32:20.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 sudo tail /var/log/bash.log
2019-11-29T22:33:56.000 controller-0 -sh: info HISTORY: PID=2487534 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.loca...

Read more...

Changed in starlingx:
status: Incomplete → Confirmed
Revision history for this message
Ma Zheng (zheng-vm) wrote :

Latest case reported by Mitchell is not the same as Peng reported I think.

For the latest report, I think this error is caused by resource limited.
Let's take req-aaffb3cb-f95d-45a3-8716-8e237ee86d39 as an example. As posted in #comment-15, cinder-api received HTTP-500. What I have found in the log of controller-1 (/var/log/containers/cinder-api-8548c4c585-s9lxl_openstack_cinder-api-5c984b95c1058fb9900da2125445bcc4d41dfac2353d7bb0700796d1d611c981.log) at line-7461 is
{"log":"2019-11-29 22:31:43.910 19 ERROR oslo_db.sqlalchemy.exc_filters [req-aaffb3cb-f95d-45a3-8716-8e237ee86d39 f9f2c5d355544777808fc2ea96ae5c9a 109492cf8aed4438b7ecb126063c8025 - default default] DBAPIError exception wrapped from (pymysql.err.InternalError) (1021, u'Disk full (/tmp/#sql_136_1.MAI); waiting for someone to free some space... (errno: 28 \"No space left on device\")')\n","stream":"stdout","time":"2019-11-29T22:31:43.911447596Z"}

DB returned "disk full", which I think caused this error. Plus I found a lot of 'Evicted' status of cinder-volume pods in #comment-16, maybe the "not-enough-space-error" happened on the storage node caused the eviction of cinder-volume pods.

Ma Zheng (zheng-vm)
Changed in starlingx:
status: Confirmed → Incomplete
Revision history for this message
Ma Zheng (zheng-vm) wrote :

We have not observed this error in our recent test.

By Mitchell's latest report, I found Cinder api worked well after it returned a 500 error number (log on cinder-api@controller-1 after 2019-11-29 22:32:01.210) .
So I still on my opinion that it was caused by resource limitation and recovered automatically after enough resource provided.

I can see that the system restart to response to those api requested just a minute later.

Revision history for this message
Austin Sun (sunausti) wrote :

Hi pengpeng and Wendy:
    do you meet same issue recently?

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving unresolved medium priority bugs (< 100 days OR recently reproduced) from stx.3.0 to stx.4.0

As per the comments above, this issue was reproduced on 2019-11-29

tags: added: stx.4.0
removed: stx.3.0
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

closing

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

20200111T023000Z

tags: removed: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Closing (marking status as Invalid which we use for Not Reproducible issues) based on notes above

Changed in starlingx:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.