Volume was observed to go into error state on creation

Bug #1827063 reported by Maria Yousaf
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Ma Zheng

Bug Description

Brief Description
-----------------
As part of automated regression, a 40GB volume was created. This volume was observed to go into error state.

Severity
--------
Major

Steps to Reproduce
------------------
1. Create a 40GB volume
2. Volume goes from creating to error state

Expected Behavior
------------------
Volume should be successfully created

Actual Behavior
----------------
Volume went into error state

Reproducibility
---------------
Intermittent.

System Configuration
--------------------
Dedicated storage

Branch/Pull Time/Commit
-----------------------
master as of 20190427T013000Z

Last Pass
---------
To be added

Timestamp/Logs
--------------
Creating:
[2019-04-28 12:39:56,199] 262 DEBUG MainThread ssh.send :: Send 'cinder --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os_region_name RegionOne show 4cc75d05-ff48-41bf-b4f3-cbb1c762abb4'
[2019-04-28 12:39:57,564] 387 DEBUG MainThread ssh.expect :: Output:
+------------------------------+--------------------------------------+
| Property | Value |
+------------------------------+--------------------------------------+
| attached_servers | [] |
| attachment_ids | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2019-04-28T12:39:51.000000 |
| description | None |
| encrypted | False |
| id | 4cc75d05-ff48-41bf-b4f3-cbb1c762abb4 |
| metadata | |
| multiattach | False |
| name | vol_inst2 |
| os-vol-tenant-attr:tenant_id | 5eddc54506a04b1487601e764f77baa5 |
| replication_status | None |
| size | 40 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | 2019-04-28T12:39:51.000000 |
| user_id | 60b94f9a0e5148f1bfcd1e192ba5f4d7 |
| volume_type | None |
+------------------------------+--------------------------------------+

Error state:

[2019-04-28 12:40:02,063] 387 DEBUG MainThread ssh.expect :: Output:
+------------------------------+--------------------------------------+
| Property | Value |
+------------------------------+--------------------------------------+
| attached_servers | [] |
| attachment_ids | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2019-04-28T12:39:51.000000 |
| description | None |
| encrypted | False |
| id | 4cc75d05-ff48-41bf-b4f3-cbb1c762abb4 |
| metadata | |
| multiattach | False |
| name | vol_inst2 |
| os-vol-tenant-attr:tenant_id | 5eddc54506a04b1487601e764f77baa5 |
| replication_status | None |
| size | 40 |
| snapshot_id | None |
| source_volid | None |
| status | error |
| updated_at | 2019-04-28T12:39:58.000000 |
| user_id | 60b94f9a0e5148f1bfcd1e192ba5f4d7 |
| volume_type | None |
+------------------------------+--------------------------------------+
[wrsroot@controller-0 ~(keystone_admin)]$

The following is seen in the logs:

{"log":"2019-04-28 12:26:15.037 19 ERROR oslo_db.sqlalchemy.engines DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)\n","stream":"stdout","time":"2019-04-28T12:26:15.038802703Z"}$
{"log":"2019-04-28 12:26:15.037 19 ERROR oslo_db.sqlalchemy.engines \n","stream":"stdout","time":"2019-04-28T12:26:15.038807776Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines [req-e4a3f391-0361-4d4a-b667-f58774dace73 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)\n","stream":"stdout","time":"2019-04-28T12:28:45.070504819Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):\n","stream":"stdout","time":"2019-04-28T12:28:45.070548809Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py\", line 73, in _connect_ping_listener\n","stream":"stdout","time":"2019-04-28T12:28:45.070565662Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines connection.scalar(select([1]))\n","stream":"stdout","time":"2019-04-28T12:28:45.070571682Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 912, in scalar\n","stream":"stdout","time":"2019-04-28T12:28:45.070576625Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines return self.execute(object_, *multiparams, **params).scalar()\n","stream":"stdout","time":"2019-04-28T12:28:45.07058163Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 980, in execute\n","stream":"stdout","time":"2019-04-28T12:28:45.070587339Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines return meth(self, multiparams, params)\n","stream":"stdout","time":"2019-04-28T12:28:45.070592368Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/sql/elements.py\", line 273, in _execute_on_connection\n","stream":"stdout","time":"2019-04-28T12:28:45.070596992Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines return connection._execute_clauseelement(self, multiparams, params)\n","stream":"stdout","time":"2019-04-28T12:28:45.07060209Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 1099, in _execute_clauseelement\n","stream":"stdout","time":"2019-04-28T12:28:45.070609566Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines distilled_params,\n","stream":"stdout","time":"2019-04-28T12:28:45.070614512Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 1240, in _execute_context\n","stream":"stdout","time":"2019-04-28T12:28:45.070619107Z"}$

Test Activity
-------------
Regression

Revision history for this message
Maria Yousaf (myousaf) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking as stx.2.0 gating as creating volumes is fundamental to using StarlingX with VMs. Issue is intermittent so setting priority to medium.
This may be a duplicate of https://bugs.launchpad.net/starlingx/+bug/1827230

Assigning to Cindy and request assistance to identify a prime to investigate this issue.

Changed in starlingx:
status: New → Triaged
importance: Undecided → High
importance: High → Medium
assignee: nobody → Cindy Xie (xxie1)
tags: added: stx.20.0 stx.retestneeded
tags: added: stx.2.0
removed: stx.20.0
Revision history for this message
Liang Fang (liangfang) wrote :

Agreed, it should be the same issue as bug 1827230.

The root cause is: lost the connection to DB.
To identify the issue, may need to:
1. Login to the pod running mariaDB, and check if mariaDB is working. I guess it should be working. Could the bug reporter confirm this?
2. If mariaDB is working correctly, then it would be a network issue to access DB. Need network team to take a look of this bug then.

Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Zhuweiwei (vivian.zhu)
Liang Fang (liangfang)
tags: added: stx.networking
Cindy Xie (xxie1)
tags: added: stx.storage
Liang Fang (liangfang)
Changed in starlingx:
assignee: Zhuweiwei (vivian.zhu) → Liang Fang (liangfang)
Liang Fang (liangfang)
Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Peng Peng (ppeng) wrote :

We did not see this issue in recent loads

tags: removed: stx.retestneeded
Cindy Xie (xxie1)
Changed in starlingx:
status: Incomplete → Invalid
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_113_121
Load: 20190703T013000Z

collect log attached,

MariaDB login log as detail.
controller-1:~$ kubectl get pod -n openstack | grep maria
mariadb-ingress-6ff964556d-tb4qt 1/1 Running 0 9h
mariadb-ingress-6ff964556d-zgjf2 1/1 Running 0 165m
mariadb-ingress-error-pages-764cfd869b-fpbtq 1/1 Running 0 165m
mariadb-server-0 1/1 Running 4 8h
mariadb-server-1 1/1 Running 4 164m

controller-1:~$ kubectl exec mariadb-server-0 -n openstack -it bash
root@mariadb-server-0:/#
root@mariadb-server-0:/# env | grep MY
MYSQL_DBADMIN_PASSWORD=eba59f6ebf32Ti0*
MYSQL_DBSST_PASSWORD=password
MYSQL_DBADMIN_USERNAME=root
MARIADB_SERVICE_PORT_MYSQL=3306
MARIADB_SERVER_SERVICE_PORT_MYSQL=3306
MYSQL_DBSST_USERNAME=sst
root@mariadb-server-0:/#
root@mariadb-server-0:/# mysql --password $MYSQL_DBADMIN_PASSWORD
Enter password:
ERROR 1045 (28000): Access denied for user 'root'@'127.0.0.1' (using password: YES)
root@mariadb-server-0:/#
root@mariadb-server-0:/# mysql --password=$MYSQL_DBADMIN_PASSWORD
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 28955
Server version: 10.2.18-MariaDB-1:10.2.18+maria~bionic mariadb.org binary distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| aodh |
| barbican |
| cinder |
| glance |
| gnocchi |
| heat |
| horizon |
| information_schema |
| keystone |
| mysql |
| neutron |
| nova |
| nova_api |
| nova_cell0 |
| panko |
| performance_schema |
| placement |
+--------------------+
17 rows in set (0.00 sec)

MariaDB [(none)]>

Revision history for this message
Peng Peng (ppeng) wrote :
Changed in starlingx:
status: Invalid → Confirmed
Revision history for this message
Liang Fang (liangfang) wrote :

log show the latest create volume request is sent at: 17:36:24 from controller-0, it want to create a volume with 2GB from image 'b545cf51-0e6f-4422-b1a8-7e774c054801'. This request was served by controller-0, and the volume is successfully created.

refer
-----------
file:cinder-api-77f55b6b-8kbss_openstack_cinder-api-d621e86322d63ddc598389f645d86b61b42246cdb88cbd31e668d3faabd07f82.log line 3021
file:cinder-volume-6d6cb49c84-kxc5n_openstack_cinder-volume-d914f3966780b1c3c12173c35bd4b203153ce16b357f7f564b74e88e2aee3c73.log line 237

I haven't found any request to create volume after this point, also haven't found any error after this point in the log.

I setup a starlingx environment using iso generated at 20190708, volumes were created normally, and I cannot reproduce this issue in my side.

Some question:
1. If this issue happen, will the same issue happen in the subsequent request? Will all the subsequent create volume requests fail?
2. Is it possible to share your environment if this issue happens? It would be much easier to root cause this kind of issue that happens intermittent. My email: <email address hidden>
3. Could you please print the time that sending create volume request? This is used to find in the log. Thanks.

Changed in starlingx:
status: Confirmed → Incomplete
Revision history for this message
Liang Fang (liangfang) wrote :

We found DB connection was lost at 15:56:56. Do you still have some record that shows at which time you send create volume request?

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Not sure why this bug has an stx.networking tag on it. This seems to be a storage issue. Internal connection to the DB is not part of the networking domain.

tags: removed: stx.networking
Revision history for this message
Liang Fang (liangfang) wrote :

@peng, if volume number exceeded quota, e.g. 8, the volume status will be error. If reproduced, please take a look of quota, thanks.

Revision history for this message
Peng Peng (ppeng) wrote :

We have not observed this issue recently. Will provide more information if it is reproduced again.

Revision history for this message
Senthil Mukundakumar (smukunda) wrote :

This issue is reproducible using load 20190807T053000Z in a storage system. The collect all logs are attached.

TIMESTAMP: 2019-08-08 09:56:03,516
2019-08-08 10:52:27,061

Revision history for this message
Yang Liu (yliu12) wrote :
Download full text (8.6 KiB)

Above issue happened in sanity. After sanity completes, there are a number cinder volumes stuck at creating status while one in in-use.
The volume that was in in-use status was the first one that having issue. Basically the vm uses it already deleted, but it was stuck at in-use. Ever since then, all volumes creation were stuck.

This is the steps for the first issue:

# create a cinder volume
2019-08-08T09:37:17.000 controller-1 -sh: info HISTORY: PID=748516 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=2 --image=507863f1-b992-4300-967a-e7b8839e06d3 --bootable vol-tenant1-vm-1-1

# launch a vm from it
[2019-08-08 09:37:21,085] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --boot-volume=7ca4c340-dee7-4125-a272-7023f833d121 --key-name=keypair-tenant1 --flavor=d469bf62-c5db-4c49-9ac9-c75d01e3522f --poll --nic net-id=cb9285ab-3dd3-4efa-a09d-87bbc56bf7fb --nic net-id=c50df027-2272-41b4-82d7-0ec97634adad tenant1-vm-1'

# force reboot active controller
[2019-08-08 09:44:52,046] 180 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-1
[2019-08-08 09:44:52,046] 301 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

# Stop and delete the server
[2019-08-08 09:51:21,774] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne stop 00bc77d9-48e0-4906-a711-54dccd939eb8'
[2019-08-08 09:51:28,556] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server delete 00bc77d9-48e0-4906-a711-54dccd939eb8'

# VM was not removed right away, it got stuck for longer than normal
[2019-08-08 09:54:27,307] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
+--------------------------------------+--------------+---------+----------+-------+----------------------+
| ID | Name | Status | Networks | Image | Flavor |
+-------------------------------------...

Read more...

Changed in starlingx:
status: Incomplete → Confirmed
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Yang Liu (yliu12)
tags: added: stx.retestneeded
Cindy Xie (xxie1)
Changed in starlingx:
assignee: Liang Fang (liangfang) → Ma Zheng (zheng-vm)
Revision history for this message
Ma Zheng (zheng-vm) wrote :

We cannot reproduce this in an AIO-DX environment with load 20190902T230000Z. Steps are as #13. We create a volume of 50GB with cirros-0.4.0-x86_64-disk.img. And volume was created successfully.

We are doing more testing to verify this bug.

Expecting more comments!

yong hu (yhu6)
tags: added: stx.distro.openstack
Revision history for this message
Ma Zheng (zheng-vm) wrote :

Hi all,

Since the bug has been inactive for more than a month, I'd like to ask about recent test result.
If this bug is not reproduced in recent testing, is it possible for the reporter or someone to change the bug status?

Thanks!

Revision history for this message
Ma Zheng (zheng-vm) wrote :

For recent update - Openstack Train version integration, we have setup an AIO-DX environment with load 20191119T000000Z.
Steps are as #13. We create a volume of 50GB with cirros-0.4.0-x86_64-disk.img. This bug is not reproduced in our testing.

As previous log shows, there is a timeout error connecting to RabbitMQ as well as a ceph cluster error. This error can happen as the cluster is not fully initialized after rebooted.

So we suggest, if tested manually, to wait longer and see if this happens again.

Revision history for this message
Austin Sun (sunausti) wrote :

Maria:
   would you like to verify it on latest master build ?

Thanks.
BR
Austin Sun.

Ma Zheng (zheng-vm)
Changed in starlingx:
status: Confirmed → Incomplete
Revision history for this message
zhipeng liu (zhipengs) wrote :

Not reproducible in 20191119T000000Z with train release.
Waiting for reporter's confirm!

Revision history for this message
zhipeng liu (zhipengs) wrote :

Set to invalid first since we could not reproduce it in latest build.

Zhipeng

Changed in starlingx:
status: Incomplete → Invalid
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was not reproduced on train
2019-11-21_20-00-00
wcp_3-6

Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.