StarlingX

Volume was observed to go into error state on creation

Bug #1827063 reported by Maria Yousaf on 2019-04-30

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Medium	Ma Zheng

Bug Description

Brief Description
-----------------
As part of automated regression, a 40GB volume was created. This volume was observed to go into error state.

Severity
--------
Major

Steps to Reproduce
------------------
1. Create a 40GB volume
2. Volume goes from creating to error state

Expected Behavior
------------------
Volume should be successfully created

Actual Behavior
----------------
Volume went into error state

Reproducibility
---------------
Intermittent.

System Configuration
--------------------
Dedicated storage

Branch/Pull Time/Commit
-----------------------
master as of 20190427T013000Z

Last Pass
---------
To be added

Error state:

The following is seen in the logs:

{"log":"2019-04-28 12:26:15.037 19 ERROR oslo_db.sqlalchemy.engines DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)\n","stream":"stdout","time":"2019-04-28T12:26:15.038802703Z"}$
{"log":"2019-04-28 12:26:15.037 19 ERROR oslo_db.sqlalchemy.engines \n","stream":"stdout","time":"2019-04-28T12:26:15.038807776Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines [req-e4a3f391-0361-4d4a-b667-f58774dace73 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)\n","stream":"stdout","time":"2019-04-28T12:28:45.070504819Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):\n","stream":"stdout","time":"2019-04-28T12:28:45.070548809Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py\", line 73, in _connect_ping_listener\n","stream":"stdout","time":"2019-04-28T12:28:45.070565662Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines connection.scalar(select([1]))\n","stream":"stdout","time":"2019-04-28T12:28:45.070571682Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 912, in scalar\n","stream":"stdout","time":"2019-04-28T12:28:45.070576625Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines return self.execute(object_, *multiparams, **params).scalar()\n","stream":"stdout","time":"2019-04-28T12:28:45.07058163Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 980, in execute\n","stream":"stdout","time":"2019-04-28T12:28:45.070587339Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines return meth(self, multiparams, params)\n","stream":"stdout","time":"2019-04-28T12:28:45.070592368Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/sql/elements.py\", line 273, in _execute_on_connection\n","stream":"stdout","time":"2019-04-28T12:28:45.070596992Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines return connection._execute_clauseelement(self, multiparams, params)\n","stream":"stdout","time":"2019-04-28T12:28:45.07060209Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 1099, in _execute_clauseelement\n","stream":"stdout","time":"2019-04-28T12:28:45.070609566Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines distilled_params,\n","stream":"stdout","time":"2019-04-28T12:28:45.070614512Z"}$
{"log":"2019-04-28 12:28:45.069 19 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 1240, in _execute_context\n","stream":"stdout","time":"2019-04-28T12:28:45.070619107Z"}$

Test Activity
-------------
Regression

Tags:

Revision history for this message

Maria Yousaf (myousaf) wrote on 2019-04-30:

cinder-volume-f8c4b5968-xlkcz_openstack_cinder-volume-abcda4d0b768290e22d2df9710e13e9c5ba7915385861f38c8f5cc0a7a18ddff.log Edit (71.5 KiB, text/plain)

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-05-01:

Marking as stx.2.0 gating as creating volumes is fundamental to using StarlingX with VMs. Issue is intermittent so setting priority to medium.
This may be a duplicate of https://bugs.launchpad.net/starlingx/+bug/1827230

Assigning to Cindy and request assistance to identify a prime to investigate this issue.

Changed in starlingx:
status:	New → Triaged
importance:	Undecided → High
importance:	High → Medium
assignee:	nobody → Cindy Xie (xxie1)
tags:	added: stx.20.0 stx.retestneeded
tags:	added: stx.2.0 removed: stx.20.0

Revision history for this message

Liang Fang (liangfang) wrote on 2019-05-08:

Agreed, it should be the same issue as bug 1827230.

The root cause is: lost the connection to DB.
To identify the issue, may need to:
1. Login to the pod running mariaDB, and check if mariaDB is working. I guess it should be working. Could the bug reporter confirm this?
2. If mariaDB is working correctly, then it would be a network issue to access DB. Need network team to take a look of this bug then.

Cindy Xie (xxie1) on 2019-05-08

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → Zhuweiwei (vivian.zhu)

Liang Fang (liangfang) on 2019-05-08

tags:

added: stx.networking

Cindy Xie (xxie1) on 2019-05-13

tags:

added: stx.storage

Liang Fang (liangfang) on 2019-05-17

Changed in starlingx:
assignee:	Zhuweiwei (vivian.zhu) → Liang Fang (liangfang)

Liang Fang (liangfang) on 2019-05-17

Changed in starlingx:
status:	Triaged → Incomplete

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-27:

We did not see this issue in recent loads

tags:

removed: stx.retestneeded

Cindy Xie (xxie1) on 2019-05-29

Changed in starlingx:
status:	Incomplete → Invalid

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-03:

ALL_NODES_20190703.181758.tar Edit (28.1 MiB, application/x-tar)

Issue was reproduced on
Lab: WCP_113_121
Load: 20190703T013000Z

collect log attached,

controller-1:~$ kubectl exec mariadb-server-0 -n openstack -it bash
root@mariadb-server-0:/#
root@mariadb-server-0:/# env | grep MY
MYSQL_DBADMIN_PASSWORD=eba59f6ebf32Ti0*
MYSQL_DBSST_PASSWORD=password
MYSQL_DBADMIN_USERNAME=root
MARIADB_SERVICE_PORT_MYSQL=3306
MARIADB_SERVER_SERVICE_PORT_MYSQL=3306
MYSQL_DBSST_USERNAME=sst
root@mariadb-server-0:/#
root@mariadb-server-0:/# mysql --password $MYSQL_DBADMIN_PASSWORD
Enter password:
ERROR 1045 (28000): Access denied for user 'root'@'127.0.0.1' (using password: YES)
root@mariadb-server-0:/#
root@mariadb-server-0:/# mysql --password=$MYSQL_DBADMIN_PASSWORD
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 28955
Server version: 10.2.18-MariaDB-1:10.2.18+maria~bionic mariadb.org binary distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]>

Issue was reproduced on
Lab: WCP_113_121
Load: 20190703T013000Z

collect log attached,

MariaDB login log as detail.
controller-1:~$ kubectl get pod -n openstack | grep maria
mariadb-ingress-6ff964556d-tb4qt                     1/1     Running     0          9h
mariadb-ingress-6ff964556d-zgjf2                     1/1     Running     0          165m
mariadb-ingress-error-pages-764cfd869b-fpbtq         1/1     Running     0          165m
mariadb-server-0                                     1/1     Running     4          8h
mariadb-server-1                                     1/1     Running     4          164m

controller-1:~$ kubectl exec mariadb-server-0 -n openstack -it bash
root@mariadb-server-0:/#
root@mariadb-server-0:/# env | grep MY
MYSQL_DBADMIN_PASSWORD=eba59f6ebf32Ti0*
MYSQL_DBSST_PASSWORD=password
MYSQL_DBADMIN_USERNAME=root
MARIADB_SERVICE_PORT_MYSQL=3306
MARIADB_SERVER_SERVICE_PORT_MYSQL=3306
MYSQL_DBSST_USERNAME=sst
root@mariadb-server-0:/#
root@mariadb-server-0:/# mysql --password $MYSQL_DBADMIN_PASSWORD
Enter password:
ERROR 1045 (28000): Access denied for user 'root'@'127.0.0.1' (using password: YES)
root@mariadb-server-0:/#
root@mariadb-server-0:/# mysql --password=$MYSQL_DBADMIN_PASSWORD
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 28955
Server version: 10.2.18-MariaDB-1:10.2.18+maria~bionic mariadb.org binary distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]>

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-03:

controller-0_20190703.182325.tar Edit (26.4 MiB, application/x-tar)

Changed in starlingx:
status:	Invalid → Confirmed

Revision history for this message

Liang Fang (liangfang) wrote on 2019-07-10:

log show the latest create volume request is sent at: 17:36:24 from controller-0, it want to create a volume with 2GB from image 'b545cf51-0e6f-4422-b1a8-7e774c054801'. This request was served by controller-0, and the volume is successfully created.

refer
-----------
file:cinder-api-77f55b6b-8kbss_openstack_cinder-api-d621e86322d63ddc598389f645d86b61b42246cdb88cbd31e668d3faabd07f82.log line 3021
file:cinder-volume-6d6cb49c84-kxc5n_openstack_cinder-volume-d914f3966780b1c3c12173c35bd4b203153ce16b357f7f564b74e88e2aee3c73.log line 237

I haven't found any request to create volume after this point, also haven't found any error after this point in the log.

I setup a starlingx environment using iso generated at 20190708, volumes were created normally, and I cannot reproduce this issue in my side.

Some question:
1. If this issue happen, will the same issue happen in the subsequent request? Will all the subsequent create volume requests fail?
2. Is it possible to share your environment if this issue happens? It would be much easier to root cause this kind of issue that happens intermittent. My email: <email address hidden>
3. Could you please print the time that sending create volume request? This is used to find in the log. Thanks.

Changed in starlingx:
status:	Confirmed → Incomplete

Revision history for this message

Liang Fang (liangfang) wrote on 2019-07-10:

We found DB connection was lost at 15:56:56. Do you still have some record that shows at which time you send create volume request?

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-10:

Not sure why this bug has an stx.networking tag on it. This seems to be a storage issue. Internal connection to the DB is not part of the networking domain.

tags:

removed: stx.networking

Revision history for this message

Liang Fang (liangfang) wrote on 2019-07-12:

#10

@peng, if volume number exceeded quota, e.g. 8, the volume status will be error. If reproduced, please take a look of quota, thanks.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-25:

#11

We have not observed this issue recently. Will provide more information if it is reproduced again.

Revision history for this message

Senthil Mukundakumar (smukunda) wrote on 2019-08-08:

#12

volume_create_stuck_in_Creating_state.tar Edit (112.6 MiB, application/x-tar)

This issue is reproducible using load 20190807T053000Z in a storage system. The collect all logs are attached.

TIMESTAMP: 2019-08-08 09:56:03,516
2019-08-08 10:52:27,061

Revision history for this message

Yang Liu (yliu12) wrote on 2019-08-08:

#13

Download full text (8.6 KiB)

Above issue happened in sanity. After sanity completes, there are a number cinder volumes stuck at creating status while one in in-use.
The volume that was in in-use status was the first one that having issue. Basically the vm uses it already deleted, but it was stuck at in-use. Ever since then, all volumes creation were stuck.

This is the steps for the first issue:

# create a cinder volume
2019-08-08T09:37:17.000 controller-1 -sh: info HISTORY: PID=748516 UID=42425 openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=2 --image=507863f1-b992-4300-967a-e7b8839e06d3 --bootable vol-tenant1-vm-1-1

# launch a vm from it
[2019-08-08 09:37:21,085] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --boot-volume=7ca4c340-dee7-4125-a272-7023f833d121 --key-name=keypair-tenant1 --flavor=d469bf62-c5db-4c49-9ac9-c75d01e3522f --poll --nic net-id=cb9285ab-3dd3-4efa-a09d-87bbc56bf7fb --nic net-id=c50df027-2272-41b4-82d7-0ec97634adad tenant1-vm-1'

# force reboot active controller
[2019-08-08 09:44:52,046] 180 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-1
[2019-08-08 09:44:52,046] 301 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

# Stop and delete the server
[2019-08-08 09:51:21,774] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne stop 00bc77d9-48e0-4906-a711-54dccd939eb8'
[2019-08-08 09:51:28,556] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server delete 00bc77d9-48e0-4906-a711-54dccd939eb8'

This is the steps for the first issue:

# launch a vm from it
[2019-08-08 09:37:21,085] 301  DEBUG MainThread ssh.send    :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne boot --boot-volume=7ca4c340-dee7-4125-a272-7023f833d121 --key-name=keypair-tenant1 --flavor=d469bf62-c5db-4c49-9ac9-c75d01e3522f --poll --nic net-id=cb9285ab-3dd3-4efa-a09d-87bbc56bf7fb --nic net-id=c50df027-2272-41b4-82d7-0ec97634adad tenant1-vm-1'

# force reboot active controller
[2019-08-08 09:44:52,046] 180  INFO  MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-1
[2019-08-08 09:44:52,046] 301  DEBUG MainThread ssh.send    :: Send 'sudo reboot -f'

# Stop and delete the server
[2019-08-08 09:51:21,774] 301  DEBUG MainThread ssh.send    :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne stop 00bc77d9-48e0-4906-a711-54dccd939eb8'
[2019-08-08 09:51:28,556] 301  DEBUG MainThread ssh.send    :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server delete 00bc77d9-48e0-4906-a711-54dccd939eb8'

# VM was not removed right away, it got stuck for longer than normal
[2019-08-08 09:54:27,307] 301  DEBUG MainThread ssh.send    :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
+--------------------------------------+--------------+---------+----------+-------+----------------------+
| ID                                   | Name         | Status  | Networks | Image | Flavor               |
+--------------------------------------+--------------+---------+----------+-------+----------------------+
| 00bc77d9-48e0-4906-a711-54dccd939eb8 | tenant1-vm-1 | SHUTOFF |          |       | flavor-default-size2 |
+--------------------------------------+--------------+---------+----------+-------+----------------------+
controller-0:~$

# Above vm was gone from openstack server list maybe after another boot-from-image vm was created:
[2019-08-08 09:55:43,442] 301  DEBUG MainThread ssh.send    :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
+--------------------------------------+---------------------+--------+------------------------------------------------------------+------------------+----------------------+
| ID                                   | Name                | Status | Networks                                                   | Image            | Flavor               |
+--------------------------------------+---------------------+--------+------------------------------------------------------------+------------------+----------------------+
| 8196250a-f138-4fda-ab4e-e75c37b12c5e | tenant1-swact_img-2 | ACTIVE | tenant1-mgmt-net=192.168.121.19; tenant1-net0=172.16.0.207 | tis-centos-guest | flavor-default-size2 |
+--------------------------------------+---------------------+--------+------------------------------------------------------------+------------------+----------------------+
controller-0:~$

From above observations, it is likely that force rebooting of the controller caused it.

Changed in starlingx:
status:	Incomplete → Confirmed

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-23:

#14

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags:

added: stx.3.0
removed: stx.2.0

Yang Liu (yliu12) on 2019-09-09

tags:

added: stx.retestneeded

Cindy Xie (xxie1) on 2019-09-11

Changed in starlingx:
assignee:	Liang Fang (liangfang) → Ma Zheng (zheng-vm)

Revision history for this message

Ma Zheng (zheng-vm) wrote on 2019-09-18:

#15

We cannot reproduce this in an AIO-DX environment with load 20190902T230000Z. Steps are as #13. We create a volume of 50GB with cirros-0.4.0-x86_64-disk.img. And volume was created successfully.

We are doing more testing to verify this bug.

Expecting more comments!

yong hu (yhu6) on 2019-09-25

tags:

added: stx.distro.openstack

Revision history for this message

Ma Zheng (zheng-vm) wrote on 2019-10-28:

#16

Hi all,

Since the bug has been inactive for more than a month, I'd like to ask about recent test result.
If this bug is not reproduced in recent testing, is it possible for the reporter or someone to change the bug status?

Thanks!

Revision history for this message

Ma Zheng (zheng-vm) wrote on 2019-11-22:

#17

For recent update - Openstack Train version integration, we have setup an AIO-DX environment with load 20191119T000000Z.
Steps are as #13. We create a volume of 50GB with cirros-0.4.0-x86_64-disk.img. This bug is not reproduced in our testing.

As previous log shows, there is a timeout error connecting to RabbitMQ as well as a ceph cluster error. This error can happen as the cluster is not fully initialized after rebooted.

So we suggest, if tested manually, to wait longer and see if this happens again.

Revision history for this message

Austin Sun (sunausti) wrote on 2019-11-25:

#18

Maria:
would you like to verify it on latest master build ?

Thanks.
BR
Austin Sun.

Ma Zheng (zheng-vm) on 2019-11-26

Changed in starlingx:
status:	Confirmed → Incomplete

Revision history for this message

zhipeng liu (zhipengs) wrote on 2019-11-27:

#19

Not reproducible in 20191119T000000Z with train release.
Waiting for reporter's confirm！

Revision history for this message

zhipeng liu (zhipengs) wrote on 2019-11-27:

#20

Set to invalid first since we could not reproduce it in latest build.

Zhipeng

Changed in starlingx:
status:	Incomplete → Invalid

Revision history for this message

Peng Peng (ppeng) wrote on 2019-11-29:

#21

Issue was not reproduced on train
2019-11-21_20-00-00
wcp_3-6

Ghada Khalil (gkhalil) on 2019-11-29

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1827230

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.