Bug #1837241 “Create instance from Image or from Volume fails” : Bugs : StarlingX

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-07-19:

#1

controller-0_20190719.093638.tar Edit (25.8 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-19:

#2

Assigning to the distro.openstack team to investigate. Based on the reporter, this is a recent issue and is 100% reproducible. The impact is high given instances cannot be created.

tags:	added: stx.distro.openstack
Changed in starlingx:
assignee:	nobody → yong hu (yhu6)
importance:	Undecided → Critical
importance:	Critical → High
tags:	added: stx.2.0
Changed in starlingx:
status:	New → Triaged

Revision history for this message

yong hu (yhu6) wrote on 2019-07-19:

#3

@zhipeng, please analyze this High issue.

Changed in starlingx:
assignee:	yong hu (yhu6) → zhipeng liu (zhipengs)

Revision history for this message

zhipeng liu (zhipengs) wrote on 2019-07-22:

#4

Download full text (4.7 KiB)

From provided log, I saw below errors.
I will try to reproduce it in my setup and check further.

{"log":"2019-07-19 05:17:45.335 1 ERROR oslo.messaging._drivers.impl_rabbit [-] [c4616b8e-1973-4934-abcd-edbccab40c7d] AMQP server on rabbitmq.openstack.svc.cluster.local:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer\n","stream":"stdout","time":"2019-07-19T05:17:45.336051349Z"}
{"log":"2019-07-19 05:17:51.061 1 ERROR oslo_db.sqlalchemy.engines [req-c3320e24-d192-4e3e-b87d-f65c2ea1a13e - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)\n","stream":"stdout","time":"2019-07-19T05:17:51.064104242Z"}

{"log":"2019-07-19 05:17:45.427 1 ERROR oslo.messaging._drivers.impl_rabbit [-] [9b5b205e-f138-4471-9ebc-858358c7baef] AMQP server on rabbitmq.openstack.svc.cluster.local:5672 is unreachable: Basic.cancel: (0) 3. Trying again in 1 seconds.: ConsumerCancelled: Basic.cancel: (0) 3\n","stream":"stdout","time":"2019-07-19T05:17:45.428226962Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines [req-7c6bf1f1-4670-47df-8333-063e51a59285 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)\n","stream":"stdout","time":"2019-07-19T05:17:45.6008754Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):\n","stream":"stdout","time":"2019-07-19T05:17:45.60089745Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py\", line 73, in _connect_ping_listener\n","stream":"stdout","time":"2019-07-19T05:17:45.600901396Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines connection.scalar(select([1]))\n","stream":"stdout","time":"2019-07-19T05:17:45.600904434Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 912, in scalar\n","stream":"stdout","time":"2019-07-19T05:17:45.60090709Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines return self.execute(object_, *multiparams, **params).scalar()\n","stream":"stdout","time":"2019-07-19T05:17:45.600911245Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 980, in execute\n","stream":"stdout","time":"2019-07-19T05:17:45.600913757Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines return meth(self, multiparams, params)\n","stream":"stdout","time":"2019-07-19T05:17:45.600916528Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/sql/el...

From provided log, I saw below errors.
I will try to reproduce it in my setup and check further.

{"log":"2019-07-19 05:17:45.335 1 ERROR oslo.messaging._drivers.impl_rabbit [-] [c4616b8e-1973-4934-abcd-edbccab40c7d] AMQP server on rabbitmq.openstack.svc.cluster.local:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer\n","stream":"stdout","time":"2019-07-19T05:17:45.336051349Z"}
{"log":"2019-07-19 05:17:51.061 1 ERROR oslo_db.sqlalchemy.engines [req-c3320e24-d192-4e3e-b87d-f65c2ea1a13e - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)\n","stream":"stdout","time":"2019-07-19T05:17:51.064104242Z"}

{"log":"2019-07-19 05:17:45.427 1 ERROR oslo.messaging._drivers.impl_rabbit [-] [9b5b205e-f138-4471-9ebc-858358c7baef] AMQP server on rabbitmq.openstack.svc.cluster.local:5672 is unreachable: Basic.cancel: (0) 3. Trying again in 1 seconds.: ConsumerCancelled: Basic.cancel: (0) 3\n","stream":"stdout","time":"2019-07-19T05:17:45.428226962Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines [req-7c6bf1f1-4670-47df-8333-063e51a59285 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)\n","stream":"stdout","time":"2019-07-19T05:17:45.6008754Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):\n","stream":"stdout","time":"2019-07-19T05:17:45.60089745Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py\", line 73, in _connect_ping_listener\n","stream":"stdout","time":"2019-07-19T05:17:45.600901396Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines     connection.scalar(select([1]))\n","stream":"stdout","time":"2019-07-19T05:17:45.600904434Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines   File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 912, in scalar\n","stream":"stdout","time":"2019-07-19T05:17:45.60090709Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines     return self.execute(object_, *multiparams, **params).scalar()\n","stream":"stdout","time":"2019-07-19T05:17:45.600911245Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines   File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 980, in execute\n","stream":"stdout","time":"2019-07-19T05:17:45.600913757Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines     return meth(self, multiparams, params)\n","stream":"stdout","time":"2019-07-19T05:17:45.600916528Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines   File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/sql/elements.py\", line 273, in _execute_on_connection\n","stream":"stdout","time":"2019-07-19T05:17:45.600926595Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines     return connection._execute_clauseelement(self, multiparams, params)\n","stream":"stdout","time":"2019-07-19T05:17:45.600929445Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines   File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 1099, in _execute_clauseelement\n","stream":"stdout","time":"2019-07-19T05:17:45.600931858Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines     distilled_params,\n","stream":"stdout","time":"2019-07-19T05:17:45.600934417Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines   File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 1240, in _execute_context\n","stream":"stdout","time":"2019-07-19T05:17:45.600936884Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines     e, statement, parameters, cursor, context\n","stream":"stdout","time":"2019-07-19T05:17:45.600940242Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines   File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 1456, in _handle_dbapi_exception\n","stream":"stdout","time":"2019-07-19T05:17:45.600942581Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines     util.raise_from_cause(newraise, exc_info)\n","stream":"stdout","time":"2019-07-19T05:17:45.600945278Z"}

Revision history for this message

zhipeng liu (zhipengs) wrote on 2019-07-22:

#5

From bug description, it seems just failed on Standard with External Storage (Virtual and Baremetal, same errors).

From log in http://paste.openstack.org/show/754600/, image format is not right?

fault message: u'Build of instance ab3dfe11-3ac3-4a26-a1ce-3bff1e2ca78b aborted: Image 69e1593f-4d03-4cd5-9779-8ffc1d227da0 is unacceptable: Converted to raw, but format is now qcow2', u'code': 500, u'details': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 1984, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 2344, in _build_and_run_instance\n reason=e.format_message())\n', u'created': u'2019-07-19T11:01:40Z'}

Not sure if it is related to another bug reported in this daily sanity test.
https://bugs.launchpad.net/starlingx/+bug/1837263
App platform-integ-apps not applied, Ceph HEALTH_WARN

Zhipeng

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-22:

#6

Changing the importance to Critical given this is causing a red sanity

Changed in starlingx:
importance:	High → Critical

Revision history for this message

yong hu (yhu6) wrote on 2019-07-22:

#7

@zhipeng, before you create VM instance, please check both platform and stx-openstack "application" were correctly applied and all the pods/jobs/services are under correct state.

Moreover this error message seems more relevant to this LP:
fault message: u'Build of instance ab3dfe11-3ac3-4a26-a1ce-3bff1e2ca78b aborted: Image 69e1593f-4d03-4cd5-9779-8ffc1d227da0 is unacceptable: Converted to raw, but format is now qcow2'

I would advise you dig in this direction.
FYI: you can have a cross-check b/w "openstack image create" and "nova image-create", and see if any differences come out:

glance image-create --name “centos-root-img” --visibility public --disk-format=qcow2 --container-format=bare --file cirros-0.4.0-x86_64-disk.qcow2 --progress

openstack image create --file /home/sysadmin/cirros-0.4.0-x86_64-disk.qcow2 --disk-format qcow2 --public cirros

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-07-22:

#8

Hi Yong, Zhipeng,

Something that might help.

1. The image is healthy, is the exact same image that we use to test on Simplex, duplex, standard. We also execute a checksum before consuming it.

2. Please, check this log: http://paste.openstack.org/show/754739/ .
     a) Creating instance from Image fails
     b) Creating instance from Volume PASS.
     c) Creating instance from volume snapshot fails.
     d) Creating instance using template fails.
   Is there some specific backend storage that instances created from volumes are using?

3. From the previous comment, and debugging on our logs, we found that the command "system host-lvg-modify -b remote compute-0 nova-local" is no longer working. I couldn't find how to configure the remote lvg on compute nodes. Could this be the cause of the failure?

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-07-22:

#9

Further checking on the logs, it seems that the command "system host-lvg-modify" was updated since June/28, it might not be related to this issue:

./standard_ext_provision_201906280523/debug.log:system: error: unrecognized arguments: -b nova-local

Revision history for this message

zhipeng liu (zhipengs) wrote on 2019-07-23:

#10

In 20190719T013000Z/outputs/CHANGELOG.txt
I saw we merged
https://review.opendev.org/#/c/661512/
Use true for force_raw_images when using ceph image backend

We need this patch for two reasons:

Nova of starlingx has not this patch[0]. We use remote storage(ceph)
as nova backend. If we set force_raw_image to False and use qcow2
format image to boot vms, the vms will fail to boot.

Nova of starlingx will have this patch[0]. If we still use False for
force_raw_images, the nova-compute service will refuse to start.

So that, we must set this force_raw_images to True at all.

[0] https://review.opendev.org/#/c/640271/

However, I double checked our current stein.2 nova, [0] is not included, this is the problem!
We have two options:
1) revert 640271, unblock sanity test first
2) cherry pick [0] to nova stein.2

Zhipeng

zhipeng liu (zhipengs) on 2019-07-23

Changed in starlingx:
status:	Triaged → Confirmed

Revision history for this message

Boxiang Zhu (bxzhu-5355) wrote on 2019-07-23:

#11

@zhipeng

First of, the default value of force_raw_images in nova project is True.

In stx 1.0, we set it to False and then force to convert the format to raw into rbd with some starlingx codes(no nova upstream).

Now in stx 2.0, we use nova upstream so we do not have some starlingx codes to force convert the format to raw into rbd. So we must set the value of force_raw_images to True. If we don't do that, vms will fail to boot when using qcow2 images.

I know that now the stein branch both nova upsteam and stx do not include the patch[0]

[0] https://review.opendev.org/#/c/640271/

Revision history for this message

zhipeng liu (zhipengs) wrote on 2019-07-23:

#12

Hi all,

After discuss with boxiang, it might not fix the issue if we just cherry pick upstream patch 640271
So I propose to revert below patch to unblock sanity test first.
https://review.opendev.org/#/c/661512/

If we want to merge this change, we need to figure out if nova part or other place need to be changed as well.

Thanks!
Zhipeng

OpenStack Infra (hudson-openstack) on 2019-07-23

Changed in starlingx:
status:	Confirmed → In Progress

Revision history for this message

yong hu (yhu6) wrote on 2019-07-23:

#13

patch under review: https://review.opendev.org/#/c/672287/

let's see the latest validation on EB and sanity test result.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-23: Fix merged to config (master)

#14

Reviewed: https://review.opendev.org/672287
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=916e716018dcefabb7f14e58db1ad38bc7d17f97
Submitter: Zuul
Branch: master

commit 916e716018dcefabb7f14e58db1ad38bc7d17f97
Author: zhipeng liu <email address hidden>
Date: Tue Jul 23 13:13:01 2019 +0000

Revert "Use true for force_raw_images when using ceph image backend"

    This reverts commit 82906cd3c946f9689afe03b52ada16707f5b9a92.
    This patch blocked sanity test
    Create instance from Image or from Volume fails

Closes-Bug: #1837241

Change-Id: Ic1063f3520a20e962a80300942af09f2aa9c7292
Signed-off-by: zhipengl <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Chris Friesen (cbf123) wrote on 2019-07-23:

#15

Boxiang, is there anything in particular that you're worried about when saying that backporting https://review.opendev.org/#/c/640271 might not fix things?

As I recall, we merged https://review.opendev.org/#/c/661512/ because qcow2 images wouldn't work with ceph remote storage, so could someone explain how reverting it will help things?

Revision history for this message

Chris Friesen (cbf123) wrote on 2019-07-23:

#16

Has anyone actually tried this experiment after backporting https://review.opendev.org/#/c/640271 ?

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-07-25:

#17

We used the build that Zhipeng created to test it on Virtual environment, instances were created properly.

With the fix merged on CENGN build 20190724T013000Z, a full sanity was executed on both, virtual and baremetal environments. Instances can be created without issues on both.

Thanks for fixing it.

StarlingX

Create instance from Image or from Volume fails

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches