Create instance from Image or from Volume fails

Bug #1837241 reported by Cristopher Lemus
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
zhipeng liu

Bug Description

Brief Description
-----------------
During sanity test, Creating an instance from Image and Creating an instance from Snapshot fails on Standard with External Storage (Virtual and Baremetal, same errors).

Severity
--------
Provide the severity of the defect.
Critical. Instances cannot be created.

Steps to Reproduce
------------------
These are the steps to reproduce the error on Instance from Image. We are using cirros to do this test:
1. Create flavor
   openstack flavor create --ram 2048 --disk 20 --vcpus 1 --public --id auto f1.small

2. Create image
   openstack image create --file /home/sysadmin/cirros-0.4.0-x86_64-disk.qcow2 --disk-format qcow2 --public cirros

3. Create network
   openstack network create network-1
   openstack subnet create --network network-1 --subnet-range 192.168.0.0/24 --ip-version 4 --dhcp subnet-1

4. Launch Cirros instance
   openstack server create --image cirros --flavor f1.small --nic net-id=e0ff5c44-ec8d-42f9-b6cb-86e308613860vm-cirros-1

5. Instance vm-cirros-1 goes to ERROR state
   openstack server show vm-cirros-1|grep -w status
   ERROR != ACTIVE

Additional output logged into this link: http://paste.openstack.org/show/754600/

Creating instances from snapshots and from templates also fails with same behavior.

Expected Behavior
------------------
Instance on ACTIVE status.

Actual Behavior
----------------
Instance on ERROR status.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Virtual and Baremetal. Standard with external storage.

Branch/Pull Time/Commit
-----------------------
ISO: 20190719T013000Z

Last Pass
---------
Last pass with previous ISO 20190718.

Timestamp/Logs
--------------
Console output logged into: http://paste.openstack.org/show/754600/. Full collect attached.

Test Activity
-------------
Sanity

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to the distro.openstack team to investigate. Based on the reporter, this is a recent issue and is 100% reproducible. The impact is high given instances cannot be created.

tags: added: stx.distro.openstack
Changed in starlingx:
assignee: nobody → yong hu (yhu6)
importance: Undecided → Critical
importance: Critical → High
tags: added: stx.2.0
Changed in starlingx:
status: New → Triaged
Revision history for this message
yong hu (yhu6) wrote :

@zhipeng, please analyze this High issue.

Changed in starlingx:
assignee: yong hu (yhu6) → zhipeng liu (zhipengs)
Revision history for this message
zhipeng liu (zhipengs) wrote :
Download full text (4.7 KiB)

From provided log, I saw below errors.
I will try to reproduce it in my setup and check further.

{"log":"2019-07-19 05:17:45.335 1 ERROR oslo.messaging._drivers.impl_rabbit [-] [c4616b8e-1973-4934-abcd-edbccab40c7d] AMQP server on rabbitmq.openstack.svc.cluster.local:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer\n","stream":"stdout","time":"2019-07-19T05:17:45.336051349Z"}
{"log":"2019-07-19 05:17:51.061 1 ERROR oslo_db.sqlalchemy.engines [req-c3320e24-d192-4e3e-b87d-f65c2ea1a13e - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)\n","stream":"stdout","time":"2019-07-19T05:17:51.064104242Z"}

{"log":"2019-07-19 05:17:45.427 1 ERROR oslo.messaging._drivers.impl_rabbit [-] [9b5b205e-f138-4471-9ebc-858358c7baef] AMQP server on rabbitmq.openstack.svc.cluster.local:5672 is unreachable: Basic.cancel: (0) 3. Trying again in 1 seconds.: ConsumerCancelled: Basic.cancel: (0) 3\n","stream":"stdout","time":"2019-07-19T05:17:45.428226962Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines [req-7c6bf1f1-4670-47df-8333-063e51a59285 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)\n","stream":"stdout","time":"2019-07-19T05:17:45.6008754Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):\n","stream":"stdout","time":"2019-07-19T05:17:45.60089745Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py\", line 73, in _connect_ping_listener\n","stream":"stdout","time":"2019-07-19T05:17:45.600901396Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines connection.scalar(select([1]))\n","stream":"stdout","time":"2019-07-19T05:17:45.600904434Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 912, in scalar\n","stream":"stdout","time":"2019-07-19T05:17:45.60090709Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines return self.execute(object_, *multiparams, **params).scalar()\n","stream":"stdout","time":"2019-07-19T05:17:45.600911245Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 980, in execute\n","stream":"stdout","time":"2019-07-19T05:17:45.600913757Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines return meth(self, multiparams, params)\n","stream":"stdout","time":"2019-07-19T05:17:45.600916528Z"}
{"log":"2019-07-19 05:17:45.598 1 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/sql/el...

Read more...

Revision history for this message
zhipeng liu (zhipengs) wrote :

From bug description, it seems just failed on Standard with External Storage (Virtual and Baremetal, same errors).

From log in http://paste.openstack.org/show/754600/, image format is not right?

fault message: u'Build of instance ab3dfe11-3ac3-4a26-a1ce-3bff1e2ca78b aborted: Image 69e1593f-4d03-4cd5-9779-8ffc1d227da0 is unacceptable: Converted to raw, but format is now qcow2', u'code': 500, u'details': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 1984, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 2344, in _build_and_run_instance\n reason=e.format_message())\n', u'created': u'2019-07-19T11:01:40Z'}

Not sure if it is related to another bug reported in this daily sanity test.
https://bugs.launchpad.net/starlingx/+bug/1837263
App platform-integ-apps not applied, Ceph HEALTH_WARN

Zhipeng

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Changing the importance to Critical given this is causing a red sanity

Changed in starlingx:
importance: High → Critical
Revision history for this message
yong hu (yhu6) wrote :

@zhipeng, before you create VM instance, please check both platform and stx-openstack "application" were correctly applied and all the pods/jobs/services are under correct state.

Moreover this error message seems more relevant to this LP:
fault message: u'Build of instance ab3dfe11-3ac3-4a26-a1ce-3bff1e2ca78b aborted: Image 69e1593f-4d03-4cd5-9779-8ffc1d227da0 is unacceptable: Converted to raw, but format is now qcow2'

I would advise you dig in this direction.
FYI: you can have a cross-check b/w "openstack image create" and "nova image-create", and see if any differences come out:

glance image-create --name “centos-root-img” --visibility public --disk-format=qcow2 --container-format=bare --file cirros-0.4.0-x86_64-disk.qcow2 --progress

openstack image create --file /home/sysadmin/cirros-0.4.0-x86_64-disk.qcow2 --disk-format qcow2 --public cirros

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi Yong, Zhipeng,

Something that might help.

1. The image is healthy, is the exact same image that we use to test on Simplex, duplex, standard. We also execute a checksum before consuming it.

2. Please, check this log: http://paste.openstack.org/show/754739/ .
     a) Creating instance from Image fails
     b) Creating instance from Volume PASS.
     c) Creating instance from volume snapshot fails.
     d) Creating instance using template fails.
   Is there some specific backend storage that instances created from volumes are using?

3. From the previous comment, and debugging on our logs, we found that the command "system host-lvg-modify -b remote compute-0 nova-local" is no longer working. I couldn't find how to configure the remote lvg on compute nodes. Could this be the cause of the failure?

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Further checking on the logs, it seems that the command "system host-lvg-modify" was updated since June/28, it might not be related to this issue:

./standard_ext_provision_201906280523/debug.log:system: error: unrecognized arguments: -b nova-local

Revision history for this message
zhipeng liu (zhipengs) wrote :

In 20190719T013000Z/outputs/CHANGELOG.txt
I saw we merged
https://review.opendev.org/#/c/661512/
Use true for force_raw_images when using ceph image backend

We need this patch for two reasons:

Nova of starlingx has not this patch[0]. We use remote storage(ceph)
as nova backend. If we set force_raw_image to False and use qcow2
format image to boot vms, the vms will fail to boot.

Nova of starlingx will have this patch[0]. If we still use False for
force_raw_images, the nova-compute service will refuse to start.

So that, we must set this force_raw_images to True at all.

[0] https://review.opendev.org/#/c/640271/

However, I double checked our current stein.2 nova, [0] is not included, this is the problem!
We have two options:
1) revert 640271, unblock sanity test first
2) cherry pick [0] to nova stein.2

Zhipeng

zhipeng liu (zhipengs)
Changed in starlingx:
status: Triaged → Confirmed
Revision history for this message
Boxiang Zhu (bxzhu-5355) wrote :

@zhipeng

First of, the default value of force_raw_images in nova project is True.

In stx 1.0, we set it to False and then force to convert the format to raw into rbd with some starlingx codes(no nova upstream).

Now in stx 2.0, we use nova upstream so we do not have some starlingx codes to force convert the format to raw into rbd. So we must set the value of force_raw_images to True. If we don't do that, vms will fail to boot when using qcow2 images.

I know that now the stein branch both nova upsteam and stx do not include the patch[0]

[0] https://review.opendev.org/#/c/640271/

Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi all,

After discuss with boxiang, it might not fix the issue if we just cherry pick upstream patch 640271
So I propose to revert below patch to unblock sanity test first.
https://review.opendev.org/#/c/661512/

If we want to merge this change, we need to figure out if nova part or other place need to be changed as well.

Thanks!
Zhipeng

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
yong hu (yhu6) wrote :

patch under review: https://review.opendev.org/#/c/672287/

let's see the latest validation on EB and sanity test result.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/672287
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=916e716018dcefabb7f14e58db1ad38bc7d17f97
Submitter: Zuul
Branch: master

commit 916e716018dcefabb7f14e58db1ad38bc7d17f97
Author: zhipeng liu <email address hidden>
Date: Tue Jul 23 13:13:01 2019 +0000

    Revert "Use true for force_raw_images when using ceph image backend"

    This reverts commit 82906cd3c946f9689afe03b52ada16707f5b9a92.
    This patch blocked sanity test
    Create instance from Image or from Volume fails

    Closes-Bug: #1837241

    Change-Id: Ic1063f3520a20e962a80300942af09f2aa9c7292
    Signed-off-by: zhipengl <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Chris Friesen (cbf123) wrote :

Boxiang, is there anything in particular that you're worried about when saying that backporting https://review.opendev.org/#/c/640271 might not fix things?

As I recall, we merged https://review.opendev.org/#/c/661512/ because qcow2 images wouldn't work with ceph remote storage, so could someone explain how reverting it will help things?

Revision history for this message
Chris Friesen (cbf123) wrote :

Has anyone actually tried this experiment after backporting https://review.opendev.org/#/c/640271 ?

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

We used the build that Zhipeng created to test it on Virtual environment, instances were created properly.

With the fix merged on CENGN build 20190724T013000Z, a full sanity was executed on both, virtual and baremetal environments. Instances can be created without issues on both.

Thanks for fixing it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.