create instance with dashbaord / 404 returned

Bug #1682423 reported by Bernhard on 2017-04-13
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned

Bug Description

Description
===========
When launching an instance over dashboard, a 404 code is returned ( Instance XX-XX-XX could not be found ).
Same issue occurs when you make it through nova client, but it has no effect.
In both tasks the instance is created.
You can see a "HTTP exception thrown:" in the nova-api logfile.

Steps to reproduce
==================
A chronological list of steps which will bring off the
issue you noticed:
* upgraded all nodes to CentoS 7.3
* no issues
* upgraded from newton to ocata (RDO)
* 404 issue

* launching an instance through dashboard.
The instance has been added but a 404 returned.
So an error is displayed in dashboard interface (see create_instance_error.png)

For testing purposes i used curl.
see output.txt

* launching an instance using nova client
Same error occurs but it has no effect.

see output.txt

Expected result
===============
A new instance launched and the wizard should be closed.

Actual result
=============
I get an 404 Code and the "instance wizard" is still open although the instances was created and is running.

Environment
===========
openstack-nova-console-15.0.0-1.el7
python-nova-15.0.0-1.el7
openstack-nova-placement-api-15.0.0-1.el7
python2-novaclient-7.1.0-1.el7
openstack-nova-scheduler-15.0.0-1.el7
openstack-nova-conductor-15.0.0-1.el7
openstack-nova-novncproxy-15.0.0-1.el7
openstack-nova-api-15.0.0-1.el7
openstack-nova-common-15.0.0-1.el7
openstack-nova-cert-15.0.0-1.el7
openstack-dashboard-11.0.0-1.el7
python-networking-bgpvpn-dashboard-6.0.0-0.1.681a5e7git.el7

Logs & Configs
==============
see nova-api-logs.txt

Bernhard (b-krieger) wrote :
Bernhard (b-krieger) wrote :
Bernhard (b-krieger) wrote :
Sean Dague (sdague) wrote :

Automatically discovered version ocata in description. If this is incorrect, please update the description to include 'nova version: ...'

tags: added: openstack-version.ocata
Sean Dague (sdague) wrote :

The nova message seems to be that the flavor was not found? Without debug level logs it's not really possible to get any further with this.

Changed in nova:
status: New → Incomplete
Dan Smith (danms) wrote :

How reproducible is this? I would think that tempest would catch this easily, since it does this pattern quite a bit.

MarkMielke (mark-mielke) wrote :

I hit the same thing when I upgraded to Ocata. I meant to update this issue after resolving, but out of mind...

I believe the issue was that not all of the compute nodes were successfully upgraded, so the messages are limited to an older version, and one of these failed. In my case, I went from Newton to Ocata, and upgraded all of the *active* compute nodes, but it turned out there were other *inactive* compute nodes still registered in the database, and these prevented the API from deciding that it was safe to use the newer messaging version.

I think there is a bug here, but I got past it by deleted the stale records from the database and restarting Nova. This allowed it to detect the latest version for all of the compute nodes, and switch to the Ocata messaging version.

Sven Michels (smichels) wrote :

Hi,

we're currently facing this issue. We have a fresh new ocata installation, and getting the same error
when trying to start an instance. The instance starts fine, no issues beside that it reports an
instance not found (404) when creating it. But this is not only cosmetic in this case, it breaks
tempest tests for example. If someone is willing to dig into it, i'm happy to provide what ever
possible. Because i'm currently out of ideas how to fix this. We checked our rabbits first, as we
thought this might be related, but the cluster looks good, no issues so far.

Cheers,
Sven

MarkMielke (mark-mielke) wrote :

Hi Sven:

See my explanation. While there probably is a bug, you probably don't want to wait for the fix. If you have the same problem as I did, which seems likely, then you need to confirm that *all* of your services are upgraded to Ocata version, and reporting in. You may have to check your database to see which services are not updated yet, and you may find they are machines that you are no longer using, as I did.

Sven Michels (smichels) wrote :

Hey Mark,

but we're having a new installation... so this should not be the
case in our case. So you found nothing else related to it?

Cheers and thanks,
Sven

Bernhard (b-krieger) wrote :

When adding an instance by CLI, a 404 will be returned too.
But the error has no effect except the instance is added by dashboard.

To override the error i added a sleep at 2360 in file /usr/lib/python2.7/site-packages/nova/compute/api.py to delay the fetch request.

time.sleep(1.0)
try:
if uuidutils.is_uuid_like(instance_id):
    LOG.debug("Fetching instance by UUID",
    instance_uuid=instance_id)

I attached the debug output and the nova-api.log.

Sven Michels (smichels) wrote :

This didn't fix it for me, unfortunately :(

We're now at the latest available ocata packages, but the issue is still there.

In my case, it was due to the messaging protocol mismatch. As soon as I
ensured that the API server recognized all agents at the latest level, the
problem cleared. The symptoms seem similar, so I expect even though this is
a new installation, you may still be hitting this somehow. For example...
have you pinned the messaging version in the conf files?

For me, it was a lot of hours of tracking down exactly what was going on. I
eventually traced it into the database where I found rows that recorded the
nodes at older messaging versions, and they were machines I didn't even
have online as they had been replaced long ago. Clearing those out was all
that it took...

On Mon, Aug 7, 2017 at 11:12 AM, Sven Michels <email address hidden> wrote:

> Hey Mark,
>
> but we're having a new installation... so this should not be the
> case in our case. So you found nothing else related to it?
>
> Cheers and thanks,
> Sven
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1682423
>
> Title:
> create instance with dashbaord / 404 returned
>
> Status in OpenStack Compute (nova):
> Incomplete
>
> Bug description:
> Description
> ===========
> When launching an instance over dashboard, a 404 code is returned (
> Instance XX-XX-XX could not be found ).
> Same issue occurs when you make it through nova client, but it has no
> effect.
> In both tasks the instance is created.
> You can see a "HTTP exception thrown:" in the nova-api logfile.
>
>
> Steps to reproduce
> ==================
> A chronological list of steps which will bring off the
> issue you noticed:
> * upgraded all nodes to CentoS 7.3
> * no issues
> * upgraded from newton to ocata (RDO)
> * 404 issue
>
>
> * launching an instance through dashboard.
> The instance has been added but a 404 returned.
> So an error is displayed in dashboard interface (see
> create_instance_error.png)
>
> For testing purposes i used curl.
> see output.txt
>
>
> * launching an instance using nova client
> Same error occurs but it has no effect.
>
> see output.txt
>
>
> Expected result
> ===============
> A new instance launched and the wizard should be closed.
>
> Actual result
> =============
> I get an 404 Code and the "instance wizard" is still open although the
> instances was created and is running.
>
> Environment
> ===========
> openstack-nova-console-15.0.0-1.el7
> python-nova-15.0.0-1.el7
> openstack-nova-placement-api-15.0.0-1.el7
> python2-novaclient-7.1.0-1.el7
> openstack-nova-scheduler-15.0.0-1.el7
> openstack-nova-conductor-15.0.0-1.el7
> openstack-nova-novncproxy-15.0.0-1.el7
> openstack-nova-api-15.0.0-1.el7
> openstack-nova-common-15.0.0-1.el7
> openstack-nova-cert-15.0.0-1.el7
> openstack-dashboard-11.0.0-1.el7
> python-networking-bgpvpn-dashboard-6.0.0-0.1.681a5e7git.el7
>
>
> Logs & Configs
> ==============
> see nova-api-logs.txt
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/nova/+bug/1682423/+subscriptions
>

--
Mark Mielke <email address hidden>

Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
Kim-Norman Sahm (kisahm) wrote :

I've tried Bernards workaround and it's working.

--- api.py 2017-10-10 14:40:14.384511170 +0200
+++ api.py.patched 2017-10-10 14:40:46.192451521 +0200
@@ -83,6 +83,7 @@
 from nova import utils
 from nova.virt import hardware
 from nova.volume import cinder
+from time import sleep

 LOG = logging.getLogger(__name__)

@@ -2354,6 +2355,7 @@
         expected_attrs.extend(['metadata', 'system_metadata',
                                'security_groups', 'info_cache'])
         # NOTE(ameade): we still need to support integer ids for ec2
+ sleep(1.0)
         try:
             if uuidutils.is_uuid_like(instance_id):
                 LOG.debug("Fetching instance by UUID",

Changed in nova:
status: Expired → Confirmed
Matt Riedemann (mriedem) wrote :

There is some confusion here about different issues and how this is not working.

The sleep workaround that Bernhard is using is totally unrelated to anything with service versions or message version pins in the services that Mark is talking about.

What Bernhard (and others) are likely hitting with the 404 immediately after the server is created, is that in Ocata we now do this when creating a server:

1. The API service creates a build request (in the nova_api.build_requests table) which is temporary until the instance is created in a cell database.
2. The API casts to conductor to continue building the instance.
3. The conductor service asks the scheduler for a host on which to build the instance, and that host is mapped to a cell via the nova_api.host_mappings table.
4. Once a host is chosen, conductor creates the instance in the cell database and deletes the build request record in the API database.

While this is happening, if requests are coming into the API to get the instance, the API code is going to lookup the instance in the cell database via it's nova_api.instance_mappings record. If the instance is not mapped to a cell yet, the API will lookup the build request record and use that to return details about the instance while it's building.

There should be no window of time where neither exist and you get a 404.

I do remember there being some intermittent Tempest failures because of a window of time where the build request was gone but the instance wasn't mapped to the cell, but we fixed that and it should already be in Ocata.

What version of Ocata are people experiencing this issue? If you're still at 15.0.0 then you need to upgrade the controller services to make sure you have the latest fixes.

Matt Riedemann (mriedem) wrote :

What Mark was hitting was maybe this check for a minimum service version:

https://github.com/openstack/nova/blob/15.0.0/nova/compute/api.py#L2263

But that's a minimum service version check on the nova-api service, not the nova-compute services. So I'm not sure why stale dead compute services would have any influence there.

Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/427782/ was a change that made it into 15.0.0 Ocata GA.

However, looking at the code:

- we create an instance mapping record during instance create:

https://github.com/openstack/nova/blob/15.0.0/nova/compute/api.py#L1006

- assuming you make it this far (and your nova-api service version is >=15

https://github.com/openstack/nova/blob/15.0.0/nova/compute/api.py#L2271

We get the instance mapping for the instance, created above. The instance might not be in a cell at this point though.

- if the instance is not yet in a cell, we get the representation from the build request:

https://github.com/openstack/nova/blob/15.0.0/nova/compute/api.py#L2281

--

The only other thing I remember affecting that nova-osapi_compute service version check in Ocata was if people are running nova-api under uwsgi or mod_wsgi, as the API service version isn't set then, but running the API under uwsgi wasn't supported in Ocata anyway - so can people confirm that is not how they are running the API?

Matt Riedemann (mriedem) wrote :

The workaround with the 1 second sleep seems to suggest there is a race window where the instance is not mapped to a cell yet but the build request is also gone, which shouldn't happen as we create the instance in the cell before deleting the build request:

https://github.com/openstack/nova/blob/15.0.0/nova/conductor/manager.py#L914

https://github.com/openstack/nova/blob/15.0.0/nova/conductor/manager.py#L937

So while it's possible for the instance to be created in the cell database but the instance mapping record isn't yet updated, we account for that when looking up the instance here:

https://github.com/openstack/nova/blob/15.0.0/nova/compute/api.py#L2276

and here:

https://github.com/openstack/nova/blob/15.0.0/nova/compute/api.py#L2289

tags: added: api cells
Matt Riedemann (mriedem) on 2017-10-11
Changed in nova:
status: Confirmed → Incomplete
Matt Riedemann (mriedem) wrote :

This sounds like a duplicate of bug 1661360.

Reviewed: https://review.openstack.org/511503
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4d7acf3a0caa111f407ea7aac5dde95ff6ca49b6
Submitter: Zuul
Branch: stable/ocata

commit 4d7acf3a0caa111f407ea7aac5dde95ff6ca49b6
Author: Matt Riedemann <email address hidden>
Date: Thu Oct 12 10:28:11 2017 -0400

    Add release note for running nova-api under wsgi in Ocata

    Commit e846c32ce3c3aee2cd83dec7561dc14cb3f0ada8 added a warning
    about running nova-api under wsgi in Ocata but that's not very
    discoverable for people that are upgrading and already running
    in this mode.

    This change adds a release note to make that more obvious and
    link to the known bug.

    Related-Bug: #1661360
    Related-Bug: #1682423

    Change-Id: Id9034f795de1b55ac8190feaa37a370e9afd2d8d

tags: added: in-stable-ocata
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers