Polling for server metadata causes engine to log warnings

Bug #1507568 reported by Steven Hardy
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Zane Bitter
Kilo
Fix Released
Undecided
Unassigned

Bug Description

When os-collect-config is configured to poll heat's API for server metadata, we see warnings at regular intervals in the engine log like this:

2015-10-19 13:22:34.380 WARNING heat.engine.clients.os.nova [req-2c4da8b3-f441-43c3-bb59-bbe1a6665a6f None 2240eae5cccc439298daa101fc8557bd-96361e11-11f8-45d1-a89b-732401f] Instance (099bb7bb-f22d-4739-a077-4776a09bbda9) not found: Instance 099bb7bb-f22d-4739-a077-4776a09bbda9 could not be found. (HTTP 404) (Request-ID: req-3118e9c7-f5f1-497a-9cb1-fc12da67dd66)

I think this is because the polling identity is a stack domain user, thus can't find the Instance created by the stack while evaluating a custom constraint while we instantiate the stack object while servicing the request.

This is a bit distracting as you get a lot of these in large deployments with lots of instances polling heat - we should find a way to either fix the problem (conditionally evaluate custom constraints only on create/update/preview/validate) or suppress the log warning (or, at least, reduce it to DEBUG).

Revision history for this message
Marcin Zbik (zbikmarc+launchpad) wrote :

It looks that heat engine asks about VM created in heat created project not in project of user who deployed stack.

Revision history for this message
Steven Hardy (shardy) wrote :

> It looks that heat engine asks about VM created in heat created project not in project of user who deployed stack.

Yes, which is expected when polling from the VM using stack domain user credentials

http://hardysteven.blogspot.co.uk/2014/04/heat-auth-model-updates-part-2-stack.html

The credentials deployed inside instances have permission to access a subset of the Heat API surface, but no other services or anything in the project of the user who deployed the stack, which is by design.

So we need to handle that case in a non-failing way and not log errors when a request comes from the project associated with the stack domain users (which is known per stack).

Revision history for this message
Steven Hardy (shardy) wrote :

So, here's what I think needs to happen, I just haven't had time to do it yet:

https://github.com/openstack/heat/blob/master/heat/engine/stack.py#L164

Whenever we handle a request where the request context project ID matches the stack stack_user_project_id, we need to disable evaluation of custom constraints (which I think is where the error is coming from above).

Revision history for this message
Zane Bitter (zaneb) wrote :

I saw this issue as well. The RPC API call used when we poll stack metadata is describe_stack_resource, which is overkill for what we need and returns a bunch of extra information that is thrown away by the ReST API, which was a reasonable trade-off at the time.

However, thanks to http://git.openstack.org/cgit/openstack/heat/commit/?id=6d8a5cb35cec81b23a71e8c6fa692f8e7b4dca1c amongst that information we now return all the attributes of the resource, which means that attempting to read the first_address attribute is inevitable, and hence we get these warnings in the log.

I suspect that it is this, and not any custom constraints, that is the source of the problem (the log message above says "Instance (...) not found", and the only log message that says "Instance" is in server_to_ipaddress(), which is only called when attempting to read the first_address attribute). However, it's entirely possible that custom constraints could also cause similar log messages.

Zane Bitter (zaneb)
Changed in heat:
assignee: nobody → Zane Bitter (zaneb)
status: New → In Progress
importance: Undecided → Critical
importance: Critical → High
Revision history for this message
Steven Hardy (shardy) wrote :

One possible related solution would be to change the default transport to a native (non CFN) one, I'll start a ML thread to discuss it, but looking at the o-c-c code, it seems we use the resource metadata call which shouldn't suffer from this issue:

https://github.com/openstack/os-collect-config/blob/master/os_collect_config/heat.py#L83

It's always seemed pretty wrong that we rely on the CFN API for native server/deployment resources, so this could be an opportunity to fix that, and even deprecate ye-olde CFN transport.

Revision history for this message
Steven Hardy (shardy) wrote :

I started an openstack-dev thread regarding changing the defaults for polling/signalling:

http://lists.openstack.org/pipermail/openstack-dev/2016-January/084693.html

Revision history for this message
Zane Bitter (zaneb) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/271431

Zane Bitter (zaneb)
tags: added: kilo-backport-potential liberty-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/272352

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/272353

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/271431
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=f0fa7312a024266cf6a51638a9ea772128bc3b89
Submitter: Jenkins
Branch: master

commit f0fa7312a024266cf6a51638a9ea772128bc3b89
Author: Zane Bitter <email address hidden>
Date: Fri Jan 22 12:10:40 2016 -0500

    Don't calculate attributes for metadata request

    The RPC API call describe_stack_resource is used by two separate ReST API
    calls and one CFN API call: Resource.show and Resource.metadata in heat-api
    and StackController.describe_stack_resource. The metadata call in
    particular throws away all of the results of the describe_stack_resource
    output except for the metadata itself. Either this call or the cfn one
    (depending on configuration) is also used very, very often: it's the one
    that every instance running os-collect-config polls for information about
    software deployments (so, to a first approximation, every server in every
    stack will be calling this once every 30s).

    Since blueprint detailed-resource-show merged in
    6d8a5cb35cec81b23a71e8c6fa692f8e7b4dca1c, every call to the RPC
    describe_stack_resource API has resulted in Heat fetching the attribute
    values for every single attribute in the resource from its corresponding
    OpenStack API (including those calls originating from a request for just
    the metadata, or from the cfn API which similarly discards the attribute
    values).

    This is insanely inefficient (we're making OpenStack API calls to fetch
    data that we don't even return to the user). It's quite possibly a
    substantial part of the reason why we've had so much difficulty scaling up
    Heat to deal with large numbers of servers polling it for config. And it
    tends to leave a lot of annoying warnings in the log messages, not just of
    Heat but also the other services it is polling, since polling the metadata
    generally happens using the heat_stack_user's credentials, not those of the
    actual owner of the resource - it can't actually find the resource and
    returns a 404.

    The fix is to change the default "with_attr" value in the RPC client to
    False, and only pass None (the previous default) or a list of extra
    attributes to include in the specific case where the ReST API will actually
    return the attribute values to the user and not just discard them. Since
    the server previously treated any falsey value for the parameter (e.g.
    None, False or []) as equivalent, and since nothing was previously passing
    False, this change is safe to backport to stable branches. The api and
    engine can be updated in any order without change in behaviour until a new
    api is talking to a new server (at which point the bug is fixed).

    This change also updates the ReST API unit tests to accurately reflect the
    data returned from the engine.

    Change-Id: Ifffeaa552d3205ca06a79adda09d35a77099a2bf
    Closes-Bug: #1507568

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/kilo)

Reviewed: https://review.openstack.org/272353
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=24228cbc0620d05a7d057edd35d894561eab514d
Submitter: Jenkins
Branch: stable/kilo

commit 24228cbc0620d05a7d057edd35d894561eab514d
Author: Zane Bitter <email address hidden>
Date: Fri Jan 22 12:10:40 2016 -0500

    Don't calculate attributes for metadata request

    The RPC API call describe_stack_resource is used by two separate ReST API
    calls and one CFN API call: Resource.show and Resource.metadata in heat-api
    and StackController.describe_stack_resource. The metadata call in
    particular throws away all of the results of the describe_stack_resource
    output except for the metadata itself. Either this call or the cfn one
    (depending on configuration) is also used very, very often: it's the one
    that every instance running os-collect-config polls for information about
    software deployments (so, to a first approximation, every server in every
    stack will be calling this once every 30s).

    Since blueprint detailed-resource-show merged in
    6d8a5cb35cec81b23a71e8c6fa692f8e7b4dca1c, every call to the RPC
    describe_stack_resource API has resulted in Heat fetching the attribute
    values for every single attribute in the resource from its corresponding
    OpenStack API (including those calls originating from a request for just
    the metadata, or from the cfn API which similarly discards the attribute
    values).

    This is insanely inefficient (we're making OpenStack API calls to fetch
    data that we don't even return to the user). It's quite possibly a
    substantial part of the reason why we've had so much difficulty scaling up
    Heat to deal with large numbers of servers polling it for config. And it
    tends to leave a lot of annoying warnings in the log messages, not just of
    Heat but also the other services it is polling, since polling the metadata
    generally happens using the heat_stack_user's credentials, not those of the
    actual owner of the resource - it can't actually find the resource and
    returns a 404.

    The fix is to change the default "with_attr" value in the RPC client to
    False, and only pass None (the previous default) or a list of extra
    attributes to include in the specific case where the ReST API will actually
    return the attribute values to the user and not just discard them. Since
    the server previously treated any falsey value for the parameter (e.g.
    None, False or []) as equivalent, and since nothing was previously passing
    False, this change is safe to backport to stable branches. The api and
    engine can be updated in any order without change in behaviour until a new
    api is talking to a new server (at which point the bug is fixed).

    This change also updates the ReST API unit tests to accurately reflect the
    data returned from the engine.

    Change-Id: Ifffeaa552d3205ca06a79adda09d35a77099a2bf
    Closes-Bug: #1507568
    (cherry picked from commit f0fa7312a024266cf6a51638a9ea772128bc3b89)

tags: added: in-stable-kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/liberty)

Reviewed: https://review.openstack.org/272352
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=a3bc0962a970bf6554158fc19203c0830a81fb69
Submitter: Jenkins
Branch: stable/liberty

commit a3bc0962a970bf6554158fc19203c0830a81fb69
Author: Zane Bitter <email address hidden>
Date: Fri Jan 22 12:10:40 2016 -0500

    Don't calculate attributes for metadata request

    The RPC API call describe_stack_resource is used by two separate ReST API
    calls and one CFN API call: Resource.show and Resource.metadata in heat-api
    and StackController.describe_stack_resource. The metadata call in
    particular throws away all of the results of the describe_stack_resource
    output except for the metadata itself. Either this call or the cfn one
    (depending on configuration) is also used very, very often: it's the one
    that every instance running os-collect-config polls for information about
    software deployments (so, to a first approximation, every server in every
    stack will be calling this once every 30s).

    Since blueprint detailed-resource-show merged in
    6d8a5cb35cec81b23a71e8c6fa692f8e7b4dca1c, every call to the RPC
    describe_stack_resource API has resulted in Heat fetching the attribute
    values for every single attribute in the resource from its corresponding
    OpenStack API (including those calls originating from a request for just
    the metadata, or from the cfn API which similarly discards the attribute
    values).

    This is insanely inefficient (we're making OpenStack API calls to fetch
    data that we don't even return to the user). It's quite possibly a
    substantial part of the reason why we've had so much difficulty scaling up
    Heat to deal with large numbers of servers polling it for config. And it
    tends to leave a lot of annoying warnings in the log messages, not just of
    Heat but also the other services it is polling, since polling the metadata
    generally happens using the heat_stack_user's credentials, not those of the
    actual owner of the resource - it can't actually find the resource and
    returns a 404.

    The fix is to change the default "with_attr" value in the RPC client to
    False, and only pass None (the previous default) or a list of extra
    attributes to include in the specific case where the ReST API will actually
    return the attribute values to the user and not just discard them. Since
    the server previously treated any falsey value for the parameter (e.g.
    None, False or []) as equivalent, and since nothing was previously passing
    False, this change is safe to backport to stable branches. The api and
    engine can be updated in any order without change in behaviour until a new
    api is talking to a new server (at which point the bug is fixed).

    This change also updates the ReST API unit tests to accurately reflect the
    data returned from the engine.

    Change-Id: Ifffeaa552d3205ca06a79adda09d35a77099a2bf
    Closes-Bug: #1507568
    (cherry picked from commit f0fa7312a024266cf6a51638a9ea772128bc3b89)

tags: added: in-stable-liberty
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/heat 6.0.0.0b3

This issue was fixed in the openstack/heat 6.0.0.0b3 development milestone.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/heat 2015.1.4

This issue was fixed in the openstack/heat 2015.1.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 5.0.2

This issue was fixed in the openstack/heat 5.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 2015.1.4

This issue was fixed in the openstack/heat 2015.1.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.