metadata service occasionally not returning keys

Bug #1668958 reported by Kevin Benton
56
This bug affects 10 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Kevin Benton

Bug Description

Occasionally we are getting failures like this in the Linux Bridge job in Neutron:

2017-02-28 11:50:22,162 12602 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.17 (Error reading SSH protocol banner). Number attempts: 16. Retry after 17 seconds.

I traced it down to the VM not getting keys back from the metadata service even though it has a keypair configured. The request makes it over to Nova metadata with the relevant instance ID but no keys are being returned.

Revision history for this message
Kevin Benton (kevinbenton) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/439560

Revision history for this message
Kevin Benton (kevinbenton) wrote :

This should be reverted once the bug is fixed to avoid too much log noise: https://review.openstack.org/#/c/439560/

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

I think I found the root cause somewhere else:

If you look at e.g. http://logs.openstack.org/68/439168/2/check/gate-tempest-dsvm-neutron-full-ssh/007c455/console.html#_2017-03-01_15_34_09_762760 there are five connection failures with "Auth failed" before the "connection resets" start.

It seems to me that paramiko does not close the transport connection after the auth failure properly. And the dropbear sshd on the cirros side indeed has a variable MAX_UNAUTH_PER_IP=5 defined.

So https://review.openstack.org/439638 should fix this issue, which only should occur if the cirros instance doesn't go through adding the SSH key fast enough before the 5 connections pile up.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

That's not quite the issue. If you look at the console output from the cirros image, it says that it fails to get any SSH keys from the metadata service.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/439560
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=99989cb6526fb05bd47941dcaa1e6d95dfedbb20
Submitter: Jenkins
Branch: master

commit 99989cb6526fb05bd47941dcaa1e6d95dfedbb20
Author: Kevin Benton <email address hidden>
Date: Wed Mar 1 04:12:47 2017 -0800

    Add some metadata logging to root cause ssh failure

    Instances aren't getting SSH keys in some circumstances.
    This adds a few debug statements to see if the Nova metadata
    service thinks an instance has keys.

    Related-Bug: #1668958
    Change-Id: I3095d3024b62d5ebb4ea9e388a40ed94ef8f832b

no longer affects: nova
Changed in neutron:
status: New → In Progress
assignee: nobody → Kevin Benton (kevinbenton)
importance: Undecided → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/441346

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/441353

tags: added: gate-failure ocata-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/441346
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ff3132d8d455012b2b29f1eb65817f8492f84fe9
Submitter: Jenkins
Branch: master

commit ff3132d8d455012b2b29f1eb65817f8492f84fe9
Author: Kevin Benton <email address hidden>
Date: Fri Mar 3 10:57:57 2017 -0800

    Stop killing conntrack state without CT Zone

    The conntrack clearing code was belligerenty killing connections
    without a conntrack zone specifier when it couldn't get the zone
    for a given device. This means it would kill all connections based
    on an IP address match, which meant hitting innocent bystanders
    in other tenant networks with overlapping IP addresses.

    This bad fallback was being triggered every time because it was
    using the wrong identifier for a port to look up the zone.

    This patch fixes the port lookup and adjusts the fallback behavior
    to never clear conntrack entries if we can't find the conntrack
    zone for a port.

    This triggered the bug below (in the cases I root-caused) by
    killing a metadata connection right in the middle of retrieving
    a key.

    Closes-Bug: #1668958
    Change-Id: Ia4ee9b3305e89c958ac927980d80119c53ea519b

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/441546

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/441546
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5a0700ee9f1c2fc7d651003b4ede8d850199c28b
Submitter: Jenkins
Branch: stable/ocata

commit 5a0700ee9f1c2fc7d651003b4ede8d850199c28b
Author: Kevin Benton <email address hidden>
Date: Fri Mar 3 10:57:57 2017 -0800

    Stop killing conntrack state without CT Zone

    The conntrack clearing code was belligerenty killing connections
    without a conntrack zone specifier when it couldn't get the zone
    for a given device. This means it would kill all connections based
    on an IP address match, which meant hitting innocent bystanders
    in other tenant networks with overlapping IP addresses.

    This bad fallback was being triggered every time because it was
    using the wrong identifier for a port to look up the zone.

    This patch fixes the port lookup and adjusts the fallback behavior
    to never clear conntrack entries if we can't find the conntrack
    zone for a port.

    This triggered the bug below (in the cases I root-caused) by
    killing a metadata connection right in the middle of retrieving
    a key.

    Closes-Bug: #1668958
    Change-Id: Ia4ee9b3305e89c958ac927980d80119c53ea519b
    (cherry picked from commit ff3132d8d455012b2b29f1eb65817f8492f84fe9)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Jens Rosenboom (<email address hidden>) on branch: master
Review: https://review.openstack.org/441023
Reason: Not needed anymore

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/441353
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c76164c058a0cfeee3eb46b523a9ad012f93dd51
Submitter: Jenkins
Branch: master

commit c76164c058a0cfeee3eb46b523a9ad012f93dd51
Author: Kevin Benton <email address hidden>
Date: Fri Mar 3 11:18:28 2017 -0800

    Move conntrack zones to IPTablesFirewall

    The regular IPTablesFirewall needs zones to support safely
    clearly conntrack entries.

    In order to support the single bridge use case, the conntrack
    manager had to be refactored slightly to allow zones to be
    either unique to ports or unique to networks.

    Since all ports in a network share a bridge in the IPTablesDriver
    use case, a zone per port cannot be used since there is no way
    to distinguish which zone traffic should be checked against when
    traffic enters the bridge from outside the system.

    A zone per network is adequate for the single bridge per network
    solution since it implicitly does not suffer from the double-bridge
    cross in a single network that led to per port usage in OVS.[1]

    This had to adjust the functional firewall tests to use the correct
    bridge name now that it's relevant in the non hybrid IPTables case.

    1. Ibe9e49653b2a280ea72cb95c2da64cd94c7739da

    Closes-Bug: #1668958
    Closes-Bug: #1657260
    Change-Id: Ie88237d3fe4807b712a7ec61eb932748c38952cc

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/455399

tags: added: neutron-proactive-backport-potential
tags: added: neutron-easy-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.1

This issue was fixed in the openstack/neutron 10.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.0.0b1

This issue was fixed in the openstack/neutron 11.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/455399
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9a920fe0a561d36db95e27ac5673a9dba4d845d3
Submitter: Jenkins
Branch: stable/ocata

commit 9a920fe0a561d36db95e27ac5673a9dba4d845d3
Author: Kevin Benton <email address hidden>
Date: Fri Mar 3 11:18:28 2017 -0800

    Move conntrack zones to IPTablesFirewall

    The regular IPTablesFirewall needs zones to support safely
    clearly conntrack entries.

    In order to support the single bridge use case, the conntrack
    manager had to be refactored slightly to allow zones to be
    either unique to ports or unique to networks.

    Since all ports in a network share a bridge in the IPTablesDriver
    use case, a zone per port cannot be used since there is no way
    to distinguish which zone traffic should be checked against when
    traffic enters the bridge from outside the system.

    A zone per network is adequate for the single bridge per network
    solution since it implicitly does not suffer from the double-bridge
    cross in a single network that led to per port usage in OVS.[1]

    This had to adjust the functional firewall tests to use the correct
    bridge name now that it's relevant in the non hybrid IPTables case.

    1. Ibe9e49653b2a280ea72cb95c2da64cd94c7739da

    Closes-Bug: #1668958
    Closes-Bug: #1657260
    Change-Id: Ie88237d3fe4807b712a7ec61eb932748c38952cc
    (cherry picked from commit c76164c058a0cfeee3eb46b523a9ad012f93dd51)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/460905

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/460906

tags: removed: neutron-easy-proactive-backport-potential neutron-proactive-backport-potential ocata-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/460905
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4f01368001304d0c42265a6498e43c93b5aec49b
Submitter: Jenkins
Branch: stable/newton

commit 4f01368001304d0c42265a6498e43c93b5aec49b
Author: Kevin Benton <email address hidden>
Date: Fri Mar 3 10:57:57 2017 -0800

    Stop killing conntrack state without CT Zone

    The conntrack clearing code was belligerenty killing connections
    without a conntrack zone specifier when it couldn't get the zone
    for a given device. This means it would kill all connections based
    on an IP address match, which meant hitting innocent bystanders
    in other tenant networks with overlapping IP addresses.

    This bad fallback was being triggered every time because it was
    using the wrong identifier for a port to look up the zone.

    This patch fixes the port lookup and adjusts the fallback behavior
    to never clear conntrack entries if we can't find the conntrack
    zone for a port.

    This triggered the bug below (in the cases I root-caused) by
    killing a metadata connection right in the middle of retrieving
    a key.

    Closes-Bug: #1668958
    Change-Id: Ia4ee9b3305e89c958ac927980d80119c53ea519b
    (cherry picked from commit ff3132d8d455012b2b29f1eb65817f8492f84fe9)
    (cherry picked from commit 5a0700ee9f1c2fc7d651003b4ede8d850199c28b)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/460906
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f142cde767c9ff1d9f787048bb4754b95aea8e84
Submitter: Jenkins
Branch: stable/newton

commit f142cde767c9ff1d9f787048bb4754b95aea8e84
Author: Kevin Benton <email address hidden>
Date: Fri Mar 3 11:18:28 2017 -0800

    Move conntrack zones to IPTablesFirewall

    The regular IPTablesFirewall needs zones to support safely
    clearly conntrack entries.

    In order to support the single bridge use case, the conntrack
    manager had to be refactored slightly to allow zones to be
    either unique to ports or unique to networks.

    Since all ports in a network share a bridge in the IPTablesDriver
    use case, a zone per port cannot be used since there is no way
    to distinguish which zone traffic should be checked against when
    traffic enters the bridge from outside the system.

    A zone per network is adequate for the single bridge per network
    solution since it implicitly does not suffer from the double-bridge
    cross in a single network that led to per port usage in OVS.[1]

    This had to adjust the functional firewall tests to use the correct
    bridge name now that it's relevant in the non hybrid IPTables case.

    1. Ibe9e49653b2a280ea72cb95c2da64cd94c7739da

    Closes-Bug: #1668958
    Closes-Bug: #1657260
    Change-Id: Ie88237d3fe4807b712a7ec61eb932748c38952cc
    (cherry picked from commit c76164c058a0cfeee3eb46b523a9ad012f93dd51)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.4.0

This issue was fixed in the openstack/neutron 9.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.2

This issue was fixed in the openstack/neutron 10.0.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.