M->Newton upgrade causes all fip to disappear.

Bug #1763322 reported by Sofer Athlan-Guyot
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Sofer Athlan-Guyot

Bug Description

Hi,

Reported here https://bugzilla.redhat.com/show_bug.cgi?id=1499201

All fip associated to instance before the upgrade becomes unreachable after the upgrade.

The neutron agent list returns duplicate entries for each controller with half of the agent being down:

neutron agent-list
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
| id | agent_type | host | availability_zone | alive | admin_state_up | binary |
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
| 0d87b463-d27c-4b90-b43c-4420d367a0bb | Open vSwitch agent | overcloud-controller-2.localdomain | | :-) | True | neutron-openvswitch-agent |
| 1f21aad1-0688-41df-94fc-afffbc6ad639 | Metadata agent | overcloud-controller-1 | | xxx | True | neutron-metadata-agent |
| 22c57edf-8015-4172-acdd-5ca30fe9d2fd | L3 agent | overcloud-controller-1.localdomain | nova | :-) | True | neutron-l3-agent |
| 2552feaf-1429-495b-a25f-1a492e5a6668 | Metadata agent | overcloud-controller-2 | | xxx | True | neutron-metadata-agent |
| 340db352-474c-4a31-a62a-e9a0f4406bd1 | DHCP agent | overcloud-controller-0 | nova | xxx | True | neutron-dhcp-agent |
| 53b444b0-abbc-4825-b1ad-8622c77aa36e | L3 agent | overcloud-controller-0.localdomain | nova | :-) | True | neutron-l3-agent |
| 54caa3f7-53ec-4f27-9252-a774b78c06c9 | Open vSwitch agent | overcloud-controller-1.localdomain | | :-) | True | neutron-openvswitch-agent |
| 6caf3e06-c3f5-4e50-99fa-c4f6ae4bdbb5 | DHCP agent | overcloud-controller-2 | nova | xxx | True | neutron-dhcp-agent |
| 6e362eb0-678b-434f-b2bd-746107610114 | DHCP agent | overcloud-controller-1.localdomain | nova | :-) | True | neutron-dhcp-agent |
| 82286c37-7c71-446a-a4b1-73647834944f | Metadata agent | overcloud-controller-0 | | xxx | True | neutron-metadata-agent |
| 83117afc-c8f7-4b5d-b9d5-859f960c677c | Metadata agent | overcloud-controller-0.localdomain | | :-) | True | neutron-metadata-agent |
| 844ef54d-69db-4728-b203-869136ef4368 | Open vSwitch agent | overcloud-controller-1 | | xxx | True | neutron-openvswitch-agent |
| 84c43451-4890-4448-a746-f4cab94cc767 | Open vSwitch agent | overcloud-controller-2 | | xxx | True | neutron-openvswitch-agent |
| 85330667-84b7-4bbf-93be-9dadd0736eea | Open vSwitch agent | overcloud-controller-0.localdomain | | :-) | True | neutron-openvswitch-agent |
| 87172b40-265c-4b24-a44f-ae7c5f2bb116 | L3 agent | overcloud-controller-0 | nova | xxx | True | neutron-l3-agent |
| 88ea82ef-22e8-46dd-850b-5f34efd83bf5 | Metadata agent | overcloud-controller-2.localdomain | | :-) | True | neutron-metadata-agent |
| 8b30b03a-9c32-4c03-bf44-2ac1fd4492fe | DHCP agent | overcloud-controller-0.localdomain | nova | :-) | True | neutron-dhcp-agent |
| c448cc63-29d8-4a41-a71d-97e499958aef | Metadata agent | overcloud-controller-1.localdomain | | :-) | True | neutron-metadata-agent |
| d17095af-7799-4024-abbf-b7c01efee452 | DHCP agent | overcloud-controller-1 | nova | xxx | True | neutron-dhcp-agent |
| d1f664e1-6539-41a3-9686-1e828b9258af | Open vSwitch agent | overcloud-controller-0 | | xxx | True | neutron-openvswitch-agent |
| d5f26fbc-02ab-4866-945c-c798e80de94f | L3 agent | overcloud-controller-2.localdomain | nova | :-) | True | neutron-l3-agent |
| d6af24ef-2b49-4477-923d-b29bc7e13e86 | L3 agent | overcloud-controller-1 | nova | xxx | True | neutron-l3-agent |
| d71fbc5a-a3da-4eb2-bf76-f06c6130c895 | DHCP agent | overcloud-controller-2.localdomain | nova | :-) | True | neutron-dhcp-agent |
| de1c1c12-3e2f-4ebf-9daa-f2b0b3eb3b38 | Open vSwitch agent | overcloud-compute-1.localdomain | | :-) | True | neutron-openvswitch-agent |
| ec13520a-dcc2-4b34-bbfc-4a6c76466379 | L3 agent | overcloud-controller-2 | nova | xxx | True | neutron-l3-agent |
| f24abfbf-3c42-45cf-9d39-d2eb11feb6e9 | Open vSwitch agent | overcloud-compute-0.localdomain | | :-) | True | neutron-openvswitch-agent |
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/560855

Changed in tripleo:
importance: Critical → High
Changed in tripleo:
assignee: nobody → Sofer Athlan-Guyot (sofer-athlan-guyot)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/newton)

Reviewed: https://review.openstack.org/560855
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=85b489bf4fcfc8a19a82e4d713d0adde25463a9c
Submitter: Zuul
Branch: stable/newton

commit 85b489bf4fcfc8a19a82e4d713d0adde25463a9c
Author: Sofer Athlan-Guyot <email address hidden>
Date: Thu Apr 12 12:11:16 2018 +0200

    [NEWTON ONLY] Adjust NeutronAllowL3AgentFailover to new default.

    The default value of default NeutronAllowL3AgentFailover has changed
    to true. But in Mitaka that was set to false in the registry. So
    that value persists and overrides the new default.

    This has no impact currently as this value is ignored neutron as the
    l3 agents are in HA configuration (this parameter only work when
    non-HA).

    But, it's still good to have for two reasons:

     - it simplifies recovery if we hit bug #1763322, see[1]
     - it will not cause unexpected issue if we offer the possibility to
       have non-HA l3 agent.

    [1] https://bugzilla.redhat.com/attachment.cgi?id=1421308

    Change-Id: Ibba351ff625abfe8fd5b1f76b9a49bac120cbd8a
    Related-Bug: #1763322

tags: added: in-stable-newton
Changed in tripleo:
milestone: rocky-1 → rocky-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/563572

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (stable/queens)

Change abandoned by Athlan-Guyot sofer (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/563572
Reason: In favor of https://review.openstack.org/#/c/562542/ which will handle the no value to value switch in newton.

Changed in tripleo:
assignee: Sofer Athlan-Guyot (sofer-athlan-guyot) → Oliver Walsh (owalsh)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/newton)

Reviewed: https://review.openstack.org/562542
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=6938b5d1f2747416338ce188068d7a967cba617b
Submitter: Zuul
Branch: stable/newton

commit 6938b5d1f2747416338ce188068d7a967cba617b
Author: Sofer Athlan-Guyot <email address hidden>
Date: Thu Apr 19 12:38:31 2018 +0200

    [Newton only] Do not overwrite current {neutron,nova}::host value.

    In newton we are setting both nova::host and neutron::host values
    explicitly to the fqdn.

    This can cause problem during upgrade from Mitaka. The previous host
    value (defaulting to python socket.gethostname) could return only the
    hostname[0]. It means that during upgrade we are changing this
    identifier. At restart nova/neutron creates *new* agents. Those
    agents are then unaware of existing workload.

    For neutron, the problem is that due to [1] and the fact the L3 agents
    are in HA mode, the previous defined workloads on those agent get lost
    and FIPs become unreachable.

    For nova it's no longer possible to send commands to (before upgrade)
    existing vm anymore.

    This patch checks the current live value of the host parameter through
    a fact and set the nova::host and neutron::host value to it if we are
    not in a deployment (upgrade/update)

    For nova, we directly use nova-manage to get the current live value.
    Using the mysql parameter directly has the advantage that it's defined
    on all types of node (controller *and* compute). As a matter of fact
    the required auth parameters are usually not defined on compute node.

    For neutron, when auth is available in the configuration (on
    Controller) we use that. There is no neutron-manage equivalent here
    so we use the nova value when auth is unavailable. When host is unset
    they both use python.gethostname, so it should be the same value.
    Using auth on controller add another level of confidence though. And
    the controller are where the l3 agents are, so better be safe than
    sorry.

    This patch is newton only as it's where we are setting for the first
    time this parameter. After that (ocata on) we use[2] to make sure
    that those parameters are never rewritten.

    [0] https://bugzilla.redhat.com/show_bug.cgi?id=1499201
    [1] https://review.openstack.org/#/c/560855/
    [2] need to be backported to ocata https://review.openstack.org/#/q/I8f075a5ad869ef0dc72a700dcb7be0b6efca787a

    Partial-Bug: #1763322
    Change-Id: Ieb92ff161d1684c214382c5eb6b5949efc3fe75c

Changed in tripleo:
milestone: rocky-2 → rocky-3
Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

You need also https://review.openstack.org/#/c/568552/ for this to be fixed which is merged.

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

We have solved M->N, but the problem arise Ocata now (for environment deployed in Mitatka). See https://bugzilla.redhat.com/show_bug.cgi?id=1596571 for an exemple.

Changed in tripleo:
assignee: Oliver Walsh (owalsh) → Sofer Athlan-Guyot (sofer-athlan-guyot)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Oliver Walsh (<email address hidden>) on branch: master
Review: https://review.openstack.org/562876

Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

The main issue is fixed now for M->N upgrade. If needed we'll open other lp for other release.

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/555732
Reason: This review is > 180 days without comment and WIP -1. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewers. For more details check policy https://specs.openstack.org/openstack/tripleo-specs/specs/policy/patch-abandonment.html

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.