neutron failed to deploy in a multi-node deployment

Bug #1546789 reported by Lingfeng Xiong
44
This bug affects 10 people
Affects Status Importance Assigned to Milestone
kolla
Fix Released
Critical
Steven Dake
Mitaka
Fix Released
Critical
Steven Dake

Bug Description

I deployed ubuntu binary images in a multi-node environment. It failed with:

TASK: [neutron | Reading json from variable] **********************************
skipping: [node-compute]
fatal: [node-controller] => One or more undefined variables: 'dict object' has no attribute 'stdout'

If I only deploy on controller node by removing anything from [compute] section in inventory, the deployment is succeeded.

Steven Dake (sdake)
Changed in kolla:
status: New → Triaged
importance: Undecided → High
milestone: none → mitaka-3
Revision history for this message
Lingfeng Xiong (xionglingfeng) wrote :

Hi Steven,
I saw you set this bug as triaged. Could you share the root of this bug and/or possible workarounds?

Revision history for this message
Lingfeng Xiong (xionglingfeng) wrote :

My current workaround is:
1. remove all compute nodes from inventory file, only leave controller and network node (they are the same host in my environment)
2. run deployment with this inventory file
3. add back compute nodes in inventory file
4. modify
kolla/ansible/roles/neutron/tasks/bootstrap.yml
change

set_fact:
    database_created: "{{ (database.stdout.split('localhost | SUCCESS => ')[1]|$

to

set_fact:
    database_created: "true"

run deployment again.

It is mandatory to use the original bootstrap.yml to finish the initial deployment on controller node (correct database will be created in this step), then do the modification and deploy to compute node again. If run deployment with modified bootstrap.yml directly on a fresh multi-node deployment, the deployment will succeed but neutron cannot start because the missing of service/endpoint in keystone and databases in mariadb.

Revision history for this message
Thiago Gomes (fthiagogv) wrote :
Thiago Gomes (fthiagogv)
Changed in kolla:
assignee: nobody → Thiago Gomes (fthiagogv)
Revision history for this message
Sam Yaple (s8m) wrote :

This is unfortunate, but it is a bug with Ansible 1.x. This bug is fixed in Ansible 2.x, but our playbooks are not compatible with 2.x

In the newton cycle we will fix this issue indirectly by switching to Ansible 2.x.

Changed in kolla:
milestone: mitaka-3 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla (master)

Change abandoned by Thiago Gomes (<email address hidden>) on branch: master
Review: https://review.openstack.org/285408

Thiago Gomes (fthiagogv)
Changed in kolla:
assignee: Thiago Gomes (fthiagogv) → nobody
Revision history for this message
Vikram Hosakote (vhosakot) wrote :

I don't think this is a valid kolla bug. As Sam mentioned, this bug will be fixed in Newton when kolla moves to Ansible 2.x.

Revision history for this message
Martin Matyáš (martinx-maty) wrote :

How is this issue planned to be handled for Mitaka? Will there be ansible 1.9.x with fix of this?

Maybe technically it is not a valid bug against Kolla, but the other thing is that it hits Kolla's functionality significantly - deployment of multi-node configuration fails. Workaround mentioned above works, but requires manual actions, not simple to automatize. Also other workaround with changing service distribution across nodes is I think not much good.

Note that there is other workaround mentioned on Kolla's IRC channel
http://eavesdrop.openstack.org/irclogs/%23kolla/%23kolla.2016-03-11.log.html#t2016-03-11T19:56:39
which is working for me - tweak site.yml for neutron role - putting following services on section top in this order:
    - neutron-server
    - neutron-dhcp-agent
    - neutron-l3-agent
    - neutron-metadata-agent

in neutron section:
https://github.com/openstack/kolla/blob/906c13eb6148d0c48b5f5ae157cfb10113efe173/ansible/site.yml#L101

Would this be acceptable fix/workaround to include into kolla directly?

Steven Dake (sdake)
Changed in kolla:
milestone: none → newton-1
Steven Dake (sdake)
Changed in kolla:
assignee: nobody → Michał Jastrzębski (inc007)
status: Triaged → Confirmed
Revision history for this message
Ganesh Maharaj Mahalingam (ganesh-mahalingam) wrote :

 This seems to be an option that worked without having to re-order the list on site.yml. Maybe this can be pursued.

diff --git a/ansible/roles/neutron/tasks/bootstrap.yml b/ansible/roles/neutron/tasks/bootstrap.yml
index 30c9006..c149072 100644
--- a/ansible/roles/neutron/tasks/bootstrap.yml
+++ b/ansible/roles/neutron/tasks/bootstrap.yml
@@ -10,8 +10,9 @@
   changed_when: "{{ database.stdout.find('localhost | SUCCESS => ') != -1 and
                     (database.stdout.split('localhost | SUCCESS => ')[1]|from_json).changed }}"
   failed_when: database.stdout.split()[2] != 'SUCCESS'
- run_once: True
- delegate_to: "{{ groups['neutron-server'][0] }}"
+ delegate_to: "{{ inventory_hostname }}"
+ until:
+ database.stdout.split()[2] == "SUCCESS"

 - name: Reading json from variable
   set_fact:

Revision history for this message
Steven Dake (sdake) wrote :

Ganesh,

I have a patch up which may be more suitable based upon something Vikram said tonight. Can you give it a spin and see if it works?

Thanks
-steve

Changed in kolla:
importance: High → Critical
assignee: Michał Jastrzębski (inc007) → Steven Dake (sdake)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (master)

Fix proposed to branch: master
Review: https://review.openstack.org/299803

Changed in kolla:
status: Confirmed → In Progress
Revision history for this message
Ganesh Maharaj Mahalingam (ganesh-mahalingam) wrote :

From the logs this is a plausible theory. When the play starts a large list is created with all the hosts on which the play should happen based on the order of hosts. Then the playbook goes through each of the tasks/includes based on the cirterion and runs them.

delegate_to is broken in ansible <2.0 per these bugs. https://github.com/ansible/ansible/issues/14684 && https://github.com/ansible/ansible/pull/15024.

In all the plays where 'run_once' is enabled, the playbook attempts it in the first machine that is in the list (created above) and skips if it doesnt match the criterion.

eg: Neutron endpoint creation fails. http://paste.openstack.org/show/492636/

Followed by creating the config drives which works correctly as they are not 'run_once'. http://paste.openstack.org/show/492637/

Changing the order will purely just change the order of the list and the key tasks which are run_once are obviously to be run on the server nodes (neutron in this case). Changing the order should have the most minimal impact here.

Revision history for this message
Ganesh Maharaj Mahalingam (ganesh-mahalingam) wrote :

The above fixed patch has the recent change where the ordering of hosts in 'ansible/site.yml' has been changed to have neutron-server at the top of the list. With ansible 1.9.4 as recommended by kolla, delegate_to and run_once have some issues where the task to be run is attempted on the first host where the details are gathered and irrespective of the outcome of the task, it is only attempted once. The first host is apparently chosen from the ordering of the list of hosts in 'ansible/site.yml'. All the other plays have the respective servers at the top of the list and neutron did not. They made the neutron db creation and service registering task to always be attempted on the first compute node and gets skipped. This patch should be re-visited/reverted once we move a version of ansible where delegate_to issues are fixed to make sure that the code works as expected.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (master)

Reviewed: https://review.openstack.org/299803
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=0bba5fe0007b281c0a2ee75e4f1c9f3950413e6f
Submitter: Jenkins
Branch: master

commit 0bba5fe0007b281c0a2ee75e4f1c9f3950413e6f
Author: Steven Dake <email address hidden>
Date: Thu Mar 31 04:04:27 2016 -0400

    Workaround ansible bug related to delegate_to

    Currently the delegate_to doesnt happen and the neutron role creation is
    attempted once on the first server and is skipped. The re-ordering of hosts in
    site.yml seems to make the first host to be one inside neutron-server group
    yielding the expected results. This patch needs to be re-visited as soon as a
    version of ansible is chosen that fixes the issues with delegate_to

    Co-Authored-By: Steven Dake <email address hidden>
    Co-Authored-By: Vikram Hosakote <email address hidden>
    Co-Authored-By: Nate Potter <email address hidden>
    Co-Authored-By: Ganesh Mahalingam <email address hidden>
    Change-Id: Ia712b323aa9d750d470a11ee899ab1b3054a903f
    Partial-Bug: #1546789

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/300655

Steven Dake (sdake)
Changed in kolla:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (stable/mitaka)

Reviewed: https://review.openstack.org/300655
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=8dc91eafb55297e93bc9b59058e118ceda587c35
Submitter: Jenkins
Branch: stable/mitaka

commit 8dc91eafb55297e93bc9b59058e118ceda587c35
Author: Steven Dake <email address hidden>
Date: Thu Mar 31 04:04:27 2016 -0400

    Workaround ansible bug related to delegate_to

    Currently the delegate_to doesnt happen and the neutron role creation is
    attempted once on the first server and is skipped. The re-ordering of hosts in
    site.yml seems to make the first host to be one inside neutron-server group
    yielding the expected results. This patch needs to be re-visited as soon as a
    version of ansible is chosen that fixes the issues with delegate_to

    Co-Authored-By: Steven Dake <email address hidden>
    Co-Authored-By: Vikram Hosakote <email address hidden>
    Co-Authored-By: Nate Potter <email address hidden>
    Co-Authored-By: Ganesh Mahalingam <email address hidden>
    Change-Id: Ia712b323aa9d750d470a11ee899ab1b3054a903f
    Partial-Bug: #1546789
    (cherry picked from commit 0bba5fe0007b281c0a2ee75e4f1c9f3950413e6f)

tags: added: in-stable-mitaka
Revision history for this message
Christian Berendt (berendt) wrote :

I hit the same bug with Glance when enabling Ceph and running Glance on different nodes than the Ceph monitor services.

After changing the order of the hosts used for the Glances tasks everything is working like expected.

TASK: [glance | Creating Glance database] *************************************
skipping: [de-1-node-1]

TASK: [glance | Reading json from variable] ***********************************
skipping: [de-1-node-2]
skipping: [de-1-node-1]
skipping: [de-1-node-3]
fatal: [de-1-controller-1] => One or more undefined variables: 'dict object' has no attribute 'stdout'
fatal: [de-1-controller-2] => One or more undefined variables: 'dict object' has no attribute 'stdout'
fatal: [de-1-controller-3] => One or more undefined variables: 'dict object' has no attribute 'stdout'

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (master)

Fix proposed to branch: master
Review: https://review.openstack.org/321241

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla (master)

Change abandoned by Christian Berendt (<email address hidden>) on branch: master
Review: https://review.openstack.org/321241
Reason: issues solved with ansible >= 2

Changed in kolla:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.