Server manager provision stuck in storage

Bug #1461398 reported by wenqing liang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
High
Dheeraj Gautam
Trunk
Fix Committed
High
Dheeraj Gautam

Bug Description

The issue with server manager provisioning stuck in storage is seen since R2.20 build 38 (both icehouse and juno) albeit intermittently. Not seen in prior R2.20 build (#37).

Note only perf1 is the storage-master.

root@cmbu-auto-esx1-lnx02:~# server-manager status server
{
    "server": [
        {
            "id": "cmbu-ceph-perf1",
            "ip_address": "10.87.140.197",
            "mac_address": "00:25:90:AB:9C:88",
            "status": "storage-compute_completed"
        },
        {
            "id": "cmbu-ceph-perf2",
            "ip_address": "10.87.140.198",
            "mac_address": "00:25:90:92:0D:54",
            "status": "storage-master_completed"
        },
        {
            "id": "cmbu-ceph-perf3",
            "ip_address": "10.87.140.199",
            "mac_address": "00:25:90:92:0E:6C",
            "status": "storage-master_completed"
        },
        {
            "id": "cmbu-ceph-perf4",
            "ip_address": "10.87.140.200",
            "mac_address": "00:25:90:92:0D:F2",
            "status": "storage_started"
        }
    ]
}
root@cmbu-auto-esx1-lnx02:~# server-manager show cluster --detail
{
    "cluster": [
        {
            "base_image_id": "",
            "email": "",
            "id": "test-cluster",
            "package_image_id": "",
            "parameters": {
                "admin_key": "AQDIgtNTgPLWARAAK6gs/fj8m88LnY9DwxJdYA==",
                "analytics_data_ttl": "168",
                "database_dir": "/home/cassandra",
                "database_token": "0",
                "domain": "englab.juniper.net",
                "encapsulation_priority": "MPLSoUDP,MPLSoGRE,VXLAN",
                "external_bgp": "",
                "gateway": "10.87.159.254",
                "haproxy": "disable",
                "internal_vip": "",
                "keystone_password": "contrail123",
                "keystone_tenant": "admin",
                "keystone_username": "admin",
                "live_migration": "enable",
                "live_migration_nfs_vm_host": "cmbu-ceph-perf2",
                "live_migration_storage_scope": "global",
                "multi_tenancy": "False",
                "openstack_mgmt_ip": "",
                "osd_bootstrap_key": "AQCq7NFTeJUoBhAAlTVpxwWQJtBej/JDNhT6+Q==",
                "password": "c0ntrail123",
                "router_asn": "64512",
                "service_token": "contrail123",
                "storage_fsid": "10acd23e-86da-4fc8-84cb-720c43a51b46",
                "storage_mon_secret": "AQBM78tTEMz+GhAA3WiOXQI7UVdIy0YFFuTGdw==",
                "storage_virsh_uuid": "e0be2131-609a-49fb-9ffd-6ace64ceee61",
                "subnet_mask": "255.255.224.0",
                "use_certificates": "False",
                "uuid": "8371e031-50f0-4f99-a755-188817294764"
            },
            "provision_role_sequence": "{'completed': [('cmbu-ceph-perf1', 'haproxy', '2015_06_02__14_10_26'), ('cmbu-ceph-perf1', 'database', '2015_06_02__14_12_02'), ('cmbu-ceph-perf1', 'openstack', '2015_06_02__14_20_07'), ('cmbu-ceph-perf1', 'config', '2015_06_02__14_22_17'), ('cmbu-ceph-perf1', 'control', '2015_06_02__14_22_56'), ('cmbu-ceph-perf1', 'collector', '2015_06_02__14_23_32'), ('cmbu-ceph-perf1', 'webui', '2015_06_02__14_23_52'), ('cmbu-ceph-perf2', 'compute', '2015_06_02__14_28_36'), ('cmbu-ceph-perf3', 'compute', '2015_06_02__14_29_19'), ('cmbu-ceph-perf4', 'compute', '2015_06_02__14_29_28')], 'steps': [[(u'cmbu-ceph-perf2', 'post_provision'), (u'cmbu-ceph-perf3', 'post_provision'), (u'cmbu-ceph-perf4', 'post_provision'), (u'cmbu-ceph-perf2', 'storage-compute'), (u'cmbu-ceph-perf3', 'storage-compute'), (u'cmbu-ceph-perf4', 'storage-compute'), (u'cmbu-ceph-perf1', 'storage-master'), (u'cmbu-ceph-perf1', 'post_provision')]]}",
            "provisioned_id": null
        }
    ]
}
root@cmbu-auto-esx1-lnx02:~#

compute syslog:

Jun 2 16:38:18 cmbu-ceph-perf2 puppet-agent[2851]: (/Stage[storage]/Contrail::Profile::Storage/Contrail::Storage/Contrail::Lib::Storage_common[storage-compute]/Contrail::Lib::Report_status[storage-compute_completed]/Exec[contrail-status-storage-compute_completed]) Dependency Exec[setup-config-storage-compute-live-migration] has failures: true
Jun 2 16:38:18 cmbu-ceph-perf2 puppet-agent[2851]: (/Stage[storage]/Contrail::Profile::Storage/Contrail::Storage/Contrail::Lib::Storage_common[storage-compute]/Contrail::Lib::Report_status[storage-compute_completed]/Exec[contrail-status-storage-compute_completed]) Skipping because of failed dependencies
Jun 2 16:38:18 cmbu-ceph-perf2 puppet-agent[2851]: (/Stage[post]/Contrail::Provision_complete/Contrail::Lib::Report_status[post_provision_completed]/Exec[contrail-status-post_provision_completed]) Dependency Exec[setup-config-storage-compute-live-migration] has failures: true
Jun 2 16:38:18 cmbu-ceph-perf2 puppet-agent[2851]: (/Stage[post]/Contrail::Provision_complete/Contrail::Lib::Report_status[post_provision_completed]/Exec[contrail-status-post_provision_completed]) Skipping because of failed dependencies
Jun 2 16:38:18 cmbu-ceph-perf2 puppet-agent[2851]: (/Stage[post]/Contrail::Provision_complete/Exec[do-reboot-server]) Dependency Exec[setup-config-storage-compute-live-migration] has failures: true
Jun 2 16:38:18 cmbu-ceph-perf2 puppet-agent[2851]: (/Stage[post]/Contrail::Provision_complete/Exec[do-reboot-server]) Skipping because of failed dependencies

Revision history for this message
Dheeraj Gautam (dgautam) wrote :

This is failing as base contrail-openstack installation is not installed correct. non of the computes registered themselves.

root@cmbu-ceph-perf1:~# nova-manage host list
host zone
2015-06-02 23:48:24.979 24328 DEBUG oslo.db.sqlalchemy.session [req-c18f4f7c-37e3-4322-b172-9d256dbfeac6 ] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _init_events /usr/lib/python2.7/dist-packages/oslo/db/sqlalchemy/session.py:461
cmbu-ceph-perf1 internal
root@cmbu-ceph-perf1:~# nova host-list
+-----------------+-------------+----------+
| host_name | service | zone |
+-----------------+-------------+----------+
| cmbu-ceph-perf1 | consoleauth | internal |
| cmbu-ceph-perf1 | conductor | internal |
| cmbu-ceph-perf1 | conductor | internal |
| cmbu-ceph-perf1 | scheduler | internal |
+-----------------+-------------+----------+

Changed in juniperopenstack:
assignee: Dheeraj Gautam (dgautam) → Thilak Raj (tsurendra)
information type: Proprietary → Public
Revision history for this message
Thilak Raj (tsurendra) wrote :

Wenqing , Please try in later builds and let us know, if you see this issue.

Revision history for this message
Thilak Raj (tsurendra) wrote :
Download full text (5.4 KiB)

Wenqing, Hit this issue again today on build 47, with the patch from
https://review.opencontrail.org/#/c/11442/

Below are the analysis from my side,

I think the sequence for storage is wrong.

t@cmbu-auto-esx1-lnx02:~# server-manager status server
{
    "server": [
        {
            "id": "cmbu-ceph-perf1",
            "ip_address": "10.87.140.197",
            "mac_address": "00:25:90:AB:9C:88",
            "status": "storage_started"
        },
        {
            "id": "cmbu-ceph-perf2",
            "ip_address": "10.87.140.198",
            "mac_address": "00:25:90:92:0D:54",
            "status": "storage_started"
        },
        {
            "id": "cmbu-ceph-perf3",
            "ip_address": "10.87.140.199",
            "mac_address": "00:25:90:92:0E:6C",
            "status": "storage-master_completed"
        },
        {
            "id": "cmbu-ceph-perf4",
            "ip_address": "10.87.140.200",
            "mac_address": "00:25:90:92:0D:F2",
            "status": "storage_started"
        }
    ]
}
root@cmbu-auto-esx1-lnx02:~# server-manager show server --select "id,ip_address,roles"
{
    "server": [
        {
            "id": "cmbu-ceph-perf1",
            "ip_address": "10.87.140.197",
            "roles": [
                "config",
                "openstack",
                "control",
                "collector",
                "webui",
                "database",
                "storage-master"
            ]
        },
        {
            "id": "cmbu-ceph-perf2",
            "ip_address": "10.87.140.198",
            "roles": [
                "compute",
                "storage-compute"
            ]
        },
        {
            "id": "cmbu-ceph-perf3",
            "ip_address": "10.87.140.199",
            "roles": [
                "compute",
                "storage-compute"
            ]
        },
        {
            "id": "cmbu-ceph-perf4",
            "ip_address": "10.87.140.200",
            "roles": [
                "compute",
                "storage-compute"
            ]
        }
    ]
}
===============
Not sure why we have storage-master completed when there is no storage master at "cmbu-ceph-perf4"
and only has a storage-compute

Also, Not sure if role sequence is right?

            "provision_role_sequence": "{'completed': [('cmbu-ceph-perf1', 'haproxy', '2015_06_10__11_33_28'), ('cmbu-ceph-perf1', 'database', '2015_06_10__11_35_06'),('cmbu-ceph-perf1', 'openstack', '2015_06_10__11_43_45'), ('cmbu-ceph-perf1', 'config', '2015_06_10__11_45_57'), ('cmbu-ceph-perf1', 'control', '2015_06_10__11_46_34') ('cmbu-ceph-perf1', 'collector', '2015_06_10__11_47_08'), ('cmbu-ceph-perf1', 'webui', '2015_06_10__11_47_27'), ('cmbu-ceph-perf2', 'compute', '2015_06_10__11_52_48') ('cmbu-ceph-perf3', 'compute', '2015_06_10__11_53_00'), ('cmbu-ceph-perf4', 'compute', '2015_06_10__11_53_02')], 'steps': [[(u'cmbu-ceph-perf2', 'post_provision'), (ucmbu-ceph-perf3', 'post_provision'), (u'cmbu-ceph-perf4', 'post_provision'), (u'cmbu-ceph-perf2', 'storage-compute'), (u'cmbu-ceph-perf3', 'storage-compute'), (u'cmbu-eph-perf4', 'storage-compute'), (u'cmbu-ceph-perf1', 'storage-master'), (u'cmbu-ceph-perf1...

Read more...

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/11504
Submitter: Dheeraj Gautam (<email address hidden>)

Revision history for this message
Abhay Joshi (abhayj) wrote :

Why did you assign this to Thilak? If it is for reason of review, bug should remain with you. Please reassign to yourself.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11504
Committed: http://github.org/Juniper/contrail-server-manager/commit/7c29330c68c3228613c22d3d229b799f04626214
Submitter: Zuul
Branch: R2.20

commit 7c29330c68c3228613c22d3d229b799f04626214
Author: root <email address hidden>
Date: Thu Jun 11 08:43:22 2015 -0700

SM-Sequencing: ensure that post_provsion is
enabled after all the other roles have been enabled

Partial-Bug: #1461398

There is a possibility of race condition while enabling post_provision and
other roles (storage, tsn, toragent). Currently SM may enable post_provision
first and write into the hieradata file and next iteration it may pick other
roles, enable them and again update the file. Effectively hieradata file
changes are not correctly sequenced. During enabling post_provision and other
roles, puppet may read the file and skip other roles

FIX:
===
ensure post_provision is in last of the list of iteration. This ensures
post_provision changes will be enabled in last, after all other roles has been
enabled.

TESTING:
1. Prasad: provisioned base contrail and verified post_provision is
in end of role_sequence. Verified provisioned is successful.
2. Provisioned storage roles and verfied post_provision is in end of
role_sequence. Verified provisioning is successful.

Change-Id: I8e83abfabff1cc69f70ebfd2b7b6d9eccfc31205

wenqing liang (wliang)
description: updated
tags: added: blocker
description: updated
Revision history for this message
Dheeraj Gautam (dgautam) wrote :

Original issue may still happen

root@cmbu-ceph-perf1:~# . /etc/contrail/openstackrc
root@cmbu-ceph-perf1:~# nova list
ERROR (ClientException): The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-d0a291f2-a884-465b-99e4-fdd7d8f534d7)
root@cmbu-ceph-perf1:~# nova-manage host list
host zone
2015-06-16 08:27:49.101 19033 DEBUG oslo.db.sqlalchemy.session [req-4071216c-2cf9-48a7-96b1-2c24cc4618c8 ] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _init_events /usr/lib/python2.7/dist-packages/oslo/db/sqlalchemy/session.py:461
cmbu-ceph-perf1 internal
root@cmbu-ceph-perf1:~# nova host-list
+-----------------+-------------+----------+
| host_name | service | zone |
+-----------------+-------------+----------+
| cmbu-ceph-perf1 | consoleauth | internal |
| cmbu-ceph-perf1 | conductor | internal |
| cmbu-ceph-perf1 | conductor | internal |
| cmbu-ceph-perf1 | scheduler | internal |
+-----------------+-------------+----------+
root@cmbu-ceph-perf1:~#

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/11776
Submitter: Dheeraj Gautam (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11776
Committed: http://github.org/Juniper/contrail-puppet/commit/f5a5276402cf3d7ab8268072b830900a7d624bed
Submitter: Zuul
Branch: R2.20

commit f5a5276402cf3d7ab8268072b830900a7d624bed
Author: root <email address hidden>
Date: Thu Jun 18 00:20:35 2015 -0700

Server-Manager: pulled bug fixes from
openstack-puppet modules

Closes-Bug: #1461398

Based on bug https://bugs.launchpad.net/fuel/+bug/1335804, following changes
were pulled into the our repo.

https://review.openstack.org/#/c/118267/
https://review.openstack.org/#/c/106785/

This puts ordering of nova-db-sync before starting service.

TESTING:
Tried 5 times on same problematic setup and issue didn't come up. Previosuly
out of 5, 2-3 times issue was observed.

Change-Id: I38a8a3c172ffbc3444a85a9e0a5dba7475d9ea03

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/12391
Submitter: Dheeraj Gautam (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/12391
Committed: http://github.org/Juniper/contrail-puppet/commit/852d5e234e4486a7f4e55bef03e4353833fb694d
Submitter: Zuul
Branch: master

commit 852d5e234e4486a7f4e55bef03e4353833fb694d
Author: root <email address hidden>
Date: Thu Jun 18 00:20:35 2015 -0700

Server-Manager: pulled bug fixes from
openstack-puppet modules

Closes-Bug: #1461398

Based on bug https://bugs.launchpad.net/fuel/+bug/1335804, following changes
were pulled into the our repo.

https://review.openstack.org/#/c/118267/
https://review.openstack.org/#/c/106785/

This puts ordering of nova-db-sync before starting service.

TESTING:
Tried 5 times on same problematic setup and issue didn't come up. Previosuly
out of 5, 2-3 times issue was observed.

Change-Id: I38a8a3c172ffbc3444a85a9e0a5dba7475d9ea03

Revision history for this message
wenqing liang (wliang) wrote :

Seen again on R2.20-97 hence reopening the bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.