ServerManager :: Provisioning Fails on R3.1.1 for multi node/interface setups

Bug #1638814 reported by Ritam Gangopadhyay
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.1
Fix Committed
Critical
Nitish Krishna Kaveri
R3.2
Fix Committed
Critical
Nitish Krishna Kaveri
Trunk
Fix Committed
Critical
Nitish Krishna Kaveri

Bug Description

R3.1.1 build 37 - In single node provisioning goes through inspite of the repeated termination of config process but in case of multi node/interface setups it stalls provisioning completely

SM - nodej5

host1 = 'root@10.204.221.58'
host2 = 'root@10.204.221.59'
host3 = 'root@10.204.221.60'
host4 = 'root@10.204.221.61'

env.roledefs = {
    'all': [host1, host2, host3, host4],
    'cfgm': [host1, host2],
    'openstack': [host1],
    'control': [host1,host2],
    'compute': [host3,host4],
    'collector': [host2],
    'webui': [host2],
    'database': [host2],
    'build': [host_build],
}

Nov 2 01:31:27 nodec35 puppet-agent[20722]: (/Stage[config]/Contrail::Config::Service/Service[supervisor-support-service]) Triggered 'refresh' from 1 events
Nov 2 01:31:27 nodec35 puppet-agent[20722]: contrail contrail_exec_provision_control is python exec_provision_control.py --api_server_ip "192.168.100.3" --api_server_port 8082 --host_name_list "nodec33,nodec35" --host_ip_list "192.168.100.3,192.168.100.4" --router_asn "64512" --mt_options "admin,contrail123,admin" && echo exec-provision-control >> /etc/contrail/contrail_config_exec.out
Nov 2 01:31:27 nodec35 puppet-agent[20722]: (/Stage[config]/Contrail::Exec_provision_control/Notify[contrail contrail_exec_provision_control is python exec_provision_control.py --api_server_ip "192.168.100.3" --api_server_port 8082 --host_name_list "nodec33,nodec35" --host_ip_list "192.168.100.3,192.168.100.4" --router_asn "64512" --mt_options "admin,contrail123,admin" && echo exec-provision-control >> /etc/contrail/contrail_config_exec.out]/message) defined 'message' as 'contrail contrail_exec_provision_control is python exec_provision_control.py --api_server_ip "192.168.100.3" --api_server_port 8082 --host_name_list "nodec33,nodec35" --host_ip_list "192.168.100.3,192.168.100.4" --router_asn "64512" --mt_options "admin,contrail123,admin" && echo exec-provision-control >> /etc/contrail/contrail_config_exec.out'
Nov 2 01:31:43 nodec35 kernel: [ 3332.251126] init: supervisor-config main process (23426) killed by TERM signal
Nov 2 01:32:00 nodec35 kernel: [ 3349.896544] init: supervisor-config main process (23710) killed by TERM signal
Nov 2 01:32:18 nodec35 kernel: [ 3367.621155] init: supervisor-config main process (23856) killed by TERM signal
Nov 2 01:32:37 nodec35 kernel: [ 3386.652796] init: supervisor-config main process (23999) killed by TERM signal
Nov 2 01:32:55 nodec35 kernel: [ 3404.709807] init: supervisor-config main process (24146) killed by TERM signal
Nov 2 01:33:13 nodec35 kernel: [ 3422.397463] init: supervisor-config main process (24292) killed by TERM signal
Nov 2 01:33:31 nodec35 kernel: [ 3439.998079] init: supervisor-config main process (24638) killed by TERM signal

Revision history for this message
sundarkh (sundar-kh) wrote :
Download full text (3.7 KiB)

Same issue seen with centos build 37 multi interface setup , where the supervisor config process gets killed continuously

/var/log/messages/

Nov 3 02:05:05 nodec35 systemd: Started SYSV: Supervisord instance for Contrail Config Package.
Nov 3 02:05:09 nodec35 systemd: Stopping SYSV: Supervisord instance for Contrail Config Package...
Nov 3 02:05:20 nodec35 supervisor-config: contrail-config-nodemgr: stopped
Nov 3 02:05:20 nodec35 supervisor-config: ifmap: stopped
Nov 3 02:05:20 nodec35 supervisor-config: contrail-discovery: stopped
Nov 3 02:05:20 nodec35 supervisor-config: contrail-api:0: stopped
Nov 3 02:05:20 nodec35 supervisor-config: contrail-device-manager: stopped
Nov 3 02:05:20 nodec35 supervisor-config: contrail-schema: stopped
Nov 3 02:05:20 nodec35 supervisor-config: contrail-svc-monitor: stopped
Nov 3 02:05:20 nodec35 supervisor-config: Shut down
Nov 3 02:05:20 nodec35 systemd: Starting SYSV: Supervisord instance for Contrail Config Package...
Nov 3 02:05:20 nodec35 supervisor-config: Starting Supervisor Daemon for Contrail Config Package
Nov 3 02:05:20 nodec35 systemd: Started SYSV: Supervisord instance for Contrail Config Package.
Nov 3 02:05:27 nodec35 systemd: Stopping SYSV: Supervisord instance for Contrail Config Package...
Nov 3 02:05:37 nodec35 supervisor-config: contrail-config-nodemgr: stopped
Nov 3 02:05:37 nodec35 supervisor-config: ifmap: stopped
Nov 3 02:05:37 nodec35 supervisor-config: contrail-discovery: stopped
Nov 3 02:05:37 nodec35 supervisor-config: contrail-api:0: stopped
Nov 3 02:05:37 nodec35 supervisor-config: contrail-device-manager: stopped
Nov 3 02:05:37 nodec35 supervisor-config: contrail-schema: stopped
Nov 3 02:05:37 nodec35 supervisor-config: contrail-svc-monitor: stopped
Nov 3 02:05:38 nodec35 supervisor-config: Shut down
Nov 3 02:05:38 nodec35 systemd: Starting SYSV: Supervisord instance for Contrail Config Package...
Nov 3 02:05:38 nodec35 supervisor-config: Starting Supervisor Daemon for Contrail Config Package
Nov 3 02:05:38 nodec35 systemd: Started SYSV: Supervisord instance for Contrail Config Package.
Nov 3 02:05:41 nodec35 systemd: Stopping SYSV: Supervisord instance for Contrail Config Package...
Nov 3 02:05:44 nodec35 puppet-agent[2363]: python exec_provision_control.py --api_server_ip "192.168.100.3" --api_server_port 8082 --host_name_list "nodec33,nodec35" --host_ip_list "192.168.100.3,192.168.100.4" --router_asn "64512" --mt_options "admin,contrail123,admin" && echo exec-provision-control >> /etc/contrail/contrail_config_exec.out returned 1 instead of one of [0]
Nov 3 02:05:44 nodec35 puppet-agent[2363]: (/Stage[config]/Contrail::Exec_provision_control/Exec[exec-provision-control]/returns) change from notrun to 0 failed: python exec_provision_control.py --api_server_ip "192.168.100.3" --api_server_port 8082 --host_name_list "nodec33,nodec35" --host_ip_list "192.168.100.3,192.168.100.4" --router_asn "64512" --mt_options "admin,contrail123,admin" && echo exec-provision-control >> /etc/contrail/contrail_config_exec.out returned 1 instead of one of [0]
Nov 3 02:05:44 nodec35 puppet-agent[2363]: (/Stage[config]/Contrail::Exec_provision_control/Notify[execut...

Read more...

information type: Proprietary → Public
Changed in juniperopenstack:
milestone: r3.1.1.0 → none
Abhay Joshi (abhayj)
Changed in juniperopenstack:
assignee: Abhay Joshi (abhayj) → Nitish Krishna Kaveri (nitishk)
Revision history for this message
Nitish Krishna Kaveri (nitishk) wrote :

We now have the config db separation feature which has been made DEFAULT behavior in build 37, so this is not a new feature, but what is new is it has been made default.

There seems to be an issue with the distribution of roles in the cluster. Acc to bug testbed.py has following info:
host1 = 'root@10.204.221.58'
host2 = 'root@10.204.221.59'
host3 = 'root@10.204.221.60'
host4 = 'root@10.204.221.61'

env.roledefs = {
    'all': [host1, host2, host3, host4],
    'cfgm': [host1, host2],
    'openstack': [host1],
    'control': [host1,host2],
    'compute': [host3,host4],
    'collector': [host2],
    'webui': [host2],
    'database': [host2],
    'build': [host_build],
}

There are two config nodes:

One config node host1: There is only config role

While on config node host2: There is config, analytics, webui and database roles

According to Ignatious, this role distribution is NOT a supported scenario. This is also why provision was failing. I had made the assumption in provision that any config node without DB role defined will get DB provisioned on it. This logic kicked in on one of the config nodes but not the other so the DB list looked different on both.

The two supported scenarios are:

ALL config nodes have database, webui, control and database roles as well. Also there has to be ODD number of config nodes (for Zookeeper)
NONE of the config nodes have database or analytics nodes. Again there should be ODD number of config nodes and odd number of analytics nodes

Can you please change the test setup to reflect this supported distribution?
The roledefs for the two supported scenarios are pasted below:

1.

env.roledefs = {
    'all': [host1, host2, host3, host4],
    'cfgm': [host2],
    'openstack': [host1],
    'control': [host2],
    'compute': [host3,host4],
    'collector': [host2],
    'webui': [host2],
    'database': [host2],
    'build': [host_build],
}

2.

env.roledefs = {
    'all': [host1, host2, host3, host4],
    'cfgm': [host2],
    'openstack': [host1],
    'control': [host2],
    'compute': [host3,host4],
    'collector': [host1],
    'webui': [host2],
    'database': [host1],
    'build': [host_build],
}

Revision history for this message
Nitish Krishna Kaveri (nitishk) wrote :

If you want to have an even number of config nodes, you can point to an external database node (again there need to be even number of these database nodes).
To do this, simply set config::manage_db as false in the cluster parameter below:
https://github.com/Juniper/contrail-server-manager/blob/master/src/client/new-cluster.json#L52

Revision history for this message
Abhay Joshi (abhayj) wrote :

Sundar,

Reassiging the bug to you to verify the two combinations that Nitish has listed as valid. The configuration you used is not valid. Please close the bug or revert back if you see issue with valid combinations.

Changed in juniperopenstack:
assignee: Nitish Krishna Kaveri (nitishk) → Ritam Gangopadhyay (ritam)
Revision history for this message
Ritam Gangopadhyay (ritam) wrote :

Inclusive and exclusive config and db combinations work.
Need a resolution on the supervisor-config process kills seen in single node setup.
Provisioning time increased from 2800-3000 sec to close to an hour for multi interface setup. A possible solution on this would be good.
Removing the blocker status.

tags: removed: blocker
Changed in juniperopenstack:
assignee: Ritam Gangopadhyay (ritam) → Nitish Krishna Kaveri (nitishk)
Revision history for this message
Nitish Krishna Kaveri (nitishk) wrote :

In single node provision:

I see provisioning being slowed down by ceilometer agent central and ceilometer collector processes failing continuously and restarting

I believe this is due to some missing config during the sequence in which ceilometer components are provisioned.

I am not seeing the error below:

Nov 2 01:31:43 nodec35 kernel: [ 3332.251126] init: supervisor-config main process (23426) killed by TERM signal
Nov 2 01:32:00 nodec35 kernel: [ 3349.896544] init: supervisor-config main process (23710) killed by TERM signal
Nov 2 01:32:18 nodec35 kernel: [ 3367.621155] init: supervisor-config main process (23856) killed by TERM signal
Nov 2 01:32:37 nodec35 kernel: [ 3386.652796] init: supervisor-config main process (23999) killed by TERM signal
Nov 2 01:32:55 nodec35 kernel: [ 3404.709807] init: supervisor-config main process (24146) killed by TERM signal
Nov 2 01:33:13 nodec35 kernel: [ 3422.397463] init: supervisor-config main process (24292) killed by TERM signal
Nov 2 01:33:31 nodec35 kernel: [ 3439.998079] init: supervisor-config main process (24638) killed by TERM signal

Deubgging further with Megh

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.1

Review in progress for https://review.opencontrail.org/25749
Submitter: Nitish Krishna Kaveri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/25750
Submitter: Nitish Krishna Kaveri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/25751
Submitter: Nitish Krishna Kaveri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/25752
Submitter: Nitish Krishna Kaveri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/25749
Committed: http://github.org/Juniper/contrail-puppet/commit/a951d15596c3e54dc7c02f9c5b6500452fbca35b
Submitter: Zuul
Branch: R3.1

commit a951d15596c3e54dc7c02f9c5b6500452fbca35b
Author: nitishkrishna <email address hidden>
Date: Fri Nov 4 17:47:40 2016 -0700

Closes-Bug: #1638814 - Ceilometer processes restart too much

The ordering was missing and rpc_backend being removed also caused too many restarts

Closes-Bug: #1612774 - Ceilometer HA support

Needs integration with tooz + couple missing params

Change-Id: I8640c01881836b8c61c4943dd8003d08def6b647

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/25752
Committed: http://github.org/Juniper/contrail-puppet/commit/3a91ab74e0882bd6d5a016e452965d9184a7c309
Submitter: Zuul
Branch: master

commit 3a91ab74e0882bd6d5a016e452965d9184a7c309
Author: nitishkrishna <email address hidden>
Date: Fri Nov 4 18:02:28 2016 -0700

Closes-Bug: #1638814 - Ceilometer processes restart too much

The ordering was missing and rpc_backend being removed also caused too many restarts

Change-Id: Ib185435f2e3bc2be35e403c9387763956387c26e

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/25751
Committed: http://github.org/Juniper/contrail-puppet/commit/260e1939e76d7d779ab1535eb3f897209d98f106
Submitter: Zuul
Branch: R3.2

commit 260e1939e76d7d779ab1535eb3f897209d98f106
Author: nitishkrishna <email address hidden>
Date: Fri Nov 4 18:02:28 2016 -0700

Closes-Bug: #1638814 - Ceilometer processes restart too much

The ordering was missing and rpc_backend being removed also caused too many restarts

Change-Id: Ib185435f2e3bc2be35e403c9387763956387c26e

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.