Vcenter-as-compute: R4.1 ocata provisioning fails due to issue installation fluentd container

Bug #1731596 reported by Sarath
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R4.0
Fix Committed
Critical
Ramprakash R
R4.1
Fix Committed
Critical
Ramprakash R
Trunk
Fix Committed
Critical
Ramprakash R

Bug Description

Version: 4.1.0.0-37-ocata
Topology: 3node HA with multiple computes (multi-cluster Esxi) & Kvm

Shown the problem setup to Ram and he triaged and find issues related to below,
https://ask.openstack.org/en/question/111005/kolla-ansible-pike-installation-fluentd-container-in-restarting-mode/

>>> below SM debug.log of failures

"2017-11-10 17:22:46,634-INFO-sm_ansible_callback.py:53-append(): TASK [mariadb : Waiting for MariaDB service to be ready]"
"2017-11-10 17:23:12,523-DEBUG-server_mgr_main.py:894-validate_smgr_get(): match key returned: {'cluster_id': ['cluster-vcenter-compute']}"
"2017-11-10 17:23:12,524-DEBUG-server_mgr_main.py:896-validate_smgr_get(): select keys: None"
"2017-11-10 17:23:12,905-DEBUG-server_mgr_main.py:894-validate_smgr_get(): match key returned: {'cluster_id': ['cluster-vcenter-compute']}"
"2017-11-10 17:23:12,906-DEBUG-server_mgr_main.py:896-validate_smgr_get(): select keys: None"
"2017-11-10 17:23:20,882-INFO-server_mgr_ssh_client.py:65-connect(): CONNECT SUCCESS: Host => 10.87.36.15, option => key"
"2017-11-10 17:23:47,665-INFO-server_mgr_ssh_client.py:60-connect(): CONNECT FAILED: Host => 10.87.36.19, option => password, ERROR => [Errno None] Unable to connect to port 22 on 10.87.36.19"
"2017-11-10 17:23:47,666-ERROR-server_mgr_mon_base_plugin.py:465-copy_ssh_keys_to_server(): COPY-KEYS: Host : contrailvm-5a10s27 Try: 59: ERROR Copying Keys: [Errno None] Unable to connect to port 22 on 10.87.36.19"
"2017-11-10 17:23:48,721-INFO-server_mgr_ssh_client.py:60-connect(): CONNECT FAILED: Host => 10.87.36.21, option => password, ERROR => [Errno None] Unable to connect to port 22 on 10.87.36.21"
"2017-11-10 17:23:48,721-ERROR-server_mgr_mon_base_plugin.py:465-copy_ssh_keys_to_server(): COPY-KEYS: Host : contrailvm-5a10s31 Try: 59: ERROR Copying Keys: [Errno None] Unable to connect to port 22 on 10.87.36.21"
"2017-11-10 17:23:49,721-INFO-server_mgr_ssh_client.py:60-connect(): CONNECT FAILED: Host => 10.87.36.20, option => password, ERROR => [Errno None] Unable to connect to port 22 on 10.87.36.20"
"2017-11-10 17:23:49,722-ERROR-server_mgr_mon_base_plugin.py:465-copy_ssh_keys_to_server(): COPY-KEYS: Host : contrailvm-5a10s25 Try: 59: ERROR Copying Keys: [Errno None] Unable to connect to port 22 on 10.87.36.20"
"2017-11-10 17:23:58,891-INFO-server_mgr_ssh_client.py:65-connect(): CONNECT SUCCESS: Host => 10.87.36.12, option => key"
"2017-11-10 17:24:39,741-INFO-server_mgr_ssh_client.py:60-connect(): CONNECT FAILED: Host => 10.87.36.21, option => key, ERROR => [Errno None] Unable to connect to port 22 on 10.87.36.21"
"2017-11-10 17:24:53,319-DEBUG-server_mgr_main.py:894-validate_smgr_get(): match key returned: {'cluster_id': ['cluster-vcenter-compute']}"
"2017-11-10 17:24:53,319-DEBUG-server_mgr_main.py:896-validate_smgr_get(): select keys: None"
"2017-11-10 17:24:53,711-DEBUG-server_mgr_main.py:894-validate_smgr_get(): match key returned: {'cluster_id': ['cluster-vcenter-compute']}"
"2017-11-10 17:24:53,711-DEBUG-server_mgr_main.py:896-validate_smgr_get(): select keys: None"
"2017-11-10 17:25:14,894-INFO-server_mgr_ssh_client.py:65-connect(): CONNECT SUCCESS: Host => 10.87.36.11, option => key"
"2017-11-10 17:25:50,665-INFO-server_mgr_ssh_client.py:60-connect(): CONNECT FAILED: Host => 10.87.36.19, option => password, ERROR => [Errno None] Unable to connect to port 22 on 10.87.36.19"
"2017-11-10 17:25:50,666-ERROR-server_mgr_mon_base_plugin.py:465-copy_ssh_keys_to_server(): COPY-KEYS: Host : contrailvm-5a10s27 Try: 60: ERROR Copying Keys: [Errno None] Unable to connect to port 22 on 10.87.36.19"
"2017-11-10 17:25:51,721-INFO-server_mgr_ssh_client.py:60-connect(): CONNECT FAILED: Host => 10.87.36.21, option => password, ERROR => [Errno None] Unable to connect to port 22 on 10.87.36.21"
"2017-11-10 17:25:51,722-ERROR-server_mgr_mon_base_plugin.py:465-copy_ssh_keys_to_server(): COPY-KEYS: Host : contrailvm-5a10s31 Try: 60: ERROR Copying Keys: [Errno None] Unable to connect to port 22 on 10.87.36.21"
"2017-11-10 17:25:52,721-INFO-server_mgr_ssh_client.py:60-connect(): CONNECT FAILED: Host => 10.87.36.20, option => password, ERROR => [Errno None] Unable to connect to port 22 on 10.87.36.20"
"2017-11-10 17:25:52,721-ERROR-server_mgr_mon_base_plugin.py:465-copy_ssh_keys_to_server(): COPY-KEYS: Host : contrailvm-5a10s25 Try: 60: ERROR Copying Keys: [Errno None] Unable to connect to port 22 on 10.87.36.20"
"2017-11-10 17:25:52,894-INFO-server_mgr_ssh_client.py:65-connect(): CONNECT SUCCESS: Host => 10.87.36.10, option => key"
"2017-11-10 17:26:33,745-INFO-server_mgr_ssh_client.py:60-connect(): CONNECT FAILED: Host => 10.87.36.20, option => key, ERROR => [Errno None] Unable to connect to port 22 on 10.87.36.20"
"2017-11-10 17:26:34,093-DEBUG-server_mgr_main.py:894-validate_smgr_get(): match key returned: {'cluster_id': ['cluster-vcenter-compute']}"
"2017-11-10 17:26:34,094-DEBUG-server_mgr_main.py:896-validate_smgr_get(): select keys: None"
"2017-11-10 17:26:34,464-DEBUG-server_mgr_main.py:894-validate_smgr_get(): match key returned: {'cluster_id': ['cluster-vcenter-compute']}"
"2017-11-10 17:26:34,466-DEBUG-server_mgr_main.py:896-validate_smgr_get(): select keys: None"
"2017-11-10 17:27:11,745-INFO-server_mgr_ssh_client.py:60-connect(): CONNECT FAILED: Host => 10.87.36.19, option => key, ERROR => [Errno None] Unable to connect to port 22 on 10.87.36.19"
"2017-11-10 17:27:46,898-INFO-server_mgr_ssh_client.py:65-connect(): CONNECT SUCCESS: Host => 10.87.36.18, option => key"
"2017-11-10 17:27:53,665-INFO-server_mgr_ssh_client.py:60-connect(): CONNECT FAILED: Host => 10.87.36.19, option => password, ERROR => [Errno None] Unable to connect to port 22 on 10.87.36.19"
"2017-11-10 17:27:53,666-ERROR-server_mgr_mon_base_plugin.py:465-copy_ssh_keys_to_server(): COPY-KEYS: Host : contrailvm-5a10s27 Try: 61: ERROR Copying Keys: [Errno None] Unable to connect to port 22 on 10.87.36.19"
"2017-11-10 17:27:54,721-INFO-server_mgr_ssh_client.py:60-connect(): CONNECT FAILED: Host => 10.87.36.21, option => password, ERROR => [Errno None] Unable to connect to port 22 on 10.87.36.21"
"2017-11-10 17:27:54,722-ERROR-server_mgr_mon_base_plugin.py:465-copy_ssh_keys_to_server(): COPY-KEYS: Host : contrailvm-5a10s31 Try: 61: ERROR Copying Keys: [Errno None] Unable to connect to port 22 on 10.87.36.21"
"2017-11-10 17:27:55,721-INFO-server_mgr_ssh_client.py:60-connect(): CONNECT FAILED: Host => 10.87.36.20, option => password, ERROR => [Errno None] Unable to connect to port 22 on 10.87.36.20"
"2017-11-10 17:27:55,722-ERROR-server_mgr_mon_base_plugin.py:465-copy_ssh_keys_to_server(): COPY-KEYS: Host : contrailvm-5a10s25 Try: 61: ERROR Copying Keys: [Errno None] Unable to connect to port 22 on 10.87.36.20"
"2017-11-10 17:28:14,894-DEBUG-server_mgr_main.py:894-validate_smgr_get(): match key returned: {'cluster_id': ['cluster-vcenter-compute']}"
"2017-11-10 17:28:14,895-DEBUG-server_mgr_main.py:896-validate_smgr_get(): select keys: None"
"2017-11-10 17:28:15,262-DEBUG-server_mgr_main.py:894-validate_smgr_get(): match key returned: {'cluster_id': ['cluster-vcenter-compute']}"
"2017-11-10 17:28:15,263-DEBUG-server_mgr_main.py:896-validate_smgr_get(): select keys: None"
"2017-11-10 17:28:24,900-INFO-server_mgr_ssh_client.py:65-connect(): CONNECT SUCCESS: Host => 10.87.36.15, option => key"
"2017-11-10 17:28:51,676-INFO-sm_ansible_callback.py:53-append(): fatal: [10.87.36.11]: FAILED! => (item - None) {"attempts": 10, "changed": false, "failed": true, "module_stderr": "Shared connection to 10.87.36.11 closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_bhgfLW/ansible_module_wait_for.py\", line 585, in <module>\r\n main()\r\n File \"/tmp/ansible_bhgfLW/ansible_module_wait_for.py\", line 525, in main\r\n response = s.recv(1024)\r\nsocket.error: [Errno 104] Connection reset by peer\r\n", "msg": "MODULE FAILURE", "rc": 0}"
"2017-11-10 17:28:51,921-INFO-sm_ansible_callback.py:53-append(): fatal: [10.87.36.12]: FAILED! => (item - None) {"attempts": 10, "changed": false, "failed": true, "module_stderr": "Shared connection to 10.87.36.12 closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_JRKczT/ansible_module_wait_for.py\", line 585, in <module>\r\n main()\r\n File \"/tmp/ansible_JRKczT/ansible_module_wait_for.py\", line 525, in main\r\n response = s.recv(1024)\r\nsocket.error: [Errno 104] Connection reset by peer\r\n", "msg": "MODULE FAILURE", "rc": 0}"
"2017-11-10 17:28:51,924-INFO-sm_ansible_utils.py:496-send_REST_request(): Sending post request to http://10.87.36.10:9002/ansible_status?server_id=10.87.36.10&state=provision_failed"
"2017-11-10 17:28:51,926-DEBUG-server_mgr_status.py:134-put_ansible_status(): Server status Data 5a10s31 provision_failed 2017_11_10__17_28_51"

Sarath (nsarath)
summary: - Vcenter-as-compute: R4.1 ocata provisioing fails due to issue
+ Vcenter-as-compute: R4.1 ocata provisioning fails due to issue
installation fluentd container
Revision history for this message
Sarath (nsarath) wrote :

/auto/cores/1731596$ ls
log.tar

Revision history for this message
Ramprakash R (ramprakash) wrote :

This looks like an issue with the upstream fluentd package. This should be solved when we start building containers using packages from the contrail snapshot repos rather than pulling them from upstream. We can see if builds after the fix for bug #1729460 gets merged. Please reopen if issue is seen even on builds after that bug is merged.

Revision history for this message
Sarath (nsarath) wrote :

Ram, This also seen on R4.0 latest ocata and we had this working on vcenter-compute during R4.0.1#32

Sure, let me verify with latest with code bug #1729460 merged

Revision history for this message
Sarath (nsarath) wrote :

From: Ramprakash Ram Mohan
Sent: Wednesday, November 15, 2017 7:53 PM
To: Abhay Joshi <email address hidden>; Rudra Rugge <email address hidden>; Sachchidanand Vaidya <email address hidden>; Suresh Balineni <email address hidden>; Kiran KN <email address hidden>; Arvind Viswanathan <email address hidden>; Sarathbabu Narasimhan <email address hidden>; Ashok Singh R <email address hidden>; Vinod Nair <email address hidden>
Cc: Sachin Bansal <email address hidden>; Jeba Paulaiyan <email address hidden>; Sudheendra Rao <email address hidden>; Anish Mehta <email address hidden>; Contrail Eng Managers <email address hidden>; Sarathbabu Narasimhan <email address hidden>
Subject: Re: 4.1 release blockers

Update on bug #1731596 (Vcenter-as-compute: R4.1 ocata provisioning fails due to issue installation fluentd container):

We tried testing this bug after fixing bug #1729460 which we thought will fix this issue also and found that the fluentd issue is still present.
It looks like it is caused by pulling in package td-agent from a proprietary repo (hosted by treasuredata.com). This is being done by the upstream kolla containers’ Dockerfile. I am trying to figure out various options to overcome this issue. Shall update with the findings.

Thanks,
Ram

Revision history for this message
Ramprakash R (ramprakash) wrote :

WORKAROUND:
~~~~~~~~~~~
This is because of an upstream package "td-agent" which is installed in the fluentd container dropped support for some configuration which the kolla playbook were using.

Please use the following value in the "kolla_globals" section of the cluster.json as a workaround until we fix the configuration to be compatible with the new version of "td-agent":

...
"kolla_globals": {
...
...
   "fluentd_image_full": "kolla/ubuntu-binary-fluentd:4.0.0"
...
...
}

NOTE: You will need internet access on the open stack nodes for this workaround to work.

Revision history for this message
Abhay Joshi (abhayj) wrote :

The above workaround has been suceesfully verified in multiple setups. In 4.1.0 we will document the workaround. Proper fix for the bug would be worked on for 4.1.1.

IMPORTANT : For 4.1.0 the bug should be release noted with workaround specified in comment #5.

Jeba Paulaiyan (jebap)
tags: added: releasenote
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/37749
Submitter: Ramprakash R (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/37750
Submitter: Ramprakash R (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.1

Review in progress for https://review.opencontrail.org/37751
Submitter: Ramprakash R (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/37749
Committed: http://github.com/Juniper/contrail-ansible/commit/961db4b85114e0b9bb1f68670de96577a47dae3f
Submitter: Zuul (<email address hidden>)
Branch: master

commit 961db4b85114e0b9bb1f68670de96577a47dae3f
Author: Ramprakash Ram Mohan <email address hidden>
Date: Tue Nov 21 14:32:02 2017 -0800

Fix the deprecated syntax in 01-rewrite.conf

The current syntax of Ubuntu fluentd rewrite rules are not supported
anymore, reference to this commit[1].
According to the build of this patch[2], The Centos has no such isuue.
Only Ubuntu needs to be upgrade to use the <rule> section.

* Centos use 01-rewrite-0.12.conf.j2
* Ubuntu use 01-rewrite-0.14.conf.j2

[1] fluent/fluent-plugin-rewrite-tag-filter@248ed8e
[2] https://review.openstack.org/#/c/517907

Manually patched from commit 419a2fc9fd206447f37fbcf62da8e15ab96f8530 of
github.com/openstack/kolla-ansible

Change-Id: I2e250a6c4f257e6ce6af7d13f518e282407d0ead
Closes-bug: #1731596

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/37751
Committed: http://github.com/Juniper/contrail-ansible/commit/913c36c9979ae4ff9f1bd98917652c6c90d397db
Submitter: Zuul (<email address hidden>)
Branch: R4.1

commit 913c36c9979ae4ff9f1bd98917652c6c90d397db
Author: Ramprakash Ram Mohan <email address hidden>
Date: Tue Nov 21 14:32:02 2017 -0800

Fix the deprecated syntax in 01-rewrite.conf

The current syntax of Ubuntu fluentd rewrite rules are not supported
anymore, reference to this commit[1].
According to the build of this patch[2], The Centos has no such isuue.
Only Ubuntu needs to be upgrade to use the <rule> section.

* Centos use 01-rewrite-0.12.conf.j2
* Ubuntu use 01-rewrite-0.14.conf.j2

[1] fluent/fluent-plugin-rewrite-tag-filter@248ed8e
[2] https://review.openstack.org/#/c/517907

Manually patched from commit 419a2fc9fd206447f37fbcf62da8e15ab96f8530 of
github.com/openstack/kolla-ansible

Change-Id: I2e250a6c4f257e6ce6af7d13f518e282407d0ead
Closes-bug: #1731596

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/37750
Committed: http://github.com/Juniper/contrail-ansible/commit/0dcc3b8287d0cd9ddf1d82a7bcb3a8a9576a3411
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit 0dcc3b8287d0cd9ddf1d82a7bcb3a8a9576a3411
Author: Ramprakash Ram Mohan <email address hidden>
Date: Tue Nov 21 14:32:02 2017 -0800

Fix the deprecated syntax in 01-rewrite.conf

The current syntax of Ubuntu fluentd rewrite rules are not supported
anymore, reference to this commit[1].
According to the build of this patch[2], The Centos has no such isuue.
Only Ubuntu needs to be upgrade to use the <rule> section.

* Centos use 01-rewrite-0.12.conf.j2
* Ubuntu use 01-rewrite-0.14.conf.j2

[1] fluent/fluent-plugin-rewrite-tag-filter@248ed8e
[2] https://review.openstack.org/#/c/517907

Manually patched from commit 419a2fc9fd206447f37fbcf62da8e15ab96f8530 of
github.com/openstack/kolla-ansible

Change-Id: I2e250a6c4f257e6ce6af7d13f518e282407d0ead
Closes-bug: #1731596

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.