containers: WebUI shows two server is down

Bug #1694048 reported by Andrey Pavlov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R4.0
Won't Fix
Low
Unassigned
Trunk
Invalid
Low
Unassigned

Bug Description

build 14, mitaka + trusty, deployed by Juju

analytics and analyticsdb has an alarms that says 'ContrailConfig is missing or incorrect'

screen is attached

Tags: provisioning
Revision history for this message
Andrey Pavlov (apavlov-e) wrote :
Jeba Paulaiyan (jebap)
tags: added: provisioning
Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

there are no warnings/errors when deployment has 3 controllers/analytics/analyticsdb

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

Comment from Gokul:

1. Analytics and analyticsdb: the nodes are not registered with contrail and the status shows “Missing Contrail Config”. I tried registering them manually and they are properly displayed on the UI now:

curl -u admin:password http://localhost:8082/analytics-nodes

{"analytics-nodes": []}

curl -u admin:password http://localhost:8082/database-nodes

{"database-nodes": []}

python /usr/share/contrail-utils/provision_analytics_node.py --api_server_ip 172.31.15.186 --host_name ip-172-31-15-186 --host_ip 172.31.15.186 --oper add --admin_tenant_name admin --admin_password password --openstack_ip 18.220.180.11

python /usr/share/contrail-utils/provision_analytics_database.py --api_server_ip 172.31.15.186 --host_name ip-172-31-15-186 --host_ip 172.31.15.186 --oper add --admin_tenant_name admin --admin_password password --openstack_ip 18.220.180.11

Revision history for this message
Santosh Gupta (sangupta) wrote :

Could you check if its provisioned correctly. Looks similar to bug 1716799.
Plese check if ContrailConfig is present in raw data, see comment#1 in bug 1716799.

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

Hi Santosh,

As it posted in comment #3 - analytics and analyticsdb is not provisioned during deployment. I think that it should be done by ansible-internal for all cases (not only for non-openstack deployments).

So I think that its provisioned correctly and don't looks similar to bug 1716799.

Revision history for this message
Santosh Gupta (sangupta) wrote :

Hi Andrey,
Which branch/image – I didn’t get it from build#14.
Also when you try next please try a the latest build on that branch.
In comment#3, its mentioned that analytics and analyticsdb nodes are not provisioned.
Did that provisioning reach that step and bail out on error. Please provide this info and error message.

I need some more info to debug this.
combined json file
List of juju installation steps followed
Debug logs of the juju installation scripts
docker ps -a
docker logs controller
docker logs analytics
docker logs analyticsdb

Could you please leave box in that state and send me the details to login.
Thanks

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

Hi Santosh,

Tried this on latest release build 32 (R4.0) for ubuntu 14.04 - the same behaviour.

I mentioned that analytics/analyticsdb is not provisioned. I meant that ansible-internal code even doesn't try to provision them -
https://github.com/Juniper/contrail-ansible-internal/blob/R4.0/playbooks/roles/contrail/analytics/tasks/provision.yml#L10
So there is no error cause there is no try.

Do you still need environment?

Revision history for this message
Santosh Gupta (sangupta) wrote :

https://github.com/Juniper/contrail-ansible-internal/blob/R4.0/playbooks/roles/contrail/analytics/tasks/main.yml#L28

Hi Andrey,
       Since you have openstack orchestrator, we don't need to run it. You can see in "docker logs" that its evaluated and skipped. You dont need to run this command manually.

ubuntu@ip-172-31-8-29:/etc/contrailctl$ sudo docker logs contrail-analytics | grep -A5 -B5 "register analytics"
ok: [localhost]

TASK [contrail/analytics : Wait till config api server answers] ****************
skipping: [localhost]

TASK [contrail/analytics : register analytics to config api server (non-openstack)] ***
skipping: [localhost]

PLAY RECAP *********************************************************************
localhost : ok=29 changed=3 unreachable=0 failed=0

ubuntu@ip-172-31-8-29:/etc/contrailctl$ sudo docker logs contrail-analyticsdb | grep -A5 -B5 "register analyticsdb"
ok: [localhost]

TASK [contrail/analyticsdb : Wait till config api server answers] **************
skipping: [localhost]

TASK [contrail/analyticsdb : register analyticsdb to config api server (non-openstack)] ***
skipping: [localhost]

PLAY RECAP *********************************************************************
localhost : ok=31 changed=2 unreachable=0 failed=0

The alarm that you see is another manifestation of the bug#1716799. I see the same on ubuntu14 setup using ansible/openstack. We suspect there is a race condition where alarmgen gets notified of uve via kafka but its not present in redis when alarmgen tries to read it. Restart of alarmgen should resolve it.

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

Hi Santosh,

I disagree that the bug is duplicate. I don't see timing issues here - Servers are down always. In all 4.X builds.

I don't see that containers do restart of alarm-gen.

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

same thing on HA setup with a last release build (4.0.2 33) - screen is attached.

Another point - TripleO and OSPd do not have such problem cause it calls provision_analytics/provision_analyticsdb after deployment.

Revision history for this message
Santosh Gupta (sangupta) wrote :

Hi Andrey,
The setup that you provided me for debugging had all servers up.
The only issue it had was the “ContrailConfig missing or incorrect” alarm on 2 nodes. Restarting alarmgen removed alarms on your box. Hence I marked it duplicate.
That is the only issue stated in your initial bug report.
If you see other issues please open separate bug so that we don’t chase different issues in one bug.

>>> I don't see timing issues here - Servers are down always. In all 4.X builds.
In my setup, SM and ansible based all-in-one systems come up fine for 4.0.2.0-33.
Please bring up systems using ansible and juju on ubuntu14.04 and then we can debug your system.

>>> Another point - TripleO and OSPd do not have such problem cause it calls provision_analytics/provision_analyticsdb after deployment.
Openstack orchestrator based handling is different from non-openstack orchestrators as pointed in comment#8.
But I use openstack based provisioning using SM/ansible and it comes up fine. I don’t see the servers down.
Please have the systems up and we can look into it. Thanks.

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

Hi Santosh,
setup is ready. WebUI - https://13.58.222.139:8143/
build 4.0.2.0-34
nodes are present but down.

Some time ago Gokul (if I'm remember correctly the person) said me that this issue is happened cause no one call provision_analytics and provision_analyticsdb.

P.S. Juju can't restart alarm-gen inside container.

Revision history for this message
Santosh Gupta (sangupta) wrote :

As discussed the only issue we have is 'ContrailConfig is missing or incorrect' alarm. We shall wait and verify it once the bug#1716799 is resolved.

Revision history for this message
Santosh Gupta (sangupta) wrote :

bug#1716799 has some debugging code added. Please see if you can reproduce the issue on your setup.

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

@Santosh - what build I can use for it?

Revision history for this message
Santosh Gupta (sangupta) wrote :

vi /auto/cs-build/jenkins-jobs/CB-mainline-ubuntu14-mitaka/builds/83/archive/sandbox/controller/src/opserver/alarmgen.py
You can pick this build, the file has changes done for bug#1716799

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

tried build 84 from mainline.
analyticsdb container doesn't work.

TASK [cassandra : Configure Datastax-agent] ************************************
ESC[0;31mfatal: [localhost]: FAILED! => {"failed": true, "msg": "{{ opscenter_ip }}: 'opscenter_ip' is undefined"}ESC[0m
        to retry, use: --limit @/contrail-ansible-internal/playbooks/contrail_analyticsdb.retry

PLAY RECAP *********************************************************************
ESC[0;31mlocalhostESC[0m : ESC[0;32mok=34 ESC[0m ESC[0;33mchanged=9 ESC[0m unreachable=0 ESC[0;31mfailed=1 ESC[0m

waiting for a new build.

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :
information type: Proprietary → Public
Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

It's not applicable for master branch (trunk)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.