Contrail Ansible Deployer: schema remains in initializing state, (Database:Cassandra[] connection down)

Bug #1801474 reported by Prachi Yadav
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R5.0
Fix Committed
High
Nagendra Prasath
Trunk
Fix Committed
High
Nagendra Prasath

Bug Description

Observed on builds: 5.0-322, 5.0-326 and 5.0-330

Controller node IPs : 10.87.118.141, 10.87.118.142 and 10.87.118.144

Contrail-status output:

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail config-database ==
nodemgr: active
zookeeper: active
rabbitmq: active
cassandra: active

== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail webui ==
web: active
job: active

== Contrail config ==
svc-monitor: backup
nodemgr: active
device-manager: backup
api: active
schema: initializing (Database:Cassandra[] connection down)

Prachi Yadav (pryadav)
tags: added: contrail-networking sanity
Jeba Paulaiyan (jebap)
tags: added: blocker
Revision history for this message
Nagendra Prasath (npchandran) wrote :

Restart the contrail-schema to proceed. Shouldnt be a blocker.

Revision history for this message
Prachi Yadav (pryadav) wrote :
Download full text (3.5 KiB)

Build: 5.0-333
sku: queens

The provisioning looks good on the latest build.

[root@controller-0-de42d58b7729422095b955676d449053 logs]# contrail-status
Pod Service Original Name State Status
                 redis contrail-external-redis running Up 13 hours
analytics alarm-gen contrail-analytics-alarm-gen running Up 13 hours
analytics api contrail-analytics-api running Up 13 hours
analytics collector contrail-analytics-collector running Up 13 hours
analytics nodemgr contrail-nodemgr running Up 13 hours
analytics query-engine contrail-analytics-query-engine running Up 13 hours
analytics snmp-collector contrail-analytics-snmp-collector running Up 13 hours
analytics topology contrail-analytics-topology running Up 13 hours
config api contrail-controller-config-api running Up 13 hours
config device-manager contrail-controller-config-devicemgr running Up 13 hours
config nodemgr contrail-nodemgr running Up 13 hours
config schema contrail-controller-config-schema running Up 13 hours
config svc-monitor contrail-controller-config-svcmonitor running Up 13 hours
config-database cassandra contrail-external-cassandra running Up 13 hours
config-database nodemgr contrail-nodemgr running Up 13 hours
config-database rabbitmq contrail-external-rabbitmq running Up 13 hours
config-database zookeeper contrail-external-zookeeper running Up 13 hours
control control contrail-controller-control-control running Up 13 hours
control dns contrail-controller-control-dns running Up 13 hours
control named contrail-controller-control-named running Up 13 hours
control nodemgr contrail-nodemgr running Up 13 hours
database cassandra contrail-external-cassandra running Up 13 hours
database kafka contrail-external-kafka running Up 13 hours
database nodemgr contrail-nodemgr running Up 13 hours
database zookeeper contrail-external-zookeeper running Up 13 hours
webui job contrail-controller-webui-job running Up 13 hours
webui web contrail-controller-webui-web running Up 13 hours

WARNING: container with original name 'contrail-external-redis' have Pod or Service empty. Pod: '' / Service: 'redis'. Please pass NODE_TYPE with pod name to container's env

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail config-database ==
nodemgr: active
zookeeper: active
rabbitmq: active
cassandra: active

== Contrail database ==
kaf...

Read more...

Jeba Paulaiyan (jebap)
tags: removed: blocker
tags: added: releasenote
Revision history for this message
manishkn (manishkn) wrote : hi..Latest good build to use for fabric ?

Thanks
Manish Krishnan

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/47491
Submitter: Shivayogi Ugaji (<email address hidden>)

Revision history for this message
alok kumar (kalok) wrote :

this is seen on RHOSP13 setup too for 5.0.2 build 341, restarting schema fixed the issue.

Revision history for this message
Jeba Paulaiyan (jebap) wrote :

Notes:

After provisioning, if the contrail-status shows that schema is in initializing state with connection to Cassandra down error, then restarting the schema service will recover from the error.

Jeba Paulaiyan (jebap)
information type: Proprietary → Public
tags: added: sanityblocker
removed: sanity
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/47491
Committed: http://github.com/Juniper/contrail-controller/commit/6a29730679db2375c0eee7006de2665b53cf979d
Submitter: Zuul v3 CI (<email address hidden>)
Branch: master

commit 6a29730679db2375c0eee7006de2665b53cf979d
Author: Shivayogi Ugaji <email address hidden>
Date: Mon Nov 5 22:07:18 2018 -0800

db_resync_done lock is used to indicate the amqp thread to wait for resync to
complete. In this case, when we call SchemaTransformer.destroy_instance()
due to Casandra connection failure, this lock remains locked blocking
destroy_instance. destroy_instance calls _vnc_subscribe_callback to drain the
amqp queue which waits infinitely for db_resync_done lock to be released.
This fix releases db_resync_done lock so that destroy_instance doesnt get
blocked.

Change-Id: Ic81fd0acda0fd4d43b3bfd061b47527b150b9096
Closes-Bug: #1801474

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R5.0

Review in progress for https://review.opencontrail.org/48899
Submitter: Nagendra Prasath (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/48899
Committed: http://github.com/Juniper/contrail-controller/commit/f4fdbf871d088907475c88069d46a3e23aadbf3d
Submitter: Zuul v3 CI (<email address hidden>)
Branch: R5.0

commit f4fdbf871d088907475c88069d46a3e23aadbf3d
Author: Shivayogi Ugaji <email address hidden>
Date: Mon Nov 5 22:07:18 2018 -0800

db_resync_done lock is used to indicate the amqp thread to wait for resync to
complete. In this case, when we call SchemaTransformer.destroy_instance()
due to Casandra connection failure, this lock remains locked blocking
destroy_instance. destroy_instance calls _vnc_subscribe_callback to drain the
amqp queue which waits infinitely for db_resync_done lock to be released.
This fix releases db_resync_done lock so that destroy_instance doesnt get
blocked.

Change-Id: Ic81fd0acda0fd4d43b3bfd061b47527b150b9096
Closes-Bug: #1801474
(cherry picked from commit 6a29730679db2375c0eee7006de2665b53cf979d)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R6.0-WIP

Review in progress for https://review.opencontrail.org/49393
Submitter: Mateusz Neumann (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged
Download full text (37.3 KiB)

Reviewed: https://review.opencontrail.org/49393
Committed: http://github.com/Juniper/contrail-controller/commit/77df3b58265b3fab414dfbc00e1ff39d19f0a99c
Submitter: Zuul v3 CI (<email address hidden>)
Branch: R6.0-WIP

commit 77df3b58265b3fab414dfbc00e1ff39d19f0a99c
Author: Shivayogi Ugaji <email address hidden>
Date: Mon Nov 5 22:07:18 2018 -0800

Apply commits from master onto R6.0-WIP

db_resync_done lock is used to indicate the amqp thread to wait for resync to
complete. In this case, when we call SchemaTransformer.destroy_instance()
due to Casandra connection failure, this lock remains locked blocking
destroy_instance. destroy_instance calls _vnc_subscribe_callback to drain the
amqp queue which waits infinitely for db_resync_done lock to be released.
This fix releases db_resync_done lock so that destroy_instance doesnt get
blocked.
Closes-Bug: #1801474

[DM] Hitless image upgrade implementation
Closes-Bug: #1799322

Provisioner for the devicemanager node.
usage:
from /opt/contrail/utils
python provision_devicemgr_node.py --host_name aio --host_ip 10.87.82.2
--oper add --admin_user admin --admin_password contrail123 --admin_tenant_name
admin --openstack_ip 10.87.82.2 --api_server_ip 10.87.82.2
Closes-Bug: #1805303

CFM: Changes for onboarding L3PNF
- Add new platform SRX240
- Add L3PNF subnet is schema
- Add new namespace, VN and IPAM for L3PNF during brownfield onboarding
Closes-Bug: 1800701

Add entrypoint to vrouter-agent service on Windows
Introduce entrypoint for agent similar in design to that from
microservice deployment. For now it will only start agent,
actual features will be added in following changes.
Partial-Bug: #1806677

Check build dependencies for tbb, SimpleAmqpClient and rabbitmq
Closes-Bug: #1806719

Make agent's entrypoint update agent's config on Windows
In future we will generate the whole config from scratch
as on Linux, but for now we only update the vhost's ifname.
It's the only field that can change upon restart.
Partial-Bug: #1806677

bgp-peer selection support for bgpaas
1. Listener BgpRouterConfig is added for BgpRouter and ControlNodeZone
2. BgpRouterConfig builds BgpRouterTree and ControlNodeZoneTree
from IFMapNode
3. BGPaaS gets BgpRouter for configured ControlNodeZone from
BgpRouterConfig and Updates bgp-peer-ip and bgp-peer-port in
the flow.
4. Step 3 is followed for xmpp based peer-selection also.
5. BGPaaS sandesh is updated with primary_control_node_zone,
secondary_control_node_zone, bgp_peer_ip and bgp_peer_port
Partial-bug: #1775872

[DM] Inside-outside workflow - lag/mH
1. Change the exisiting business logic to adhere to the new data model for lag/mH workflow
2. Multi-vlan support
Partial-Bug: #1799329

Rework nodemgr before fixing ntp issue
- move windows/linux code to separate classes instead of same condition through the code
- simplify main.py
- remove copy duplication
Closes-Bug: 1800704

[fabric] Added playbook retry support to job manager
1) When playbook return retry_devices in the output, job manager will retry the playbooks against those devices
2) remove obsolete playbooks from 5.0
3) remove obsolete ansible roles from 5.0
4) added a warning log on missing loopback interface when ...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.