Process failure seen in 2 controllers/3 when deleting 2k VN's and process contrail-device-manager,contrail-schema,contrail-svc-monitor failed on the 2 controllers

Bug #1656115 reported by Arun Paul
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenContrail
New
Undecided
Unassigned

Bug Description

I have the following controllers and TSN's in the cluster and HA is enabled (I have attacahed the testbed.py)

host1 = 'root@10.94.63.102' >> Controller (leader)
host2 = 'root@10.94.63.103' >> Backup
host3 = 'root@10.94.63.133' >> Backup
host4 = 'root@10.94.191.150' >> TSN
host5 = 'root@10.94.191.151' >> TSN

At present the controllers had 4K VN's

1-2000 VNs configured under alpha naming convention and 2001 - 4000 VNs configured under bravo naming convention

Now I delete the VNs from 2001 - 4000 .I started this on Jan 10 evening and I saw process crash in Controller 1 and controller 2

Below is the conntrail-status before and after crash on all three Controllers

Before crash on controller 1

root@NTTC-Contrail-1:/opt/contrail/utils/fabfile/testbeds# contrail-status
== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen:0 active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector active
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager backup
contrail-discovery:0 active
contrail-schema backup
contrail-svc-monitor backup
ifmap active

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active
contrail-webui-middleware active

== Contrail Database ==
contrail-database: active

== Contrail Supervisor Database ==
supervisor-database: active
contrail-database-nodemgr active
kafka active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

After crash on controller 1

root@NTTC-Contrail-1:/opt/contrail/utils/fabfile/testbeds# contrail-status
== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen:0 active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector active
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager failed
contrail-discovery:0 active
contrail-schema failed
contrail-svc-monitor failed
ifmap active

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active
contrail-webui-middleware active

== Contrail Database ==
contrail-database: active

== Contrail Supervisor Database ==
supervisor-database: active
contrail-database-nodemgr active
kafka active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

Before crash on Controller2

root@NTTC-Contrail-2:/opt/contrail/utils/fabfile/testbeds# contrail-status
== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen:0 active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector active
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager backup
contrail-discovery:0 active
contrail-schema backup
contrail-svc-monitor backup
ifmap active

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active
contrail-webui-middleware active

== Contrail Database ==
contrail-database: active

== Contrail Supervisor Database ==
supervisor-database: active
contrail-database-nodemgr active
kafka active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

After crash on Controller 2

root@NTTC-Contrail-2:/opt/contrail/utils/fabfile/testbeds# contrail-status
== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen:0 active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector active
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager backup
contrail-discovery:0 active
contrail-schema backup
contrail-svc-monitor failed
ifmap active

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active
contrail-webui-middleware active

== Contrail Database ==
contrail-database: active

== Contrail Supervisor Database ==
supervisor-database: active
contrail-database-nodemgr active
kafka active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

No failure seen in controller3

root@NTTC-Contrail-3:~# contrail-status
== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen:0 active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector active
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager active
contrail-discovery:0 active
contrail-schema active
contrail-svc-monitor active
ifmap active

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active
contrail-webui-middleware active

== Contrail Database ==
contrail-database: active

== Contrail Supervisor Database ==
supervisor-database: active
contrail-database-nodemgr active
kafka active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

========Run time service failures=============
/var/crashes/core.contrail-collec.1794.NTTC-Contrail-3.1483717342
/var/crashes/core.contrail-contro.1795.NTTC-Contrail-3.1483733782
root@NTTC-Contrail-3:~# contrail-status
== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen:0 active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector active
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager active
contrail-discovery:0 active
contrail-schema active
contrail-svc-monitor active
ifmap active

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active
contrail-webui-middleware active

== Contrail Database ==
contrail-database: active

== Contrail Supervisor Database ==
supervisor-database: active
contrail-database-nodemgr active
kafka active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

Logs will be root@NTTC-Contrail-3:~# contrail-status
== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen:0 active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector active
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager active
contrail-discovery:0 active
contrail-schema active
contrail-svc-monitor active
ifmap active

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active
contrail-webui-middleware active

== Contrail Database ==
contrail-database: active

== Contrail Supervisor Database ==
supervisor-database: active
contrail-database-nodemgr active
kafka active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

========Run time service failures=============
/var/crashes/core.contrail-collec.1794.NTTC-Contrail-3.1483717342
/var/crashes/core.contrail-contro.1795.NTTC-Contrail-3.1483733782

root@NTTC-Contrail-3:/var/log/contrail# cd /var/crashes/
root@NTTC-Contrail-3:/var/crashes# ls -ltr
total 1206696
-rw------- 1 contrail contrail 1993863168 Jan 6 07:42 core.contrail-collec.1794.NTTC-Contrail-3.1483717342
-rw------- 1 contrail contrail 69795840 Jan 6 12:16 core.contrail-contro.1795.NTTC-Contrail-3.1483733782

The logs will be copied to

/volume/dcg-systest/PRS/PR###

Revision history for this message
Arun Paul (ampul) wrote :
Revision history for this message
Arun Paul (ampul) wrote :

The logs are copied to

@sp-ulnx2:/volume/dcg-systest/PRS/PR1656115> ls
controller1 controller2 controller3
@sp-ulnx2:/volume/dcg-systest/PRS/PR1656115>

The controller ip address is https://10.94.63.102:8143

I started deleting the VNs on Jan 10 17:07 .The crash happened on Jan 10 in the night so I have copied the logs of Jan 10 20:06 from /var/log/contrail folder .

If you need any other logs please let me know .The Controller is also in the state where it shows process failure on 2 CN's

Revision history for this message
chhandak (chhandak) wrote :

Observed following trace in schema log

Traceback (most recent call last):
  File "/usr/bin/contrail-schema", line 9, in <module>
    load_entry_point('schema-transformer==0.1dev', 'console_scripts', 'contrail-schema')()
  File "/usr/lib/python2.7/dist-packages/schema_transformer/to_bgp.py", line 874, in server_main
    main()
  File "/usr/lib/python2.7/dist-packages/schema_transformer/to_bgp.py", line 868, in main
    args)
  File "/usr/lib/python2.7/dist-packages/cfgm_common/zkclient.py", line 346, in master_election
    self._election.run(func, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/kazoo/recipe/election.py", line 53, in run
    func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/schema_transformer/to_bgp.py", line 849, in run_schema_transformer
    transformer = SchemaTransformer(args)
  File "/usr/lib/python2.7/dist-packages/schema_transformer/to_bgp.py", line 243, in __init__
    raise e
NotFoundException: NotFoundException(_message=None, why='Column family route_target_table not found.')

Revision history for this message
Sachin Bansal (sbansal) wrote :

This looks like cassandra issue. Did you check cassandra logs? Also, how do I access sp-ulnx2?

Changed in opencontrail:
assignee: nobody → Sachin Bansal (sbansal)
Revision history for this message
Sachin Bansal (sbansal) wrote :

I logged in to the setup and I found that autorestart is set to false on controller-1, while it is set to true in controller-2 and controller-3. As a result, once the process fails, it is not restarted by supervisord on controller-1. Did someone change this setting on controller-1?

root@NTTC-Contrail-1:/var/log/cassandra# cat /etc/contrail/supervisord_config_files/contrail-schema.ini
[program:contrail-schema]
command=/usr/bin/contrail-schema --conf_file /etc/contrail/contrail-schema.conf --conf_file /etc/contrail/contrail-keystone-auth.conf --conf_file /etc/contrail/contrail-database.conf
priority=450
autostart=false
autorestart=false
killasgroup=true
stopsignal=TERM
redirect_stderr=true
stdout_logfile=/var/log/contrail/contrail-schema-stdout.log
stderr_logfile=/dev/null
exitcodes=0 ; 'expected' exit codes for process (default 0,2)
user=contrail

Changed in opencontrail:
assignee: Sachin Bansal (sbansal) → Arun Paul (ampul)
Revision history for this message
Arun Paul (ampul) wrote :

Hi Sachin

   I am not aware of the setting's .However the bug is for process failure that occurred.This bug is not for why controller-1 did not reboot at process failure

Arun Paul (ampul)
Changed in opencontrail:
assignee: Arun Paul (ampul) → nobody
Revision history for this message
vivekananda shenoy (vshenoy83) wrote :

Hi Sachin,

Any updates on this bug ?

Regards,
Vivek

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.