[k8s-R5.0]: Default NW FW policy do not get created for non-default projects

Bug #1764493 reported by Pulkit Tandon on 2018-04-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R5.0
Fix Released
Medium
Dinesh Bakiaraj
Trunk
Fix Released
Medium
Dinesh Bakiaraj

Bug Description

BUG Tempelate:

Configuration:
K8s 1.9.2
coat-5.0-15
Centos-7.4

Setup:
5 node setup.
1 Kube master. 3 Controller.
2 Agent+ K8s slaves

The issues was observed in a k8s sanity run:
LogsLocation : http://10.204.216.50/Docs/logs/5.0-15_2018_04_16_17_17_20_1523889114.59/logs/
Report : http://10.204.216.50/Docs/logs/5.0-15_2018_04_16_17_17_20_1523889114.59/junit-noframes.html

Description:
1. Created few pods in namespace "default"
2. Created few pods in namespace "non-default"

Pods in "non-default namespace" can't ping pods in same namespace.
At the same time, they can ping the pods in "default" namespace.

FWD flow:
   ```337876<=>24624 10.47.255.241:63744 1 (2)
                         10.47.255.240:0
(Gen: 1, K(nh):29, Action:D(FwPolicy), Flags:, QOS:-1, S(nh):29, Stats:3/294,
 SPort 52395, TTL 0, Sinfo 4.0.0.0)```

Rev Flow:
    ```24624<=>337876 10.47.255.240:63744 1 (2)
                         10.47.255.241:0
(Gen: 1, K(nh):29, Action:D(Unknown), Flags:, QOS:-1, S(nh):26, Stats:0/0,
 SPort 57821, TTL 0, Sinfo 0.0.0.0)```

Found that the FW policy rules of allow all is not present for non default namespaces.
This might not be a problem with fresh setup.
This might have happened during sanity by some specific step of restart or any other.
Not sure on this.

Review in progress for https://review.opencontrail.org/41980
Submitter: Dinesh Bakiaraj (<email address hidden>)

Review in progress for https://review.opencontrail.org/41981
Submitter: Dinesh Bakiaraj (<email address hidden>)

Dinesh Bakiaraj (dineshb) wrote :

RabbitMQ config is not provisioned properly per HA config.
Hence the rabbitmq cluster is not formed.

[root@nodeg12 ~]# docker exec -it configdatabase_rabbitmq_1 bash
root@nodeg12:/# cat /etc/rabbitmq/rabbitmq.config
[ { rabbit, [
    { loopback_users, [ ] },
    { tcp_listeners, [ 5672 ] },
    { ssl_listeners, [ ] },
    { hipe_compile, false }
] } ].
root@nodeg12:/#

Download full text (4.1 KiB)

Hi All,

I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.

K8s HA Setup details:
3 Controller+kube managers:
10.204.217.52(nodeg12)
10.204.217.71(nodeg31)
10.204.217.98(nodec58)
2 Agents/ k8s slave:
10.204.217.100(nodec60)
10.204.217.101(nodec61)
Multi interface setup

Following are key observations:

1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.

rabbitmq: inactive

Docker logs for rabbitmq container on nodec58:

{"init terminating in do_boot",{error,{inconsistent_cluster,"Node contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but contrail@nodeg31 disagrees"}}}

2. On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)

3. If I create a k8s Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails

4. ON CNI of both agent, seeing this error:
I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation : GET Url : http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-002590c55f6a

E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get operation. Return code 404

I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get vrouter failed

E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter

I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter

E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed processing Add command.

E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter

I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter

E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed processing Add command.

NOTE: Most of the issues observed are on k8s HA multi interface setup.
             Things are better with Non HA/ single interface setup.

Agent crash trace:
(gdb) bt full
#0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000c15440 in AgentOperDBTable::ConfigEventHandler(IFMapNode*, DBEntry*) ()
No symbol table info available.
#5 0x0000000000c41714 in IFMapDependencyManager::ProcessChangeList() ()
No symbol table info available.
#6 0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() ()
No symbol table info available.
#7 0x00000000...

Read more...

Michael Henkel (mhenkel-3) wrote :
Download full text (4.3 KiB)

Pulkit,

How many resources did you assign to your instances?

Regards,
Michael

Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <<email address hidden><mailto:<email address hidden>>>:

Hi All,

I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.

K8s HA Setup details:
3 Controller+kube managers:
10.204.217.52(nodeg12)
10.204.217.71(nodeg31)
10.204.217.98(nodec58)
2 Agents/ k8s slave:
10.204.217.100(nodec60)
10.204.217.101(nodec61)
Multi interface setup

Following are key observations:

1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.

rabbitmq: inactive

Docker logs for rabbitmq container on nodec58:

{"init terminating in do_boot",{error,{inconsistent_cluster,"Node contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but contrail@nodeg31 disagrees"}}}

2. On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)

3. If I create a k8s Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails

4. ON CNI of both agent, seeing this error:
I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation : GET Url : http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-002590c55f6a

E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get operation. Return code 404

I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get vrouter failed

E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter

I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter

E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed processing Add command.

E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter

I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter

E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed processing Add command.

NOTE: Most of the issues observed are on k8s HA multi interface setup.
             Things are better with Non HA/ single interface setup.

Agent crash trace:
(gdb) bt full
#0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000c15440 in AgentOperDBTable::ConfigEventHandler(IFMapNode*, DBEntry*) ()
No symbol table info available.
#5 0x0000000000c41714 in IF...

Read more...

Andrey Pavlov (apavlov-e) wrote :
Download full text (5.7 KiB)

Hey Michael,

I have similar problems in my 3-nodes setup:

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail analytics ==
snmp-collector: initializing (Database:Cassandra[] connection down)
query-engine: active
api: active
alarm-gen: initializing (Database:Cassandra[] connection down)
nodemgr: active
collector: initializing (Database:Cassandra connection down)
topology: initializing (Database:Cassandra[] connection down)

== Contrail config ==
api: initializing (Database:Cassandra[] connection down)
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup

== Contrail webui ==
web: active
job: active

== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active

[root@node-10-1-56-124 ~]# free -hw
              total used free shared buffers
 cache available
Mem: 15G 11G 3.3G 28M 0B
892M 3.7G
Swap: 0B 0B 0B

Regards,
Andrey Pavlov.

On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden> wrote:

> Pulkit,
>
> How many resources did you assign to your instances?
>
> Regards,
> Michael
>
> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
>
> Hi All,
>
>
>
> I need your help and expertise debugging the k8s sanity setup which is in
> really bad state. Things are messier starting build 15.
>
> I observed multiple problems on current attempt. Not sure if they are
> linked or all are different.
>
> Kept the setup in same setup so that you can debug the failures on live
> setup.
>
>
>
> *K8s HA Setup details:*
>
> 3 Controller+kube managers:
>
> 10.204.217.52(nodeg12)
>
> 10.204.217.71(nodeg31)
>
> 10.204.217.98(nodec58)
>
> 2 Agents/ k8s slave:
>
> 10.204.217.100(nodec60)
>
> 10.204.217.101(nodec61)
>
> Multi interface setup
>
>
>
> Following are key observations:
>
> 1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has
> rabbitmq as inactive.
>
> rabbitmq: inactive
>
> Docker logs for rabbitmq container on nodec58:
>
> {"init terminating in do_boot",{error,{inconsistent_cluster,"Node
> contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but
> contrail@nodeg31 disagrees"}}}
>
>
>
> 2. On all 3 controllers, Cassandra connection not established for 2
> hours after provisioning. This issue seems flapping with time and
> sometimes, I see the services as active too:
> control: initializing (Database:Cassandra connection down)
> collector: initializing (Database:Cassandra connection down)
>
>
>
> 3. If I create a k8s Pod, many a times it results in POD creation
> failure and instantly vrouter crash happens.
> The trace is below.
> Irrespective of crash happens or not, POD creation fails
>
>
>
> 4. ON CNI of both agent, seeing this error:
> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation
> : GET Url : http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-002590c55f6a
>
> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get operation.
> Return code 404
>
> I : 24646 : 2018/04/17 17:35:44 vroute...

Read more...

Michael Henkel (mhenkel-3) wrote :
Download full text (6.0 KiB)

Hi Andrey, did you check nodetool status?

Regards,
Michael

Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <<email address hidden><mailto:<email address hidden>>>:

Hey Michael,

I have similar problems in my 3-nodes setup:

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail analytics ==
snmp-collector: initializing (Database:Cassandra[] connection down)
query-engine: active
api: active
alarm-gen: initializing (Database:Cassandra[] connection down)
nodemgr: active
collector: initializing (Database:Cassandra connection down)
topology: initializing (Database:Cassandra[] connection down)

== Contrail config ==
api: initializing (Database:Cassandra[] connection down)
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup

== Contrail webui ==
web: active
job: active

== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active

[root@node-10-1-56-124 ~]# free -hw
              total used free shared buffers cache available
Mem: 15G 11G 3.3G 28M 0B 892M 3.7G
Swap: 0B 0B 0B

Regards,
Andrey Pavlov.

On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <<email address hidden><mailto:<email address hidden>>> wrote:
Pulkit,

How many resources did you assign to your instances?

Regards,
Michael

Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <<email address hidden><mailto:<email address hidden>>>:

Hi All,

I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.

K8s HA Setup details:
3 Controller+kube managers:
10.204.217.52(nodeg12)
10.204.217.71(nodeg31)
10.204.217.98(nodec58)
2 Agents/ k8s slave:
10.204.217.100(nodec60)
10.204.217.101(nodec61)
Multi interface setup

Following are key observations:

1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.

rabbitmq: inactive

Docker logs for rabbitmq container on nodec58:

{"init terminating in do_boot",{error,{inconsistent_cluster,"Node contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but contrail@nodeg31 disagrees"}}}

2. On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)

3. If I create a k8s Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails

4. ON CNI of both agent, seeing this error:
I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation : GET Url : http://127...

Read more...

Pulkit Tandon (pulkitt) wrote :
Download full text (8.6 KiB)

Hi Michael,

I did not explicitly mention any resources to instances.

Following is my instances.yaml:

global_configuration:
   REGISTRY_PRIVATE_INSECURE: True
   CONTAINER_REGISTRY: 10.204.217.152:5000
provider_config:
  bms:
    domainsuffix: englab.juniper.net
    ntpserver: 10.204.217.158
    ssh_pwd: c0ntrail123
    ssh_user: root

instances:
  nodec58:
      ip: 10.204.217.98
      provider: bms
      roles:
          analytics: null
          analytics_database: null
          config: null
          config_database: null
          control: null
          k8s_master: null
          kubemanager: null
          webui: null
  nodec60:
      ip: 10.204.217.100
      provider: bms
      roles:
          k8s_node: null
          vrouter:
              PHYSICAL_INTERFACE: bond0
  nodec61:
      ip: 10.204.217.101
      provider: bms
      roles:
          k8s_node: null
          vrouter:
              PHYSICAL_INTERFACE: enp2s0f1
  nodeg12:
      ip: 10.204.217.52
      provider: bms
      roles:
          analytics: null
          analytics_database: null
          config: null
          config_database: null
          control: null
          k8s_master: null
          kubemanager: null
          webui: null
  nodeg31:
      ip: 10.204.217.71
      provider: bms
      roles:
          analytics: null
          analytics_database: null
          config: null
          config_database: null
          control: null
          k8s_master: null
          kubemanager: null
          webui: null

contrail_configuration:
  CONTRAIL_VERSION: ocata-5.0-15
  CLOUD_ORCHESTRATOR: kubernetes
  METADATA_PROXY_SECRET: c0ntrail123
  CLOUD_ORCHESTRATOR: kubernetes
  CONTAINER_REGISTRY: 10.204.217.152:5000
  CONTRAIL_VERSION: ocata-5.0-15
  CONTROL_DATA_NET_LIST: 77.77.1.0/24
  CONTROLLER_NODES: 77.77.1.20,77.77.1.30,77.77.1.11
  KUBERNETES_PUBLIC_FIP_POOL:
      domain: default-domain
      name: __fip_pool_public__
      network: __public__
      project: default
  RABBITMQ_NODE_PORT: 5673
  REGISTRY_PRIVATE_INSECURE: true
  VROUTER_GATEWAY: 77.77.1.100

From: Andrey Pavlov <email address hidden>
Date: Tuesday, April 17, 2018 at 6:34 PM
To: Michael Henkel <email address hidden>
Cc: Pulkit Tandon <email address hidden>, Sachchidanand Vaidya <email address hidden>, Dinesh Bakiaraj <email address hidden>, Prasanna Mucharikar <email address hidden>, Yuvaraja Mariappan <email address hidden>, Aniket Gawade <email address hidden>, Sudheendra Rao <email address hidden>, Venkatesh Velpula <email address hidden>, Rudra Rugge <email address hidden>, Ignatious Johnson <email address hidden>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16

Hey Michael,

I have similar problems in my 3-nodes setup:

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail analytics ==
snmp-collector: initializing (Database:Cassandra[] connection down)
query-engine: active
api: active
alarm-gen: initializing (Database:Cassandra[] connection down)
nodemgr: active
collector: initializing (Database:Cassandra conn...

Read more...

Andrey Pavlov (apavlov-e) wrote :
Download full text (7.2 KiB)

root@node-10-1-56-124:/# nodetool -p 7200 status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID
                     Rack
UN 10.1.56.125 3.11 MiB 256 68.5%
 468a1809-53ee-4242-971f-3015ccedc6c2 rack1
UN 10.1.56.124 1.89 MiB 256 72.2%
 9aa41a48-3e9c-417d-b25c-7abf5e1f94aa rack1
UN 10.1.56.126 3.63 MiB 256 59.3%
 33e498c9-f3e2-4430-86b4-261b0ffbaa0e rack1

root@node-10-1-56-124:/# nodetool -p 7200 statusgossip
running
root@node-10-1-56-124:/# nodetool -p 7200 statusthrift
running
root@node-10-1-56-124:/# nodetool -p 7200 statusbinary
running

Regards,
Andrey Pavlov.

On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden> wrote:

> Hi Andrey, did you check nodetool status?
>
> Regards,
> Michael
>
> Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
>
> Hey Michael,
>
> I have similar problems in my 3-nodes setup:
>
> == Contrail control ==
> control: active
> nodemgr: active
> named: active
> dns: active
>
> == Contrail analytics ==
> snmp-collector: initializing (Database:Cassandra[] connection down)
> query-engine: active
> api: active
> alarm-gen: initializing (Database:Cassandra[] connection down)
> nodemgr: active
> collector: initializing (Database:Cassandra connection down)
> topology: initializing (Database:Cassandra[] connection down)
>
> == Contrail config ==
> api: initializing (Database:Cassandra[] connection down)
> zookeeper: active
> svc-monitor: backup
> nodemgr: active
> device-manager: backup
> cassandra: active
> rabbitmq: active
> schema: backup
>
> == Contrail webui ==
> web: active
> job: active
>
> == Contrail database ==
> kafka: active
> nodemgr: active
> zookeeper: active
> cassandra: active
>
> [root@node-10-1-56-124 ~]# free -hw
> total used free shared buffers
> cache available
> Mem: 15G 11G 3.3G 28M 0B
> 892M 3.7G
> Swap: 0B 0B 0B
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden>
> wrote:
>
>> Pulkit,
>>
>> How many resources did you assign to your instances?
>>
>> Regards,
>> Michael
>>
>> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
>>
>> Hi All,
>>
>>
>>
>> I need your help and expertise debugging the k8s sanity setup which is in
>> really bad state. Things are messier starting build 15.
>>
>> I observed multiple problems on current attempt. Not sure if they are
>> linked or all are different.
>>
>> Kept the setup in same setup so that you can debug the failures on live
>> setup.
>>
>>
>>
>> *K8s HA Setup details:*
>>
>> 3 Controller+kube managers:
>>
>> 10.204.217.52(nodeg12)
>>
>> 10.204.217.71(nodeg31)
>>
>> 10.204.217.98(nodec58)
>>
>> 2 Agents/ k8s slave:
>>
>> 10.204.217.100(nodec60)
>>
>> 10.204.217.101(nodec61)
>>
>> Multi interface setup
>>
>>
>>
>> Following are key observations:
>>
>> 1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58
>> has rabbitmq as inactive.
>>
>> rabbitmq: inactive
>>
>> ...

Read more...

Andrey Pavlov (apavlov-e) wrote :
Download full text (7.7 KiB)

btw, memory change for cassandra was merged recently -
https://review.opencontrail.org/#/c/41767/1/containers/external/cassandra/contrail-entrypoint.sh

Regards,
Andrey Pavlov.

On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden> wrote:

> root@node-10-1-56-124:/# nodetool -p 7200 status
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns (effective) Host ID
> Rack
> UN 10.1.56.125 3.11 MiB 256 68.5%
> 468a1809-53ee-4242-971f-3015ccedc6c2 rack1
> UN 10.1.56.124 1.89 MiB 256 72.2%
> 9aa41a48-3e9c-417d-b25c-7abf5e1f94aa rack1
> UN 10.1.56.126 3.63 MiB 256 59.3%
> 33e498c9-f3e2-4430-86b4-261b0ffbaa0e rack1
>
> root@node-10-1-56-124:/# nodetool -p 7200 statusgossip
> running
> root@node-10-1-56-124:/# nodetool -p 7200 statusthrift
> running
> root@node-10-1-56-124:/# nodetool -p 7200 statusbinary
> running
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden>
> wrote:
>
>> Hi Andrey, did you check nodetool status?
>>
>> Regards,
>> Michael
>>
>> Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
>>
>> Hey Michael,
>>
>> I have similar problems in my 3-nodes setup:
>>
>> == Contrail control ==
>> control: active
>> nodemgr: active
>> named: active
>> dns: active
>>
>> == Contrail analytics ==
>> snmp-collector: initializing (Database:Cassandra[] connection down)
>> query-engine: active
>> api: active
>> alarm-gen: initializing (Database:Cassandra[] connection down)
>> nodemgr: active
>> collector: initializing (Database:Cassandra connection down)
>> topology: initializing (Database:Cassandra[] connection down)
>>
>> == Contrail config ==
>> api: initializing (Database:Cassandra[] connection down)
>> zookeeper: active
>> svc-monitor: backup
>> nodemgr: active
>> device-manager: backup
>> cassandra: active
>> rabbitmq: active
>> schema: backup
>>
>> == Contrail webui ==
>> web: active
>> job: active
>>
>> == Contrail database ==
>> kafka: active
>> nodemgr: active
>> zookeeper: active
>> cassandra: active
>>
>> [root@node-10-1-56-124 ~]# free -hw
>> total used free shared buffers
>> cache available
>> Mem: 15G 11G 3.3G 28M 0B
>> 892M 3.7G
>> Swap: 0B 0B 0B
>>
>>
>> Regards,
>> Andrey Pavlov.
>>
>> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden>
>> wrote:
>>
>>> Pulkit,
>>>
>>> How many resources did you assign to your instances?
>>>
>>> Regards,
>>> Michael
>>>
>>> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
>>>
>>> Hi All,
>>>
>>>
>>>
>>> I need your help and expertise debugging the k8s sanity setup which is
>>> in really bad state. Things are messier starting build 15.
>>>
>>> I observed multiple problems on current attempt. Not sure if they are
>>> linked or all are different.
>>>
>>> Kept the setup in same setup so that you can debug the failures on live
>>> setup.
>>>
>>>
>>>
>>> *K8s HA Setup details:*
>>>
>>> 3 Controller+...

Read more...

Michael Henkel (mhenkel-3) wrote :
Download full text (7.7 KiB)

And since then we have the cassandra problems? The symptoms clearly point towards memory shortage.
We have to expose the heap size as a parameter, otherwise Java is running crazy.
Regards,
Michael

> On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden> wrote:
>
> btw, memory change for cassandra was merged recently - https://review.opencontrail.org/#/c/41767/1/containers/external/cassandra/contrail-entrypoint.sh
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden> wrote:
> root@node-10-1-56-124:/# nodetool -p 7200 status
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns (effective) Host ID Rack
> UN 10.1.56.125 3.11 MiB 256 68.5% 468a1809-53ee-4242-971f-3015ccedc6c2 rack1
> UN 10.1.56.124 1.89 MiB 256 72.2% 9aa41a48-3e9c-417d-b25c-7abf5e1f94aa rack1
> UN 10.1.56.126 3.63 MiB 256 59.3% 33e498c9-f3e2-4430-86b4-261b0ffbaa0e rack1
>
> root@node-10-1-56-124:/# nodetool -p 7200 statusgossip
> running
> root@node-10-1-56-124:/# nodetool -p 7200 statusthrift
> running
> root@node-10-1-56-124:/# nodetool -p 7200 statusbinary
> running
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden> wrote:
> Hi Andrey, did you check nodetool status?
>
> Regards,
> Michael
>
> Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
>
>> Hey Michael,
>>
>> I have similar problems in my 3-nodes setup:
>>
>> == Contrail control ==
>> control: active
>> nodemgr: active
>> named: active
>> dns: active
>>
>> == Contrail analytics ==
>> snmp-collector: initializing (Database:Cassandra[] connection down)
>> query-engine: active
>> api: active
>> alarm-gen: initializing (Database:Cassandra[] connection down)
>> nodemgr: active
>> collector: initializing (Database:Cassandra connection down)
>> topology: initializing (Database:Cassandra[] connection down)
>>
>> == Contrail config ==
>> api: initializing (Database:Cassandra[] connection down)
>> zookeeper: active
>> svc-monitor: backup
>> nodemgr: active
>> device-manager: backup
>> cassandra: active
>> rabbitmq: active
>> schema: backup
>>
>> == Contrail webui ==
>> web: active
>> job: active
>>
>> == Contrail database ==
>> kafka: active
>> nodemgr: active
>> zookeeper: active
>> cassandra: active
>>
>> [root@node-10-1-56-124 ~]# free -hw
>> total used free shared buffers cache available
>> Mem: 15G 11G 3.3G 28M 0B 892M 3.7G
>> Swap: 0B 0B 0B
>>
>>
>> Regards,
>> Andrey Pavlov.
>>
>> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden> wrote:
>> Pulkit,
>>
>> How many resources did you assign to your instances?
>>
>> Regards,
>> Michael
>>
>> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
>>
>>> Hi All,
>>>
>>>
>>>
>>> I need your help and expertise debugging the k8s sanity setup which is in r...

Read more...

Andrey Pavlov (apavlov-e) wrote :
Download full text (8.5 KiB)

Alexey added JVM_EXTRA_OPTS to cassandra's container here
https://review.opencontrail.org/#/c/41928/1/containers/external/cassandra/contrail-entrypoint.sh
Now I'm checking this way
https://github.com/cloudscaling/juniper-ci/blob/master/contrail-containers/ansible/instances.yaml.tmpl#L69

Regards,
Andrey Pavlov.

On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <email address hidden> wrote:

> And since then we have the cassandra problems? The symptoms clearly point
> towards memory shortage.
> We have to expose the heap size as a parameter, otherwise Java is running
> crazy.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden> wrote:
> >
> > btw, memory change for cassandra was merged recently -
> https://review.opencontrail.org/#/c/41767/1/containers/
> external/cassandra/contrail-entrypoint.sh
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden>
> wrote:
> > root@node-10-1-56-124:/# nodetool -p 7200 status
> > Datacenter: datacenter1
> > =======================
> > Status=Up/Down
> > |/ State=Normal/Leaving/Joining/Moving
> > -- Address Load Tokens Owns (effective) Host ID
> Rack
> > UN 10.1.56.125 3.11 MiB 256 68.5%
> 468a1809-53ee-4242-971f-3015ccedc6c2 rack1
> > UN 10.1.56.124 1.89 MiB 256 72.2%
> 9aa41a48-3e9c-417d-b25c-7abf5e1f94aa rack1
> > UN 10.1.56.126 3.63 MiB 256 59.3%
> 33e498c9-f3e2-4430-86b4-261b0ffbaa0e rack1
> >
> > root@node-10-1-56-124:/# nodetool -p 7200 statusgossip
> > running
> > root@node-10-1-56-124:/# nodetool -p 7200 statusthrift
> > running
> > root@node-10-1-56-124:/# nodetool -p 7200 statusbinary
> > running
> >
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden>
> wrote:
> > Hi Andrey, did you check nodetool status?
> >
> > Regards,
> > Michael
> >
> > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
> >
> >> Hey Michael,
> >>
> >> I have similar problems in my 3-nodes setup:
> >>
> >> == Contrail control ==
> >> control: active
> >> nodemgr: active
> >> named: active
> >> dns: active
> >>
> >> == Contrail analytics ==
> >> snmp-collector: initializing (Database:Cassandra[] connection down)
> >> query-engine: active
> >> api: active
> >> alarm-gen: initializing (Database:Cassandra[] connection down)
> >> nodemgr: active
> >> collector: initializing (Database:Cassandra connection down)
> >> topology: initializing (Database:Cassandra[] connection down)
> >>
> >> == Contrail config ==
> >> api: initializing (Database:Cassandra[] connection down)
> >> zookeeper: active
> >> svc-monitor: backup
> >> nodemgr: active
> >> device-manager: backup
> >> cassandra: active
> >> rabbitmq: active
> >> schema: backup
> >>
> >> == Contrail webui ==
> >> web: active
> >> job: active
> >>
> >> == Contrail database ==
> >> kafka: active
> >> nodemgr: active
> >> zookeeper: active
> >> cassandra: active
> >>
> >> [root@node-10-1-56-124 ~]# free -hw
> >> total used free shared buffers
> cache available
> >> Mem: 15G ...

Read more...

Michael Henkel (mhenkel-3) wrote :
Download full text (8.8 KiB)

ok, let me know how it goes.
Regards,
Michael

> On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <email address hidden> wrote:
>
> Alexey added JVM_EXTRA_OPTS to cassandra's container here https://review.opencontrail.org/#/c/41928/1/containers/external/cassandra/contrail-entrypoint.sh
> Now I'm checking this way https://github.com/cloudscaling/juniper-ci/blob/master/contrail-containers/ansible/instances.yaml.tmpl#L69
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <email address hidden> wrote:
> And since then we have the cassandra problems? The symptoms clearly point towards memory shortage.
> We have to expose the heap size as a parameter, otherwise Java is running crazy.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden> wrote:
> >
> > btw, memory change for cassandra was merged recently - https://review.opencontrail.org/#/c/41767/1/containers/external/cassandra/contrail-entrypoint.sh
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden> wrote:
> > root@node-10-1-56-124:/# nodetool -p 7200 status
> > Datacenter: datacenter1
> > =======================
> > Status=Up/Down
> > |/ State=Normal/Leaving/Joining/Moving
> > -- Address Load Tokens Owns (effective) Host ID Rack
> > UN 10.1.56.125 3.11 MiB 256 68.5% 468a1809-53ee-4242-971f-3015ccedc6c2 rack1
> > UN 10.1.56.124 1.89 MiB 256 72.2% 9aa41a48-3e9c-417d-b25c-7abf5e1f94aa rack1
> > UN 10.1.56.126 3.63 MiB 256 59.3% 33e498c9-f3e2-4430-86b4-261b0ffbaa0e rack1
> >
> > root@node-10-1-56-124:/# nodetool -p 7200 statusgossip
> > running
> > root@node-10-1-56-124:/# nodetool -p 7200 statusthrift
> > running
> > root@node-10-1-56-124:/# nodetool -p 7200 statusbinary
> > running
> >
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden> wrote:
> > Hi Andrey, did you check nodetool status?
> >
> > Regards,
> > Michael
> >
> > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
> >
> >> Hey Michael,
> >>
> >> I have similar problems in my 3-nodes setup:
> >>
> >> == Contrail control ==
> >> control: active
> >> nodemgr: active
> >> named: active
> >> dns: active
> >>
> >> == Contrail analytics ==
> >> snmp-collector: initializing (Database:Cassandra[] connection down)
> >> query-engine: active
> >> api: active
> >> alarm-gen: initializing (Database:Cassandra[] connection down)
> >> nodemgr: active
> >> collector: initializing (Database:Cassandra connection down)
> >> topology: initializing (Database:Cassandra[] connection down)
> >>
> >> == Contrail config ==
> >> api: initializing (Database:Cassandra[] connection down)
> >> zookeeper: active
> >> svc-monitor: backup
> >> nodemgr: active
> >> device-manager: backup
> >> cassandra: active
> >> rabbitmq: active
> >> schema: backup
> >>
> >> == Contrail webui ==
> >> web: active
> >> job: active
> >>
> >> == Contrail database ==
> >> kafka: active
> >> nodemgr: active
> >> zookeeper: active
> >> cassandra: a...

Read more...

Sundaresan Rajangam (srajanga) wrote :

@Michael, Andrey, setting -Xms1g -Xmx2g is not appropriate. Xms and Xmx should be set to the same value and analytics cassandra requires atleast 8g heap and it is computed by cassandra-env.sh based on the available memory. I made this change https://review.opencontrail.org/#/c/41767/1/containers/external/cassandra/contrail-entrypoint.sh
to not hardcode the Xms and Xmx to 1g and 2g respectively. Many folks reported cassandra raising OutOfMemoryError: Java Heap space exception because of setting the Xms and Xmx set to 1g and 2g respectively.

Andrey Pavlov (apavlov-e) wrote :
Download full text (9.3 KiB)

Michael, it helped for me.

Regards,
Andrey.

вт, 17 апр. 2018 г., 18:41 Michael Henkel <email address hidden>:

> ok, let me know how it goes.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <email address hidden> wrote:
> >
> > Alexey added JVM_EXTRA_OPTS to cassandra's container here
> https://review.opencontrail.org/#/c/41928/1/containers/external/cassandra/contrail-entrypoint.sh
> > Now I'm checking this way
> https://github.com/cloudscaling/juniper-ci/blob/master/contrail-containers/ansible/instances.yaml.tmpl#L69
> >
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <email address hidden>
> wrote:
> > And since then we have the cassandra problems? The symptoms clearly
> point towards memory shortage.
> > We have to expose the heap size as a parameter, otherwise Java is
> running crazy.
> > Regards,
> > Michael
> >
> > > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden>
> wrote:
> > >
> > > btw, memory change for cassandra was merged recently -
> https://review.opencontrail.org/#/c/41767/1/containers/external/cassandra/contrail-entrypoint.sh
> > >
> > > Regards,
> > > Andrey Pavlov.
> > >
> > > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden>
> wrote:
> > > root@node-10-1-56-124:/# nodetool -p 7200 status
> > > Datacenter: datacenter1
> > > =======================
> > > Status=Up/Down
> > > |/ State=Normal/Leaving/Joining/Moving
> > > -- Address Load Tokens Owns (effective) Host ID
> Rack
> > > UN 10.1.56.125 3.11 MiB 256 68.5%
> 468a1809-53ee-4242-971f-3015ccedc6c2 rack1
> > > UN 10.1.56.124 1.89 MiB 256 72.2%
> 9aa41a48-3e9c-417d-b25c-7abf5e1f94aa rack1
> > > UN 10.1.56.126 3.63 MiB 256 59.3%
> 33e498c9-f3e2-4430-86b4-261b0ffbaa0e rack1
> > >
> > > root@node-10-1-56-124:/# nodetool -p 7200 statusgossip
> > > running
> > > root@node-10-1-56-124:/# nodetool -p 7200 statusthrift
> > > running
> > > root@node-10-1-56-124:/# nodetool -p 7200 statusbinary
> > > running
> > >
> > >
> > > Regards,
> > > Andrey Pavlov.
> > >
> > > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden>
> wrote:
> > > Hi Andrey, did you check nodetool status?
> > >
> > > Regards,
> > > Michael
> > >
> > > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
> > >
> > >> Hey Michael,
> > >>
> > >> I have similar problems in my 3-nodes setup:
> > >>
> > >> == Contrail control ==
> > >> control: active
> > >> nodemgr: active
> > >> named: active
> > >> dns: active
> > >>
> > >> == Contrail analytics ==
> > >> snmp-collector: initializing (Database:Cassandra[] connection down)
> > >> query-engine: active
> > >> api: active
> > >> alarm-gen: initializing (Database:Cassandra[] connection down)
> > >> nodemgr: active
> > >> collector: initializing (Database:Cassandra connection down)
> > >> topology: initializing (Database:Cassandra[] connection down)
> > >>
> > >> == Contrail config ==
> > >> api: initializing (Database:Cassandra[] connection down)
> > >> zookeeper: active
> > >> svc-monitor: backup
> > >> nodemgr: active
> > >> device-manager: backup
> >...

Read more...

Pulkit Tandon (pulkitt) wrote :
Download full text (11.5 KiB)

Assuming that setting JVM_EXTRA_OPTS: "-Xms1g -Xmx2g" under contrail_configurations will help resolve the Cassandra issue.
I will set it in next run.
Can you please explain what is this value and how is it helping?

Apart from this, I am facing rabbitMQ issues, crash issues , POD creations issues and CNI issues.
Can anyone help debug that?

From: Andrey Pavlov <email address hidden>
Date: Tuesday, April 17, 2018 at 10:46 PM
To: Michael Henkel <email address hidden>
Cc: Pulkit Tandon <email address hidden>, Sachchidanand Vaidya <email address hidden>, Dinesh Bakiaraj <email address hidden>, Prasanna Mucharikar <email address hidden>, Yuvaraja Mariappan <email address hidden>, Aniket Gawade <email address hidden>, Sudheendra Rao <email address hidden>, Venkatesh Velpula <email address hidden>, Rudra Rugge <email address hidden>, Ignatious Johnson <email address hidden>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16

Michael, it helped for me.
Regards,
Andrey.

вт, 17 апр. 2018 г., 18:41 Michael Henkel <<email address hidden><mailto:<email address hidden>>>:
ok, let me know how it goes.
Regards,
Michael

> On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <<email address hidden><mailto:<email address hidden>>> wrote:
>
> Alexey added JVM_EXTRA_OPTS to cassandra's container here https://review.opencontrail.org/#/c/41928/1/containers/external/cassandra/contrail-entrypoint.sh<https://urldefense.proofpoint.com/v2/url?u=https-3A__review.opencontrail.org_-23_c_41928_1_containers_external_cassandra_contrail-2Dentrypoint.sh&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=Ua6iyp2AdFKOsN_QJy36M_iDQ1wUslstDZYVhqWGuQ8&m=0VR3EDa8yTKxue_p3M921OVVxpEdfQls3BC9wG3Tdys&s=x8NptCP1QtgrZZZ0Ufj2A_kC-7-mdOo1s-qzcRdv3a8&e=>
> Now I'm checking this way https://github.com/cloudscaling/juniper-ci/blob/master/contrail-containers/ansible/instances.yaml.tmpl#L69<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cloudscaling_juniper-2Dci_blob_master_contrail-2Dcontainers_ansible_instances.yaml.tmpl-23L69&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=Ua6iyp2AdFKOsN_QJy36M_iDQ1wUslstDZYVhqWGuQ8&m=0VR3EDa8yTKxue_p3M921OVVxpEdfQls3BC9wG3Tdys&s=yhztS6GchDgx8GRmFiM3LYFmzb8wzZHAI65_mc3xg8I&e=>
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <<email address hidden><mailto:<email address hidden>>> wrote:
> And since then we have the cassandra problems? The symptoms clearly point towards memory shortage.
> We have to expose the heap size as a parameter, otherwise Java is running crazy.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <<email address hidden><mailto:<email address hidden>>> wrote:
> >
> > btw, memory change for cassandra was merged recently - https://review.opencontrail.org/#/c/41767/1/containers/external/cassandra/contrail-entrypoint.sh<https://urldefense.proofpoint.com/v2/url?u=https-3A__review.opencontrail.org_-23_c_41767_1_containers_external_cassandra_contrail-2Dentrypoint.sh&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=Ua6iyp2AdFKOsN_QJy36M_iDQ1wUslstDZYVhqWGuQ8&m=0VR3EDa8yTKxue_p3M921OVVxpEdfQls3BC9wG3Tdys&s=YekARY_U7n...

Michael Henkel (mhenkel-3) wrote :
Download full text (12.2 KiB)

the setting limits the amount of memory Java can grab. If you don’t provide sufficient resources and do not make that setting, it can have all sort of side effects. Solve the memory consumption problem first and then check the other issues.

Regards,
Michael

Am 17.04.2018 um 10:45 schrieb Pulkit Tandon <<email address hidden><mailto:<email address hidden>>>:

Assuming that setting JVM_EXTRA_OPTS: "-Xms1g -Xmx2g" under contrail_configurations will help resolve the Cassandra issue.
I will set it in next run.
Can you please explain what is this value and how is it helping?

Apart from this, I am facing rabbitMQ issues, crash issues , POD creations issues and CNI issues.
Can anyone help debug that?

From: Andrey Pavlov <<email address hidden><mailto:<email address hidden>>>
Date: Tuesday, April 17, 2018 at 10:46 PM
To: Michael Henkel <<email address hidden><mailto:<email address hidden>>>
Cc: Pulkit Tandon <<email address hidden><mailto:<email address hidden>>>, Sachchidanand Vaidya <<email address hidden><mailto:<email address hidden>>>, Dinesh Bakiaraj <<email address hidden><mailto:<email address hidden>>>, Prasanna Mucharikar <<email address hidden><mailto:<email address hidden>>>, Yuvaraja Mariappan <<email address hidden><mailto:<email address hidden>>>, Aniket Gawade <<email address hidden><mailto:<email address hidden>>>, Sudheendra Rao <<email address hidden><mailto:<email address hidden>>>, Venkatesh Velpula <<email address hidden><mailto:<email address hidden>>>, Rudra Rugge <<email address hidden><mailto:<email address hidden>>>, Ignatious Johnson <<email address hidden><mailto:<email address hidden>>>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16

Michael, it helped for me.
Regards,
Andrey.

вт, 17 апр. 2018 г., 18:41 Michael Henkel <<email address hidden><mailto:<email address hidden>>>:
ok, let me know how it goes.
Regards,
Michael

> On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <<email address hidden><mailto:<email address hidden>>> wrote:
>
> Alexey added JVM_EXTRA_OPTS to cassandra's container here https://review.opencontrail.org/#/c/41928/1/containers/external/cassandra/contrail-entrypoint.sh<https://urldefense.proofpoint.com/v2/url?u=https-3A__review.opencontrail.org_-23_c_41928_1_containers_external_cassandra_contrail-2Dentrypoint.sh&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=Ua6iyp2AdFKOsN_QJy36M_iDQ1wUslstDZYVhqWGuQ8&m=0VR3EDa8yTKxue_p3M921OVVxpEdfQls3BC9wG3Tdys&s=x8NptCP1QtgrZZZ0Ufj2A_kC-7-mdOo1s-qzcRdv3a8&e=>
> Now I'm checking this way https://github.com/cloudscaling/juniper-ci/blob/master/contrail-containers/ansible/instances.yaml.tmpl#L69<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cloudscaling_juniper-2Dci_blob_master_contrail-2Dcontainers_ansible_instances.yaml.tmpl-23L69&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=Ua6iyp2AdFKOsN_QJy36M_iDQ1wUslstDZYVhqWGuQ8&m=0VR3EDa8yTKxue_p3M921OVVxpEdfQls3BC9wG3Tdys&s=yhztS6GchDgx8GRmFiM3LYFmzb8wzZHAI65_mc3xg8I&e=>
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <<email address hidden><mailto:<email address hidden>>> wrote:
> And since then we have the cassandra problems? The symptoms clea...

Pulkit Tandon (pulkitt) wrote :
Download full text (13.1 KiB)

Just an update:
For vrouter crash, Hari has some fix which has already been verified on my setup.
Post fix, Pod creation issue is also resolved.

https://bugs.launchpad.net/juniperopenstack/+bug/1764821

Thanks!
Pulkit Tandon

From: Michael Henkel <email address hidden>
Date: Tuesday, April 17, 2018 at 11:21 PM
To: Pulkit Tandon <email address hidden>
Cc: Andrey Pavlov <email address hidden>, Sachchidanand Vaidya <email address hidden>, Dinesh Bakiaraj <email address hidden>, Prasanna Mucharikar <email address hidden>, Yuvaraja Mariappan <email address hidden>, Aniket Gawade <email address hidden>, Sudheendra Rao <email address hidden>, Venkatesh Velpula <email address hidden>, Rudra Rugge <email address hidden>, Ignatious Johnson <email address hidden>, Hari Prasad Killi <email address hidden>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16

the setting limits the amount of memory Java can grab. If you don’t provide sufficient resources and do not make that setting, it can have all sort of side effects. Solve the memory consumption problem first and then check the other issues.
Regards,
Michael

Am 17.04.2018 um 10:45 schrieb Pulkit Tandon <<email address hidden><mailto:<email address hidden>>>:
Assuming that setting JVM_EXTRA_OPTS: "-Xms1g -Xmx2g" under contrail_configurations will help resolve the Cassandra issue.
I will set it in next run.
Can you please explain what is this value and how is it helping?

Apart from this, I am facing rabbitMQ issues, crash issues , POD creations issues and CNI issues.
Can anyone help debug that?

From: Andrey Pavlov <<email address hidden><mailto:<email address hidden>>>
Date: Tuesday, April 17, 2018 at 10:46 PM
To: Michael Henkel <<email address hidden><mailto:<email address hidden>>>
Cc: Pulkit Tandon <<email address hidden><mailto:<email address hidden>>>, Sachchidanand Vaidya <<email address hidden><mailto:<email address hidden>>>, Dinesh Bakiaraj <<email address hidden><mailto:<email address hidden>>>, Prasanna Mucharikar <<email address hidden><mailto:<email address hidden>>>, Yuvaraja Mariappan <<email address hidden><mailto:<email address hidden>>>, Aniket Gawade <<email address hidden><mailto:<email address hidden>>>, Sudheendra Rao <<email address hidden><mailto:<email address hidden>>>, Venkatesh Velpula <<email address hidden><mailto:<email address hidden>>>, Rudra Rugge <<email address hidden><mailto:<email address hidden>>>, Ignatious Johnson <<email address hidden><mailto:<email address hidden>>>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16

Michael, it helped for me.
Regards,
Andrey.

вт, 17 апр. 2018 г., 18:41 Michael Henkel <<email address hidden><mailto:<email address hidden>>>:
ok, let me know how it goes.
Regards,
Michael

> On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <<email address hidden><mailto:<email address hidden>>> wrote:
>
> Alexey added JVM_EXTRA_OPTS to cassandra's container here https://review.opencontrail.org/#/c/41928/1/containers/external/cassandra/contrail-entrypoint.sh<https://urldefense.proofpoint.com/v2/url?u=https-3A__review.opencontrail.org_-23_c_41928_1_containers_external_cassandra_contrail-2Dentrypoint.sh&...

Dinesh Bakiaraj (dineshb) wrote :

This issue was a side effect of missing java memory config for Cassandra.
That was addressed by change in provisioning options.
Once that is done, there is no function issue.
But there is a defensive check we could add that will prevent a backtrace in kibe-manager, when such issues might show up.
This has not functional impact.
So am reducing the severity of the bug.

Reviewed: https://review.opencontrail.org/41981
Committed: http://github.com/Juniper/contrail-controller/commit/7e7b2fd13e54415d2840f0ee5467ab32d9649368
Submitter: Zuul v3 CI (<email address hidden>)
Branch: R5.0

commit 7e7b2fd13e54415d2840f0ee5467ab32d9649368
Author: dineshb-jnpr <email address hidden>
Date: Mon Apr 16 15:04:01 2018 -0700

Defensive check to handled invalid input.

Defensive check to not initiate any firewall rule delete VNC calls,
if policy or rule info is not provided.

Change-Id: I220f607a766abae1325a4c0d67c9a1c80fe75ce7
Partial-Bug: #1764493

Reviewed: https://review.opencontrail.org/41980
Committed: http://github.com/Juniper/contrail-controller/commit/98168542767a115284254932f8d576e5d947a9d0
Submitter: Zuul v3 CI (<email address hidden>)
Branch: master

commit 98168542767a115284254932f8d576e5d947a9d0
Author: dineshb-jnpr <email address hidden>
Date: Mon Apr 16 15:04:01 2018 -0700

Defensive check to handled invalid input.

Defensive check to not initiate any firewall rule delete VNC calls,
if policy or rule info is not provided.

Change-Id: I220f607a766abae1325a4c0d67c9a1c80fe75ce7
Partial-Bug: #1764493

Pulkit Tandon (pulkitt) wrote :

Not observed any related issues since past many sanity runs.
Recent sanity run was on R5.0-50.
Hance, closing the bug

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers