[k8s-R5.0]: Default NW FW policy do not get created for non-default projects
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Juniper Openstack | Status tracked in Trunk | |||||
R5.0 |
Fix Released
|
Medium
|
Dinesh Bakiaraj | |||
Trunk |
Fix Released
|
Medium
|
Dinesh Bakiaraj |
Bug Description
BUG Tempelate:
Configuration:
K8s 1.9.2
coat-5.0-15
Centos-7.4
Setup:
5 node setup.
1 Kube master. 3 Controller.
2 Agent+ K8s slaves
The issues was observed in a k8s sanity run:
LogsLocation : http://
Report : http://
Description:
1. Created few pods in namespace "default"
2. Created few pods in namespace "non-default"
Pods in "non-default namespace" can't ping pods in same namespace.
At the same time, they can ping the pods in "default" namespace.
FWD flow:
```337876<
(Gen: 1, K(nh):29, Action:D(FwPolicy), Flags:, QOS:-1, S(nh):29, Stats:3/294,
SPort 52395, TTL 0, Sinfo 4.0.0.0)```
Rev Flow:
```
(Gen: 1, K(nh):29, Action:D(Unknown), Flags:, QOS:-1, S(nh):26, Stats:0/0,
SPort 57821, TTL 0, Sinfo 0.0.0.0)```
Found that the FW policy rules of allow all is not present for non default namespaces.
This might not be a problem with fresh setup.
This might have happened during sanity by some specific step of restart or any other.
Not sure on this.
OpenContrail Admin (ci-admin-f) wrote : [Review update] master | #1 |
OpenContrail Admin (ci-admin-f) wrote : [Review update] R5.0 | #2 |
Review in progress for https:/
Submitter: Dinesh Bakiaraj (<email address hidden>)
Dinesh Bakiaraj (dineshb) wrote : | #3 |
RabbitMQ config is not provisioned properly per HA config.
Hence the rabbitmq cluster is not formed.
[root@nodeg12 ~]# docker exec -it configdatabase_
root@nodeg12:/# cat /etc/rabbitmq/
[ { rabbit, [
{ loopback_users, [ ] },
{ tcp_listeners, [ 5672 ] },
{ ssl_listeners, [ ] },
{ hipe_compile, false }
] } ].
root@nodeg12:/#
Pulkit Tandon (pulkitt) wrote : Debugging required on k8s sanity setup which failed for R5.0-16 | #4 |
Hi All,
I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.
K8s HA Setup details:
3 Controller+kube managers:
10.204.
10.204.
10.204.
2 Agents/ k8s slave:
10.204.
10.204.
Multi interface setup
Following are key observations:
1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.
rabbitmq: inactive
Docker logs for rabbitmq container on nodec58:
{"init terminating in do_boot"
2. On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)
3. If I create a k8s Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails
4. ON CNI of both agent, seeing this error:
I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation : GET Url : http://
E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get operation. Return code 404
I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get vrouter failed
E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
E : 24633 : 2018/04/17 17:35:49 contrail-
E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
E : 24646 : 2018/04/17 17:35:49 contrail-
NOTE: Most of the issues observed are on k8s HA multi interface setup.
Things are better with Non HA/ single interface setup.
Agent crash trace:
(gdb) bt full
#0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000c15440 in AgentOperDBTabl
No symbol table info available.
#5 0x0000000000c41714 in IFMapDependency
No symbol table info available.
#6 0x0000000000ea4a57 in TaskTrigger:
No symbol table info available.
#7 0x00000000...
Michael Henkel (mhenkel-3) wrote : | #5 |
Pulkit,
How many resources did you assign to your instances?
Regards,
Michael
Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <<email address hidden>
Hi All,
I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.
K8s HA Setup details:
3 Controller+kube managers:
10.204.
10.204.
10.204.
2 Agents/ k8s slave:
10.204.
10.204.
Multi interface setup
Following are key observations:
1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.
rabbitmq: inactive
Docker logs for rabbitmq container on nodec58:
{"init terminating in do_boot"
2. On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)
3. If I create a k8s Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails
4. ON CNI of both agent, seeing this error:
I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation : GET Url : http://
E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get operation. Return code 404
I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get vrouter failed
E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
E : 24633 : 2018/04/17 17:35:49 contrail-
E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
E : 24646 : 2018/04/17 17:35:49 contrail-
NOTE: Most of the issues observed are on k8s HA multi interface setup.
Things are better with Non HA/ single interface setup.
Agent crash trace:
(gdb) bt full
#0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000c15440 in AgentOperDBTabl
No symbol table info available.
#5 0x0000000000c41714 in IF...
Andrey Pavlov (apavlov-e) wrote : | #6 |
Hey Michael,
I have similar problems in my 3-nodes setup:
== Contrail control ==
control: active
nodemgr: active
named: active
dns: active
== Contrail analytics ==
snmp-collector: initializing (Database:
query-engine: active
api: active
alarm-gen: initializing (Database:
nodemgr: active
collector: initializing (Database:Cassandra connection down)
topology: initializing (Database:
== Contrail config ==
api: initializing (Database:
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup
== Contrail webui ==
web: active
job: active
== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active
[root@node-
total used free shared buffers
cache available
Mem: 15G 11G 3.3G 28M 0B
892M 3.7G
Swap: 0B 0B 0B
Regards,
Andrey Pavlov.
On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden> wrote:
> Pulkit,
>
> How many resources did you assign to your instances?
>
> Regards,
> Michael
>
> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
>
> Hi All,
>
>
>
> I need your help and expertise debugging the k8s sanity setup which is in
> really bad state. Things are messier starting build 15.
>
> I observed multiple problems on current attempt. Not sure if they are
> linked or all are different.
>
> Kept the setup in same setup so that you can debug the failures on live
> setup.
>
>
>
> *K8s HA Setup details:*
>
> 3 Controller+kube managers:
>
> 10.204.
>
> 10.204.
>
> 10.204.
>
> 2 Agents/ k8s slave:
>
> 10.204.
>
> 10.204.
>
> Multi interface setup
>
>
>
> Following are key observations:
>
> 1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has
> rabbitmq as inactive.
>
> rabbitmq: inactive
>
> Docker logs for rabbitmq container on nodec58:
>
> {"init terminating in do_boot"
> contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but
> contrail@nodeg31 disagrees"}}}
>
>
>
> 2. On all 3 controllers, Cassandra connection not established for 2
> hours after provisioning. This issue seems flapping with time and
> sometimes, I see the services as active too:
> control: initializing (Database:Cassandra connection down)
> collector: initializing (Database:Cassandra connection down)
>
>
>
> 3. If I create a k8s Pod, many a times it results in POD creation
> failure and instantly vrouter crash happens.
> The trace is below.
> Irrespective of crash happens or not, POD creation fails
>
>
>
> 4. ON CNI of both agent, seeing this error:
> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation
> : GET Url : http://
>
> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get operation.
> Return code 404
>
> I : 24646 : 2018/04/17 17:35:44 vroute...
Michael Henkel (mhenkel-3) wrote : | #7 |
Hi Andrey, did you check nodetool status?
Regards,
Michael
Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <<email address hidden>
Hey Michael,
I have similar problems in my 3-nodes setup:
== Contrail control ==
control: active
nodemgr: active
named: active
dns: active
== Contrail analytics ==
snmp-collector: initializing (Database:
query-engine: active
api: active
alarm-gen: initializing (Database:
nodemgr: active
collector: initializing (Database:Cassandra connection down)
topology: initializing (Database:
== Contrail config ==
api: initializing (Database:
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup
== Contrail webui ==
web: active
job: active
== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active
[root@node-
total used free shared buffers cache available
Mem: 15G 11G 3.3G 28M 0B 892M 3.7G
Swap: 0B 0B 0B
Regards,
Andrey Pavlov.
On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <<email address hidden>
Pulkit,
How many resources did you assign to your instances?
Regards,
Michael
Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <<email address hidden>
Hi All,
I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.
K8s HA Setup details:
3 Controller+kube managers:
10.204.
10.204.
10.204.
2 Agents/ k8s slave:
10.204.
10.204.
Multi interface setup
Following are key observations:
1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.
rabbitmq: inactive
Docker logs for rabbitmq container on nodec58:
{"init terminating in do_boot"
2. On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)
3. If I create a k8s Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails
4. ON CNI of both agent, seeing this error:
I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation : GET Url : http://
Pulkit Tandon (pulkitt) wrote : | #8 |
Hi Michael,
I did not explicitly mention any resources to instances.
Following is my instances.yaml:
global_
REGISTRY_
CONTAINER_
provider_config:
bms:
domainsuffix: englab.juniper.net
ntpserver: 10.204.217.158
ssh_pwd: c0ntrail123
ssh_user: root
instances:
nodec58:
ip: 10.204.217.98
provider: bms
roles:
config: null
control: null
webui: null
nodec60:
ip: 10.204.217.100
provider: bms
roles:
k8s_node: null
vrouter:
nodec61:
ip: 10.204.217.101
provider: bms
roles:
k8s_node: null
vrouter:
nodeg12:
ip: 10.204.217.52
provider: bms
roles:
config: null
control: null
webui: null
nodeg31:
ip: 10.204.217.71
provider: bms
roles:
config: null
control: null
webui: null
contrail_
CONTRAIL_VERSION: ocata-5.0-15
CLOUD_
METADATA_
CLOUD_
CONTAINER_
CONTRAIL_VERSION: ocata-5.0-15
CONTROL_
CONTROLLER_NODES: 77.77.1.
KUBERNETES_
domain: default-domain
name: __fip_pool_public__
network: __public__
project: default
RABBITMQ_
REGISTRY_
VROUTER_GATEWAY: 77.77.1.100
From: Andrey Pavlov <email address hidden>
Date: Tuesday, April 17, 2018 at 6:34 PM
To: Michael Henkel <email address hidden>
Cc: Pulkit Tandon <email address hidden>, Sachchidanand Vaidya <email address hidden>, Dinesh Bakiaraj <email address hidden>, Prasanna Mucharikar <email address hidden>, Yuvaraja Mariappan <email address hidden>, Aniket Gawade <email address hidden>, Sudheendra Rao <email address hidden>, Venkatesh Velpula <email address hidden>, Rudra Rugge <email address hidden>, Ignatious Johnson <email address hidden>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16
Hey Michael,
I have similar problems in my 3-nodes setup:
== Contrail control ==
control: active
nodemgr: active
named: active
dns: active
== Contrail analytics ==
snmp-collector: initializing (Database:
query-engine: active
api: active
alarm-gen: initializing (Database:
nodemgr: active
collector: initializing (Database:Cassandra conn...
Andrey Pavlov (apavlov-e) wrote : | #9 |
root@node-
Datacenter: datacenter1
=======
Status=Up/Down
|/ State=Normal/
-- Address Load Tokens Owns (effective) Host ID
UN 10.1.56.125 3.11 MiB 256 68.5%
468a1809-
UN 10.1.56.124 1.89 MiB 256 72.2%
9aa41a48-
UN 10.1.56.126 3.63 MiB 256 59.3%
33e498c9-
root@node-
running
root@node-
running
root@node-
running
Regards,
Andrey Pavlov.
On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden> wrote:
> Hi Andrey, did you check nodetool status?
>
> Regards,
> Michael
>
> Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
>
> Hey Michael,
>
> I have similar problems in my 3-nodes setup:
>
> == Contrail control ==
> control: active
> nodemgr: active
> named: active
> dns: active
>
> == Contrail analytics ==
> snmp-collector: initializing (Database:
> query-engine: active
> api: active
> alarm-gen: initializing (Database:
> nodemgr: active
> collector: initializing (Database:Cassandra connection down)
> topology: initializing (Database:
>
> == Contrail config ==
> api: initializing (Database:
> zookeeper: active
> svc-monitor: backup
> nodemgr: active
> device-manager: backup
> cassandra: active
> rabbitmq: active
> schema: backup
>
> == Contrail webui ==
> web: active
> job: active
>
> == Contrail database ==
> kafka: active
> nodemgr: active
> zookeeper: active
> cassandra: active
>
> [root@node-
> total used free shared buffers
> cache available
> Mem: 15G 11G 3.3G 28M 0B
> 892M 3.7G
> Swap: 0B 0B 0B
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden>
> wrote:
>
>> Pulkit,
>>
>> How many resources did you assign to your instances?
>>
>> Regards,
>> Michael
>>
>> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
>>
>> Hi All,
>>
>>
>>
>> I need your help and expertise debugging the k8s sanity setup which is in
>> really bad state. Things are messier starting build 15.
>>
>> I observed multiple problems on current attempt. Not sure if they are
>> linked or all are different.
>>
>> Kept the setup in same setup so that you can debug the failures on live
>> setup.
>>
>>
>>
>> *K8s HA Setup details:*
>>
>> 3 Controller+kube managers:
>>
>> 10.204.
>>
>> 10.204.
>>
>> 10.204.
>>
>> 2 Agents/ k8s slave:
>>
>> 10.204.
>>
>> 10.204.
>>
>> Multi interface setup
>>
>>
>>
>> Following are key observations:
>>
>> 1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58
>> has rabbitmq as inactive.
>>
>> rabbitmq: inactive
>>
>> ...
Andrey Pavlov (apavlov-e) wrote : | #10 |
btw, memory change for cassandra was merged recently -
https:/
Regards,
Andrey Pavlov.
On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden> wrote:
> root@node-
> Datacenter: datacenter1
> =======
> Status=Up/Down
> |/ State=Normal/
> -- Address Load Tokens Owns (effective) Host ID
> Rack
> UN 10.1.56.125 3.11 MiB 256 68.5%
> 468a1809-
> UN 10.1.56.124 1.89 MiB 256 72.2%
> 9aa41a48-
> UN 10.1.56.126 3.63 MiB 256 59.3%
> 33e498c9-
>
> root@node-
> running
> root@node-
> running
> root@node-
> running
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden>
> wrote:
>
>> Hi Andrey, did you check nodetool status?
>>
>> Regards,
>> Michael
>>
>> Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
>>
>> Hey Michael,
>>
>> I have similar problems in my 3-nodes setup:
>>
>> == Contrail control ==
>> control: active
>> nodemgr: active
>> named: active
>> dns: active
>>
>> == Contrail analytics ==
>> snmp-collector: initializing (Database:
>> query-engine: active
>> api: active
>> alarm-gen: initializing (Database:
>> nodemgr: active
>> collector: initializing (Database:Cassandra connection down)
>> topology: initializing (Database:
>>
>> == Contrail config ==
>> api: initializing (Database:
>> zookeeper: active
>> svc-monitor: backup
>> nodemgr: active
>> device-manager: backup
>> cassandra: active
>> rabbitmq: active
>> schema: backup
>>
>> == Contrail webui ==
>> web: active
>> job: active
>>
>> == Contrail database ==
>> kafka: active
>> nodemgr: active
>> zookeeper: active
>> cassandra: active
>>
>> [root@node-
>> total used free shared buffers
>> cache available
>> Mem: 15G 11G 3.3G 28M 0B
>> 892M 3.7G
>> Swap: 0B 0B 0B
>>
>>
>> Regards,
>> Andrey Pavlov.
>>
>> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden>
>> wrote:
>>
>>> Pulkit,
>>>
>>> How many resources did you assign to your instances?
>>>
>>> Regards,
>>> Michael
>>>
>>> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
>>>
>>> Hi All,
>>>
>>>
>>>
>>> I need your help and expertise debugging the k8s sanity setup which is
>>> in really bad state. Things are messier starting build 15.
>>>
>>> I observed multiple problems on current attempt. Not sure if they are
>>> linked or all are different.
>>>
>>> Kept the setup in same setup so that you can debug the failures on live
>>> setup.
>>>
>>>
>>>
>>> *K8s HA Setup details:*
>>>
>>> 3 Controller+...
Michael Henkel (mhenkel-3) wrote : | #11 |
And since then we have the cassandra problems? The symptoms clearly point towards memory shortage.
We have to expose the heap size as a parameter, otherwise Java is running crazy.
Regards,
Michael
> On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden> wrote:
>
> btw, memory change for cassandra was merged recently - https:/
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden> wrote:
> root@node-
> Datacenter: datacenter1
> =======
> Status=Up/Down
> |/ State=Normal/
> -- Address Load Tokens Owns (effective) Host ID Rack
> UN 10.1.56.125 3.11 MiB 256 68.5% 468a1809-
> UN 10.1.56.124 1.89 MiB 256 72.2% 9aa41a48-
> UN 10.1.56.126 3.63 MiB 256 59.3% 33e498c9-
>
> root@node-
> running
> root@node-
> running
> root@node-
> running
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden> wrote:
> Hi Andrey, did you check nodetool status?
>
> Regards,
> Michael
>
> Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
>
>> Hey Michael,
>>
>> I have similar problems in my 3-nodes setup:
>>
>> == Contrail control ==
>> control: active
>> nodemgr: active
>> named: active
>> dns: active
>>
>> == Contrail analytics ==
>> snmp-collector: initializing (Database:
>> query-engine: active
>> api: active
>> alarm-gen: initializing (Database:
>> nodemgr: active
>> collector: initializing (Database:Cassandra connection down)
>> topology: initializing (Database:
>>
>> == Contrail config ==
>> api: initializing (Database:
>> zookeeper: active
>> svc-monitor: backup
>> nodemgr: active
>> device-manager: backup
>> cassandra: active
>> rabbitmq: active
>> schema: backup
>>
>> == Contrail webui ==
>> web: active
>> job: active
>>
>> == Contrail database ==
>> kafka: active
>> nodemgr: active
>> zookeeper: active
>> cassandra: active
>>
>> [root@node-
>> total used free shared buffers cache available
>> Mem: 15G 11G 3.3G 28M 0B 892M 3.7G
>> Swap: 0B 0B 0B
>>
>>
>> Regards,
>> Andrey Pavlov.
>>
>> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden> wrote:
>> Pulkit,
>>
>> How many resources did you assign to your instances?
>>
>> Regards,
>> Michael
>>
>> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
>>
>>> Hi All,
>>>
>>>
>>>
>>> I need your help and expertise debugging the k8s sanity setup which is in r...
Andrey Pavlov (apavlov-e) wrote : | #12 |
Alexey added JVM_EXTRA_OPTS to cassandra's container here
https:/
Now I'm checking this way
https:/
Regards,
Andrey Pavlov.
On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <email address hidden> wrote:
> And since then we have the cassandra problems? The symptoms clearly point
> towards memory shortage.
> We have to expose the heap size as a parameter, otherwise Java is running
> crazy.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden> wrote:
> >
> > btw, memory change for cassandra was merged recently -
> https:/
> external/
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden>
> wrote:
> > root@node-
> > Datacenter: datacenter1
> > =======
> > Status=Up/Down
> > |/ State=Normal/
> > -- Address Load Tokens Owns (effective) Host ID
> Rack
> > UN 10.1.56.125 3.11 MiB 256 68.5%
> 468a1809-
> > UN 10.1.56.124 1.89 MiB 256 72.2%
> 9aa41a48-
> > UN 10.1.56.126 3.63 MiB 256 59.3%
> 33e498c9-
> >
> > root@node-
> > running
> > root@node-
> > running
> > root@node-
> > running
> >
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden>
> wrote:
> > Hi Andrey, did you check nodetool status?
> >
> > Regards,
> > Michael
> >
> > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
> >
> >> Hey Michael,
> >>
> >> I have similar problems in my 3-nodes setup:
> >>
> >> == Contrail control ==
> >> control: active
> >> nodemgr: active
> >> named: active
> >> dns: active
> >>
> >> == Contrail analytics ==
> >> snmp-collector: initializing (Database:
> >> query-engine: active
> >> api: active
> >> alarm-gen: initializing (Database:
> >> nodemgr: active
> >> collector: initializing (Database:Cassandra connection down)
> >> topology: initializing (Database:
> >>
> >> == Contrail config ==
> >> api: initializing (Database:
> >> zookeeper: active
> >> svc-monitor: backup
> >> nodemgr: active
> >> device-manager: backup
> >> cassandra: active
> >> rabbitmq: active
> >> schema: backup
> >>
> >> == Contrail webui ==
> >> web: active
> >> job: active
> >>
> >> == Contrail database ==
> >> kafka: active
> >> nodemgr: active
> >> zookeeper: active
> >> cassandra: active
> >>
> >> [root@node-
> >> total used free shared buffers
> cache available
> >> Mem: 15G ...
Michael Henkel (mhenkel-3) wrote : | #13 |
ok, let me know how it goes.
Regards,
Michael
> On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <email address hidden> wrote:
>
> Alexey added JVM_EXTRA_OPTS to cassandra's container here https:/
> Now I'm checking this way https:/
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <email address hidden> wrote:
> And since then we have the cassandra problems? The symptoms clearly point towards memory shortage.
> We have to expose the heap size as a parameter, otherwise Java is running crazy.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden> wrote:
> >
> > btw, memory change for cassandra was merged recently - https:/
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden> wrote:
> > root@node-
> > Datacenter: datacenter1
> > =======
> > Status=Up/Down
> > |/ State=Normal/
> > -- Address Load Tokens Owns (effective) Host ID Rack
> > UN 10.1.56.125 3.11 MiB 256 68.5% 468a1809-
> > UN 10.1.56.124 1.89 MiB 256 72.2% 9aa41a48-
> > UN 10.1.56.126 3.63 MiB 256 59.3% 33e498c9-
> >
> > root@node-
> > running
> > root@node-
> > running
> > root@node-
> > running
> >
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden> wrote:
> > Hi Andrey, did you check nodetool status?
> >
> > Regards,
> > Michael
> >
> > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
> >
> >> Hey Michael,
> >>
> >> I have similar problems in my 3-nodes setup:
> >>
> >> == Contrail control ==
> >> control: active
> >> nodemgr: active
> >> named: active
> >> dns: active
> >>
> >> == Contrail analytics ==
> >> snmp-collector: initializing (Database:
> >> query-engine: active
> >> api: active
> >> alarm-gen: initializing (Database:
> >> nodemgr: active
> >> collector: initializing (Database:Cassandra connection down)
> >> topology: initializing (Database:
> >>
> >> == Contrail config ==
> >> api: initializing (Database:
> >> zookeeper: active
> >> svc-monitor: backup
> >> nodemgr: active
> >> device-manager: backup
> >> cassandra: active
> >> rabbitmq: active
> >> schema: backup
> >>
> >> == Contrail webui ==
> >> web: active
> >> job: active
> >>
> >> == Contrail database ==
> >> kafka: active
> >> nodemgr: active
> >> zookeeper: active
> >> cassandra: a...
Sundaresan Rajangam (srajanga) wrote : | #14 |
@Michael, Andrey, setting -Xms1g -Xmx2g is not appropriate. Xms and Xmx should be set to the same value and analytics cassandra requires atleast 8g heap and it is computed by cassandra-env.sh based on the available memory. I made this change https:/
to not hardcode the Xms and Xmx to 1g and 2g respectively. Many folks reported cassandra raising OutOfMemoryError: Java Heap space exception because of setting the Xms and Xmx set to 1g and 2g respectively.
Andrey Pavlov (apavlov-e) wrote : | #15 |
Michael, it helped for me.
Regards,
Andrey.
вт, 17 апр. 2018 г., 18:41 Michael Henkel <email address hidden>:
> ok, let me know how it goes.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <email address hidden> wrote:
> >
> > Alexey added JVM_EXTRA_OPTS to cassandra's container here
> https:/
> > Now I'm checking this way
> https:/
> >
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <email address hidden>
> wrote:
> > And since then we have the cassandra problems? The symptoms clearly
> point towards memory shortage.
> > We have to expose the heap size as a parameter, otherwise Java is
> running crazy.
> > Regards,
> > Michael
> >
> > > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden>
> wrote:
> > >
> > > btw, memory change for cassandra was merged recently -
> https:/
> > >
> > > Regards,
> > > Andrey Pavlov.
> > >
> > > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden>
> wrote:
> > > root@node-
> > > Datacenter: datacenter1
> > > =======
> > > Status=Up/Down
> > > |/ State=Normal/
> > > -- Address Load Tokens Owns (effective) Host ID
> Rack
> > > UN 10.1.56.125 3.11 MiB 256 68.5%
> 468a1809-
> > > UN 10.1.56.124 1.89 MiB 256 72.2%
> 9aa41a48-
> > > UN 10.1.56.126 3.63 MiB 256 59.3%
> 33e498c9-
> > >
> > > root@node-
> > > running
> > > root@node-
> > > running
> > > root@node-
> > > running
> > >
> > >
> > > Regards,
> > > Andrey Pavlov.
> > >
> > > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden>
> wrote:
> > > Hi Andrey, did you check nodetool status?
> > >
> > > Regards,
> > > Michael
> > >
> > > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
> > >
> > >> Hey Michael,
> > >>
> > >> I have similar problems in my 3-nodes setup:
> > >>
> > >> == Contrail control ==
> > >> control: active
> > >> nodemgr: active
> > >> named: active
> > >> dns: active
> > >>
> > >> == Contrail analytics ==
> > >> snmp-collector: initializing (Database:
> > >> query-engine: active
> > >> api: active
> > >> alarm-gen: initializing (Database:
> > >> nodemgr: active
> > >> collector: initializing (Database:Cassandra connection down)
> > >> topology: initializing (Database:
> > >>
> > >> == Contrail config ==
> > >> api: initializing (Database:
> > >> zookeeper: active
> > >> svc-monitor: backup
> > >> nodemgr: active
> > >> device-manager: backup
> >...
Pulkit Tandon (pulkitt) wrote : | #16 |
Assuming that setting JVM_EXTRA_OPTS: "-Xms1g -Xmx2g" under contrail_
I will set it in next run.
Can you please explain what is this value and how is it helping?
Apart from this, I am facing rabbitMQ issues, crash issues , POD creations issues and CNI issues.
Can anyone help debug that?
From: Andrey Pavlov <email address hidden>
Date: Tuesday, April 17, 2018 at 10:46 PM
To: Michael Henkel <email address hidden>
Cc: Pulkit Tandon <email address hidden>, Sachchidanand Vaidya <email address hidden>, Dinesh Bakiaraj <email address hidden>, Prasanna Mucharikar <email address hidden>, Yuvaraja Mariappan <email address hidden>, Aniket Gawade <email address hidden>, Sudheendra Rao <email address hidden>, Venkatesh Velpula <email address hidden>, Rudra Rugge <email address hidden>, Ignatious Johnson <email address hidden>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16
Michael, it helped for me.
Regards,
Andrey.
вт, 17 апр. 2018 г., 18:41 Michael Henkel <<email address hidden>
ok, let me know how it goes.
Regards,
Michael
> On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <<email address hidden>
>
> Alexey added JVM_EXTRA_OPTS to cassandra's container here https:/
> Now I'm checking this way https:/
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <<email address hidden>
> And since then we have the cassandra problems? The symptoms clearly point towards memory shortage.
> We have to expose the heap size as a parameter, otherwise Java is running crazy.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <<email address hidden>
> >
> > btw, memory change for cassandra was merged recently - https:/
Michael Henkel (mhenkel-3) wrote : | #17 |
the setting limits the amount of memory Java can grab. If you don’t provide sufficient resources and do not make that setting, it can have all sort of side effects. Solve the memory consumption problem first and then check the other issues.
Regards,
Michael
Am 17.04.2018 um 10:45 schrieb Pulkit Tandon <<email address hidden>
Assuming that setting JVM_EXTRA_OPTS: "-Xms1g -Xmx2g" under contrail_
I will set it in next run.
Can you please explain what is this value and how is it helping?
Apart from this, I am facing rabbitMQ issues, crash issues , POD creations issues and CNI issues.
Can anyone help debug that?
From: Andrey Pavlov <<email address hidden>
Date: Tuesday, April 17, 2018 at 10:46 PM
To: Michael Henkel <<email address hidden>
Cc: Pulkit Tandon <<email address hidden>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16
Michael, it helped for me.
Regards,
Andrey.
вт, 17 апр. 2018 г., 18:41 Michael Henkel <<email address hidden>
ok, let me know how it goes.
Regards,
Michael
> On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <<email address hidden>
>
> Alexey added JVM_EXTRA_OPTS to cassandra's container here https:/
> Now I'm checking this way https:/
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <<email address hidden>
> And since then we have the cassandra problems? The symptoms clea...
Pulkit Tandon (pulkitt) wrote : | #18 |
Just an update:
For vrouter crash, Hari has some fix which has already been verified on my setup.
Post fix, Pod creation issue is also resolved.
https:/
Thanks!
Pulkit Tandon
From: Michael Henkel <email address hidden>
Date: Tuesday, April 17, 2018 at 11:21 PM
To: Pulkit Tandon <email address hidden>
Cc: Andrey Pavlov <email address hidden>, Sachchidanand Vaidya <email address hidden>, Dinesh Bakiaraj <email address hidden>, Prasanna Mucharikar <email address hidden>, Yuvaraja Mariappan <email address hidden>, Aniket Gawade <email address hidden>, Sudheendra Rao <email address hidden>, Venkatesh Velpula <email address hidden>, Rudra Rugge <email address hidden>, Ignatious Johnson <email address hidden>, Hari Prasad Killi <email address hidden>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16
the setting limits the amount of memory Java can grab. If you don’t provide sufficient resources and do not make that setting, it can have all sort of side effects. Solve the memory consumption problem first and then check the other issues.
Regards,
Michael
Am 17.04.2018 um 10:45 schrieb Pulkit Tandon <<email address hidden>
Assuming that setting JVM_EXTRA_OPTS: "-Xms1g -Xmx2g" under contrail_
I will set it in next run.
Can you please explain what is this value and how is it helping?
Apart from this, I am facing rabbitMQ issues, crash issues , POD creations issues and CNI issues.
Can anyone help debug that?
From: Andrey Pavlov <<email address hidden>
Date: Tuesday, April 17, 2018 at 10:46 PM
To: Michael Henkel <<email address hidden>
Cc: Pulkit Tandon <<email address hidden>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16
Michael, it helped for me.
Regards,
Andrey.
вт, 17 апр. 2018 г., 18:41 Michael Henkel <<email address hidden>
ok, let me know how it goes.
Regards,
Michael
> On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <<email address hidden>
>
> Alexey added JVM_EXTRA_OPTS to cassandra's container here https:/
Dinesh Bakiaraj (dineshb) wrote : | #19 |
This issue was a side effect of missing java memory config for Cassandra.
That was addressed by change in provisioning options.
Once that is done, there is no function issue.
But there is a defensive check we could add that will prevent a backtrace in kibe-manager, when such issues might show up.
This has not functional impact.
So am reducing the severity of the bug.
OpenContrail Admin (ci-admin-f) wrote : A change has been merged | #20 |
Reviewed: https:/
Committed: http://
Submitter: Zuul v3 CI (<email address hidden>)
Branch: R5.0
commit 7e7b2fd13e54415
Author: dineshb-jnpr <email address hidden>
Date: Mon Apr 16 15:04:01 2018 -0700
Defensive check to handled invalid input.
Defensive check to not initiate any firewall rule delete VNC calls,
if policy or rule info is not provided.
Change-Id: I220f607a766aba
Partial-Bug: #1764493
OpenContrail Admin (ci-admin-f) wrote : | #21 |
Reviewed: https:/
Committed: http://
Submitter: Zuul v3 CI (<email address hidden>)
Branch: master
commit 98168542767a115
Author: dineshb-jnpr <email address hidden>
Date: Mon Apr 16 15:04:01 2018 -0700
Defensive check to handled invalid input.
Defensive check to not initiate any firewall rule delete VNC calls,
if policy or rule info is not provided.
Change-Id: I220f607a766aba
Partial-Bug: #1764493
Pulkit Tandon (pulkitt) wrote : | #22 |
Not observed any related issues since past many sanity runs.
Recent sanity run was on R5.0-50.
Hance, closing the bug
Review in progress for https:/ /review. opencontrail. org/41980
Submitter: Dinesh Bakiaraj (<email address hidden>)