Bug #1764493 “[k8s-R5.0]: Default NW FW policy do not get create...” : Bugs : Juniper Openstack

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2018-04-16: [Review update] master

#1

Review in progress for https://review.opencontrail.org/41980
Submitter: Dinesh Bakiaraj (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2018-04-16: [Review update] R5.0

#2

Review in progress for https://review.opencontrail.org/41981
Submitter: Dinesh Bakiaraj (<email address hidden>)

Revision history for this message

Dinesh Bakiaraj (dineshb) wrote on 2018-04-17:

#3

RabbitMQ config is not provisioned properly per HA config.
Hence the rabbitmq cluster is not formed.

[root@nodeg12 ~]# docker exec -it configdatabase_rabbitmq_1 bash
root@nodeg12:/# cat /etc/rabbitmq/rabbitmq.config
[ { rabbit, [
    { loopback_users, [ ] },
    { tcp_listeners, [ 5672 ] },
    { ssl_listeners, [ ] },
    { hipe_compile, false }
] } ].
root@nodeg12:/#

Revision history for this message

Pulkit Tandon (pulkitt) wrote on 2018-04-17: Debugging required on k8s sanity setup which failed for R5.0-16

#4

Download full text (4.1 KiB)

Hi All,

I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.

K8s HA Setup details:
3 Controller+kube managers:
10.204.217.52(nodeg12)
10.204.217.71(nodeg31)
10.204.217.98(nodec58)
2 Agents/ k8s slave:
10.204.217.100(nodec60)
10.204.217.101(nodec61)
Multi interface setup

Following are key observations:

1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.

rabbitmq: inactive

Docker logs for rabbitmq container on nodec58:

{"init terminating in do_boot",{error,{inconsistent_cluster,"Node contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but contrail@nodeg31 disagrees"}}}

2. On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)

3. If I create a k8s Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails

4. ON CNI of both agent, seeing this error:
I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation : GET Url : http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-002590c55f6a

E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get operation. Return code 404

I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get vrouter failed

E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter

I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter

E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed processing Add command.

E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter

I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter

E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed processing Add command.

NOTE: Most of the issues observed are on k8s HA multi interface setup.
Things are better with Non HA/ single interface setup.

Agent crash trace:
(gdb) bt full
#0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000c15440 in AgentOperDBTable::ConfigEventHandler(IFMapNode*, DBEntry*) ()
No symbol table info available.
#5 0x0000000000c41714 in IFMapDependencyManager::ProcessChangeList() ()
No symbol table info available.
#6 0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() ()
No symbol table info available.
#7 0x00000000...

Hi All,

I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.

K8s HA Setup details:
3 Controller+kube managers:
10.204.217.52(nodeg12)
10.204.217.71(nodeg31)
10.204.217.98(nodec58)
2 Agents/ k8s slave:
10.204.217.100(nodec60)
10.204.217.101(nodec61)
Multi interface setup

Following are key observations:

1.       RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.

rabbitmq: inactive

Docker logs for rabbitmq container on nodec58:

{"init terminating in do_boot",{error,{inconsistent_cluster,"Node contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but contrail@nodeg31 disagrees"}}}

2.       On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)

3.       If I create a k8s  Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails

4.       ON CNI of both agent, seeing this error:
I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation : GET Url :  http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-002590c55f6a