Bug #1888522 “Default Kafka replication factor is always 1” : Series stein : Bugs : kolla-ansible

Doug Szumski (dszumski) on 2020-07-22

Changed in kolla-ansible:
assignee:	nobody → Doug Szumski (dszumski)

Revision history for this message

Doug Szumski (dszumski) wrote on 2020-07-22:

#1

Download full text (3.5 KiB)

Example of how a 1 in 3 node failure could look (and the cluster should keep on working).

```
(kafka)[kafka@control02 /opt/kafka/bin]$ ./kafka-topics.sh --describe --zookeeper localhost --topic metrics
Topic:metrics PartitionCount:30 ReplicationFactor:2 Configs:
        Topic: metrics Partition: 0 Leader: 1002 Replicas: 1002,1003 Isr: 1002
        Topic: metrics Partition: 1 Leader: 1001 Replicas: 1003,1001 Isr: 1001
        Topic: metrics Partition: 2 Leader: 1001 Replicas: 1001,1002 Isr: 1001,1002
        Topic: metrics Partition: 3 Leader: 1002 Replicas: 1002,1001 Isr: 1002,1001
        Topic: metrics Partition: 4 Leader: 1002 Replicas: 1003,1002 Isr: 1002
        Topic: metrics Partition: 5 Leader: 1001 Replicas: 1001,1003 Isr: 1001
        Topic: metrics Partition: 6 Leader: 1002 Replicas: 1002,1003 Isr: 1002
        Topic: metrics Partition: 7 Leader: 1001 Replicas: 1003,1001 Isr: 1001
        Topic: metrics Partition: 8 Leader: 1001 Replicas: 1001,1002 Isr: 1001,1002
        Topic: metrics Partition: 9 Leader: 1002 Replicas: 1002,1001 Isr: 1002,1001
        Topic: metrics Partition: 10 Leader: 1002 Replicas: 1003,1002 Isr: 1002
        Topic: metrics Partition: 11 Leader: 1001 Replicas: 1001,1003 Isr: 1001
        Topic: metrics Partition: 12 Leader: 1002 Replicas: 1002,1003 Isr: 1002
        Topic: metrics Partition: 13 Leader: 1001 Replicas: 1003,1001 Isr: 1001
        Topic: metrics Partition: 14 Leader: 1001 Replicas: 1001,1002 Isr: 1001,1002
        Topic: metrics Partition: 15 Leader: 1002 Replicas: 1002,1001 Isr: 1002,1001
        Topic: metrics Partition: 16 Leader: 1002 Replicas: 1003,1002 Isr: 1002
        Topic: metrics Partition: 17 Leader: 1001 Replicas: 1001,1003 Isr: 1001
        Topic: metrics Partition: 18 Leader: 1002 Replicas: 1002,1003 Isr: 1002
        Topic: metrics Partition: 19 Leader: 1001 Replicas: 1003,1001 Isr: 1001
        Topic: metrics Partition: 20 Leader: 1001 Replicas: 1001,1002 Isr: 1001,1002
        Topic: metrics Partition: 21 Leader: 1002 Replicas: 1002,1001 Isr: 1002,1001
        Topic: metrics Partition: 22 Leader: 1002 Replicas: 1003,1002 Isr: 1002
        Topic: metrics Partition: 23 Leader: 1001 Replicas: 1001,1003 Isr: 1001
        Topic: metrics Partition: 24 Leader: 1002 Replicas: 1002,1003 Isr: 1002
        Topic: metrics Partition: 25 Leader: 1001 Replicas: 1003,1001 Isr: 1001
        Topic: metrics Partition: 26 Leader: 1001 Replicas: 1001,1002 I...

Example of how a 1 in 3 node failure could look (and the cluster should keep on working).

```
(kafka)[kafka@control02 /opt/kafka/bin]$ ./kafka-topics.sh --describe --zookeeper localhost --topic metrics
Topic:metrics   PartitionCount:30       ReplicationFactor:2     Configs:                                   
        Topic: metrics  Partition: 0    Leader: 1002    Replicas: 1002,1003     Isr: 1002                  
        Topic: metrics  Partition: 1    Leader: 1001    Replicas: 1003,1001     Isr: 1001                  
        Topic: metrics  Partition: 2    Leader: 1001    Replicas: 1001,1002     Isr: 1001,1002             
        Topic: metrics  Partition: 3    Leader: 1002    Replicas: 1002,1001     Isr: 1002,1001             
        Topic: metrics  Partition: 4    Leader: 1002    Replicas: 1003,1002     Isr: 1002                  
        Topic: metrics  Partition: 5    Leader: 1001    Replicas: 1001,1003     Isr: 1001                  
        Topic: metrics  Partition: 6    Leader: 1002    Replicas: 1002,1003     Isr: 1002                  
        Topic: metrics  Partition: 7    Leader: 1001    Replicas: 1003,1001     Isr: 1001                  
        Topic: metrics  Partition: 8    Leader: 1001    Replicas: 1001,1002     Isr: 1001,1002             
        Topic: metrics  Partition: 9    Leader: 1002    Replicas: 1002,1001     Isr: 1002,1001             
        Topic: metrics  Partition: 10   Leader: 1002    Replicas: 1003,1002     Isr: 1002                  
        Topic: metrics  Partition: 11   Leader: 1001    Replicas: 1001,1003     Isr: 1001                  
        Topic: metrics  Partition: 12   Leader: 1002    Replicas: 1002,1003     Isr: 1002                  
        Topic: metrics  Partition: 13   Leader: 1001    Replicas: 1003,1001     Isr: 1001                  
        Topic: metrics  Partition: 14   Leader: 1001    Replicas: 1001,1002     Isr: 1001,1002             
        Topic: metrics  Partition: 15   Leader: 1002    Replicas: 1002,1001     Isr: 1002,1001             
        Topic: metrics  Partition: 16   Leader: 1002    Replicas: 1003,1002     Isr: 1002                  
        Topic: metrics  Partition: 17   Leader: 1001    Replicas: 1001,1003     Isr: 1001                  
        Topic: metrics  Partition: 18   Leader: 1002    Replicas: 1002,1003     Isr: 1002                  
        Topic: metrics  Partition: 19   Leader: 1001    Replicas: 1003,1001     Isr: 1001                  
        Topic: metrics  Partition: 20   Leader: 1001    Replicas: 1001,1002     Isr: 1001,1002             
        Topic: metrics  Partition: 21   Leader: 1002    Replicas: 1002,1001     Isr: 1002,1001             
        Topic: metrics  Partition: 22   Leader: 1002    Replicas: 1003,1002     Isr: 1002                  
        Topic: metrics  Partition: 23   Leader: 1001    Replicas: 1001,1003     Isr: 1001                  
        Topic: metrics  Partition: 24   Leader: 1002    Replicas: 1002,1003     Isr: 1002                  
        Topic: metrics  Partition: 25   Leader: 1001    Replicas: 1003,1001     Isr: 1001                  
        Topic: metrics  Partition: 26   Leader: 1001    Replicas: 1001,1002     Isr: 1001,1002             
        Topic: metrics  Partition: 27   Leader: 1002    Replicas: 1002,1001     Isr: 1002,1001             
        Topic: metrics  Partition: 28   Leader: 1002    Replicas: 1003,1002     Isr: 1002                  
        Topic: metrics  Partition: 29   Leader: 1001    Replicas: 1001,1003     Isr: 1001                  
```

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-22: Fix proposed to kolla-ansible (master)

#2

Fix proposed to branch: master
Review: https://review.opendev.org/742479

Changed in kolla-ansible:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-27: Fix merged to kolla-ansible (master)

#3

Reviewed: https://review.opendev.org/742479
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=a273e28e208eaf7c3d607bff220309ca3b3b0bd7
Submitter: Zuul
Branch: master

commit a273e28e208eaf7c3d607bff220309ca3b3b0bd7
Author: Doug Szumski <email address hidden>
Date: Wed Jul 22 17:18:26 2020 +0100

Set Kafka default replication factor

    This ensures that when using automatic Kafka topic creation, with more than one
    node in the Kafka cluster, all partitions in the topic are automatically
    replicated. When a single node goes down in a >=3 node cluster, these topics will
    continue to accept writes providing there are at least two insync replicas.

    In a two node cluster, no failures are tolerated. In a three node cluster, only a
    single node failure is tolerated. In a larger cluster the configuration may need
    manual tuning.

This configuration follows advice given here:

[1] https://docs.cloudera.com/documentation/kafka/1-2-x/topics/kafka_ha.html#xd_583c10bfdbd326ba-590cb1d1-149e9ca9886--6fec__section_d2t_ff2_lq

Closes-Bug: #1888522

Change-Id: I7d38c6ccb22061aa88d9ac6e2e25c3e095fdb8c3

Changed in kolla-ansible:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-27: Fix proposed to kolla-ansible (stable/ussuri)

#4

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/743296

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-27: Fix proposed to kolla-ansible (stable/train)

#5

Fix proposed to branch: stable/train
Review: https://review.opendev.org/743297

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-27: Fix proposed to kolla-ansible (stable/stein)

#6

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/743298

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-28: Fix merged to kolla-ansible (stable/ussuri)

#7

Reviewed: https://review.opendev.org/743296
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=a82e233706e2da87c9ef613eaa04559b66089230
Submitter: Zuul
Branch: stable/ussuri

commit a82e233706e2da87c9ef613eaa04559b66089230
Author: Doug Szumski <email address hidden>
Date: Wed Jul 22 17:18:26 2020 +0100

Set Kafka default replication factor

    This ensures that when using automatic Kafka topic creation, with more than one
    node in the Kafka cluster, all partitions in the topic are automatically
    replicated. When a single node goes down in a >=3 node cluster, these topics will
    continue to accept writes providing there are at least two insync replicas.

    In a two node cluster, no failures are tolerated. In a three node cluster, only a
    single node failure is tolerated. In a larger cluster the configuration may need
    manual tuning.

This configuration follows advice given here:

[1] https://docs.cloudera.com/documentation/kafka/1-2-x/topics/kafka_ha.html#xd_583c10bfdbd326ba-590cb1d1-149e9ca9886--6fec__section_d2t_ff2_lq

Closes-Bug: #1888522

Change-Id: I7d38c6ccb22061aa88d9ac6e2e25c3e095fdb8c3
(cherry picked from commit a273e28e208eaf7c3d607bff220309ca3b3b0bd7)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-28: Fix merged to kolla-ansible (stable/train)

#8

Reviewed: https://review.opendev.org/743297
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=410e66eec7dbfcc2b099f015a6d89b5aea38ec1d
Submitter: Zuul
Branch: stable/train

commit 410e66eec7dbfcc2b099f015a6d89b5aea38ec1d
Author: Doug Szumski <email address hidden>
Date: Wed Jul 22 17:18:26 2020 +0100

Set Kafka default replication factor

    This ensures that when using automatic Kafka topic creation, with more than one
    node in the Kafka cluster, all partitions in the topic are automatically
    replicated. When a single node goes down in a >=3 node cluster, these topics will
    continue to accept writes providing there are at least two insync replicas.

    In a two node cluster, no failures are tolerated. In a three node cluster, only a
    single node failure is tolerated. In a larger cluster the configuration may need
    manual tuning.

This configuration follows advice given here:

[1] https://docs.cloudera.com/documentation/kafka/1-2-x/topics/kafka_ha.html#xd_583c10bfdbd326ba-590cb1d1-149e9ca9886--6fec__section_d2t_ff2_lq

Closes-Bug: #1888522

Change-Id: I7d38c6ccb22061aa88d9ac6e2e25c3e095fdb8c3
(cherry picked from commit a273e28e208eaf7c3d607bff220309ca3b3b0bd7)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-28: Fix merged to kolla-ansible (stable/stein)

#9

Reviewed: https://review.opendev.org/743298
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=af744bbdbcf62519437a8f59cf50e9543b5d8a5a
Submitter: Zuul
Branch: stable/stein

commit af744bbdbcf62519437a8f59cf50e9543b5d8a5a
Author: Doug Szumski <email address hidden>
Date: Wed Jul 22 17:18:26 2020 +0100

Set Kafka default replication factor

    This ensures that when using automatic Kafka topic creation, with more than one
    node in the Kafka cluster, all partitions in the topic are automatically
    replicated. When a single node goes down in a >=3 node cluster, these topics will
    continue to accept writes providing there are at least two insync replicas.

    In a two node cluster, no failures are tolerated. In a three node cluster, only a
    single node failure is tolerated. In a larger cluster the configuration may need
    manual tuning.

This configuration follows advice given here:

[1] https://docs.cloudera.com/documentation/kafka/1-2-x/topics/kafka_ha.html#xd_583c10bfdbd326ba-590cb1d1-149e9ca9886--6fec__section_d2t_ff2_lq

Closes-Bug: #1888522

Change-Id: I7d38c6ccb22061aa88d9ac6e2e25c3e095fdb8c3
(cherry picked from commit a273e28e208eaf7c3d607bff220309ca3b3b0bd7)

Revision history for this message

Dheeraj Reddy Gruddanti (tui-dheeraj) wrote on 2020-09-10:

#10

i tried applying the proposed fix for Ussuri/centos8/KA10.0.0 But still having same exact issue.

we have three controller Nodes and deploying kafka in cluster mode.

ERROR monasca_api.common.messaging.kafka_publisher [req-caf13c3d-9244-44fd-a29b-1345011c505f 3012aa888e2a406886638b91a0724652 d93adba42b7947aaa919bcad9fd1416e - default default] Unknown error.: cimpl.KafkaException: KafkaError{code=NOT_ENOUGH_REPLICAS,val=19,str="Broker: Not enough in-sync replicas"}

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-11-24: Fix included in openstack/kolla-ansible 8.3.0

#11

This issue was fixed in the openstack/kolla-ansible 8.3.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-01-07: Fix included in openstack/kolla-ansible 10.2.0

#12

This issue was fixed in the openstack/kolla-ansible 10.2.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-01-07: Fix included in openstack/kolla-ansible 9.3.0

#13

This issue was fixed in the openstack/kolla-ansible 9.3.0 release.

	Status	Importance	Assigned to	Milestone
kolla-ansible	Fix Released	Medium	Doug Szumski	kolla-ansible 11.0.0 "victoria"
Rocky	New	Medium	Unassigned
Stein	Fix Released	Medium	Mark Goddard
Train	Fix Released	Medium	Mark Goddard
Ussuri	Fix Released	Medium	Mark Goddard
Victoria	Fix Released	Medium	Doug Szumski	kolla-ansible 11.0.0 "victoria"

kolla-ansible

Default Kafka replication factor is always 1

Bug Description

Other bug subscribers

Remote bug watches