Default Kafka replication factor is always 1

Bug #1888522 reported by Doug Szumski
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
Medium
Doug Szumski
Rocky
New
Medium
Unassigned
Stein
Fix Released
Medium
Mark Goddard
Train
Fix Released
Medium
Mark Goddard
Ussuri
Fix Released
Medium
Mark Goddard
Victoria
Fix Released
Medium
Doug Szumski

Bug Description

By default, the replication factor for topics automatically created in Kafka is 1. This feature is used by Monasca.

This means that there is only ever 1 'in-sync replica' - the leader of the topic. When you deploy Kafka in a clustered configuration, this should be increased so that there is at least 1 actual replica of all partitions in the topic, allowing a single node in the cluster to fail without the topic going down.

This problem didn't show up before, because the Kafka client used by Monasca ignored the minimum insync replica setting. Now that we use the Confluent Kafka client, we see errors like this in the Monasca logs when deploying Monasca in a clustered configuration (3 nodes):

2020-07-22 13:54:04.446 40 ERROR monasca_api.common.messaging.kafka_publisher Traceback (most recent call last):
2020-07-22 13:54:04.446 40 ERROR monasca_api.common.messaging.kafka_publisher File "/var/lib/kolla/venv/lib/python3.6/site-packages/monasca_api/common/messaging/kafka_publisher.py", line 56, in send_message
2020-07-22 13:54:04.446 40 ERROR monasca_api.common.messaging.kafka_publisher self._producer.publish(self.topic, message)
2020-07-22 13:54:04.446 40 ERROR monasca_api.common.messaging.kafka_publisher File "/var/lib/kolla/venv/lib/python3.6/site-packages/monasca_common/confluent_kafka/producer.py", line 78, in publish
2020-07-22 13:54:04.446 40 ERROR monasca_api.common.messaging.kafka_publisher self._producer.poll(0)
2020-07-22 13:54:04.446 40 ERROR monasca_api.common.messaging.kafka_publisher File "/var/lib/kolla/venv/lib/python3.6/site-packages/monasca_common/confluent_kafka/producer.py", line 53, in delivery_report
2020-07-22 13:54:04.446 40 ERROR monasca_api.common.messaging.kafka_publisher raise confluent_kafka.KafkaException(err)
2020-07-22 13:54:04.446 40 ERROR monasca_api.common.messaging.kafka_publisher cimpl.KafkaException: KafkaError{code=NOT_ENOUGH_REPLICAS,val=19,str="Broker: Not enough in-sync replicas"}
2020-07-22 13:54:04.446 40 ERROR monasca_api.common.messaging.kafka_publisher

Doug Szumski (dszumski)
Changed in kolla-ansible:
assignee: nobody → Doug Szumski (dszumski)
Revision history for this message
Doug Szumski (dszumski) wrote :
Download full text (3.5 KiB)

Example of how a 1 in 3 node failure could look (and the cluster should keep on working).

```
(kafka)[kafka@control02 /opt/kafka/bin]$ ./kafka-topics.sh --describe --zookeeper localhost --topic metrics
Topic:metrics PartitionCount:30 ReplicationFactor:2 Configs:
        Topic: metrics Partition: 0 Leader: 1002 Replicas: 1002,1003 Isr: 1002
        Topic: metrics Partition: 1 Leader: 1001 Replicas: 1003,1001 Isr: 1001
        Topic: metrics Partition: 2 Leader: 1001 Replicas: 1001,1002 Isr: 1001,1002
        Topic: metrics Partition: 3 Leader: 1002 Replicas: 1002,1001 Isr: 1002,1001
        Topic: metrics Partition: 4 Leader: 1002 Replicas: 1003,1002 Isr: 1002
        Topic: metrics Partition: 5 Leader: 1001 Replicas: 1001,1003 Isr: 1001
        Topic: metrics Partition: 6 Leader: 1002 Replicas: 1002,1003 Isr: 1002
        Topic: metrics Partition: 7 Leader: 1001 Replicas: 1003,1001 Isr: 1001
        Topic: metrics Partition: 8 Leader: 1001 Replicas: 1001,1002 Isr: 1001,1002
        Topic: metrics Partition: 9 Leader: 1002 Replicas: 1002,1001 Isr: 1002,1001
        Topic: metrics Partition: 10 Leader: 1002 Replicas: 1003,1002 Isr: 1002
        Topic: metrics Partition: 11 Leader: 1001 Replicas: 1001,1003 Isr: 1001
        Topic: metrics Partition: 12 Leader: 1002 Replicas: 1002,1003 Isr: 1002
        Topic: metrics Partition: 13 Leader: 1001 Replicas: 1003,1001 Isr: 1001
        Topic: metrics Partition: 14 Leader: 1001 Replicas: 1001,1002 Isr: 1001,1002
        Topic: metrics Partition: 15 Leader: 1002 Replicas: 1002,1001 Isr: 1002,1001
        Topic: metrics Partition: 16 Leader: 1002 Replicas: 1003,1002 Isr: 1002
        Topic: metrics Partition: 17 Leader: 1001 Replicas: 1001,1003 Isr: 1001
        Topic: metrics Partition: 18 Leader: 1002 Replicas: 1002,1003 Isr: 1002
        Topic: metrics Partition: 19 Leader: 1001 Replicas: 1003,1001 Isr: 1001
        Topic: metrics Partition: 20 Leader: 1001 Replicas: 1001,1002 Isr: 1001,1002
        Topic: metrics Partition: 21 Leader: 1002 Replicas: 1002,1001 Isr: 1002,1001
        Topic: metrics Partition: 22 Leader: 1002 Replicas: 1003,1002 Isr: 1002
        Topic: metrics Partition: 23 Leader: 1001 Replicas: 1001,1003 Isr: 1001
        Topic: metrics Partition: 24 Leader: 1002 Replicas: 1002,1003 Isr: 1002
        Topic: metrics Partition: 25 Leader: 1001 Replicas: 1003,1001 Isr: 1001
        Topic: metrics Partition: 26 Leader: 1001 Replicas: 1001,1002 I...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/742479

Changed in kolla-ansible:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/742479
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=a273e28e208eaf7c3d607bff220309ca3b3b0bd7
Submitter: Zuul
Branch: master

commit a273e28e208eaf7c3d607bff220309ca3b3b0bd7
Author: Doug Szumski <email address hidden>
Date: Wed Jul 22 17:18:26 2020 +0100

    Set Kafka default replication factor

    This ensures that when using automatic Kafka topic creation, with more than one
    node in the Kafka cluster, all partitions in the topic are automatically
    replicated. When a single node goes down in a >=3 node cluster, these topics will
    continue to accept writes providing there are at least two insync replicas.

    In a two node cluster, no failures are tolerated. In a three node cluster, only a
    single node failure is tolerated. In a larger cluster the configuration may need
    manual tuning.

    This configuration follows advice given here:

    [1] https://docs.cloudera.com/documentation/kafka/1-2-x/topics/kafka_ha.html#xd_583c10bfdbd326ba-590cb1d1-149e9ca9886--6fec__section_d2t_ff2_lq

    Closes-Bug: #1888522

    Change-Id: I7d38c6ccb22061aa88d9ac6e2e25c3e095fdb8c3

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/743296

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/743297

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/743298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/ussuri)

Reviewed: https://review.opendev.org/743296
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=a82e233706e2da87c9ef613eaa04559b66089230
Submitter: Zuul
Branch: stable/ussuri

commit a82e233706e2da87c9ef613eaa04559b66089230
Author: Doug Szumski <email address hidden>
Date: Wed Jul 22 17:18:26 2020 +0100

    Set Kafka default replication factor

    This ensures that when using automatic Kafka topic creation, with more than one
    node in the Kafka cluster, all partitions in the topic are automatically
    replicated. When a single node goes down in a >=3 node cluster, these topics will
    continue to accept writes providing there are at least two insync replicas.

    In a two node cluster, no failures are tolerated. In a three node cluster, only a
    single node failure is tolerated. In a larger cluster the configuration may need
    manual tuning.

    This configuration follows advice given here:

    [1] https://docs.cloudera.com/documentation/kafka/1-2-x/topics/kafka_ha.html#xd_583c10bfdbd326ba-590cb1d1-149e9ca9886--6fec__section_d2t_ff2_lq

    Closes-Bug: #1888522

    Change-Id: I7d38c6ccb22061aa88d9ac6e2e25c3e095fdb8c3
    (cherry picked from commit a273e28e208eaf7c3d607bff220309ca3b3b0bd7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/train)

Reviewed: https://review.opendev.org/743297
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=410e66eec7dbfcc2b099f015a6d89b5aea38ec1d
Submitter: Zuul
Branch: stable/train

commit 410e66eec7dbfcc2b099f015a6d89b5aea38ec1d
Author: Doug Szumski <email address hidden>
Date: Wed Jul 22 17:18:26 2020 +0100

    Set Kafka default replication factor

    This ensures that when using automatic Kafka topic creation, with more than one
    node in the Kafka cluster, all partitions in the topic are automatically
    replicated. When a single node goes down in a >=3 node cluster, these topics will
    continue to accept writes providing there are at least two insync replicas.

    In a two node cluster, no failures are tolerated. In a three node cluster, only a
    single node failure is tolerated. In a larger cluster the configuration may need
    manual tuning.

    This configuration follows advice given here:

    [1] https://docs.cloudera.com/documentation/kafka/1-2-x/topics/kafka_ha.html#xd_583c10bfdbd326ba-590cb1d1-149e9ca9886--6fec__section_d2t_ff2_lq

    Closes-Bug: #1888522

    Change-Id: I7d38c6ccb22061aa88d9ac6e2e25c3e095fdb8c3
    (cherry picked from commit a273e28e208eaf7c3d607bff220309ca3b3b0bd7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/stein)

Reviewed: https://review.opendev.org/743298
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=af744bbdbcf62519437a8f59cf50e9543b5d8a5a
Submitter: Zuul
Branch: stable/stein

commit af744bbdbcf62519437a8f59cf50e9543b5d8a5a
Author: Doug Szumski <email address hidden>
Date: Wed Jul 22 17:18:26 2020 +0100

    Set Kafka default replication factor

    This ensures that when using automatic Kafka topic creation, with more than one
    node in the Kafka cluster, all partitions in the topic are automatically
    replicated. When a single node goes down in a >=3 node cluster, these topics will
    continue to accept writes providing there are at least two insync replicas.

    In a two node cluster, no failures are tolerated. In a three node cluster, only a
    single node failure is tolerated. In a larger cluster the configuration may need
    manual tuning.

    This configuration follows advice given here:

    [1] https://docs.cloudera.com/documentation/kafka/1-2-x/topics/kafka_ha.html#xd_583c10bfdbd326ba-590cb1d1-149e9ca9886--6fec__section_d2t_ff2_lq

    Closes-Bug: #1888522

    Change-Id: I7d38c6ccb22061aa88d9ac6e2e25c3e095fdb8c3
    (cherry picked from commit a273e28e208eaf7c3d607bff220309ca3b3b0bd7)

Revision history for this message
Dheeraj Reddy Gruddanti (tui-dheeraj) wrote :

i tried applying the proposed fix for Ussuri/centos8/KA10.0.0 But still having same exact issue.

we have three controller Nodes and deploying kafka in cluster mode.

ERROR monasca_api.common.messaging.kafka_publisher [req-caf13c3d-9244-44fd-a29b-1345011c505f 3012aa888e2a406886638b91a0724652 d93adba42b7947aaa919bcad9fd1416e - default default] Unknown error.: cimpl.KafkaException: KafkaError{code=NOT_ENOUGH_REPLICAS,val=19,str="Broker: Not enough in-sync replicas"}

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 8.3.0

This issue was fixed in the openstack/kolla-ansible 8.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 10.2.0

This issue was fixed in the openstack/kolla-ansible 10.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 9.3.0

This issue was fixed in the openstack/kolla-ansible 9.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.