Tune mon/osd heartbeat timeouts

Bug #1446391 reported by Jeya ganesh babu J
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.1
Fix Committed
High
Jeya ganesh babu J
R2.20
Fix Committed
High
Jeya ganesh babu J
Trunk
Fix Committed
High
Jeya ganesh babu J

Bug Description

Need to tune mon/osd heart beat timeouts for larger clusters.
Issue was seen in cluster with 300+ osds where the 20 sec heartbeat timeout was not sufficient.

Tags: storage ganges
tags: added: ganges
information type: Proprietary → Public
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : R2.20

Review in progress for https://review.opencontrail.org/10249
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/10250
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/10249
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/10249
Committed: http://github.org/Juniper/contrail-provisioning/commit/0e7a9220eeb53d8149362d395145596f57e5e5f7
Submitter: Zuul
Branch: R2.20

commit 0e7a9220eeb53d8149362d395145596f57e5e5f7
Author: Jeya ganesh babu J <email address hidden>
Date: Tue May 12 12:25:38 2015 -0700

Provision fixes for heartbeat and replica

Closes-Bug: #1446391
Closes-Bug: #1447707
Closes-Bug: #1388449
Closes-Bug: #1446396
Closes-Bug: #1454898
Issues: OSDs flaps because of insufficient heartbeat timeout on
large clusters
Replica configured is overwritten when upgrade or setup_storage
is run again.
Live migration provision doesnt work if there are multiple
subnets
upgrade or setup storage creates new mons when the storage-compute
order changes in the testbed.py
if only ssd-disks is specified the pgs are stuck.
Fix: Configured heartbeat based on the replica size.
Added a configuration variable 'storage_replica_size' in testbed.py
to specify the replica
Addded fix to support multiple subnets for live migration.
The current monitors are not taken into account for the total
monitors. Fix added to take existing monitors into account.
If there is only 'ssd-disks', code added to treat as 'disks'.

Change-Id: I6a373416209756e14242ca437ede32db03d9d785

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/10250
Committed: http://github.org/Juniper/contrail-fabric-utils/commit/134b9ed5d17a3e701cf5471b52206c413549e4df
Submitter: Zuul
Branch: R2.20

commit 134b9ed5d17a3e701cf5471b52206c413549e4df
Author: Jeya ganesh babu J <email address hidden>
Date: Tue May 12 12:34:21 2015 -0700

Provision fixes for heartbeat and replica

Closes-Bug: #1446391
Closes-Bug: #1447707
Issues: OSDs flaps because of insufficient heartbeat timeout on
large clusters
Replica configured is overwritten when upgrade or setup_storage
is run again.
Fix: Configured heartbeat based on the replica size.
Added a configuration variable 'storage_replica_size' in testbed.py
to specify the replica

Change-Id: I9aac6eebee5e50ca5370b2d49e950e1ee159bbb9

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/11200
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/11201
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11200
Committed: http://github.org/Juniper/contrail-fabric-utils/commit/637303e2f12adcc40f5ccd28451ea5a5e403681a
Submitter: Zuul
Branch: master

commit 637303e2f12adcc40f5ccd28451ea5a5e403681a
Author: Jeya ganesh babu J <email address hidden>
Date: Tue Jun 2 14:27:18 2015 -0700

Storage provision fix merge

Closes-Bug: #1446391
Closes-Bug: #1447707
Closes-Bug: #1456880
Closes-Bug: #1455259
Closes-Bug: #1458281
Closes-Bug: #1459485
Issues:
OSDs flaps because of insufficient heartbeat timeout on
large clusters
Replica configured is overwritten when upgrade or setup_storage
is run again.
If no replica configuration is provided in the testbed
Setup storage fails.
Current code only checks for the release numbers, doesn't check
for build numbers.
Fix:
Configured heartbeat based on the replica size.
Added a configuration variable 'storage_replica_size' in testbed.py
to specify the replica
Fixed the replica parse logic.
Added checks to avoid provisioning live migration on nodes
that have hypervisor other than kvm.
Added check for release number and build number and used
storage-package instead of contrail-install-package for the
versions.
Added a separate command for base openstack nova live-migration.

Change-Id: I06279b0d1b25ac0888fab9e6831533a58976eb68

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/11201
Committed: http://github.org/Juniper/contrail-provisioning/commit/cadb798e1cd4a38cf3c9f3088ce6da784fb0cda7
Submitter: Zuul
Branch: master

commit cadb798e1cd4a38cf3c9f3088ce6da784fb0cda7
Author: Jeya ganesh babu J <email address hidden>
Date: Tue Jun 2 14:32:43 2015 -0700

Storage provision fix merge

Closes-Bug: #1446391
Closes-Bug: #1447707
Closes-Bug: #1388449
Closes-Bug: #1446396
Closes-Bug: #1454898
Closes-Bug: #1457704
Closes-Bug: #1459835
Closes-Bug: #1460730
Closes-Bug: #1460730
Issues:
OSDs flaps because of insufficient heartbeat timeout on
large clusters
Replica configured is overwritten when upgrade or setup_storage
is run again.
Live migration provision doesnt work if there are multiple subnets
upgrade or setup storage creates new mons when the storage-compute
order changes in the testbed.py
if only ssd-disks is specified the pgs are stuck.
When an image added with http client, glance add fails.
If osd is not running, the remove disk fails as its trying
to stop the osd.
Fix:
Configured heartbeat based on the replica size.
Added a configuration variable 'storage_replica_size' in testbed.py
to specify the replica
Addded fix to support multiple subnets for live migration.
The current monitors are not taken into account for the total
monitors. Fix added to take existing monitors into account.
If there is only 'ssd-disks', code added to treat as 'disks'.
The known store configuration is set to use only rbd. This causes
even the glance client to use only rbd, blocking http access.
The quota for cinder is to be set based on the total space and
not the current available space.
Check added to stop osd only if osd is running.

Change-Id: I96a9a070eea1e0461c71566a3889a76f59828ef3

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.1

Review in progress for https://review.opencontrail.org/12112
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/12113
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/12112
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/12112
Committed: http://github.org/Juniper/contrail-fabric-utils/commit/4c1bd8f9ebe35b899140929fb235386e45485af5
Submitter: Zuul
Branch: R2.1

commit 4c1bd8f9ebe35b899140929fb235386e45485af5
Author: Jeya ganesh babu J <email address hidden>
Date: Tue Jun 30 13:52:21 2015 -0700

Storage provision fix merge

Closes-Bug: #1446391
Closes-Bug: #1447707
Closes-Bug: #1456880
Closes-Bug: #1458281
Closes-Bug: #1463170
Issues:
OSDs flaps because of insufficient heartbeat timeout on
large clusters
Replica configured is overwritten when upgrade or setup_storage
is run again.
If no replica configuration is provided in the testbed
Setup storage fails.
Current code only checks for the release numbers, doesn't check
for build numbers.
When provisioning external NFS based live-migration, the
provision script still checks for the Live migration VM image
that is used for ceph based provision for image transfer.
Fix:
Configured heartbeat based on the replica size.
Added a configuration variable 'storage_replica_size' in testbed.py
to specify the replica
Fixed the replica parse logic.
Added check for release number and build number and used
storage-package instead of contrail-install-package for the
versions.
Check added to ignore the VM image transfer in case of Non-nfs
based live-migration or external NFS based live-migration.

Change-Id: I845536a161d883d9ed62b0340c0465d8f17923ce

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/12113
Committed: http://github.org/Juniper/contrail-provisioning/commit/da76a8e21b91029acb3f12c31e34384b04dc890f
Submitter: Zuul
Branch: R2.1

commit da76a8e21b91029acb3f12c31e34384b04dc890f
Author: Jeya ganesh babu J <email address hidden>
Date: Tue Jun 30 13:56:14 2015 -0700

Storage provision fix merge

Closes-Bug: #1446391
Closes-Bug: #1447707
Closes-Bug: #1388449
Closes-Bug: #1446396
Closes-Bug: #1454898
Closes-Bug: #1459835
Closes-Bug: #1460730
Issues:
OSDs flaps because of insufficient heartbeat timeout on
large clusters
Replica configured is overwritten when upgrade or setup_storage
is run again.
Live migration provision doesnt work if there are multiple subnets
upgrade or setup storage creates new mons when the storage-compute
order changes in the testbed.py
if only ssd-disks is specified the pgs are stuck.
If osd is not running, the remove disk fails as its trying
to stop the osd.
Fix:
Configured heartbeat based on the replica size.
Added a configuration variable 'storage_replica_size' in testbed.py
to specify the replica
Addded fix to support multiple subnets for live migration.
The current monitors are not taken into account for the total
monitors. Fix added to take existing monitors into account.
If there is only 'ssd-disks', code added to treat as 'disks'.
The quota for cinder is to be set based on the total space and
not the current available space.
Check added to stop osd only if osd is running.

Change-Id: Ic2555991dc2d1597b867117b7229a7218857b1b9

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.