[library] HA deployment has failed on 3-rd controller with error: Timeout of deployment is exceeded (GRE Latency)

Bug #1275754 reported by Anastasia Palkina on 2014-02-03
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Fuel Library (Deprecated)
4.1.x
Critical
Fuel Library (Deprecated)
5.0.x
Critical
Fuel Library (Deprecated)

Bug Description

"build_id": "2014-02-03_01-17-30",
"ostf_sha": "338ddf840c229918d1df8c6597588b853d02de4c",
"build_number": "72",
"nailgun_sha": "04f17482e97c1c7ee12f7f99bafc2dc9dbfc9a95",
"fuelmain_sha": "cb36a45a6148f742665f4b0b426d69350a5f2243",
"astute_sha": "d002c3bf626cff96a1d4aec9eb92fc4d5f4542c4",
"release": "4.1",
"fuellib_sha": "cd12860ec234260b50f98963a78257b3759441f6"

1. Create new environment (CentOS, HA mode)
2. Add 3 controllers and 2 compute nodes
3. Choose VLAN Manager and 4 networks with size 64
4. Start deployment. It has failed on 3-rd controller with error:Timeout of deployment is exceeded

Anastasia Palkina (apalkina) wrote :
Vladimir Kuklin (vkuklin) wrote :

analysis of the puppet apply logs shows that a lot of time is spent on the tasks that are reading disks are extremely slow.
need logs from another env to confirm this bug.

Top lines:
01 - 94.18 sec (1.57 min) 4001 (/Stage[main]/Heat::Engine/Service[heat-engine]) Triggered 'refresh' from 36 events
02 - 81.11 sec (1.35 min) 120 (/Stage[netconfig]/L23network::L2/Package[kmod-openvswitch]/ensure) created
03 - 59.20 sec (0.99 min) 3744 (/Stage[main]/Galera/Service[mysql]/enable) enable changed 'true' to 'true'
04 - 56.11 sec (0.94 min) 4000 (/Stage[main]/Heat::Engine/Service[heat-engine]/enable) enable changed 'true' to 'true'
05 - 46.71 sec (0.78 min) 209 (/Firewall[105 nova ]/ensure) created
06 - 38.82 sec (0.65 min) 3591 (/Stage[main]/Rabbitmq::Server/Package[rabbitmq-server]/ensure) created
07 - 31.81 sec (0.53 min) 150 (L3_if_downup[eth2](provider=ruby)) Interface 'eth2' up.
08 - 30.03 sec (0.50 min) 189 (/Stage[corosync_setup]/Osnailyfacter::Cluster_ha::Virtual_ips/Cluster::Virtual_ips[management_old]/Cluster::Virtual_ip[management_old]/Cs_shadow[vip__management_old]/cib) defined 'cib' as 'vip__management_old'
09 - 28.58 sec (0.48 min) 3916 (/Stage[main]/Glance::Api/Glance_cache_config[DEFAULT/auth_url]/ensure) created
10 - 27.07 sec (0.45 min) 3738 (/Stage[main]/Galera/Package[MySQL-server]/ensure) created
Sum Total: 1161.86 sec (19.36 min)

Changed in fuel:
status: New → Incomplete
Anastasia Palkina (apalkina) wrote :

I have one more environment from ISO 72.
Deployment has failed on 3-rd controller again.

Anastasia Palkina (apalkina) wrote :
Mike Scherbakov (mihgen) wrote :

Let's analyse where the issue is.

Changed in fuel:
status: Incomplete → Confirmed
Vladimir Kuklin (vkuklin) wrote :

Again, according to the logs, the issue is with reporter's environment performance as even simpliest tasks involving, e.g. running 'depmod' command which simply reads a bunch of kernel modules and generates a file, wait for about several minutes, that clearly indicates performance problems. Thus we need logs from another env, not reporter's one. Thus I am marking this bug as incomplete.

Changed in fuel:
status: Confirmed → Incomplete

Even we are facing the similar issues though we are using GRE implementation.

I found that communication between the controller-1 & 2 are very fast, whereas between 1 & 3 and 2 & 3 are very slow because of which ceph-deploy times-out and then issues with missing keystone tables. Communications from puppet master is instant for all the controllers. There isn't any slowness as such with the controller-3. The ping between the nodes too work fine, no packet loss.

See a timeout after 18 minutes:

2014-03-18 23:50:19,123 [ceph_deploy.cli][INFO ] Invoked (1.2.7): /usr/bin/ceph-deploy --overwrite-conf config pull node-1
2014-03-18 23:50:19,123 [ceph_deploy.config][DEBUG ] Checking node-1 for /etc/ceph/ceph.conf
2014-03-18 23:50:19,124 [ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection without sudo
2014-03-19 00:08:09,975 [ceph_deploy.config][ERROR ] Unable to pull /etc/ceph/ceph.conf from node-1
2014-03-19 00:08:09,976 [ceph_deploy][ERROR ] GenericError: Failed to fetch config from 1 hosts

SSH from 1 <-> 3 & 2 <->3 takes >2 minutes whereas 1<->2 & fuel <-> controllers is instant.

See samples below, where node-1, node-2 & node-4 are controller-1, 2 & 3 respectively.

=====Fuel======
[root@fuel ~]# time ssh node-1 exit
real 0m0.103s
user 0m0.032s
sys 0m0.003s
[root@fuel ~]# time ssh node-2 exit
real 0m0.102s
user 0m0.030s
sys 0m0.004s
[root@fuel ~]# time ssh node-4 exit
real 0m0.103s
user 0m0.029s
sys 0m0.005s

=====Controller-1======
[root@node-1 ~]# time ssh fuel exit
root@fuel's password:
real 0m2.938s
user 0m0.004s
sys 0m0.007s
[root@node-1 ~]# time ssh node-2 exit
real 0m0.083s
user 0m0.013s
sys 0m0.002s
[root@node-1 ~]# time ssh node-4 exit
real 2m7.389s
user 0m0.010s
sys 0m0.005s

=====Controller-2======
[root@node-2 ~]# time ssh fuel exit
root@fuel's password:
real 0m2.997s
user 0m0.006s
sys 0m0.004s
[root@node-2 ~]# time ssh node-1 exit
real 0m0.084s
user 0m0.009s
sys 0m0.006s
[root@node-2 ~]# time ssh node-4 exit
real 2m7.060s
user 0m0.012s
sys 0m0.004s

=====Controller-3======
[root@node-4 ~]# time ssh fuel exit
root@fuel's password:
real 0m2.652s
user 0m0.007s
sys 0m0.003s
[root@node-4 ~]# time ssh node-1 exit
real 2m9.584s
user 0m0.012s
sys 0m0.003s
[root@node-4 ~]# time ssh node-2 exit
real 2m9.586s
user 0m0.013s
sys 0m0.002s

Changed in fuel:
status: Incomplete → Confirmed
Mike Scherbakov (mihgen) on 2014-03-19
Changed in fuel:
milestone: 4.1 → 5.0
Vladimir Kuklin (vkuklin) wrote :

Arminder, could you please collect diagnostic snapshot, FUEL version and post it in the another bug as your case is not related to this bug.

Changed in fuel:
status: Confirmed → Incomplete
Andrew Woodward (xarses) wrote :

It sounds like Neutron VLAN, if you have VLAN tagged networks, specifically management network, you need to test with one of the VLAN splinters mode when using CentOS.

Mike Scherbakov (mihgen) wrote :

Please reopen this if you see this again, and ideally get someone in #fuel-dev to debug your env. I'm closing this as Invalid for now.

Changed in fuel:
status: Incomplete → Invalid
Amrita Mande (amande) wrote :

I have got exactly the same thing in mu setup with fuel-5.0

Is there any fix / patches available for this - that I can apply on my system ?

Changed in fuel:
status: Invalid → New
Changed in fuel:
milestone: 5.0 → 5.0.1
Changed in fuel:
status: New → Incomplete
status: Incomplete → Confirmed

Amrita, Please describe your env and if possible attach a support bundle

summary: HA deployment has failed on 3-rd controller with error: Timeout of
- deployment is exceeded
+ deployment is exceeded (GRE Latency)
Changed in fuel:
status: Confirmed → Incomplete
Bogdan Dobrelya (bogdando) wrote :

As far as this issue affects 4.1 initially and 5.0 later, Perhaps it should be nominated for all related series as well

Changed in fuel:
milestone: 5.0.1 → 5.1
status: Incomplete → New
Changed in fuel:
status: New → Incomplete
Dmitry Ilyin (idv1985) on 2014-07-15
summary: - HA deployment has failed on 3-rd controller with error: Timeout of
- deployment is exceeded (GRE Latency)
+ [puppet] HA deployment has failed on 3-rd controller with error: Timeout
+ of deployment is exceeded (GRE Latency)
Dmitry Ilyin (idv1985) on 2014-07-23
summary: - [puppet] HA deployment has failed on 3-rd controller with error: Timeout
- of deployment is exceeded (GRE Latency)
+ [library] HA deployment has failed on 3-rd controller with error:
+ Timeout of deployment is exceeded (GRE Latency)
tags: added: library
Stanislaw Bogatkin (sbogatkin) wrote :

Cannot reproduce that on Fuel 5.1. Deployment was OK. OSTF sanity tests also was OK.
Fuel version:

{"build_id": "2014-08-11_13-56-51",
"ostf_sha": "acf52a59e04fa74d2ed2b68ea225f4d24403b264",
"build_number": "423",
"auth_required": true,
"api": "1.0",
"nailgun_sha": "2741cdc0f0615263db2f176899d406207ec4ac04",
"production": "docker",
"fuelmain_sha": "9d4463400b4924159c978af43855e48bcf2a84b2",
"astute_sha": "b52910642d6de941444901b0f20e95ebbcb2b2e9",
"feature_groups": ["mirantis"],
"release": "5.1",
"fuellib_sha": "d9b93edb53c44900bd5bc2c25e7c8af0a1310645"}

Dmitry Borodaenko (angdraug) wrote :

No reproducer for 2 months (since June 6), back to Invalid.

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers