Fuel for OpenStack

[library] HA deployment has failed on 3-rd controller with error: Timeout of deployment is exceeded (GRE Latency)

Bug #1275754 reported by Anastasia Palkina on 2014-02-03

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	Critical	Fuel Library (Deprecated)	Fuel for OpenStack 5.1
4.1.x	Invalid	Critical	Fuel Library (Deprecated)	Fuel for OpenStack 4.1.1-updates
5.0.x	Invalid	Critical	Fuel Library (Deprecated)	Fuel for OpenStack 5.0.2

Bug Description

"build_id": "2014-02-03_01-17-30",
"ostf_sha": "338ddf840c229918d1df8c6597588b853d02de4c",
"build_number": "72",
"nailgun_sha": "04f17482e97c1c7ee12f7f99bafc2dc9dbfc9a95",
"fuelmain_sha": "cb36a45a6148f742665f4b0b426d69350a5f2243",
"astute_sha": "d002c3bf626cff96a1d4aec9eb92fc4d5f4542c4",
"release": "4.1",
"fuellib_sha": "cd12860ec234260b50f98963a78257b3759441f6"

1. Create new environment (CentOS, HA mode)
2. Add 3 controllers and 2 compute nodes
3. Choose VLAN Manager and 4 networks with size 64
4. Start deployment. It has failed on 3-rd controller with error:Timeout of deployment is exceeded

Tags:

Revision history for this message

Anastasia Palkina (apalkina) wrote on 2014-02-03:

fuel-snapshot-2014-02-03_12-11-59.tgz Edit (5.9 MiB, application/x-tar)

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-02-03:

analysis of the puppet apply logs shows that a lot of time is spent on the tasks that are reading disks are extremely slow.
need logs from another env to confirm this bug.

Top lines:
01 - 94.18 sec (1.57 min) 4001 (/Stage[main]/Heat::Engine/Service[heat-engine]) Triggered 'refresh' from 36 events
02 - 81.11 sec (1.35 min) 120 (/Stage[netconfig]/L23network::L2/Package[kmod-openvswitch]/ensure) created
03 - 59.20 sec (0.99 min) 3744 (/Stage[main]/Galera/Service[mysql]/enable) enable changed 'true' to 'true'
04 - 56.11 sec (0.94 min) 4000 (/Stage[main]/Heat::Engine/Service[heat-engine]/enable) enable changed 'true' to 'true'
05 - 46.71 sec (0.78 min) 209 (/Firewall[105 nova ]/ensure) created
06 - 38.82 sec (0.65 min) 3591 (/Stage[main]/Rabbitmq::Server/Package[rabbitmq-server]/ensure) created
07 - 31.81 sec (0.53 min) 150 (L3_if_downup[eth2](provider=ruby)) Interface 'eth2' up.
08 - 30.03 sec (0.50 min) 189 (/Stage[corosync_setup]/Osnailyfacter::Cluster_ha::Virtual_ips/Cluster::Virtual_ips[management_old]/Cluster::Virtual_ip[management_old]/Cs_shadow[vip__management_old]/cib) defined 'cib' as 'vip__management_old'
09 - 28.58 sec (0.48 min) 3916 (/Stage[main]/Glance::Api/Glance_cache_config[DEFAULT/auth_url]/ensure) created
10 - 27.07 sec (0.45 min) 3738 (/Stage[main]/Galera/Package[MySQL-server]/ensure) created
Sum Total: 1161.86 sec (19.36 min)

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Anastasia Palkina (apalkina) wrote on 2014-02-04:

I have one more environment from ISO 72.
Deployment has failed on 3-rd controller again.

Revision history for this message

Anastasia Palkina (apalkina) wrote on 2014-02-04:

fuel-snapshot-2014-02-04_08-18-21.tgz Edit (32.4 MiB, application/x-tar)

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2014-02-08:

Let's analyse where the issue is.

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-02-08:

Again, according to the logs, the issue is with reporter's environment performance as even simpliest tasks involving, e.g. running 'depmod' command which simply reads a bunch of kernel modules and generates a file, wait for about several minutes, that clearly indicates performance problems. Thus we need logs from another env, not reporter's one. Thus I am marking this bug as incomplete.

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Arminder Singh Girgla (arminder) wrote on 2014-03-19:

Even we are facing the similar issues though we are using GRE implementation.

I found that communication between the controller-1 & 2 are very fast, whereas between 1 & 3 and 2 & 3 are very slow because of which ceph-deploy times-out and then issues with missing keystone tables. Communications from puppet master is instant for all the controllers. There isn't any slowness as such with the controller-3. The ping between the nodes too work fine, no packet loss.

See a timeout after 18 minutes:

2014-03-18 23:50:19,123 [ceph_deploy.cli][INFO ] Invoked (1.2.7): /usr/bin/ceph-deploy --overwrite-conf config pull node-1
2014-03-18 23:50:19,123 [ceph_deploy.config][DEBUG ] Checking node-1 for /etc/ceph/ceph.conf
2014-03-18 23:50:19,124 [ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection without sudo
2014-03-19 00:08:09,975 [ceph_deploy.config][ERROR ] Unable to pull /etc/ceph/ceph.conf from node-1
2014-03-19 00:08:09,976 [ceph_deploy][ERROR ] GenericError: Failed to fetch config from 1 hosts

SSH from 1 <-> 3 & 2 <->3 takes >2 minutes whereas 1<->2 & fuel <-> controllers is instant.

See samples below, where node-1, node-2 & node-4 are controller-1, 2 & 3 respectively.

=====Fuel======
[root@fuel ~]# time ssh node-1 exit
real 0m0.103s
user 0m0.032s
sys 0m0.003s
[root@fuel ~]# time ssh node-2 exit
real 0m0.102s
user 0m0.030s
sys 0m0.004s
[root@fuel ~]# time ssh node-4 exit
real 0m0.103s
user 0m0.029s
sys 0m0.005s

=====Controller-1======
[root@node-1 ~]# time ssh fuel exit
root@fuel's password:
real 0m2.938s
user 0m0.004s
sys 0m0.007s
[root@node-1 ~]# time ssh node-2 exit
real 0m0.083s
user 0m0.013s
sys 0m0.002s
[root@node-1 ~]# time ssh node-4 exit
real 2m7.389s
user 0m0.010s
sys 0m0.005s

=====Controller-2======
[root@node-2 ~]# time ssh fuel exit
root@fuel's password:
real 0m2.997s
user 0m0.006s
sys 0m0.004s
[root@node-2 ~]# time ssh node-1 exit
real 0m0.084s
user 0m0.009s
sys 0m0.006s
[root@node-2 ~]# time ssh node-4 exit
real 2m7.060s
user 0m0.012s
sys 0m0.004s

=====Controller-3======
[root@node-4 ~]# time ssh fuel exit
root@fuel's password:
real 0m2.652s
user 0m0.007s
sys 0m0.003s
[root@node-4 ~]# time ssh node-1 exit
real 2m9.584s
user 0m0.012s
sys 0m0.003s
[root@node-4 ~]# time ssh node-2 exit
real 2m9.586s
user 0m0.013s
sys 0m0.002s

Even we are facing the similar issues though we are using GRE implementation.

See a timeout after 18 minutes:

2014-03-18 23:50:19,123 [ceph_deploy.cli][INFO  ] Invoked (1.2.7): /usr/bin/ceph-deploy --overwrite-conf config pull node-1
2014-03-18 23:50:19,123 [ceph_deploy.config][DEBUG ] Checking node-1 for /etc/ceph/ceph.conf
2014-03-18 23:50:19,124 [ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection without sudo
2014-03-19 00:08:09,975 [ceph_deploy.config][ERROR ] Unable to pull /etc/ceph/ceph.conf from node-1
2014-03-19 00:08:09,976 [ceph_deploy][ERROR ] GenericError: Failed to fetch config from 1 hosts

SSH from 1 <-> 3 & 2 <->3 takes >2 minutes whereas 1<->2 & fuel <-> controllers is instant.

See samples below, where node-1, node-2 & node-4 are controller-1, 2 & 3 respectively.

=====Fuel======
[root@fuel ~]# time ssh node-1 exit
real    0m0.103s
user    0m0.032s
sys     0m0.003s
[root@fuel ~]# time ssh node-2 exit
real    0m0.102s
user    0m0.030s
sys     0m0.004s
[root@fuel ~]# time ssh node-4 exit
real    0m0.103s
user    0m0.029s
sys     0m0.005s

=====Controller-1======
[root@node-1 ~]# time ssh fuel exit
root@fuel's password:
real    0m2.938s
user    0m0.004s
sys     0m0.007s
[root@node-1 ~]# time ssh node-2 exit
real    0m0.083s
user    0m0.013s
sys     0m0.002s
[root@node-1 ~]# time ssh node-4 exit
real    2m7.389s
user    0m0.010s
sys     0m0.005s

=====Controller-2======
[root@node-2 ~]# time ssh fuel exit
root@fuel's password:
real    0m2.997s
user    0m0.006s
sys     0m0.004s
[root@node-2 ~]# time ssh node-1 exit
real    0m0.084s
user    0m0.009s
sys     0m0.006s
[root@node-2 ~]# time ssh node-4 exit
real    2m7.060s
user    0m0.012s
sys     0m0.004s

=====Controller-3======
[root@node-4 ~]# time ssh fuel exit
root@fuel's password:
real    0m2.652s
user    0m0.007s
sys     0m0.003s
[root@node-4 ~]# time ssh node-1 exit
real    2m9.584s
user    0m0.012s
sys     0m0.003s
[root@node-4 ~]# time ssh node-2 exit
real    2m9.586s
user    0m0.013s
sys     0m0.002s

Changed in fuel:
status:	Incomplete → Confirmed

Mike Scherbakov (mihgen) on 2014-03-19

Changed in fuel:
milestone:	4.1 → 5.0

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-03-19:

Arminder, could you please collect diagnostic snapshot, FUEL version and post it in the another bug as your case is not related to this bug.

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Andrew Woodward (xarses) wrote on 2014-03-20:

It sounds like Neutron VLAN, if you have VLAN tagged networks, specifically management network, you need to test with one of the VLAN splinters mode when using CentOS.

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2014-04-17:

#10

Please reopen this if you see this again, and ideally get someone in #fuel-dev to debug your env. I'm closing this as Invalid for now.

Changed in fuel:
status:	Incomplete → Invalid

Revision history for this message

Amrita Mande (amande) wrote on 2014-06-06:

#11

I have got exactly the same thing in mu setup with fuel-5.0

Is there any fix / patches available for this - that I can apply on my system ?

Changed in fuel:
status:	Invalid → New

Dmitry Borodaenko (angdraug) on 2014-06-17

Changed in fuel:
milestone:	5.0 → 5.0.1

Sergii Golovatiuk (sgolovatiuk) on 2014-06-17

Changed in fuel:
status:	New → Incomplete
status:	Incomplete → Confirmed

Revision history for this message

Andrew Woodward (xarses) wrote on 2014-06-24: Re: HA deployment has failed on 3-rd controller with error: Timeout of deployment is exceeded (GRE Latency)

#12

Amrita, Please describe your env and if possible attach a support bundle

summary:	HA deployment has failed on 3-rd controller with error: Timeout of - deployment is exceeded + deployment is exceeded (GRE Latency)
Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-25:

#13

As far as this issue affects 4.1 initially and 5.0 later, Perhaps it should be nominated for all related series as well

Changed in fuel:
milestone:	5.0.1 → 5.1
status:	Incomplete → New

Dmitry Borodaenko (angdraug) on 2014-06-25

Changed in fuel:
status:	New → Incomplete

Dmitry Ilyin (idv1985) on 2014-07-15

summary:

- HA deployment has failed on 3-rd controller with error: Timeout of
- deployment is exceeded (GRE Latency)
+ [puppet] HA deployment has failed on 3-rd controller with error: Timeout
+ of deployment is exceeded (GRE Latency)

Dmitry Ilyin (idv1985) on 2014-07-23

summary:	- [puppet] HA deployment has failed on 3-rd controller with error: Timeout - of deployment is exceeded (GRE Latency) + [library] HA deployment has failed on 3-rd controller with error: + Timeout of deployment is exceeded (GRE Latency)
tags:	added: library

Revision history for this message

Stanislaw Bogatkin (sbogatkin) wrote on 2014-08-12:

#14

Cannot reproduce that on Fuel 5.1. Deployment was OK. OSTF sanity tests also was OK.
Fuel version:

{"build_id": "2014-08-11_13-56-51",
"ostf_sha": "acf52a59e04fa74d2ed2b68ea225f4d24403b264",
"build_number": "423",
"auth_required": true,
"api": "1.0",
"nailgun_sha": "2741cdc0f0615263db2f176899d406207ec4ac04",
"production": "docker",
"fuelmain_sha": "9d4463400b4924159c978af43855e48bcf2a84b2",
"astute_sha": "b52910642d6de941444901b0f20e95ebbcb2b2e9",
"feature_groups": ["mirantis"],
"release": "5.1",
"fuellib_sha": "d9b93edb53c44900bd5bc2c25e7c8af0a1310645"}

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-08-12:

#15

No reproducer for 2 months (since June 6), back to Invalid.

Changed in fuel:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.