tripleo

postci timeouts on ovb-ha and ovb-updates

Bug #1649742 reported by Emilien Macchi on 2016-12-14

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Unassigned	tripleo ocata-3 "ocata-3"

Bug Description

http://logs.openstack.org/01/410301/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/8087c38/console.html#_2016-12-13_20_04_50_625658

pingtest works fine but something timeouts during postci scripts. It makes ovb-ha jobs failing.

Tags:

Revision history for this message

Gabriele Cerami (gcerami) wrote on 2016-12-14:

postci function completes correctly and return exit code 0, as seen in http://logs.openstack.org/01/410301/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/8087c38/logs/postci.txt.gz

so there must be something after the postci that doesn't catch the exit and waits indefinitely, but after postci there's nothing, control should return to testenv client

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2016-12-14:

seems like it's dstat which hangs, trying to kill it in patch: https://review.openstack.org/#/c/410738/

Emilien Macchi (emilienm) on 2016-12-14

Changed in tripleo:
milestone:	ocata-2 → ocata-3

Revision history for this message

Ben Nemec (bnemec) wrote on 2016-12-14:

I think we're pretty much concluded that it isn't anything in the ci scripts causing this. We've started to look at the testenv bits to see if something is malfunctioning there. One possible interesting thing I noticed was this:

DEBUG:gear.Server:Received error event on <gear.ServerConnection 0x7fc7280105d0 name: None host: 192.168.100.88 port: 43948>: 29

If we're losing our Gearman connection then the testenv-client might not ever return, which could explain the hang. Tracing the testenv-worker logs for an ha environment suggests that it gets created, used, and then destroyed as expected, but if there's a communication failure sending that back to the testenv-client then maybe that would cause this behavior?

I have no explanation for why this is specific to the ha and updates (which also appears to be affected) jobs though. Something to do with net-iso maybe? Could the net-iso config on the undercloud be breaking networking when we delete the extra interfaces during the testenv teardown? I've pushed a test patch that disables net-iso in ha on my big chain of test patches, so we'll see: https://review.openstack.org/410998 I also tried dropping pacemaker from the updates job since that is also a feature they both use, although I can't really see how it would be relevant here.

Anyway, that's where I'm at with this right now. I guess we'll see if any of the test patches tell us anything interesting.

summary:

- postci timeouts on ovb-ha
+ postci timeouts on ovb-ha and ovb-updates

Revision history for this message

Derek Higgins (derekh) wrote on 2016-12-15:

I havn't much info to add here but have observed two things

1. The last successful HA job I can find was on CentOS 7.2
   http://logs.openstack.org/97/409697/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/5ac4a16/
   The first job that has this particular failure was on Centos 7.3
   http://logs.openstack.org/09/409809/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/56c5a8e/

2. I ran a HA job and tcpdumped the gearman connection on both sides
   tcpdump on the testenv-client
   http://logs.openstack.org/11/111011/82/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/4c53429/logs/dump.tcp
   tcpdump on the te-broker (I didn't get the start of the dump)
   http://chunk.io/f/399cf0a7855d42b9ab4b84ed21d5fad2

The sequence numbers look strange to me, on the te-broker side we see a bunch of packets with "seq 0:35, ack 1" but the corresponding packets on the testenv-client have "seq 35:70, ack 174", shouldn't they be the same?

Revision history for this message

Gabriele Cerami (gcerami) wrote on 2016-12-15:

Sequence numbers are ok, the seq range is not for "corresponding" packets, they number each independently. The really weird thing is that 192.168.101.125 doesn't answer anymore after a while. At least one of those seq 0:35 should receive a ack 34, and it doesn't happen, so 192.168.103.254 keeps sending the same sequence hoping the other will receive it. It really may be a firewall rule blocking the traffic at this point

Revision history for this message

Ben Nemec (bnemec) wrote on 2016-12-15:

https://review.openstack.org/#/c/411514 seems to get us around the problem on newton and above. Mitaka still appears to be broken, but we made significant changes to undercloud ssl since then so I'm not entirely surprised. My best guess at this point is that one of the extra iptables rules we configure for undercloud ssl now is somehow unbreaking this.

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2016-12-16:

https://review.openstack.org/#/c/410470/6 - not ideal, but also could be a solution. It'll stop testenv in the end of timeout, timeout is calculated according to zuul job timeout and time for getting environment. It'll always stop the job, but in ha case it will just wait idle until timeout is reached.

Emilien Macchi (emilienm) on 2016-12-19

tags:	removed: alert
Changed in tripleo:
status:	Triaged → Fix Released

Revision history for this message

Gabriele Cerami (gcerami) wrote on 2016-12-19:

This is definitely a firewall problem, this change https://review.openstack.org/411189 demonstrate that adding debug option (disables the DROP rule at the end of the chain to drop every unmatched packet) fixes the problem with master HA. We'll have to add rules for geard ports somewhere in puppet-tripleo (the rules in the change get overridden)

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1649743

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.