postci timeouts on ovb-ha and ovb-updates

Bug #1649742 reported by Emilien Macchi
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

http://logs.openstack.org/01/410301/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/8087c38/console.html#_2016-12-13_20_04_50_625658

pingtest works fine but something timeouts during postci scripts. It makes ovb-ha jobs failing.

Tags: ci
Revision history for this message
Gabriele Cerami (gcerami) wrote :

postci function completes correctly and return exit code 0, as seen in http://logs.openstack.org/01/410301/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/8087c38/logs/postci.txt.gz

so there must be something after the postci that doesn't catch the exit and waits indefinitely, but after postci there's nothing, control should return to testenv client

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

seems like it's dstat which hangs, trying to kill it in patch: https://review.openstack.org/#/c/410738/

Changed in tripleo:
milestone: ocata-2 → ocata-3
Revision history for this message
Ben Nemec (bnemec) wrote :

I think we're pretty much concluded that it isn't anything in the ci scripts causing this. We've started to look at the testenv bits to see if something is malfunctioning there. One possible interesting thing I noticed was this:

DEBUG:gear.Server:Received error event on <gear.ServerConnection 0x7fc7280105d0 name: None host: 192.168.100.88 port: 43948>: 29

If we're losing our Gearman connection then the testenv-client might not ever return, which could explain the hang. Tracing the testenv-worker logs for an ha environment suggests that it gets created, used, and then destroyed as expected, but if there's a communication failure sending that back to the testenv-client then maybe that would cause this behavior?

I have no explanation for why this is specific to the ha and updates (which also appears to be affected) jobs though. Something to do with net-iso maybe? Could the net-iso config on the undercloud be breaking networking when we delete the extra interfaces during the testenv teardown? I've pushed a test patch that disables net-iso in ha on my big chain of test patches, so we'll see: https://review.openstack.org/410998 I also tried dropping pacemaker from the updates job since that is also a feature they both use, although I can't really see how it would be relevant here.

Anyway, that's where I'm at with this right now. I guess we'll see if any of the test patches tell us anything interesting.

summary: - postci timeouts on ovb-ha
+ postci timeouts on ovb-ha and ovb-updates
Revision history for this message
Derek Higgins (derekh) wrote :

I havn't much info to add here but have observed two things

1. The last successful HA job I can find was on CentOS 7.2
   http://logs.openstack.org/97/409697/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/5ac4a16/
   The first job that has this particular failure was on Centos 7.3
   http://logs.openstack.org/09/409809/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/56c5a8e/

2. I ran a HA job and tcpdumped the gearman connection on both sides
   tcpdump on the testenv-client
   http://logs.openstack.org/11/111011/82/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/4c53429/logs/dump.tcp
   tcpdump on the te-broker (I didn't get the start of the dump)
   http://chunk.io/f/399cf0a7855d42b9ab4b84ed21d5fad2

The sequence numbers look strange to me, on the te-broker side we see a bunch of packets with "seq 0:35, ack 1" but the corresponding packets on the testenv-client have "seq 35:70, ack 174", shouldn't they be the same?

Revision history for this message
Gabriele Cerami (gcerami) wrote :

Sequence numbers are ok, the seq range is not for "corresponding" packets, they number each independently. The really weird thing is that 192.168.101.125 doesn't answer anymore after a while. At least one of those seq 0:35 should receive a ack 34, and it doesn't happen, so 192.168.103.254 keeps sending the same sequence hoping the other will receive it. It really may be a firewall rule blocking the traffic at this point

Revision history for this message
Ben Nemec (bnemec) wrote :

https://review.openstack.org/#/c/411514 seems to get us around the problem on newton and above. Mitaka still appears to be broken, but we made significant changes to undercloud ssl since then so I'm not entirely surprised. My best guess at this point is that one of the extra iptables rules we configure for undercloud ssl now is somehow unbreaking this.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

https://review.openstack.org/#/c/410470/6 - not ideal, but also could be a solution. It'll stop testenv in the end of timeout, timeout is calculated according to zuul job timeout and time for getting environment. It'll always stop the job, but in ha case it will just wait idle until timeout is reached.

tags: removed: alert
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Gabriele Cerami (gcerami) wrote :

This is definitely a firewall problem, this change https://review.openstack.org/411189 demonstrate that adding debug option (disables the DROP rule at the end of the chain to drop every unmatched packet) fixes the problem with master HA. We'll have to add rules for geard ports somewhere in puppet-tripleo (the rules in the change get overridden)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.