Comment 1 for bug 1718898

Revision history for this message
Raoul Scarazzini (rasca) wrote :

I worked on isolating the problem, without any big success.
But I was able to increase the debug level and take the logs from the last SUCCESSFUL deployment [1] and the next FAILURE [2]. These logs includes also tcpdump pcap file coming from the introspection process.

Now, what we know for sure is that:

1) If I continuously deploy an env until the introspection (so once the introspection finishes I just restart from scratch), the problem does not happen;
2) If I continuously deploy a complete env, including deleting the overcloud stack before restarting from scratch, then the problem happens every 3 or 4 deployments;
3) The nodes timing out are not always the same: this is extremely racy;

That said, I'm still not able to reproduce the problem, and neither use a workaround to avoid it.

[1] http://file.rdu.redhat.com/~rscarazz/LP1718898/collect-logs_SUCCESS.tar.bz2
[2] http://file.rdu.redhat.com/~rscarazz/LP1718898/collect-logs_FAILURE.tar.bz2