I worked on isolating the problem, without any big success.
But I was able to increase the debug level and take the logs from the last SUCCESSFUL deployment [1] and the next FAILURE [2]. These logs includes also tcpdump pcap file coming from the introspection process.
Now, what we know for sure is that:
1) If I continuously deploy an env until the introspection (so once the introspection finishes I just restart from scratch), the problem does not happen;
2) If I continuously deploy a complete env, including deleting the overcloud stack before restarting from scratch, then the problem happens every 3 or 4 deployments;
3) The nodes timing out are not always the same: this is extremely racy;
That said, I'm still not able to reproduce the problem, and neither use a workaround to avoid it.
I worked on isolating the problem, without any big success.
But I was able to increase the debug level and take the logs from the last SUCCESSFUL deployment [1] and the next FAILURE [2]. These logs includes also tcpdump pcap file coming from the introspection process.
Now, what we know for sure is that:
1) If I continuously deploy an env until the introspection (so once the introspection finishes I just restart from scratch), the problem does not happen;
2) If I continuously deploy a complete env, including deleting the overcloud stack before restarting from scratch, then the problem happens every 3 or 4 deployments;
3) The nodes timing out are not always the same: this is extremely racy;
That said, I'm still not able to reproduce the problem, and neither use a workaround to avoid it.
[1] http:// file.rdu. redhat. com/~rscarazz/ LP1718898/ collect- logs_SUCCESS. tar.bz2 file.rdu. redhat. com/~rscarazz/ LP1718898/ collect- logs_FAILURE. tar.bz2
[2] http://