Comment 5 for bug 1950382

Revision history for this message
John Eckersberg (jeckersb) wrote :

Oh duh I see you already mentioned that at the very end of the original comment. Sorry, not having a good reading comprehension day :)

This one is almost certainly something to do with TLS, just based on from the rabbit stdout.log linked above:

(log_op_output) notice: rabbitmq_stop_0[1386] error output [ * TCP connection succeeded but Erlang distribution failed ]

So the cli can locate rabbit and connect to it, but can't start distribution (handshake, basically). At that point either (1) there is an erlang cookie mismatch, or (2) the TLS handshake failed for some reason (probably certificate verification).

It's probably not (1) since this is all contained within the same node and both the server and CLI share the same cookie file. The only way it could possibly be a cookie mismatch is if rabbit starts, then something in tripleo changes the cookie out from underneath of it, and then the CLI tries to use the new cookie. This has happened repeatedly in the past during upgrades, but I wouldn't expect to see it show up in a more straightforward CI run.

Michele and I did a rather significant overhaul of the rabbit TLS bits here recently:

https://review.opendev.org/c/openstack/puppet-tripleo/+/812401
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/812390

A lot of that was specifically to improve FIPS support by removing hard-coded ciphers and forcing everything to only use tls 1.2 or tls 1.3. Plus newer erlang started logging errors about certificate verification and these tweaks removed those by actually verifying the certificate or explicitly disabling verification in the cases that don't require it.

With all of that said, it's still not obvious why you would be hitting this only intermittently. I am always suspicious of name resolution, and maybe there is a mismatch with the name(s) in the certificate but only happens if something resolves in some particular manner. If we can hold a node once this reproduces it would be a huge help to poke at it with the erlang cli as well as openssl s_client and see if we can get some idea of why cert verification might or might not be failing.