tripleo

Bug #1741850
Comment #3

Comment 3 for bug 1741850

Revision history for this message

Alan Bishop (alan-bishop) wrote on 2018-01-08:

I analyzed another failure over the weekend [1], and it's the same failure I see here. For reasons I don't understand, it looks like the iscsid container is regenerating its IQN on startup, essentially overwriting the IQN initialization done by puppet. This creates a situation where the iSCSI subsystem thinks that Nova is trying to access a volume using the wrong IQN, which results in an iSCSI login failure. This failure can be seen in the log journal [2], where this appears:

Jan 06 11:48:13 centos-7-ovh-gra1-0001714610 kernel: iSCSI Initiator Node: iqn.1994-05.com.redhat:755d7cc2a05d is not authorized to access iSCSI target portal group: 1.

[1] https://review.openstack.org/530559
[2] http://logs.openstack.org/82/531182/1/gate/tripleo-ci-centos-7-scenario002-multinode-oooq-container/c99bd6b/logs/subnode-2/var/log/journal.txt.gz#_Jan_06_11_48_13

The IQN is supposed to be initialized once, during the iscsid container's puppet_config step. The same IQN value is shared by baremetal services [3] and containers [4]. That value is iqn.1994-05.com.redhat:dc122233454b, which differs from the iqn.1994-05.com.redhat:755d7cc2a05d value flagged in the login failure message. This is what causes the tempest tests to fail!

[3] http://logs.openstack.org/82/531182/1/gate/tripleo-ci-centos-7-scenario002-multinode-oooq-container/c99bd6b/logs/subnode-2/etc/iscsi/initiatorname.iscsi.gz
[4] http://logs.openstack.org/82/531182/1/gate/tripleo-ci-centos-7-scenario002-multinode-oooq-container/c99bd6b/logs/subnode-2/var/log/config-data/iscsid/etc/iscsi/initiatorname.iscsi.gz

I tracked the problem further, to the iscsid container's stdout [5]. I don't know if I'm reading this correctly, but it suggests the IQN initialization code is running twice. The logs in [5] for "Copying service configuration files" don't look correct (the source and destination directory of the files look wrong), and there's no mention of the .initiator_reset file that's supposed to ensure the IQN is only reset once [6].

[5] http://logs.openstack.org/82/531182/1/gate/tripleo-ci-centos-7-scenario002-multinode-oooq-container/c99bd6b/logs/subnode-2/var/log/extra/docker/containers/iscsid/stdout.log.txt.gz
[6] https://github.com/openstack/puppet-tripleo/blob/master/manifests/profile/base/iscsid.pp#L36

Also, from the log journal [2], you see the iscsid container's puppet_config performing the IQN reset at Jan 06 11:16:21:

Jan 06 11:16:21 centos-7-ovh-gra1-0001714610 puppet-user[35337]: (/Stage[main]/Tripleo::Profile::Base::Iscsid/Exec[reset-iscsi-initiator-name]/returns) executed successfully

But, at Jan 06 11:29:44 (13 minutes later), you see it running the commands that generate the IQN that triggers the login failure:

Jan 06 11:29:44 centos-7-ovh-gra1-0001714610 dockerd-current[25619]: Generating new iSCSI initiator name
Jan 06 11:29:44 centos-7-ovh-gra1-0001714610 dockerd-current[25619]: ++ echo 'Generating new iSCSI initiator name'
Jan 06 11:29:44 centos-7-ovh-gra1-0001714610 dockerd-current[25619]: +++ /sbin/iscsi-iname
Jan 06 11:29:44 centos-7-ovh-gra1-0001714610 dockerd-current[25619]: ++ echo InitiatorName=iqn.1994-05.com.redhat:755d7cc2a05d

We need to take a closer look at how the iscsid container is being launched to determine why the puppet code is executing more than once.

We also need to investigate how the puppet-generated configuration files are saved, and then restored when the container is launched. That code may not be properly handling dot-files such as the ".initiator_reset" file mentioned above.

Jan 06 11:48:13 centos-7-ovh-gra1-0001714610 kernel: iSCSI Initiator Node: iqn.1994-05.com.redhat:755d7cc2a05d is not authorized to access iSCSI target portal group: 1.

Also, from the log journal [2], you see the iscsid container's puppet_config performing the IQN reset at Jan 06 11:16:21:

Jan 06 11:16:21 centos-7-ovh-gra1-0001714610 puppet-user[35337]: (/Stage[main]/Tripleo::Profile::Base::Iscsid/Exec[reset-iscsi-initiator-name]/returns) executed successfully

But, at Jan 06 11:29:44 (13 minutes later), you see it running the commands that generate the IQN that triggers the login failure:

We need to take a closer look at how the iscsid container is being launched to determine why the puppet code is executing more than once.