DataSourceSmartOS needs locking
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | cloud-init |
Medium
|
Mike Gerdts | ||
| | cloud-init (Ubuntu) |
Undecided
|
Chad Smith | ||
Bug Description
I'm seeing the following while trying to read meta-data from SmartOS.
2018-01-31 21:36:03,554 - DataSourceSmart
" to metadata transport.
2018-01-31 21:36:03,995 - DataSourceSmart
2018-01-31 21:36:03,996 - handlers.py[DEBUG]: finish: init-local/
2018-01-31 21:36:03,996 - util.py[WARNING]: Getting data from <class 'cloudinit.
2018-01-31 21:36:03,996 - util.py[DEBUG]: Getting data from <class 'cloudinit.
Traceback (most recent call last):
File "/usr/lib/
if s.get_data():
File "/usr/lib/
return_value = self._get_data()
File "/usr/lib/
md[ci_noun] = self.md_
File "/usr/lib/
result = self.get(key, default=default)
File "/usr/lib/
val = self._get(key, strip=False, default=mdefault)
File "/usr/lib/
get(key, default=default, strip=strip))
File "/usr/lib/
result = self.request(
File "/usr/lib/
value = self._get_
File "/usr/lib/
frame_data = self.line_
AttributeError: 'NoneType' object has no attribute 'groupdict'
2018-01-31 21:36:04,004 - main.py[DEBUG]: No local datasource found
[root@7180e700-
sdc:routes
[root@7180e700-
[]
This seems to cause DataSourceSmartOS to fail completely, then it goes on to time out on EC2 and CloudStack.
This is using my own build with this changeset at HEAD.
$ git log -n 1
commit f7deaf15acf382d
Related branches
- Server Team CI bot: Approve (continuous-integration) on 2018-04-26
- cloud-init Commiters: Pending requested 2018-04-26
-
Diff: 2991 lines (+1186/-256)75 files modified.pylintrc (+1/-1)
cloudinit/analyze/dump.py (+1/-1)
cloudinit/cmd/tests/test_main.py (+3/-3)
cloudinit/config/cc_apt_configure.py (+1/-1)
cloudinit/config/cc_bootcmd.py (+0/-1)
cloudinit/config/cc_disk_setup.py (+4/-8)
cloudinit/config/cc_emit_upstart.py (+1/-1)
cloudinit/config/cc_resizefs.py (+3/-5)
cloudinit/config/cc_rh_subscription.py (+8/-10)
cloudinit/config/cc_runcmd.py (+0/-1)
cloudinit/config/cc_set_passwords.py (+45/-60)
cloudinit/config/cc_snap.py (+2/-3)
cloudinit/config/cc_snappy.py (+2/-2)
cloudinit/config/cc_ubuntu_advantage.py (+2/-3)
cloudinit/config/schema.py (+2/-2)
cloudinit/config/tests/test_set_passwords.py (+71/-0)
cloudinit/config/tests/test_snap.py (+27/-2)
cloudinit/config/tests/test_ubuntu_advantage.py (+28/-2)
cloudinit/distros/freebsd.py (+1/-1)
cloudinit/distros/ubuntu.py (+1/-1)
cloudinit/net/__init__.py (+27/-1)
cloudinit/net/cmdline.py (+1/-1)
cloudinit/net/dhcp.py (+1/-1)
cloudinit/net/sysconfig.py (+1/-2)
cloudinit/net/tests/test_init.py (+1/-0)
cloudinit/reporting/events.py (+1/-1)
cloudinit/sources/DataSourceAliYun.py (+1/-1)
cloudinit/sources/DataSourceAltCloud.py (+1/-4)
cloudinit/sources/DataSourceAzure.py (+13/-20)
cloudinit/sources/DataSourceIBMCloud.py (+92/-14)
cloudinit/sources/DataSourceMAAS.py (+1/-1)
cloudinit/sources/DataSourceOVF.py (+1/-1)
cloudinit/sources/DataSourceOpenStack.py (+2/-2)
cloudinit/sources/DataSourceSmartOS.py (+13/-5)
cloudinit/sources/helpers/digitalocean.py (+3/-4)
cloudinit/sources/helpers/openstack.py (+1/-1)
cloudinit/sources/helpers/vmware/imc/config_nic.py (+1/-1)
cloudinit/sources/helpers/vmware/imc/config_passwd.py (+2/-2)
cloudinit/sources/helpers/vmware/imc/guestcust_util.py (+2/-2)
cloudinit/sources/tests/test_init.py (+1/-1)
cloudinit/ssh_util.py (+63/-7)
cloudinit/templater.py (+1/-1)
cloudinit/tests/helpers.py (+32/-2)
cloudinit/tests/test_util.py (+50/-1)
cloudinit/url_helper.py (+1/-1)
cloudinit/util.py (+16/-1)
debian/changelog (+25/-0)
doc/rtd/topics/datasources.rst (+1/-0)
doc/rtd/topics/datasources/aliyun.rst (+74/-0)
packages/debian/control.in (+1/-0)
tests/cloud_tests/bddeb.py (+1/-1)
tests/cloud_tests/collect.py (+2/-1)
tests/cloud_tests/platforms/instances.py (+1/-1)
tests/cloud_tests/platforms/lxd/instance.py (+4/-6)
tests/cloud_tests/setup_image.py (+5/-6)
tests/cloud_tests/testcases/base.py (+1/-1)
tests/cloud_tests/testcases/examples/including_user_groups.py (+1/-1)
tests/cloud_tests/testcases/modules/user_groups.py (+1/-1)
tests/cloud_tests/util.py (+1/-1)
tests/unittests/test__init__.py (+1/-1)
tests/unittests/test_datasource/test_azure.py (+2/-2)
tests/unittests/test_datasource/test_ibmcloud.py (+50/-0)
tests/unittests/test_datasource/test_maas.py (+2/-2)
tests/unittests/test_datasource/test_nocloud.py (+0/-3)
tests/unittests/test_datasource/test_smartos.py (+117/-4)
tests/unittests/test_ds_identify.py (+64/-8)
tests/unittests/test_handler/test_handler_apt_source_v3.py (+1/-1)
tests/unittests/test_handler/test_handler_bootcmd.py (+26/-8)
tests/unittests/test_handler/test_handler_ntp.py (+1/-1)
tests/unittests/test_handler/test_handler_runcmd.py (+26/-7)
tests/unittests/test_net.py (+127/-5)
tests/unittests/test_sshutil.py (+94/-3)
tests/unittests/test_templating.py (+2/-2)
tests/unittests/test_util.py (+3/-3)
tools/ds-identify (+20/-1)
- Chad Smith: Pending requested 2018-04-24
- Douglas Jordan: Pending requested 2018-04-24
-
Diff: 170 lines (+86/-8)2 files modifiedcloudinit/sources/DataSourceAzure.py (+14/-6)
tests/unittests/test_datasource/test_azure.py (+72/-2)
- Scott Moser: Approve on 2018-04-20
- Server Team CI bot: Approve (continuous-integration) on 2018-04-19
-
Diff: 118 lines (+67/-2)2 files modifiedcloudinit/sources/DataSourceSmartOS.py (+2/-0)
tests/unittests/test_datasource/test_smartos.py (+65/-2)
| Mike Gerdts (mgerdts) wrote : | #1 |
| Mike Gerdts (mgerdts) wrote : | #2 |
The two stack traces seem to be unrelated. I've reproduced the first one with this patch applied.
diff --git a/cloudinit/
index 86bfa5d..76720f0 100644
--- a/cloudinit/
+++ b/cloudinit/
@@ -339,7 +339,11 @@ class JoyentMetadataC
def _get_value_
- frame_data = self.line_
+ match = self.line_
+ if match is None:
+ LOG.warning("no match for frame '%s'", frame)
+ return None
+ frame_data = match.groupdict()
if int(frame_
raise JoyentMetadataF
I then see the following in the log. I've trimmed the first part to avoid line wrap.
[DEBUG]: Writing "V2 29 b949ac51 8bb85dc9 GET c2RjOnJvdXRlcw==
" to metadata transport.
[DEBUG]: Read "nnV2 21 ff55207a 8bb85dc9 SUCCESS W10=" from metadata transport.
[WARNING]: no match for frame 'nnV2 21 ff55207a 8bb85dc9 SUCCESS W10='
$ echo c2RjOnJvdXRlcw== | base64 -d; echo
sdc:routes
$ echo W10= | base64 -d; echo
[]
| Mike Gerdts (mgerdts) wrote : | #3 |
I think I've sorted out what is going wrong here. I modified JoyentMetadataL
On a particular boot, cloud-init logged:
Writing "V2 29 98686c19 c36fd530 GET c2RjOnJvdXRlcw==\n" to metadata transport.
Read "lV2 21 1f8b55fb c36fd530 SUCCESS W10=" from metadata transport.
^ Extra 'l' carachter
Writing "V2 29 e2c45a1f 9492baa0 GET c2RjOnJvdXRlcw==\n" to metadata transport.
Read "V2 21 c1a6af86 9492baa0 SUCCESS W10=" from metadata transport.
Value "[]" found.
Notice my annotation of the extra character in the input stream.
In the host I was tracing the reads and writes by the metadata server. I saw:
R 'V2 29 98686c19 c36fd530 GET c2RjOnJvdXRlcw=
W 'invalid command\n'
^^^ this, perhaps with the exception of a single l, does not appear in guest
cloud-init log.
W 'V2 21 1f8b55fb c36fd530 SUCCESS W10=\n'
R 'V2 29 e2c45a1f 9492baa0 GET c2RjOnJvdXRlcw==\n'
W 'V2 21 c1a6af86 9492baa0 SUCCESS W10=\n'
On a related note, mdata-get flushes the serial port before it does its work. That is, reads any pending data, it writes a newline, and then reads a line. It discards whatever was read, then sends the real request and reads the real reply. My test system is one that has typically done a bunch of configuration using mdata-get from rc.local and perhaps other places. I think I have pretty strong evidence that there's a race between cloud-init and these other bits that are also trying to use the metadata service.
The extra newline that is seen by the host would correspond to the flush performed by mdata-get. The fact that cloud-init sees an extra character (l) but not the other chacaraters in "invalid command" suggests that mdata-get and cloud-init are both reading from the serial port at the same time.
The first stack trace that I posted showed:
Read "aV2 21 0e6e7ec8 d133c055 SUCCESS W10=" from metadata transport.
Notice the extra 'a' that it read. Looking through my terminal history, I see this.
Read "nnV2 21 ff55207a 8bb85dc9 SUCCESS W10=" from metadata transport.
In each case where there are extra characters, those extra characters are some subset of "invalid command".
The second stack trace that I posted timed out because it never got a newline. That could also be explained by a rogue mdata-get reading a line from the serial port.
There are a several of approaches to dealing this:
1) If it hurts, don't do it. (close as not-a-bug)
2) Do some sort of locking on /dev/ttyS1 to prevent crosstalk
3) When a command fails, retry a time or two.
Option 2 requires that everyone that may open /dev/ttyS1 agree on the locking strategy and implement it correctly. mdata-get tries to take a lock, but clearly that's not working out so well.
# strace mdata-get sdc:hostname
...
open("/dev/ttyS1", O_RDWR|
fcntl(5, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
Linux seems to be working as designed. http://
Since there's no way to keep other actors from opening the serial device that cloud-init is using, ...
| Mike Gerdts (mgerdts) wrote : | #4 |
Related bug for mdata-get:
| summary: |
- stack trace when sdc:* not defined + DataSourceSmartOS needs locking and retries |
A tentative fix is at:
https:/
Test results for the new unit test:
https:/
Once the contributor agreement for Joyent is in place I'll continue working on this.
| Changed in cloud-init: | |
| assignee: | nobody → Mike Gerdts (mgerdts) |
| Changed in cloud-init: | |
| importance: | Undecided → Medium |
| status: | New → Confirmed |
| Mike Gerdts (mgerdts) wrote : | #6 |
The original approach for the fix of this involved locking and retries. On further consideration, it seems as though the proper approach is to use locking and rely on a more complete protocol implementation to handle initial noise on the line. The fix for bug #1667735 fixes the protocol implementation by adding an initial flush of input and protocol negotiation.
| summary: |
- DataSourceSmartOS needs locking and retries + DataSourceSmartOS needs locking |
| Scott Moser (smoser) wrote : | #7 |
An upstream commit landed for this bug.
To view that commit see the following URL:
https:/
| Changed in cloud-init: | |
| status: | Confirmed → Fix Committed |
This bug is believed to be fixed in cloud-init in version 18.2-27-
Thank you.
| Changed in cloud-init (Ubuntu): | |
| status: | New → Fix Released |
| assignee: | nobody → Chad Smith (chad.smith) |
| Changed in cloud-init: | |
| status: | Fix Committed → Fix Released |
This bug is believed to be fixed in cloud-init in version 18.3. If this is still a problem for you, please make a comment and set the state back to New
Thank you.
| Changed in cloud-init: | |
| status: | Fix Released → Fix Committed |
| status: | Fix Committed → Fix Released |


In a subsequent boot, I noticed a different stack:
2018-01-31 21:56:17,517 - DataSourceSmart OS.py[DEBUG] : Writing "V2 29 ae9f5912 d2a2a4ca GET c2RjOnJvdXRlcw== search- SmartOS: FAIL: no local data found from DataSourceSmartOS sources. DataSourceSmart OS.DataSourceSm artOS'> failed sources. DataSourceSmart OS.DataSourceSm artOS'> failed python2. 7/site- packages/ cloudinit/ sources/ __init_ _.py", line 434, in find_source python2. 7/site- packages/ cloudinit/ sources/ __init_ _.py", line 121, in get_data python2. 7/site- packages/ cloudinit/ sources/ DataSourceSmart OS.py", line 237, in _get_data client. get_json( smartos_ noun) python2. 7/site- packages/ cloudinit/ sources/ DataSourceSmart OS.py", line 406, in get_json python2. 7/site- packages/ cloudinit/ sources/ DataSourceSmart OS.py", line 552, in get python2. 7/site- packages/ cloudinit/ sources/ DataSourceSmart OS.py", line 538, in _get python2. 7/site- packages/ cloudinit/ sources/ DataSourceSmart OS.py", line 398, in get rtype=' GET', param=key) python2. 7/site- packages/ cloudinit/ sources/ DataSourceSmart OS.py", line 383, in request extend( self.fp. read(1) ) python2. 7/site- packages/ serial/ serialposix. py", line 453, in read
" to metadata transport.
2018-01-31 21:56:17,825 - handlers.py[DEBUG]: finish: init-local/
2018-01-31 21:56:17,825 - util.py[WARNING]: Getting data from <class 'cloudinit.
2018-01-31 21:56:17,826 - util.py[DEBUG]: Getting data from <class 'cloudinit.
Traceback (most recent call last):
File "/usr/lib/
if s.get_data():
File "/usr/lib/
return_value = self._get_data()
File "/usr/lib/
md[ci_noun] = self.md_
File "/usr/lib/
result = self.get(key, default=default)
File "/usr/lib/
val = self._get(key, strip=False, default=mdefault)
File "/usr/lib/
get(key, default=default, strip=strip))
File "/usr/lib/
result = self.request(
File "/usr/lib/
response.
File "/usr/lib/
buf = os.read(self.fd, size-len(read))
OSError: [Errno 11] Resource temporarily unavailable
If I understand it correctly, this time around there was a timeout while reading from the serial port. The timeout is 60 seconds and the machine was very lightly loaded. This is starting to feel more like there is something else going on. This is on bhyve, which is being ported to SmartOS. As such, there is a distinct possibility I am hitting a host or hypervisor issue.