DataSourceSmartOS needs locking

Bug #1746605 reported by Mike Gerdts on 2018-01-31
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Medium
Mike Gerdts
cloud-init (Ubuntu)
Undecided
Chad Smith
Nominated for Bionic by Chad Smith
Nominated for Cosmic by Chad Smith

Bug Description

I'm seeing the following while trying to read meta-data from SmartOS.

2018-01-31 21:36:03,554 - DataSourceSmartOS.py[DEBUG]: Writing "V2 29 459961e2 d133c055 GET c2RjOnJvdXRlcw==
" to metadata transport.
2018-01-31 21:36:03,995 - DataSourceSmartOS.py[DEBUG]: Read "aV2 21 0e6e7ec8 d133c055 SUCCESS W10=" from metadata transport.
2018-01-31 21:36:03,996 - handlers.py[DEBUG]: finish: init-local/search-SmartOS: FAIL: no local data found from DataSourceSmartOS
2018-01-31 21:36:03,996 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceSmartOS.DataSourceSmartOS'> failed
2018-01-31 21:36:03,996 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceSmartOS.DataSourceSmartOS'> failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/__init__.py", line 434, in find_source
    if s.get_data():
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/__init__.py", line 121, in get_data
    return_value = self._get_data()
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 237, in _get_data
    md[ci_noun] = self.md_client.get_json(smartos_noun)
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 406, in get_json
    result = self.get(key, default=default)
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 559, in get
    val = self._get(key, strip=False, default=mdefault)
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 544, in _get
    get(key, default=default, strip=strip))
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 398, in get
    result = self.request(rtype='GET', param=key)
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 394, in request
    value = self._get_value_from_frame(request_id, response)
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 342, in _get_value_from_frame
    frame_data = self.line_regex.match(frame).groupdict()
AttributeError: 'NoneType' object has no attribute 'groupdict'
2018-01-31 21:36:04,004 - main.py[DEBUG]: No local datasource found

[root@7180e700-3cba-cb89-eb82-ff14a51a62b2 ~]# echo c2RjOnJvdXRlcw== | base64 -d; echo
sdc:routes
[root@7180e700-3cba-cb89-eb82-ff14a51a62b2 ~]# mdata-get sdc:routes
[]

This seems to cause DataSourceSmartOS to fail completely, then it goes on to time out on EC2 and CloudStack.

This is using my own build with this changeset at HEAD.

$ git log -n 1
commit f7deaf15acf382d62554e2b1d70daa9a9109d542

Related branches

Mike Gerdts (mgerdts) wrote :

In a subsequent boot, I noticed a different stack:

2018-01-31 21:56:17,517 - DataSourceSmartOS.py[DEBUG]: Writing "V2 29 ae9f5912 d2a2a4ca GET c2RjOnJvdXRlcw==
" to metadata transport.
2018-01-31 21:56:17,825 - handlers.py[DEBUG]: finish: init-local/search-SmartOS: FAIL: no local data found from DataSourceSmartOS
2018-01-31 21:56:17,825 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceSmartOS.DataSourceSmartOS'> failed
2018-01-31 21:56:17,826 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceSmartOS.DataSourceSmartOS'> failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/__init__.py", line 434, in find_source
    if s.get_data():
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/__init__.py", line 121, in get_data
    return_value = self._get_data()
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 237, in _get_data
    md[ci_noun] = self.md_client.get_json(smartos_noun)
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 406, in get_json
    result = self.get(key, default=default)
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 552, in get
    val = self._get(key, strip=False, default=mdefault)
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 538, in _get
    get(key, default=default, strip=strip))
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 398, in get
    result = self.request(rtype='GET', param=key)
  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceSmartOS.py", line 383, in request
    response.extend(self.fp.read(1))
  File "/usr/lib/python2.7/site-packages/serial/serialposix.py", line 453, in read
    buf = os.read(self.fd, size-len(read))
OSError: [Errno 11] Resource temporarily unavailable

If I understand it correctly, this time around there was a timeout while reading from the serial port. The timeout is 60 seconds and the machine was very lightly loaded. This is starting to feel more like there is something else going on. This is on bhyve, which is being ported to SmartOS. As such, there is a distinct possibility I am hitting a host or hypervisor issue.

Mike Gerdts (mgerdts) wrote :

The two stack traces seem to be unrelated. I've reproduced the first one with this patch applied.

diff --git a/cloudinit/sources/DataSourceSmartOS.py b/cloudinit/sources/DataSourceSmart
index 86bfa5d..76720f0 100644
--- a/cloudinit/sources/DataSourceSmartOS.py
+++ b/cloudinit/sources/DataSourceSmartOS.py
@@ -339,7 +339,11 @@ class JoyentMetadataClient(object):
             binascii.crc32(body.encode('utf-8')) & 0xffffffff)

     def _get_value_from_frame(self, expected_request_id, frame):
- frame_data = self.line_regex.match(frame).groupdict()
+ match = self.line_regex.match(frame)
+ if match is None:
+ LOG.warning("no match for frame '%s'", frame)
+ return None
+ frame_data = match.groupdict()
         if int(frame_data['length']) != len(frame_data['body']):
             raise JoyentMetadataFetchException(
                 'Incorrect frame length given ({0} != {1}).'.format(

I then see the following in the log. I've trimmed the first part to avoid line wrap.

[DEBUG]: Writing "V2 29 b949ac51 8bb85dc9 GET c2RjOnJvdXRlcw==
" to metadata transport.
[DEBUG]: Read "nnV2 21 ff55207a 8bb85dc9 SUCCESS W10=" from metadata transport.
[WARNING]: no match for frame 'nnV2 21 ff55207a 8bb85dc9 SUCCESS W10='

$ echo c2RjOnJvdXRlcw== | base64 -d; echo
sdc:routes
$ echo W10= | base64 -d; echo
[]

Mike Gerdts (mgerdts) wrote :
Download full text (3.2 KiB)

I think I've sorted out what is going wrong here. I modified JoyentMetadataLcient in DataSourceSmartOS.py so that it retries when an exchange is corrupted.

On a particular boot, cloud-init logged:

Writing "V2 29 98686c19 c36fd530 GET c2RjOnJvdXRlcw==\n" to metadata transport.
Read "lV2 21 1f8b55fb c36fd530 SUCCESS W10=" from metadata transport.
      ^ Extra 'l' carachter
Writing "V2 29 e2c45a1f 9492baa0 GET c2RjOnJvdXRlcw==\n" to metadata transport.
Read "V2 21 c1a6af86 9492baa0 SUCCESS W10=" from metadata transport.
Value "[]" found.

Notice my annotation of the extra character in the input stream.

In the host I was tracing the reads and writes by the metadata server. I saw:

R 'V2 29 98686c19 c36fd530 GET c2RjOnJvdXRlcw==\n\n\0'
                                                ^ ^ two newlines
W 'invalid command\n'
   ^^^ this, perhaps with the exception of a single l, does not appear in guest
       cloud-init log.
W 'V2 21 1f8b55fb c36fd530 SUCCESS W10=\n'
R 'V2 29 e2c45a1f 9492baa0 GET c2RjOnJvdXRlcw==\n'
W 'V2 21 c1a6af86 9492baa0 SUCCESS W10=\n'

On a related note, mdata-get flushes the serial port before it does its work. That is, reads any pending data, it writes a newline, and then reads a line. It discards whatever was read, then sends the real request and reads the real reply. My test system is one that has typically done a bunch of configuration using mdata-get from rc.local and perhaps other places. I think I have pretty strong evidence that there's a race between cloud-init and these other bits that are also trying to use the metadata service.

The extra newline that is seen by the host would correspond to the flush performed by mdata-get. The fact that cloud-init sees an extra character (l) but not the other chacaraters in "invalid command" suggests that mdata-get and cloud-init are both reading from the serial port at the same time.

The first stack trace that I posted showed:

Read "aV2 21 0e6e7ec8 d133c055 SUCCESS W10=" from metadata transport.

Notice the extra 'a' that it read. Looking through my terminal history, I see this.

Read "nnV2 21 ff55207a 8bb85dc9 SUCCESS W10=" from metadata transport.

In each case where there are extra characters, those extra characters are some subset of "invalid command".

The second stack trace that I posted timed out because it never got a newline. That could also be explained by a rogue mdata-get reading a line from the serial port.

There are a several of approaches to dealing this:

1) If it hurts, don't do it. (close as not-a-bug)
2) Do some sort of locking on /dev/ttyS1 to prevent crosstalk
3) When a command fails, retry a time or two.

Option 2 requires that everyone that may open /dev/ttyS1 agree on the locking strategy and implement it correctly. mdata-get tries to take a lock, but clearly that's not working out so well.

# strace mdata-get sdc:hostname
...
open("/dev/ttyS1", O_RDWR|O_EXCL|O_NOCTTY) = 5
fcntl(5, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0

Linux seems to be working as designed. http://www.tldp.org/HOWTO/Serial-HOWTO-13.html

Since there's no way to keep other actors from opening the serial device that cloud-init is using, ...

Read more...

Mike Gerdts (mgerdts) wrote :
Mike Gerdts (mgerdts) on 2018-02-01
summary: - stack trace when sdc:* not defined
+ DataSourceSmartOS needs locking and retries

A tentative fix is at:

https://github.com/mgerdts/cloud-init/tree/bug-1746605

Test results for the new unit test:

https://gist.github.com/c79383050eda90cb008497c11e1aa82b

Once the contributor agreement for Joyent is in place I'll continue working on this.

Changed in cloud-init:
assignee: nobody → Mike Gerdts (mgerdts)
Scott Moser (smoser) on 2018-02-12
Changed in cloud-init:
importance: Undecided → Medium
status: New → Confirmed
Mike Gerdts (mgerdts) wrote :

The original approach for the fix of this involved locking and retries. On further consideration, it seems as though the proper approach is to use locking and rely on a more complete protocol implementation to handle initial noise on the line. The fix for bug #1667735 fixes the protocol implementation by adding an initial flush of input and protocol negotiation.

summary: - DataSourceSmartOS needs locking and retries
+ DataSourceSmartOS needs locking
Scott Moser (smoser) wrote :

An upstream commit landed for this bug.

To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=4ed16459

Changed in cloud-init:
status: Confirmed → Fix Committed

This bug is believed to be fixed in cloud-init in version 18.2-27-g6ef92c98-0ubuntu1~18.04.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init (Ubuntu):
status: New → Fix Released
assignee: nobody → Chad Smith (chad.smith)
Changed in cloud-init:
status: Fix Committed → Fix Released

This bug is believed to be fixed in cloud-init in version 18.3. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Released → Fix Committed
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.