Bug #1960944 “cloudinit.sources.DataSourceNotFoundException: Did...” : Bugs : MAAS

Revision history for this message

Adam Collard (adam-collard) wrote on 2022-02-15:

#1

This can occur when the machine in question can't reach the MAAS controller; are you able to confirm connectivity from this machine to the rack controller?

Revision history for this message

Patricia Domingues (patriciasd) wrote on 2022-02-15:

#2

we have rack and region in the same instance, as it can PXE boot:

``` >Checking Media Presence......
Downloading NBP file...

Succeed to download NBP file.
Fetching Netboot Image
Booting under MAAS direction...
```
So. I'd say it can reach MAAS, but let me paste here other logs

Revision history for this message

Sean Feole (sfeole) wrote on 2022-02-15:

#3

The Kernel team runs a variety of MAAS servers and can confirm this Traceback has been seen in some deployments as of yesterday. Our MAAS versions vary, from 2.9 to 3.1

[ 285.240905] cloud-init[3728]: ------------------------------------------------------------
[ 285.727357] cloud-init[3733]: Can not apply stage final, no datasource found! Likely bad things to come!
[ 285.727555] cloud-init[3733]: ------------------------------------------------------------
[ 285.727658] cloud-init[3733]: Traceback (most recent call last):
[ 285.727723] cloud-init[3733]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 521, in main_modules
[ 285.727794] cloud-init[3733]: init.fetch(existing="trust")
[ 285.727862] cloud-init[3733]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 411, in fetch
[ 285.727943] cloud-init[3733]: return self._get_data_source(existing=existing)
[ 285.728034] cloud-init[3733]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 311, in _get_data_source
[ 285.728110] cloud-init[3733]: (ds, dsname) = sources.find_source(self.cfg,
[ 285.728178] cloud-init[3733]: File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 854, in find_source
[ 285.728244] cloud-init[3733]: raise DataSourceNotFoundException(msg)
[ 285.728308] cloud-init[3733]: cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes: ()
[ 285.728374] cloud-init[3733]: ------------------------------------------------------------

Revision history for this message

dann frazier (dannf) wrote on 2022-02-15:

#4

Please let us know if there are logs we can provide to help diagnose.

Revision history for this message

Patricia Domingues (patriciasd) wrote on 2022-02-15:

#5

lp1960944_rackd.log Edit (1.5 MiB, text/plain)

last 2 successful deployment:
```
2022-02-15T01:24:19.976092+00:00 maas maas.node: [info] bizzy: Status transition from DEPLOYING to DEPLOYED
2022-02-15T01:36:11.593020+00:00 maas maas.node: [info] howzit-kernel: Status transition from DEPLOYING to DEPLOYED
```
and then after that, in rackd.log I see:

`
2022-02-15 01:54:57 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS).
...
2022-02-15 01:48:22 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS).
...
and these error msgs in a loop, attaching it
2022-02-15 01:54:57 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS).
...
twisted.internet.error.ConnectingCancelledError: HostnameAddress(hostname=b'localhost', port=5240)

Revision history for this message

dann frazier (dannf) wrote on 2022-02-15:

#6

fyi, these nginx logs are contemporaneous, no idea what it's

2022/02/15 21:45:28 [error] 3341359#3341359: *1934369 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.229.193.154, server: , request: "POST /MAAS/metadata/status/ynet4x HTTP/1.1", upstream: "http://10.155.212.2:5240/MAAS/metadata/status/ynet4x", host: "10.229.32.21:5248"
2022/02/15 21:46:48 [error] 3341359#3341359: *1934377 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.229.193.154, server: , request: "POST /MAAS/metadata/status/ynet4x HTTP/1.1", upstream: "http://10.155.212.2:5240/MAAS/metadata/status/ynet4x", host: "10.229.32.21:5248"
2022/02/15 21:47:53 [error] 3341359#3341359: *1934395 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.229.193.154, server: , request: "POST /MAAS/metadata/status/ynet4x HTTP/1.1", upstream: "http://10.155.212.2:5240/MAAS/metadata/status/ynet4x", host: "10.229.32.21:5248"
2022/02/15 21:48:53 [error] 3341359#3341359: *1934399 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.229.193.154, server: , request: "POST /MAAS/metadata/status/ynet4x HTTP/1.1", upstream: "http://10.155.212.2:5240/MAAS/metadata/status/ynet4x", host: "10.229.32.21:5248"

Revision history for this message

dann frazier (dannf) wrote on 2022-02-16:

#7

While we do have sporadic messages like this in our nginx error.log, they started piling up around the time this issue was reported to us, starting with this message:

2022/02/15 01:49:24 [error] 3341359#3341359: *1929977 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.229.95.139, server: , request: "POST /MAAS/metadata/status/ww4mgk HTTP/1.1", upstream: "http://10.155.212.2:5240/MAAS/metadata/status/ww4mgk", host: "10.229.32.21:5248"

Around this time we started seeing these pile up in rackd.log:
2022-02-15 01:40:07 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS).

Our regiond processes are running, and I don't see anything that seems abnormal in the regiond log around this time. However, these symptoms reminded me of a similar issue in bug 1908452, so I started debugging it similarly. Like bug 1908452, I see one regiond process stuck in a recv call:

root@maas:/var/snap/maas/common/log# strace -p 3340720
strace: Process 3340720 attached
recvfrom(23,

All the other regiond processes are making progress, but not this one.

The server it is talking to appears to be this canonical server, which I can't currently resolve:

root@maas:/var/snap/maas/common/log# lsof -i -a -p 3340720 | grep 23
python3 3340720 root 23u IPv4 3487880288 0t0 TCP maas:42848->https-services.aerodent.canonical.com:http (ESTABLISHED)
root@maas:/var/snap/maas/common/log# host https-services.aerodent.canonical.com
Host https-services.aerodent.canonical.com not found: 3(NXDOMAIN)

However, I suspect it maybe related to image fetching again. In our regiond logs, I see that the the last log entry related to images appears to have been about an hour before things locked up:

root@maas:/var/snap/maas/common/log# grep image regiond.log | tail -1
2022-02-15 00:38:51 regiond: [info] 127.0.0.1 GET /MAAS/images-stream/streams/v1/maas:v2:download.json HTTP/1.1 --> 200 OK (referrer: -; agent: python-simplestreams/0.1)

Prior to that, we have log entries every hour, but none after. So maybe simplestreams has other places that need a timeout?

While we do have sporadic messages like this in our nginx error.log, they started piling up around the time this issue was reported to us, starting with this message:

2022/02/15 01:49:24 [error] 3341359#3341359: *1929977 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.229.95.139, server: , request: "POST /MAAS/metadata/status/ww4mgk HTTP/1.1", upstream: "http://10.155.212.2:5240/MAAS/metadata/status/ww4mgk", host: "10.229.32.21:5248"

Around this time we started seeing these pile up in rackd.log:
2022-02-15 01:40:07 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS).

Our regiond processes are running, and I don't see anything that seems abnormal in the regiond log around this time. However, these symptoms reminded me of a similar issue in bug 1908452, so I started debugging it similarly. Like bug 1908452, I see one regiond process stuck in a recv call:

root@maas:/var/snap/maas/common/log# strace -p 3340720
strace: Process 3340720 attached
recvfrom(23,

All the other regiond processes are making progress, but not this one.

The server it is talking to appears to be this canonical server, which I can't currently resolve:

root@maas:/var/snap/maas/common/log# lsof -i -a -p  3340720 | grep 23
python3 3340720 root   23u  IPv4 3487880288      0t0  TCP maas:42848->https-services.aerodent.canonical.com:http (ESTABLISHED)
root@maas:/var/snap/maas/common/log# host https-services.aerodent.canonical.com
Host https-services.aerodent.canonical.com not found: 3(NXDOMAIN)

However, I suspect it maybe related to image fetching again. In our regiond logs, I see that the the last log entry related to images appears to have been about an hour before things locked up:

root@maas:/var/snap/maas/common/log# grep image regiond.log | tail -1
2022-02-15 00:38:51 regiond: [info] 127.0.0.1 GET /MAAS/images-stream/streams/v1/maas:v2:download.json HTTP/1.1 --> 200 OK (referrer: -; agent: python-simplestreams/0.1)

Prior to that, we have log entries every hour, but none after. So maybe simplestreams has other places that need a timeout?

Changed in cloud-init:
status:	New → Invalid

Revision history for this message

Diego Mascialino (dmascialino) wrote on 2022-02-16:

#8

Hi, looks like a connection issue between region and rack.

Could you add a report of "sos report -o maas" ?

Revision history for this message

Sean Feole (sfeole) wrote on 2022-02-16:

#9

Download full text (7.9 KiB)

The Kernel team experienced this problem yesterday, Feb 15th.
MAAS Version: MAAS version: 2.9.2 (9164-g.ac176b5c4-0ubuntu1~20.04.1)
Deb Pkg Installation

Here are the logs from the rackd controller at the time this occurred.
There were no obvious warnings/errors/tracebacks in the regiond.log
I restarted maas-rackd and maas-regiond services / updated the boot sources and the problem appears to have resolved itself. Yet, would be nice to know why this happened to begin with.

2022-02-15 15:31:44 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://10.246.72.3:5240/MAAS).
Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 460, in callback
     self._startRunCallbacks(result)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
     self._runCallbacks()
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
     _inlineCallbacks(r, g, status)
--- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1367, in _doUpdate
     eventloops, maas_url = yield self._get_rpc_info(urls)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1631, in _get_rpc_info
     raise config_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1602, in _get_rpc_info
     eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1576, in handle_responses
     errors[0].raiseException()
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
     raise self.value.with_traceback(self.tb)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1537, in _serial_fetch_rpc_info
     raise last_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _serial_fetch_rpc_info
     response = yield self._fetch_rpc_info(url, orig_url)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
     result = result.throwExceptionIntoGenerator(g)
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1631, in _get_rpc_info
     raise config_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1602, in _get_rpc_info
     eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCal...

The Kernel team experienced this problem yesterday, Feb 15th.  
MAAS Version: MAAS version: 2.9.2 (9164-g.ac176b5c4-0ubuntu1~20.04.1)
Deb Pkg Installation

Here are the logs from the rackd controller at the time this occurred.
There were no obvious warnings/errors/tracebacks in the regiond.log
I restarted maas-rackd and maas-regiond services / updated the boot sources and the problem appears to have resolved itself.  Yet, would be nice to know why this happened to begin with.

2022-02-15 15:31:44 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://10.246.72.3:5240/MAAS).
	Traceback (most recent call last):
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 460, in callback
	    self._startRunCallbacks(result)
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
	    self._runCallbacks()
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
	    _inlineCallbacks(r, g, status)
	--- <exception caught here> ---
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1367, in _doUpdate
	    eventloops, maas_url = yield self._get_rpc_info(urls)
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1631, in _get_rpc_info
	    raise config_exc
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1602, in _get_rpc_info
	    eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1576, in handle_responses
	    errors[0].raiseException()
	  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
	    raise self.value.with_traceback(self.tb)
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1537, in _serial_fetch_rpc_info
	    raise last_exc
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _serial_fetch_rpc_info
	    response = yield self._fetch_rpc_info(url, orig_url)
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
	    result = result.throwExceptionIntoGenerator(g)
	  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
	    return g.throw(self.type, self.value, self.tb)
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1631, in _get_rpc_info
	    raise config_exc
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1602, in _get_rpc_info
	    eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1576, in handle_responses
	    errors[0].raiseException()
	  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
	    raise self.value.with_traceback(self.tb)
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
	    result = result.throwExceptionIntoGenerator(g)
	  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
	    return g.throw(self.type, self.value, self.tb)
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1537, in _serial_fetch_rpc_info
	    raise last_exc
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _serial_fetch_rpc_info
	    response = yield self._fetch_rpc_info(url, orig_url)
	twisted.internet.error.ConnectingCancelledError: HostnameAddress(hostname=b'10.246.72.3', port=5240)
	
2022-02-15 15:33:39 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://10.246.72.3:5240/MAAS).
	Traceback (most recent call last):
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 460, in callback
	    self._startRunCallbacks(result)
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
	    self._runCallbacks()
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
	    _inlineCallbacks(r, g, status)
	--- <exception caught here> ---
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1367, in _doUpdate
	    eventloops, maas_url = yield self._get_rpc_info(urls)
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1631, in _get_rpc_info
	    raise config_exc
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1602, in _get_rpc_info
	    eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1576, in handle_responses
	    errors[0].raiseException()
	  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
	    raise self.value.with_traceback(self.tb)
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1537, in _serial_fetch_rpc_info
	    raise last_exc
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _serial_fetch_rpc_info
	    response = yield self._fetch_rpc_info(url, orig_url)
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
	    result = result.throwExceptionIntoGenerator(g)
	  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
	    return g.throw(self.type, self.value, self.tb)
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1631, in _get_rpc_info
	    raise config_exc
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1602, in _get_rpc_info
	    eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1576, in handle_responses
	    errors[0].raiseException()
	  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
	    raise self.value.with_traceback(self.tb)
	  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
	    result = result.throwExceptionIntoGenerator(g)
	  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
	    return g.throw(self.type, self.value, self.tb)
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1537, in _serial_fetch_rpc_info
	    raise last_exc
	  File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _serial_fetch_rpc_info
	    response = yield self._fetch_rpc_info(url, orig_url)
	twisted.internet.error.ConnectingCancelledError: HostnameAddress(hostname=b'10.246.72.3', port=5240)

Revision history for this message

dann frazier (dannf) wrote on 2022-02-16:

#10

With the hypothesis stated in comment #7, I've restarted our MAAS server w/ PYTHONFAULTHANDLER=1 set so that, if it happens again, we can get a backtrace from regiond.

Revision history for this message

dann frazier (dannf) wrote on 2022-02-17:

#11

kill-hung-regionds.sh Edit (523 bytes, text/x-sh)

I've automated the monitoring for this hang with the attached script to try and collect a backtrace and kill/respawn the hung daemon to hopefully avoid another outage.

Revision history for this message

Bill Wear (billwear) wrote on 2022-03-14:

#12

Moving this to incomplete until we can get a backtrace.

Changed in maas:
status:	New → Incomplete

James Falcon (falcojr) on 2022-05-09

Changed in simplestreams:
status:	New → Invalid

Revision history for this message

dann frazier (dannf) wrote on 2022-05-09:

#13

@James can you clarify why you believe this bug is Invalid for simplestreams?

Revision history for this message

James Falcon (falcojr) wrote on 2022-05-09 (last edit on 2022-05-09):

#14

Whoops, my apologies. I misread a previous comment and assumed the status just hadn't been updated.

Changed in simplestreams:
status:	Invalid → New

Alberto Donato (ack) on 2022-05-31

Changed in maas:
status:	Incomplete → New
status:	New → Incomplete

Revision history for this message

James Falcon (falcojr) wrote on 2023-05-12:

#15

Tracked in Github Issues as https://github.com/canonical/cloud-init/issues/3949

Anton Troyanov (troyanov) on 2023-06-03

Changed in maas:
status:	Incomplete → New
status:	New → Incomplete

Adam Collard (adam-collard) on 2023-09-25

Changed in cloud-init:
importance:	Undecided → Unknown
status:	Invalid → Unknown

Bug Watch Updater (bug-watch-updater) on 2023-09-26

Changed in cloud-init:
status:	Unknown → Fix Released

MAAS

cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
MAAS	Incomplete	Undecided	Unassigned
cloud-init	Fix Released	Unknown	auto-github-canonical-cloud-init #3949
simplestreams	New	Undecided	Unassigned