cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes

Bug #1960944 reported by Patricia Domingues
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Incomplete
Undecided
Unassigned
cloud-init
Fix Released
Unknown
simplestreams
New
Undecided
Unassigned

Bug Description

Not able to deploy baremetal (arm64 and amd64) on a
snap-based MAAS: 3.1.0 (maas 3.1.0-10901-g.f1f8f1505 18199 3.1/stable)

from MAAS event log:
```
Tue, 15 Feb. 2022 17:35:33 Node changed status - From 'Deploying' to 'Failed deployment'
 Tue, 15 Feb. 2022 17:35:33 Marking node failed - Node operation 'Deploying' timed out after 30 minutes.
 Tue, 15 Feb. 2022 17:07:44 Node installation - 'cloudinit' searching for network data from DataSourceMAAS
 Tue, 15 Feb. 2022 17:06:44 Node installation - 'cloudinit' attempting to read from cache [trust]
 Tue, 15 Feb. 2022 17:06:42 Node installation - 'cloudinit' attempting to read from cache [check]
 Tue, 15 Feb. 2022 17:05:29 Performing PXE boot
 Tue, 15 Feb. 2022 17:05:29 PXE Request - installation
 Tue, 15 Feb. 2022 17:03:52 Node powered on
```

Server console log shows:

```
ubuntu login: Starting Message of the Day...
[ OK ] Listening on Socket unix for snap application lxd.daemon.
         Starting Service for snap application lxd.activate...
[ OK ] Finished Service for snap application lxd.activate.
[ OK ] Started snap.lxd.hook.conf…-4400-96a8-0c5c9e438c51.scope.
         Starting Time & Date Service...
[ OK ] Started Time & Date Service.
[ OK ] Finished Wait until snapd is fully seeded.
         Starting Apply the settings specified in cloud-config...
[ OK ] Reached target Multi-User System.
[ OK ] Reached target Graphical Interface.
         Starting Update UTMP about System Runlevel Changes...
[ OK ] Finished Update UTMP about System Runlevel Changes.
[ 322.036861] cloud-init[2034]: Can not apply stage config, no datasource found! Likely bad things to come!
[ 322.037477] cloud-init[2034]: ------------------------------------------------------------
[ 322.037907] cloud-init[2034]: Traceback (most recent call last):
[ 322.038341] cloud-init[2034]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 521, in main_modules
[ 322.038783] cloud-init[2034]: init.fetch(existing="trust")
[ 322.039181] cloud-init[2034]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 411, in fetch
[ 322.039584] cloud-init[2034]: return self._get_data_source(existing=existing)
[ 322.040031] cloud-init[2034]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 311, in _get_data_source
[ 322.040381] cloud-init[2034]: (ds, dsname) = sources.find_source(self.cfg,
[ 322.040807] cloud-init[2034]: File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 854, in find_source
[ 322.041221] cloud-init[2034]: raise DataSourceNotFoundException(msg)
[ 322.041659] cloud-init[2034]: cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes: ()
[ 322.042076] cloud-init[2034]: ------------------------------------------------------------
[FAILED] Failed to start Apply the …ngs specified in cloud-config.
See 'systemctl status cloud-config.service' for details.
         Starting Execute cloud user/final scripts...
[ 324.029369] cloud-init[2042]: Can not apply stage final, no datasource found! Likely bad things to come!
[ 324.029979] cloud-init[2042]: ------------------------------------------------------------
[ 324.030420] cloud-init[2042]: Traceback (most recent call last):
[ 324.030835] cloud-init[2042]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 521, in main_modules
[ 324.031258] cloud-init[2042]: init.fetch(existing="trust")
[ 324.031647] cloud-init[2042]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 411, in fetch
[ 324.032111] cloud-init[2042]: return self._get_data_source(existing=existing)
[ 324.032483] cloud-init[2042]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 311, in _get_data_source
[ 324.032865] cloud-init[2042]: (ds, dsname) = sources.find_source(self.cfg,
[ 324.033292] cloud-init[2042]: File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 854, in find_source
[ 324.033690] cloud-init[2042]: raise DataSourceNotFoundException(msg)
[ 324.034112] cloud-init[2042]: cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes: ()
[ 324.034559] cloud-init[2042]: ------------------------------------------------------------
[FAILED] Failed to start Execute cloud user/final scripts.
See 'systemctl status cloud-final.service' for details.
[ OK ] Reached target Cloud-init target.
[ OK ] Finished Message of the Day.
```

Revision history for this message
Adam Collard (adam-collard) wrote :

This can occur when the machine in question can't reach the MAAS controller; are you able to confirm connectivity from this machine to the rack controller?

Revision history for this message
Patricia Domingues (patriciasd) wrote :

we have rack and region in the same instance, as it can PXE boot:

``` >Checking Media Presence......
 Downloading NBP file...

  Succeed to download NBP file.
Fetching Netboot Image
Booting under MAAS direction...
```
So. I'd say it can reach MAAS, but let me paste here other logs

Revision history for this message
Sean Feole (sfeole) wrote :

The Kernel team runs a variety of MAAS servers and can confirm this Traceback has been seen in some deployments as of yesterday. Our MAAS versions vary, from 2.9 to 3.1

[ 285.240905] cloud-init[3728]: ------------------------------------------------------------
[ 285.727357] cloud-init[3733]: Can not apply stage final, no datasource found! Likely bad things to come!
[ 285.727555] cloud-init[3733]: ------------------------------------------------------------
[ 285.727658] cloud-init[3733]: Traceback (most recent call last):
[ 285.727723] cloud-init[3733]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 521, in main_modules
[ 285.727794] cloud-init[3733]: init.fetch(existing="trust")
[ 285.727862] cloud-init[3733]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 411, in fetch
[ 285.727943] cloud-init[3733]: return self._get_data_source(existing=existing)
[ 285.728034] cloud-init[3733]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 311, in _get_data_source
[ 285.728110] cloud-init[3733]: (ds, dsname) = sources.find_source(self.cfg,
[ 285.728178] cloud-init[3733]: File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 854, in find_source
[ 285.728244] cloud-init[3733]: raise DataSourceNotFoundException(msg)
[ 285.728308] cloud-init[3733]: cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes: ()
[ 285.728374] cloud-init[3733]: ------------------------------------------------------------

Revision history for this message
dann frazier (dannf) wrote :

Please let us know if there are logs we can provide to help diagnose.

Revision history for this message
Patricia Domingues (patriciasd) wrote :

last 2 successful deployment:
```
2022-02-15T01:24:19.976092+00:00 maas maas.node: [info] bizzy: Status transition from DEPLOYING to DEPLOYED
2022-02-15T01:36:11.593020+00:00 maas maas.node: [info] howzit-kernel: Status transition from DEPLOYING to DEPLOYED
```
and then after that, in rackd.log I see:

`
2022-02-15 01:54:57 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS).
...
2022-02-15 01:48:22 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS).
...
and these error msgs in a loop, attaching it
2022-02-15 01:54:57 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS).
...
        twisted.internet.error.ConnectingCancelledError: HostnameAddress(hostname=b'localhost', port=5240)

Revision history for this message
dann frazier (dannf) wrote :

fyi, these nginx logs are contemporaneous, no idea what it's

2022/02/15 21:45:28 [error] 3341359#3341359: *1934369 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.229.193.154, server: , request: "POST /MAAS/metadata/status/ynet4x HTTP/1.1", upstream: "http://10.155.212.2:5240/MAAS/metadata/status/ynet4x", host: "10.229.32.21:5248"
2022/02/15 21:46:48 [error] 3341359#3341359: *1934377 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.229.193.154, server: , request: "POST /MAAS/metadata/status/ynet4x HTTP/1.1", upstream: "http://10.155.212.2:5240/MAAS/metadata/status/ynet4x", host: "10.229.32.21:5248"
2022/02/15 21:47:53 [error] 3341359#3341359: *1934395 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.229.193.154, server: , request: "POST /MAAS/metadata/status/ynet4x HTTP/1.1", upstream: "http://10.155.212.2:5240/MAAS/metadata/status/ynet4x", host: "10.229.32.21:5248"
2022/02/15 21:48:53 [error] 3341359#3341359: *1934399 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.229.193.154, server: , request: "POST /MAAS/metadata/status/ynet4x HTTP/1.1", upstream: "http://10.155.212.2:5240/MAAS/metadata/status/ynet4x", host: "10.229.32.21:5248"

Revision history for this message
dann frazier (dannf) wrote :

While we do have sporadic messages like this in our nginx error.log, they started piling up around the time this issue was reported to us, starting with this message:

2022/02/15 01:49:24 [error] 3341359#3341359: *1929977 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.229.95.139, server: , request: "POST /MAAS/metadata/status/ww4mgk HTTP/1.1", upstream: "http://10.155.212.2:5240/MAAS/metadata/status/ww4mgk", host: "10.229.32.21:5248"

Around this time we started seeing these pile up in rackd.log:
2022-02-15 01:40:07 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS).

Our regiond processes are running, and I don't see anything that seems abnormal in the regiond log around this time. However, these symptoms reminded me of a similar issue in bug 1908452, so I started debugging it similarly. Like bug 1908452, I see one regiond process stuck in a recv call:

root@maas:/var/snap/maas/common/log# strace -p 3340720
strace: Process 3340720 attached
recvfrom(23,

All the other regiond processes are making progress, but not this one.

The server it is talking to appears to be this canonical server, which I can't currently resolve:

root@maas:/var/snap/maas/common/log# lsof -i -a -p 3340720 | grep 23
python3 3340720 root 23u IPv4 3487880288 0t0 TCP maas:42848->https-services.aerodent.canonical.com:http (ESTABLISHED)
root@maas:/var/snap/maas/common/log# host https-services.aerodent.canonical.com
Host https-services.aerodent.canonical.com not found: 3(NXDOMAIN)

However, I suspect it maybe related to image fetching again. In our regiond logs, I see that the the last log entry related to images appears to have been about an hour before things locked up:

root@maas:/var/snap/maas/common/log# grep image regiond.log | tail -1
2022-02-15 00:38:51 regiond: [info] 127.0.0.1 GET /MAAS/images-stream/streams/v1/maas:v2:download.json HTTP/1.1 --> 200 OK (referrer: -; agent: python-simplestreams/0.1)

Prior to that, we have log entries every hour, but none after. So maybe simplestreams has other places that need a timeout?

Changed in cloud-init:
status: New → Invalid
Revision history for this message
Diego Mascialino (dmascialino) wrote :

Hi, looks like a connection issue between region and rack.

Could you add a report of "sos report -o maas" ?

Revision history for this message
Sean Feole (sfeole) wrote :
Download full text (7.9 KiB)

The Kernel team experienced this problem yesterday, Feb 15th.
MAAS Version: MAAS version: 2.9.2 (9164-g.ac176b5c4-0ubuntu1~20.04.1)
Deb Pkg Installation

Here are the logs from the rackd controller at the time this occurred.
There were no obvious warnings/errors/tracebacks in the regiond.log
I restarted maas-rackd and maas-regiond services / updated the boot sources and the problem appears to have resolved itself. Yet, would be nice to know why this happened to begin with.

2022-02-15 15:31:44 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://10.246.72.3:5240/MAAS).
 Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 460, in callback
     self._startRunCallbacks(result)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
     self._runCallbacks()
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
     _inlineCallbacks(r, g, status)
 --- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1367, in _doUpdate
     eventloops, maas_url = yield self._get_rpc_info(urls)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1631, in _get_rpc_info
     raise config_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1602, in _get_rpc_info
     eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1576, in handle_responses
     errors[0].raiseException()
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
     raise self.value.with_traceback(self.tb)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1537, in _serial_fetch_rpc_info
     raise last_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _serial_fetch_rpc_info
     response = yield self._fetch_rpc_info(url, orig_url)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
     result = result.throwExceptionIntoGenerator(g)
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1631, in _get_rpc_info
     raise config_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1602, in _get_rpc_info
     eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCal...

Read more...

Revision history for this message
dann frazier (dannf) wrote :

With the hypothesis stated in comment #7, I've restarted our MAAS server w/ PYTHONFAULTHANDLER=1 set so that, if it happens again, we can get a backtrace from regiond.

Revision history for this message
dann frazier (dannf) wrote :

I've automated the monitoring for this hang with the attached script to try and collect a backtrace and kill/respawn the hung daemon to hopefully avoid another outage.

Revision history for this message
Bill Wear (billwear) wrote :

Moving this to incomplete until we can get a backtrace.

Changed in maas:
status: New → Incomplete
James Falcon (falcojr)
Changed in simplestreams:
status: New → Invalid
Revision history for this message
dann frazier (dannf) wrote :

@James can you clarify why you believe this bug is Invalid for simplestreams?

Revision history for this message
James Falcon (falcojr) wrote (last edit ):

Whoops, my apologies. I misread a previous comment and assumed the status just hadn't been updated.

Changed in simplestreams:
status: Invalid → New
Alberto Donato (ack)
Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
James Falcon (falcojr) wrote :
Changed in maas:
status: Incomplete → New
status: New → Incomplete
Changed in cloud-init:
importance: Undecided → Unknown
status: Invalid → Unknown
Changed in cloud-init:
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.