Failed Commissioning smartctl-validate

Bug #1746817 reported by Stefano Coronado
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
High
Unassigned
2.3
Won't Fix
High
Unassigned

Bug Description

When we commission our node using the hardware test smartctl-validate, we've received the following error:

modprobe: ERROR: could not insert 'ipmi_si': No such device
Exception in thread smartctl-validate (id: 455, script_version_id: 1):
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/tmp/user_data.sh.xLppTV/bin/maas-run-remote-scripts", line 328, in run_script
    script_arguments = parse_parameters(script, scripts_dir)
  File "/tmp/user_data.sh.xLppTV/bin/maas-run-remote-scripts", line 277, in parse_parameters
    model = value.get('model')
AttributeError: 'str' object has no attribute 'get'

Relevant details of our dmidecode output are included in dmi-info.txt just to have a larger picture of the nodes (or motherboards) we are using. We are using Dell Optiplexes 990s

Thanks

Revision history for this message
Stefano Coronado (stefanoume) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

@Stefano,

Do you have any output stored in MAAS itself (e.g. if you go to the script and see the different output it provides?)

Thanks

Changed in maas:
milestone: none → 2.4.x
importance: Undecided → High
Revision history for this message
Stefano Coronado (stefanoume) wrote :

Hi @Andres ,

In the maas interface, the runtime timer continues past the time out with no output.

Revision history for this message
Lee Trager (ltrager) wrote :

How did you start commissioning? Was it through the API or UI? Did you pass any parameters?

Can you reproduce with SSH enabled and post all the files in /tmp/user_data.sh*?

Changed in maas:
status: New → Incomplete
Revision history for this message
Stefano Coronado (stefanoume) wrote :

It is through the UI. We passed no additional parameters.

Attached is the contents of the user_data directory

Revision history for this message
Lee Trager (ltrager) wrote :

In the UI on the test tab is the name of the storage device above the test? What is the version of MAAS you are using?

Revision history for this message
Stefano Coronado (stefanoume) wrote :

@ltrager

Screenshot attached for what we are looking at

We are using: MAAS version: 2.3.0 (6434-gd354690-0ubuntu1~16.04.1)

Revision history for this message
Lee Trager (ltrager) wrote :

It looks like a ScriptResult for the disk on the system was never created.

Was this the first time this machine has been commissioned?
Can you confirm a disk has been created for this machine?

Could you post the output of:
maas <profile> node-script-results read <system_id> type=testing
maas <profile> machine read <system_id>

Revision history for this message
Stefano Coronado (stefanoume) wrote :

@ltrager

We tried commissioning this machine multiple times but it failed testing.

There is no disks under the storage tab in the interface. Attached is the output of what you've asked for.

Revision history for this message
Lee Trager (ltrager) wrote :

When a new machine is tested MAAS uses a place holder for storage tests until commissioning is run where it creates a result per disk. MAAS is not creating any disks for this machine. Because there are no disks created for this machine MAAS doesn't know what to test. This is currently causing an exception to be raised in the script runner. There are two bugs here

1. MAAS isn't discovering block devices for this machine.
2. Running a test script on a diskless system causes the script runner to crash.

@stefanoume
Did any commissioning scripts fail to run? If so can you post the output of the failed ones?

Does commissioning fail quickly or do you have to wait for it to timeout(20 minutes)?

Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [Bug 1746817] Re: Failed Commissioning smartctl-validate

The bug here is 2. It is not surprising that there are no disks provided
that this may need a newer kernel.

Have you tried commissioning with a newer kernel?

On Fri, Feb 2, 2018 at 8:55 PM Lee Trager <email address hidden> wrote:

> When a new machine is tested MAAS uses a place holder for storage tests
> until commissioning is run where it creates a result per disk. MAAS is
> not creating any disks for this machine. Because there are no disks
> created for this machine MAAS doesn't know what to test. This is
> currently causing an exception to be raised in the script runner. There
> are two bugs here
>
> 1. MAAS isn't discovering block devices for this machine.
> 2. Running a test script on a diskless system causes the script runner to
> crash.
>
>
> @stefanoume
> Did any commissioning scripts fail to run? If so can you post the output
> of the failed ones?
>
> Does commissioning fail quickly or do you have to wait for it to
> timeout(20 minutes)?
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1746817
>
> Title:
> Failed Commissioning smartctl-validate
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1746817/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete;
> importance=High; assignee=None;
> Launchpad-Bug: product=maas; productseries=2.3; milestone=2.3.x;
> status=New; importance=High; assignee=None;
> Launchpad-Bug-Tags: commisioning hardwaretest smartctl-validates
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl ltrager stefanoume
> Launchpad-Bug-Reporter: Stefano Coronado (stefanoume)
> Launchpad-Bug-Modifier: Lee Trager (ltrager)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>
--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Stefano Coronado (stefanoume) wrote :

@ltrager

All the commissioning scripts pass. It fails at the testing portion with smartctl-validate.
It doesn't fail quickly but time outs.

Revision history for this message
Stefano Coronado (stefanoume) wrote :

@andreserl

I was originally commissioning with the Ubuntu 16.04 kernel (ga-16.04). I tried it with the hwe kernel and it failed the same way.

For sake of completeness I tried the Trusty Tahr 14.04 ga kernel and that failed with the same conditions.

The Ubuntu Bionic kernel (ga-18.04) failed the testing portion in a different way. The GUI tells me that it is stuck on "Installing Dependencies" checking the output on the screen of the node there seems to be a loop with NTP

Stopping Network Time Service...
Stopped Network Time Service.
Starting Network Time Service...
Started Network Time Service.
Started ntp-systemd-netif.service
Stopping Network Time Service.
Stopped Network Time Service.
Starting Network Time Service...
Started Network Time Service
Started ntp-systemd-netif.service

And it went on until I had shutdown the node.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Stefano, in Settings > Commissioning set the kernel to hwe-16.04-edge and
make sure that the machine itself doesn't have a pinned minimum kernel.

On Mon, Feb 5, 2018 at 5:43 PM, Stefano Coronado <email address hidden>
wrote:

> @andreserl
>
> I was originally commissioning with the Ubuntu 16.04 kernel (ga-16.04).
> I tried it with the hwe kernel and it failed the same way.
>
> For sake of completeness I tried the Trusty Tahr 14.04 ga kernel and
> that failed with the same conditions.
>
> The Ubuntu Bionic kernel (ga-18.04) failed the testing portion in a
> different way. The GUI tells me that it is stuck on "Installing
> Dependencies" checking the output on the screen of the node there seems
> to be a loop with NTP
>
> Stopping Network Time Service...
> Stopped Network Time Service.
> Starting Network Time Service...
> Started Network Time Service.
> Started ntp-systemd-netif.service
> Stopping Network Time Service.
> Stopped Network Time Service.
> Starting Network Time Service...
> Started Network Time Service
> Started ntp-systemd-netif.service
>
> And it went on until I had shutdown the node.
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1746817
>
> Title:
> Failed Commissioning smartctl-validate
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1746817/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete;
> importance=High; assignee=None;
> Launchpad-Bug: product=maas; productseries=2.3; milestone=2.3.x;
> status=New; importance=High; assignee=None;
> Launchpad-Bug-Tags: commisioning hardwaretest smartctl-validates
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl ltrager stefanoume
> Launchpad-Bug-Reporter: Stefano Coronado (stefanoume)
> Launchpad-Bug-Modifier: Stefano Coronado (stefanoume)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Stefano Coronado (stefanoume) wrote :

@andreserl

I've set the release to xenial and default minimum kernel version to hwe-16.04-edge in the general maas settings.

For the node itself, I made it so there is no minimum kernel (assuming that is what you mean by pinned minimum kernel)

We still got the same python error with smartctl-validate

Revision history for this message
Stefano Coronado (stefanoume) wrote :

@andreserl

BTW: I'm noticing that the minimum kernel in the node settings keeps switching itself back to "xenial hwe-16.04-edge"

Revision history for this message
Andres Rodriguez (andreserl) wrote :

What's the output of the following commissioning script?:
00-maas-07-block-devices

Nodes > <your machine> >Commissioning > 00-maas-07-block-devices > View log
> [ combined | stdout | stderr | yaml ]

On Tue, Feb 6, 2018 at 4:20 PM, Stefano Coronado <email address hidden>
wrote:

> @andreserl
>
> BTW: I'm noticing that the minimum kernel in the node settings keeps
> switching itself back to "xenial hwe-16.04-edge"
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1746817
>
> Title:
> Failed Commissioning smartctl-validate
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1746817/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete;
> importance=High; assignee=None;
> Launchpad-Bug: product=maas; productseries=2.3; milestone=2.3.x;
> status=New; importance=High; assignee=None;
> Launchpad-Bug-Tags: commisioning hardwaretest smartctl-validates
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl ltrager stefanoume
> Launchpad-Bug-Reporter: Stefano Coronado (stefanoume)
> Launchpad-Bug-Modifier: Stefano Coronado (stefanoume)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Stefano Coronado (stefanoume) wrote :

@andreserl

The output of 00-maas-07-block-devices is attached. It is the combined output.

Revision history for this message
Stefano Coronado (stefanoume) wrote :

@andreserl

Do you think we will find a resolution to this? Our issue as of yet has not been fixed.

Revision history for this message
Lee Trager (ltrager) wrote :

@stefanoume can you try rerunning commissioning with no test scripts selected?

Revision history for this message
Stefano Coronado (stefanoume) wrote :

@ltrager

Without any scripts selected, it passes commissioning.

However, we can't deploy it since it says that there are no disks found

Revision history for this message
Lee Trager (ltrager) wrote :

@stefanoume could you try deleting the machine, readding it, and commissioning?

Revision history for this message
yinxingpan (yinxingpan) wrote :

i got this error too, HP(ProLiant DL360 Gen9), ubuntu 18.04, maas 2.4.0~beta2,

the HP server error:
firmware does not response

fatal hardmare error
ware error from APEI generic hardware error source

Revision history for this message
appu (appu1491) wrote :

Where can i found the smartctl-validate scripts? In which directory?

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi All,

We believe this issue has now been fixed in MAAS 2.4.2. Please upgrade and re-test. If you are still experiencing this, please re-open this bug report.

Changed in maas:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.