[2.2] rackd errors after fresh install

Bug #1705594 reported by John George
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Blake Rouse
2.2
Fix Released
Critical
Blake Rouse

Bug Description

Solutions QA CI failure caused by unavailable rack controller after fresh install of maas_2.2.1-6078-g2a6d96e-0ubuntu1~16.04.1

https://solutions.qa.canonical.com/#/qa/testRun/dd7949c8-0ce9-4f2b-a817-1a24c997aa12

rackd.log has errors.

Logs available at:
http://10.245.161.162/swift/v1/solutions-qa/dd7949c8-0ce9-4f2b-a817-1a24c997aa12/cloud_stack_417/maas-logs.tgz

Related branches

Changed in maas:
milestone: none → 2.3.0
importance: Undecided → Critical
Revision history for this message
Lee Trager (ltrager) wrote :

According to your logs the rack is trying to connect to the region at 127.0.0.1, is that correct? Could you please post the value of maas_url from /etc/maas/rackd.conf and /etc/maas/regiond.conf?

Changed in maas:
status: New → Incomplete
Revision history for this message
John George (jog) wrote :

This was a transient CI run and did not recreate on the next attempt. I'll capture the configs from etc when we hit it again.

Chris Gregan (cgregan)
tags: added: cdo-qa-blocker
Revision history for this message
Andres Rodriguez (andreserl) wrote :

@John, Chris,

Is this still an issue/transient failure?

It seems that with 2.2.2 the issue is not present.

Revision history for this message
John George (jog) wrote :

We have not hit it again yet.

MAAS deployed in the same environment has these maas_url values, requested in an earlier comment:
/etc/maas/rackd.conf:maas_url: http://10.245.208.33/MAAS
/etc/maas/regiond.conf:maas_url: http://10.245.208.33/MAAS

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Thanks for the update. I'll mark this as invalid. If it comes up again, please re-open.

Changed in maas:
status: Incomplete → Invalid
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here's logs, including /etc, of a recreate

Changed in maas:
status: Invalid → New
Revision history for this message
Greg Lutostanski (lutostag) wrote :

hit against maas_2.2.3-6106-g314b2b2-0ubuntu1~16.04.1

Revision history for this message
Greg Lutostanski (lutostag) wrote :
Revision history for this message
Blake Rouse (blake-rouse) wrote :

Can you provide a full log of all the SQL queries? I know that will be a lot of information. But from the logs I can tell that the rack controller registered at one point and it was given that system_id, but then some how that system_id is deleted or no longer a rack controller.

So I think that some how the machine gets converted back to just a region controller instead of a region and rack controller.

Maybe the output of `nodes read` would help provide more information as well.

Changed in maas:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this again last night, I've attached logs.

There is a rack controller error in regiond shortly before we hit this:

http://paste.ubuntu.com/25616610/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Blake, we hit this intermittently on a setup that runs lots of tests every day. We could turn on SQL query logging but it would add a very large load in terms of disk space and I'm not sure how well that would work out.

Is there additional logging that can be added to MAAS to help identify the source of the failure?

Changed in maas:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

if the theory is a controller is turning from a rack controller into just a region controller, can you add logging everywhere that happens? That seems important enough to log and it doesn't seem like it would happen often enough to add much noise/size to the logs.

Revision history for this message
Chris Gregan (cgregan) wrote :

@Blake
Perhaps a more reasonable fix here is to add logging to MAAS to announce when controllers change spec? Our testing only exposes this failure once a week...maybe. Turning in global sql logging would likely have other resource impacts and greatly effect our other activities.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi guys!

What do you mean by "turning from a rack controller into just a region controller" and "announce when controllers change spec" ?

Changed in maas:
milestone: 2.3.0 → 2.3.0beta2
Revision history for this message
Chris Gregan (cgregan) wrote :

@andres

"The announce when controllers change" is referring to the comment Blake made about what seems to have happened "from the logs I can tell that the rack controller registered at one point and it was given that system_id, but then some how that system_id is deleted or no longer a rack controller"

If we had more detailed logging here it would allow is to properly triage this issue.

Revision history for this message
Chris Gregan (cgregan) wrote :

Attach cdoqa-system-test code for this process

Revision history for this message
Chris Gregan (cgregan) wrote :

@Andres

Per your request, this is the script running when the error occurs: https://bazaar.launchpad.net/~cdo-qa/cdoqa-system-tests/trunk/view/head:/cdoqa/set_maas_dhcp.py

Revision history for this message
Greg Lutostanski (lutostag) wrote :

This causes there to be no rack controller at all. If this occurs. There is no retry if this error path is hit. (re-occurrence misfiled as a dupe https://bugs.launchpad.net/maas/+bug/1721302).

Offer for maas team to poke active reproduction (after it occurred).

Changed in maas:
status: New → Triaged
tags: added: internal
tags: added: foundations-engine
Changed in maas:
milestone: 2.3.0beta2 → 2.3.0beta3
Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
Changed in maas:
status: Triaged → In Progress
status: In Progress → Triaged
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.