MAAS

[2.2] Duplicate communication is occurring to the same rack controller

Bug #1704489 reported by Jason Hobbs on 2017-07-14

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Critical	Blake Rouse	MAAS 2.3.0alpha1
	2.2	Fix Released	Critical	Blake Rouse	MAAS 2.2.2

Bug Description

I'm adding a machine, which appears to work fine, but MAAS returns a 401 error:

Command failed: maas-cli root machines create hostname=taillow power_type=ipmi architecture=amd64/generic mac_addresses=ec:b1:d7:80:90:ac power_parameters_power_address=192.168.246.113 power_parameters_power_user=Administrator power_parameters_power_pass=insecure power_parameters_power_driver=LAN_2_0
b"Authorization Error: 'Nonce already used: afe9692868de46b3b163622f0c38ff09'"

from regiond.log:
2017-07-14 21:20:13 regiond: [info] 10.245.208.32 POST /MAAS/api/2.0/machines/ HTTP/1.1 --> 401 UNAUTHORIZED (referrer: -; agent: Python-httplib2/0.9.1 (gzip))

from maas.log:
Jul 14 21:19:43 lairon maas.api: [info] taillow: Enlisted new machine
Jul 14 21:19:43 lairon maas.node: [info] taillow: Status transition from NEW to COMMISSIONING
Jul 14 21:19:47 lairon maas.power: [info] Changed power state (on) of node: spinda (xneqy8)
Jul 14 21:20:14 lairon maas.power: [info] Changing power state (on) of node: taillow (6gtqax)
Jul 14 21:20:14 lairon maas.node: [info] taillow: Commissioning started
Jul 14 21:20:27 lairon maas.power: [info] Changed power state (on) of node: taillow (6gtqax)

This is very similar to bug 1702751 except the node is in commissioning state in this one, and I don't see any errors anywhere.

It's worth pointing out we're using HA rack controllers. I get the feeling that this is somehow related to it taking a while to contact the BMC for the first time.

See original description

Tags:

Related branches

~blake-rouse/maas:get-all-rack-conns-2.2

Merged into maas:2.2

Blake Rouse (community): Approve on 2017-07-18

~blake-rouse/maas:get-all-rack-conns

Merged into maas:master

Mike Pontillo (community): Approve on 2017-07-18

Jason Hobbs (jason-hobbs) on 2017-07-17

description:

updated

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-07-17:

newell is helping me look at this. At his suggestion, I stopped the maas-rackd service on my second rack controller and tried the add node over. It proceeds much quicker - nodes are added faster and it succeeds every time. It seems like having the second rack controller enabled causes a delay that causes this to fail.

Chris Gregan (cgregan) on 2017-07-17

tags:

added: cdo-qa-blocker

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-07-17:

The 401 is happening 30 seconds after the request is made. I've added some logging and it looks like all of the time is being spent in power_query_all.

Also, here is a pastebin of the traceback I got once I enabled logging of tracebacks:
http://paste.ubuntu.com/25115008/

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-07-17:

I added some logging in the ipmi power code and in power_query_all() where it's building its request list. It's sending the power query request once for each RPC client it is - which is quite a few. Here's some output from that:

http://paste.ubuntu.com/25115285/

Jul 17 23:36:08 lairon maas.power: [error] making query for client yb87bd
Jul 17 23:36:08 lairon maas.power: message repeated 11 times: [ [error] making query for client yb87bd]

That means controller yb87bd will make 12 ipmi power queries, all to the same node, and power_query_all() will wait for them all to complete.

You can see the 8 requests issued on the rackd that's on the same node as the region log:
Jul 17 23:36:08 lairon maas.drivers.power.ipmi: [error] JASON: issuing command to ('/usr/sbin/ipmipower', '-W', 'opensesspriv', '--driver-type', 'LAN_2_0', '-h', '192.168.246.112', '-u', 'Administrator', '-p', 'insecure', '--stat'): 192.168.246.112
Jul 17 23:36:08 lairon maas.drivers.power.ipmi: message repeated 7 times: [ [error] JASON: issuing command to ('/usr/sbin/ipmipower', '-W', 'opensesspriv', '--driver-type', 'LAN_2_0', '-h', '192.168.246.112', '-u', 'Administrator', '-p', 'insecure', '--stat'): 192.168.246.112]

So for one node, that's 20 ipmi power queries. That's probably why it's taking so long.

It should only be doing one ipmi power query per rack controller. I'm not sure if it's right for there to be multiple RPC clients per rack controller, but if so, the code in power_query_all shouldn't be using each of them.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-07-17:

There are at least a couple of issues here - one is the multiple ipmi requests per controller thing, the other is how the error is being suppressed/turned into a Nonce error.

I made a PR with code that I've tested and know fixes the multiple IPMI requests per controller issue:
https://github.com/maas/maas/pull/1

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2017-07-18:

I am going to target this bug ore towards the Critical issue of communicating to the same rack controller, multiple times.

This is just a broader issue in getAllClients and affects code through out MAAS.

summary:	- nonce has already been used error when adding a machine + [2.2] Duplicate communication is occurring to the same rack controller
Changed in maas:
status:	New → In Progress
importance:	Undecided → Critical
assignee:	nobody → Blake Rouse (blake-rouse)
milestone:	none → 2.3.0

MAAS Lander (maas-lander) on 2017-07-18

Changed in maas:
status:	In Progress → Fix Committed

Andres Rodriguez (andreserl) on 2017-07-28

Changed in maas:
milestone:	2.3.0 → 2.3.0alpha1

Andres Rodriguez (andreserl) on 2017-08-02

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.