[2.2] Duplicate communication is occurring to the same rack controller

Bug #1704489 reported by Jason Hobbs on 2017-07-14
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Critical
Blake Rouse
2.2
Critical
Blake Rouse

Bug Description

I'm adding a machine, which appears to work fine, but MAAS returns a 401 error:

Command failed: maas-cli root machines create hostname=taillow power_type=ipmi architecture=amd64/generic mac_addresses=ec:b1:d7:80:90:ac power_parameters_power_address=192.168.246.113 power_parameters_power_user=Administrator power_parameters_power_pass=insecure power_parameters_power_driver=LAN_2_0
b"Authorization Error: 'Nonce already used: afe9692868de46b3b163622f0c38ff09'"

from regiond.log:
2017-07-14 21:20:13 regiond: [info] 10.245.208.32 POST /MAAS/api/2.0/machines/ HTTP/1.1 --> 401 UNAUTHORIZED (referrer: -; agent: Python-httplib2/0.9.1 (gzip))

from maas.log:
Jul 14 21:19:43 lairon maas.api: [info] taillow: Enlisted new machine
Jul 14 21:19:43 lairon maas.node: [info] taillow: Status transition from NEW to COMMISSIONING
Jul 14 21:19:47 lairon maas.power: [info] Changed power state (on) of node: spinda (xneqy8)
Jul 14 21:20:14 lairon maas.power: [info] Changing power state (on) of node: taillow (6gtqax)
Jul 14 21:20:14 lairon maas.node: [info] taillow: Commissioning started
Jul 14 21:20:27 lairon maas.power: [info] Changed power state (on) of node: taillow (6gtqax)

This is very similar to bug 1702751 except the node is in commissioning state in this one, and I don't see any errors anywhere.

It's worth pointing out we're using HA rack controllers. I get the feeling that this is somehow related to it taking a while to contact the BMC for the first time.

Related branches

description: updated
Jason Hobbs (jason-hobbs) wrote :

newell is helping me look at this. At his suggestion, I stopped the maas-rackd service on my second rack controller and tried the add node over. It proceeds much quicker - nodes are added faster and it succeeds every time. It seems like having the second rack controller enabled causes a delay that causes this to fail.

Chris Gregan (cgregan) on 2017-07-17
tags: added: cdo-qa-blocker
Jason Hobbs (jason-hobbs) wrote :

The 401 is happening 30 seconds after the request is made. I've added some logging and it looks like all of the time is being spent in power_query_all.

Also, here is a pastebin of the traceback I got once I enabled logging of tracebacks:
http://paste.ubuntu.com/25115008/

Jason Hobbs (jason-hobbs) wrote :

I added some logging in the ipmi power code and in power_query_all() where it's building its request list. It's sending the power query request once for each RPC client it is - which is quite a few. Here's some output from that:

http://paste.ubuntu.com/25115285/

Jul 17 23:36:08 lairon maas.power: [error] making query for client yb87bd
Jul 17 23:36:08 lairon maas.power: message repeated 11 times: [ [error] making query for client yb87bd]

That means controller yb87bd will make 12 ipmi power queries, all to the same node, and power_query_all() will wait for them all to complete.

You can see the 8 requests issued on the rackd that's on the same node as the region log:
Jul 17 23:36:08 lairon maas.drivers.power.ipmi: [error] JASON: issuing command to ('/usr/sbin/ipmipower', '-W', 'opensesspriv', '--driver-type', 'LAN_2_0', '-h', '192.168.246.112', '-u', 'Administrator', '-p', 'insecure', '--stat'): 192.168.246.112
Jul 17 23:36:08 lairon maas.drivers.power.ipmi: message repeated 7 times: [ [error] JASON: issuing command to ('/usr/sbin/ipmipower', '-W', 'opensesspriv', '--driver-type', 'LAN_2_0', '-h', '192.168.246.112', '-u', 'Administrator', '-p', 'insecure', '--stat'): 192.168.246.112]

So for one node, that's 20 ipmi power queries. That's probably why it's taking so long.

It should only be doing one ipmi power query per rack controller. I'm not sure if it's right for there to be multiple RPC clients per rack controller, but if so, the code in power_query_all shouldn't be using each of them.

Jason Hobbs (jason-hobbs) wrote :

There are at least a couple of issues here - one is the multiple ipmi requests per controller thing, the other is how the error is being suppressed/turned into a Nonce error.

I made a PR with code that I've tested and know fixes the multiple IPMI requests per controller issue:
https://github.com/maas/maas/pull/1

Blake Rouse (blake-rouse) wrote :

I am going to target this bug ore towards the Critical issue of communicating to the same rack controller, multiple times.

This is just a broader issue in getAllClients and affects code through out MAAS.

summary: - nonce has already been used error when adding a machine
+ [2.2] Duplicate communication is occurring to the same rack controller
Changed in maas:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Blake Rouse (blake-rouse)
milestone: none → 2.3.0
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 2.3.0 → 2.3.0alpha1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers