MAAS sometimes get in a state where machine tags aren't updated from their definitions

Bug #1845351 reported by Björn Tillenius on 2019-09-25
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
High
Björn Tillenius
2.3
Medium
Björn Tillenius
2.4
High
Björn Tillenius
2.6
High
Björn Tillenius

Bug Description

Now and then we get a failure in CI where it adds a machine tag with defintion=true(), which should apply to all machines.

I stopped the CI when the tags test failed and debugged what's going in. regiond sends an RPC request, EvaluateTag, to rackd. rackd then uses the MAAS API to get the node details, but gets a 503 from the node GET request.

https://pastebin.ubuntu.com/p/BmyJ6Dh3nR/

There are nothing logged in regiond .log about the 503, but trying to add more tags result in the same error. The API itself is working fine, though, since I can use it to add the tags.

Another odd thing is that restarting maas-rackd makes the problem go away. After restarting maas-rackd I couldn't reproduce it, so I couldn't add any extra debug logging in the code.

Related branches

Changed in maas:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Björn Tillenius (bjornt)
milestone: none → 2.7.0alpha1
Björn Tillenius (bjornt) wrote :

The 2.4 tests also fail with this failure now and then.

Björn Tillenius (bjornt) wrote :

BTW, I marked this as High, since it breaks the CI runs. From the looks of it, it doesn't seem to affect many users, since it's an old bug and no one has reported it before.

Björn Tillenius (bjornt) wrote :

There are two reasons why this bug is happening. The first is that something is changing the proxy environment variables. When rackd starts up, it has http_proxy and no_proxy defined and updating the tags work in our CI environment, since the region IP is in no_proxy. But then at some point, at least the no_proxy variable is gone. From the looks of it http_proxy was also gone, but somehow the API call still went through the proxy, that doesn't allow access to the region.

But still, the best fix is still not to use any proxies for the API communication. It's documented that rackd needs TCP access to the region on port 5240. If we for some reason want to support proxy for rack-region communication, we need to add explicit support for it.

Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Björn Tillenius (bjornt) wrote :

I'm targetting this to 2.3 as well, since it affects systemtests, which fail for the 2.3 branch due to this bug.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers