Cloud provider authentication failures resulting in memory/goroutine leak

Bug #1833155 reported by Joel Sing on 2019-06-18
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
Simon Richardson

Bug Description

The logs from a Juju 2.5.4 HA controller used for JAAS repeatedly logs:

2019-06-18 01:43:52 ERROR juju.worker.dependency engine.go:636 "compute-provisioner" manifold worker returned unexpected error: googleapi: Error 403: Access Not Configured. Compute Engine API has not been used in project 123456789 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=123456789 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to
our systems and retry., accessNotConfigured
2019-06-18 01:43:52 ERROR juju.worker.dependency engine.go:636 "compute-provisioner" manifold worker returned unexpected error: googleapi: Error 403: Access Not Configured. Compute Engine API has not been used in project 123456789 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=123456789 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to
our systems and retry., accessNotConfigured
2019-06-18 01:43:52 ERROR juju.worker.dependency engine.go:636 "compute-provisioner" manifold worker returned unexpected error: Get https://www.googleapis.com/compute/v1/projects/clean-algebra-123456789/aggregated/instances?alt=json&filter=name+eq+juju-abcdef-.%2A: private key should be a PEM or plain PKSC1 or PKCS8; parse error: asn1: structure error: tags don't match (16 vs {class:1 tag:1 length:73 isCompound:false}) {optional:false explicit:false application:false private:false defaultValue:<nil> tag:<nil> stringType:0 timeType:0 set:false omitEmpty:false} pkcs1PrivateKey @2

This is presumably due to the GCE account and/or project being removed outside of JAAS. While Juju should detect this case and disable the model, the bigger issue is that memory consumption appears to continually grow linearly on the controller where these messages are logged. When the controller stops logging these messages the memory consumption appears to drop and it will then start to grow on another controller.

Joel Sing (jsing) wrote :

Juju heap profile and goroutines are available at https://private-fileshare.canonical.com/~jsing/lp1833155/ - it is worth noting that there is minimal heap in use, however a significant amount of stack appears to be allocated.

Joel Sing (jsing) on 2019-06-18
tags: added: canonical-is
Joel Sing (jsing) wrote :

On the Juju controller in question:

$ for i in {1..10}; do juju_heap_profile | grep '# Stack = '; sleep 120; done
Querying @jujud-machine-1 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1344143360 / 1344143360
Querying @jujud-machine-1 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1344274432 / 1344274432
Querying @jujud-machine-1 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1347649536 / 1347649536
Querying @jujud-machine-1 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1351712768 / 1351712768
Querying @jujud-machine-1 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1353416704 / 1353416704
Querying @jujud-machine-1 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1357807616 / 1357807616
Querying @jujud-machine-1 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1358757888 / 1358757888
Querying @jujud-machine-1 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1363542016 / 1363542016
Querying @jujud-machine-1 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1368031232 / 1368031232
Querying @jujud-machine-1 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1370947584 / 1370947584

Tim Penhey (thumper) on 2019-06-18
Changed in juju:
status: New → Triaged
importance: Undecided → High
assignee: nobody → John A Meinel (jameinel)
Joel Sing (jsing) wrote :

Same from another Juju controller in another region that is logging similar messages:

$ for i in {1..10}; do juju_heap_profile | grep '# Stack = '; sleep 120; done
Querying @jujud-machine-2 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1437859840 / 1437859840
Querying @jujud-machine-2 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1440972800 / 1440972800
Querying @jujud-machine-2 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1443987456 / 1443987456
Querying @jujud-machine-2 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1447100416 / 1447100416
Querying @jujud-machine-2 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1449558016 / 1449558016
Querying @jujud-machine-2 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1452703744 / 1452703744
Querying @jujud-machine-2 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1456078848 / 1456078848
Querying @jujud-machine-2 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1458470912 / 1458470912
Querying @jujud-machine-2 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1461190656 / 1461190656
Querying @jujud-machine-2 introspection socket: /debug/pprof/heap?debug=1
# Stack = 1465057280 / 1465057280

Tim Penhey (thumper) wrote :

https://pastebin.ubuntu.com/p/KJjPpD4HWD/

The apiserver is gaining in goroutines causing the stack growth. It appears that the someone is aksing for a watcher, but not closing it when it hits the error.

We should target 2.5.8 for this fix.

Joel Sing (jsing) wrote :

Also seeing the same/similar on Azure:

2019-06-18 04:33:30 ERROR juju.worker.dependency engine.go:636 "compute-provisioner" manifold worker returned unexpected error: failed to process updated machines: failed to get all instances from broker: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/<>/providers/Microsoft.Resources/deployments/?api-version=2018-05-01: StatusCode=400 -- Original Error: adal: Refresh request failed. Status Code = '400'. Response body: {"error":"unauthorized_client","error_description":": Application with identifier 'http://Juju' was not found in the directory '<>'. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You may have sent your authentication request to the wrong tenant.\r\nTrace ID: <>\r\nTimestamp: 2019-06-18 04:33:30Z","error_codes":[700016],"timestamp":"2019-06-18 04:33:30Z","trace_id":"<>","correlation_id":"<>","error_uri":"https://login.windows.net/error?code=700016"}
2019-06-18 04:33:31 ERROR juju.worker.dependency engine.go:636 "firewaller" manifold worker returned unexpected error: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/<>/providers/Microsoft.Resources/deployments/?api-version=2018-05-01: StatusCode=400 -- Original Error: adal: Refresh request failed. Status Code = '400'. Response body: {"error":"unauthorized_client","error_description":"<>: Application with identifier '<>' was not found in the directory '<>'. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You may have sent your authentication request to the wrong tenant.\r\nTrace ID: <>\r\nTimestamp: 2019-06-18 04:33:31Z","error_codes":[700016],"timestamp":"2019-06-18 04:33:31Z","trace_id":"<>","correlation_id":"<>","error_uri":"https://login.windows.net/error?code=700016"}

Querying @jujud-machine-0 introspection socket: /debug/pprof/heap?debug=1
# Stack = 146276352 / 146276352
Querying @jujud-machine-0 introspection socket: /debug/pprof/heap?debug=1
# Stack = 147324928 / 147324928
Querying @jujud-machine-0 introspection socket: /debug/pprof/heap?debug=1
# Stack = 148340736 / 148340736

Joel Sing (jsing) wrote :

As suggested by thumper, this appears to be caused by a goroutine leak:

$ for i in 1 2 3 ; do juju_goroutines | head -n 1; sleep 60; done
Querying @jujud-machine-0 introspection socket: /debug/pprof/goroutine?debug=1
goroutine profile: total 34295
Querying @jujud-machine-0 introspection socket: /debug/pprof/goroutine?debug=1
goroutine profile: total 34499
Querying @jujud-machine-0 introspection socket: /debug/pprof/goroutine?debug=1
goroutine profile: total 34725

summary: - GCE access failures resulting in memory leak
+ Cloud provider authentication failures resulting in memory/goroutine
+ leak
John A Meinel (jameinel) wrote :

One of the items in the goroutine dump:
64875 @ 0x43159b 0x441606 0x17a7335 0x1805908 0xa6051b 0x45fa61
# 0x17a7334 github.com/juju/juju/state.(*modelFieldChangeWatcher).loop+0x294 /workspace/_build/src/github.com/juju/juju/state/watcher.go:986

modelFieldChangeWatcher was removed in the 2.6 series. It was being used by the lxd profile code, but they changed how that was being done.

John A Meinel (jameinel) wrote :

In talking with Simon, this is all code that was removed as part of the rework in 2.6 for how charm profiles are managed. However, in 2.5.7, the code around:

func (p *provisioner) getStartTask(harvestMode config.HarvestMode) (ProvisionerTask, error) {

has a series of:
 machineWatcher, err := p.getMachineWatcher()
...
 retryWatcher, err := p.getRetryWatcher()
...
 profileWatcher, err := p.getProfileWatcher()
...
 modelCfg, err := p.st.ModelConfig()
...
 controllerCfg, err := p.st.ControllerConfig()

All of which ended up handed to:
 task, err := NewProvisionerTask(
  controllerCfg.ControllerUUID(),
  machineTag,
  harvestMode,
  p.st,
  p.distributionGroupFinder,
  p.toolsFinder,
  machineWatcher,
  retryWatcher,
  profileWatcher,
  p.broker,
  auth,
  modelCfg.ImageStream(),
  RetryStrategy{retryDelay: retryStrategyDelay, retryCount: retryStrategyCount},
  p.callContext,
 )

However, NewProvisionerTask sets up:
 workers := []worker.Worker{machineWatcher}
 var retryChanges watcher.NotifyChannel
 if retryWatcher != nil {
  retryChanges = retryWatcher.Changes()
  workers = append(workers, retryWatcher)
 }
 profileChanges := profileWatcher.Changes()

And then passes:
 err := catacomb.Invoke(catacomb.Plan{
  Site: &task.catacomb,
  Work: task.loop,
  Init: workers,
 })

So the catacomb will call stopWorkers(plan.Init).
However, we missed the step to add profileWatcher to the slice of workers (so it doesn't get the lifetime properly associated with the task.)

So this bug should already be fixed in 2.6 because we got rid of that watcher entirely. But if we want to do a 2.5.X release we could do something like:
diff --git a/worker/provisioner/provisioner_task.go b/worker/provisioner/provisioner_task.go
index b93e4fbc26..fd35e50a51 100644
--- a/worker/provisioner/provisioner_task.go
+++ b/worker/provisioner/provisioner_task.go
@@ -95,6 +95,7 @@ func NewProvisionerTask(
                workers = append(workers, retryWatcher)
        }
        profileChanges := profileWatcher.Changes()
+ workers = append(workers, profileWatcher)
        task := &provisionerTask{
                controllerUUID: controllerUUID,
                machineTag: machineTag,

Note that this code is executed in the environProvisioner, which means it affects any cloud, it isn't specific to GCE.

Changed in juju:
status: Triaged → In Progress
assignee: John A Meinel (jameinel) → Simon Richardson (simonrichardson)
status: In Progress → Fix Released
milestone: none → 2.6.4

Patch to the 2.5 branch https://github.com/juju/juju/pull/10341

The patch ensures we tie the watcher to the catacomb of the provisioner task. Additionally I've added the watcher to the start/stop provisioner task test, so that we can verify the fix.

Tim Penhey (thumper) on 2019-06-20
Changed in juju:
milestone: 2.6.4 → 2.6.5
status: Fix Released → Fix Committed
Anastasia (anastasia-macmood) wrote :

Since this fix is only required for 2.5, I'll remove 2.6 from the milestone. (Noticed while merging 2.5 into 2.6)

Changed in juju:
milestone: 2.6.5 → 2.5.8
no longer affects: juju/2.5
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers