Juju controller loses connectivity with vSphere cloud, repeatedly needs credentials uploaded

Bug #1831244 reported by Johan Hallbäck on 2019-05-31

This bug report will be marked for expiration in 55 days if no further activity occurs. (find out why)

10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju
High
Unassigned

Bug Description

PROBLEM: Juju suddenly silently fails to communicate with vCenter server, no errors in the logs, says “datacenter not found” when trying to update controller (juju upgrade-juju in controller model), all deploy/destroy operations stay pending

WORKAROUND: Solved by updating (to the same) credentials on the controller. Affected all models on one controller.

Scenario:
=========

Juju version 2.6.1 on the controller and other models in this scenario.
ESXi 6.5, build 11925212
vCenter 6.7.0 build 10244857

This has been described in here:

https://discourse.jujucharms.com/t/juju-on-vsphere-datacenter-and-credentials-lost-workaround/1350

When the problem occurs, machines cannot be deleted (machine 0-4 below) and after issuing "juju add-unit" nothing happens (machines 4-5 remain pending).

$ juju status
Model Controller Cloud/Region Version SLA Timestamp
slurm-a iuba-vmware vmware01-prod/Sodertalje-HPC 2.5.1 unsupported 13:57:02+02:00

Machine State DNS Inst id Series AZ Message
0 stopped 10.104.129.171 juju-3ae1b0-0 xenial poweredOn
1 stopped 10.104.129.44 juju-3ae1b0-1 disco poweredOn
2 stopped 10.104.129.45 juju-3ae1b0-2 disco poweredOn
3 stopped 10.104.129.46 juju-3ae1b0-3 disco poweredOn
4 pending pending disco
5 pending pending disco

Inside vSphere, machines 0-4 are still up and running, and machines 4-5 are not created at all. No information in /var/log/juju on the controller indicates credential related problems.

When the controller is in this state, the only way that produces an error is to try and upgrade Juju in the controller model:

$ juju upgrade-juju
best version:
    2.5.4
ERROR cannot make API call to provider: datacenter 'Sodertalje-HPC' not found

There are two ways to mitigate this situation. Assume there are two valid credentials for vSphere both locally on the juju client and on the controller, called "cred1" and "cred2".

Fix 1 - Update the credential in use on the controller
======================================================

$ juju update-credential vmware01-prod cred1

Fix 2 - Change to another valid credential on a model
=====================================================

$ juju show-credentials
controller-credentials:
  vmware01-prod:
    cred1:
      content:
        auth-type: userpass
        user: <email address hidden>
      models:
        slurm-a: admin
        slurm-b: admin
        slurm-c: admin
    cred2:
      content:
        auth-type: userpass
        user: <email address hidden>
      models: {}

$ juju set-credential -m slurm-a vmware01-prod cred2
Found credential remotely, on the controller. Not looking locally...
Changed cloud credential on model "slurm-a" to "cred2".

Result fix 1 & 2:
=================

All pending operations issued within the model will immediately be carried out in vSphere.

To reproduce:
=============

Stop interacting with vSphere from Juju for some time, usually hours. When juju status no longer reflects the VMs actually present in vSphere or deploy/destroy operations hang, issue fix 1 or 2.

Additional comments:
====================

Between Juju 2.4.7 and 2.5.1, some vSphere improvements related to logins and credentials were made. In 2.4.7 and earlier, invalid vSphere credentials flooded the vCenter server with failed logins. We have not seen that behavior in newer versions. Is this related?

John A Meinel (jameinel) wrote :

Is vSphere configured to have credentials expire after a period of time and "juju update-credentials" is causing us to refresh that token?
If this is happening within 'hours' that sounds much shorter than I would expect.

Changed in juju:
importance: Undecided → High
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers