"Invalid entity name or password" error with valid credentials.

Bug #1514874 reported by Jorge Niedbalski
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Anastasia
juju-core
Fix Released
Critical
Anastasia
1.25
Fix Released
Critical
Anastasia

Bug Description

(agent uninstall addressed in lp:1580233 and lp:1580221)

[Environment]

Juju-core 1.24.5 - 1.25.5 (Still reproducible with 1.24.7)
Trusty 14.04

[Description]

We noticed this behavior on multiple machines,

First for some unknown reason , the agent fails to authenticate on the state servers.

184176-2015-10-22 09:21:50 INFO juju.api apiclient.go:263 connection established to "wss://172.30.8.124:17070/environment/cb173249-d3a0-4667-8833-93bf8b804731/api"
184177-2015-10-22 09:21:50 ERROR juju.cmd.jujud agent.go:298 agent terminating due to error returned during API open: invalid entity name or password

Then because of this code, the workers are stopped, and the uninstall of the agent occurs.

https://github.com/juju/juju/blob/4e3972b33f1fc9db13d0779a5a6b0ff7fab9cdb0/worker/apicaller/open.go#L66

184168-2015-10-22 09:06:02 INFO juju.worker runner.go:275 stopped "api", err: try again
184169-2015-10-22 09:06:02 DEBUG juju.worker runner.go:203 "api" done: try again
184170-2015-10-22 09:06:02 ERROR juju.worker runner.go:223 exited "api": try again
184171-2015-10-22 09:06:02 INFO juju.worker runner.go:261 restarting "api" in 3s
184172-2015-10-22 09:06:05 INFO juju.worker runner.go:269 start "api"
184173-2015-10-22 09:06:05 INFO juju.api apiclient.go:331 dialing "wss://172.30.8.124:17070/environment/cb173249-d3a0-4667-8833-93bf8b804731/api"
184174-2015-10-22 09:06:05 INFO juju.api apiclient.go:263 connection established to "wss://172.30.8.124:17070/environment/cb173249-d3a0-4667-8833-93bf8b804731/api"
184175-2015-10-22 09:21:49 INFO juju.api apiclient.go:331 dialing "wss://172.30.8.124:17070/environment/cb173249-d3a0-4667-8833-93bf8b804731/api"
184176-2015-10-22 09:21:50 INFO juju.api apiclient.go:263 connection established to "wss://172.30.8.124:17070/environment/cb173249-d3a0-4667-8833-93bf8b804731/api"
184177-2015-10-22 09:21:50 ERROR juju.cmd.jujud agent.go:298 agent terminating due to error returned during API open: invalid entity name or password
184178:2015-10-22 09:21:50 INFO juju.worker runner.go:275 stopped "api", err: agent should be terminated
184179:2015-10-22 09:21:50 DEBUG juju.worker runner.go:203 "api" done: agent should be terminated
184180:2015-10-22 09:21:50 ERROR juju.worker runner.go:212 fatal "api": agent should be terminated
184181-2015-10-22 09:21:50 DEBUG juju.worker runner.go:248 killing "statestarter"
184182-2015-10-22 09:21:50 DEBUG juju.worker runner.go:248 killing "termination"
184183-2015-10-22 09:21:50 INFO juju.worker runner.go:275 stopped "statestarter", err: <nil>
184184-2015-10-22 09:21:50 INFO juju.worker runner.go:275 stopped "termination", err: <nil>
184185-2015-10-22 09:21:50 DEBUG juju.worker runner.go:203 "statestarter" done: <nil>
184186-2015-10-22 09:21:50 DEBUG juju.worker runner.go:227 no restart, removing "statestarter" from known workers
184187-2015-10-22 09:21:50 DEBUG juju.worker runner.go:203 "termination" done: <nil>
184188-2015-10-22 09:21:50 DEBUG juju.worker runner.go:227 no restart, removing "termination" from known workers
184189-2015-10-22 09:21:50 DEBUG juju.service discovery.go:115 discovered init system "upstart" from local host
184190-2015-10-22 09:21:51 DEBUG juju.service discovery.go:115 discovered init system "upstart" from local host
184191-2015-10-22 09:21:51 INFO juju.cmd supercommand.go:436 command finished

After this point, you can observe messages such as:

/proc/self/fd/9: 9: exec: /var/lib/juju/tools/unit-nova-cloud-controller-0/jujud: not found
/proc/self/fd/9: 9: exec: /var/lib/juju/tools/unit-nova-cloud-controller-0/jujud: not found

[Other ways to reproduce]

1) Modify the apipassword on the /var/lib/juju/agents/machine-8/agent.conf
2) Restart jujud-machine-8

The following error is printed:

2015-11-10 14:14:28 ERROR juju.worker.apicaller open.go:169 agent terminating due to error returned during API open: invalid entity name or password
2015-11-10 14:14:28 INFO juju.worker runner.go:275 stopped "api", err: agent should be terminated
2015-11-10 14:14:28 DEBUG juju.worker runner.go:203 "api" done: agent should be terminated
2015-11-10 14:14:28 ERROR juju.worker runner.go:212 fatal "api": agent should be terminated

3) /var/lib/juju dissapears
ls -lh /var/lib/jujuju
ls: cannot access /var/lib/juju: No such file or directory

Note:

Please note that we still don't know exactly what caused the authentication to fail, but, even if you are manually modifying the agent.conf , I don't think that a correct reaction is to wipe the juju agent from the machine.

description: updated
Revision history for this message
Cheryl Jennings (cherylj) wrote :

I agree that it is a bit of an overreaction to wipe the juju agent from the machine.

I am concerned about the initial authentication error, though. Could you upload logs from the state servers and the machine which failed to authenticate?

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
tags: added: sts-needs-review
tags: removed: sts sts-needs-review
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

After reading through the logs, I can see that 2 units (machine-10-lxc-4 and machine-10-lxc-6)
are possible manifesting the occurrence of bug LP: 1514874.

The sequence of events is described below:

The unit fails to authenticate on the API because an 'upgrade' is not completed or in progress. This
happened around 2016-04-10 10:18:55

2016-04-10 10:18:55 ERROR juju.worker runner.go:223 exited "api": login for "machine-10-lxc-4" blocked because upgrade in progress
2016-04-10 10:18:58 ERROR juju.worker runner.go:223 exited "api": login for "machine-10-lxc-4" blocked because upgrade in progress
2016-04-10 10:19:01 ERROR juju.worker runner.go:223 exited "api": try again

Then there is a gap on the logs for both affected units (does the machine was rebooted/halted for a couple of days?) ,
and then a subsequent failed to connect to the state server, around 05:39:02 localtime on 2016-04-12.

154936-2016-04-12 05:39:02 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
154937-2016-04-12 05:39:02 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
154938-2016-04-12 05:39:02 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
154939-2016-04-12 05:39:02 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down

Then the agent tries to reconnect to the state several times.

154953-2016-04-12 05:39:02 ERROR juju.worker runner.go:223 exited "api": watcher has been stopped
[...]
154959-2016-04-12 05:39:47 ERROR juju.worker runner.go:223 exited "api": try again
154960-2016-04-12 05:39:55 ERROR juju.worker runner.go:223 exited "api": try again

Around 10 minutes later, the state server API is back online, but the agent cannot authenticate in the
state server, as it replies with "unauthorized access" or "not provisioned"

1) https://github.com/juju/juju/blob/e5a77909f70c8d0b5bbc0bac9b2bf18546744614/apiserver/params/apierror.go#L165
2) https://github.com/juju/juju/blob/e5a77909f70c8d0b5bbc0bac9b2bf18546744614/apiserver/params/apierror.go#L113

At this point the unit falls on this check:

if params.IsCodeNotProvisioned(err) || params.IsCodeUnauthorized(err) {
logger.Errorf("agent terminating due to error returned during API open: %v", err)
return nil, false, worker.ErrTerminateAgent
}

Reference: https://github.com/juju/juju/blob/1.25/worker/apicaller/open.go#L170

This error is evidenced on both units, as you can check on the log:

machine-10-lxc-4.log
154961:2016-04-12 05:48:50 ERROR juju.worker.apicaller open.go:169 agent terminating due to error returned during API open: invalid entity name or password

This validation error causes the return of the worker.ErrTerminateAgent error, which triggers this signal https://github.com/juju/juju/blob/1.25/worker/terminationworker/worker.go#L32
that instructs to wipe the juju agent from the machine.

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

related: lp:1464304

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :
Download full text (4.2 KiB)

@eric

I don't think they are related et all. I am observing this with a fresh new 1.25.5, agents are being uninstalled with the same exact problem:

In this case, I just restarted the machine-19 (among others)

ubuntu@niedbalski-sec-bastion:~$ juju status --format=tabular |grep -i lost
cinder/1 unknown lost 1.25.5.1 6 10.5.1.203 agent is lost, sorry! See 'juju status-history cinder/1'
  cinder-ceph/1 unknown lost 1.25.5.1 10.5.1.203 agent is lost, sorry! See 'juju status-history cinder-ceph/1'
mysql/0 unknown lost 1.25.5.1 12 10.5.1.209 agent is lost, sorry! See 'juju status-history mysql/0'
openstack-dashboard/0 unknown lost 1.25.5.1 19 80/tcp,443/tcp 10.5.1.216 agent is lost, sorry! See 'juju status-history openstack-dashboard/0'

2016-05-03 15:32:31 ERROR juju.worker.apicaller open.go:169 agent terminating due to error returned during API open: invalid entity name or password

At this point juju was wiped from the system, which I still think is a complete over-reaction.

root@juju-niedbalski-sec-machine-19:/var/log/juju# more machine-19.log
2016-05-03 15:32:30 INFO juju.cmd supercommand.go:37 running jujud [1.25.5.1-trusty-amd64 gc]
2016-05-03 15:32:30 DEBUG juju.agent agent.go:482 read agent config, format "1.18"
2016-05-03 15:32:30 INFO juju.cmd.jujud machine.go:418 machine agent machine-19 start (1.25.5.1-trusty-amd64 [gc])
2016-05-03 15:32:30 DEBUG juju.wrench wrench.go:112 couldn't read wrench directory: stat /var/lib/juju/wrench: no such file or directory
2016-05-03 15:32:30 INFO juju.cmd.jujud upgrade.go:88 no upgrade steps required or upgrade steps for 1.25.5.1 have already been run.
2016-05-03 15:32:30 INFO juju.network network.go:242 setting prefer-ipv6 to false
2016-05-03 15:32:30 INFO juju.worker runner.go:269 start "api"
2016-05-03 15:32:30 INFO juju.worker runner.go:269 start "statestarter"
2016-05-03 15:32:30 INFO juju.worker runner.go:269 start "termination"
2016-05-03 15:32:30 INFO juju.api apiclient.go:262 dialing "wss://10.5.1.197:17070/environment/34927576-dbfb-459d-8f82-b2c1c0dcd12c/api"
2016-05-03 15:32:30 DEBUG juju.worker runner.go:196 "statestarter" started
2016-05-03 15:32:30 DEBUG juju.worker runner.go:196 "termination" started
2016-05-03 15:32:30 DEBUG juju.worker runner.go:191 stop "state"
2016-05-03 15:32:31 INFO juju.api apiclient.go:194 connection established to "wss://10.5.1.197:17070/environment/34927576-dbfb-459d-8f82-b2c1c0dcd12c/api"
2016-05-03 15:32:31 INFO juju.api apiclient.go:262 dialing "wss://10.5.1.197:17070/environment/34927576-dbfb-459d-8f82-b2c1c0dcd12c/api"
2016-05-03 15:32:31 INFO juju.api apiclient.go:194 connection established to "wss://10.5.1.197:17070/environment/34927576-dbfb-459d-8f82-b2c1c0dcd12c/api"
2016-05-03 15:32:31 ERROR juju.worker.apicaller open.go:169 agent terminating due to error returned during API open: invalid entity name or password
2016-05-03 15:32:31 INFO juju.work...

Read more...

description: updated
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :
Download full text (3.2 KiB)

Thanks for the update, Jorge. What you're saying makes sense. Here's a breakdown of what's going on:

1. when the machine agent starts, it creates the connection to the API server that it will use for its entire life
  https://github.com/juju/juju/blob/1.25/cmd/jujud/agent/machine.go#L444

2. the connection to the API server can't be made due to credentials, so worker.ErrTerminateAgent is returned
  https://github.com/juju/juju/blob/1.25/worker/apicaller/open.go#L170

3. the machine agent's runner treats the errors as fatal and stops

4. the machine agent waits for its runner to stop

5. the machine agent handles the worker.ErrTerminateAgent by "uninstalling" the agent
  https://github.com/juju/juju/blob/1.25/cmd/jujud/agent/machine.go#L455

6. the init services for Juju and mongod are removed and the agent data dir is deleted
  https://github.com/juju/juju/blob/1.25/cmd/jujud/agent/machine.go#L1737

Steps 1-4 are correct behavior since the agent cannot run without an API connection. The question is whether or not 5 and 6 are desirable. Let's condense that to what happens in 6 and break that down further:

A. the jujud init service is removed
B. the mongod service installed for jujud is removed
C. the agent data dir is deleted

A and B kind of make sense. If the agent was terminated then it should stay off. Uninstalling the services achieves that. Regardless, we don't want jujud to try starting back up on its own, since we expect it will fail in the same way. If that were not the case then we would certainly want to find a solution for juju to fix itself. As things are now, at this point the machine is in a "lost" state and will stay that way until it gets manual intervention.

For C it's kind of the same story. We want to avoid wasting instance resources. The files in the agent data dir are for the sake of the agent, so we clean them up since the agent is dead. If they are otherwise useful then that should be addressed.

I suppose it boils down to why you want A, B, or C to not happen. Is it so that you have a chance to manually fix the problem and then revive the "dead" agent? Given your bug report here, I expect that is the case.

First of all, I have yet to determine the history behind why we uninstall and clean up when we get worker.ErrTerminateAgent. The original rationale should be considered before we make any changes here.

That said, here are some options:

1. support a per-machine DO_NOT_UNINSTALL or DO_NOT_CLEAN_UP setting and respect that in the machine agent code (step 5/6)
2. don't ever clean up
3. don't delete the data dir but do uninstall the services
4. "disable" the agent
  - disable but do not uninstall the services
  - move the data dir out of the way (e.g. into /home/ubuntu/...)
5. like 4, but also add a "juju enable-agent" command that will ssh to the instance, rebuild the agent dir, and re-enable the services
    (or add an "enable-juju" command on the machine to do that)

These are just some ideas and would need more thought before we could move forward. #1 would probably be a less invasive fix so it might be easier to justify targeting in the short term, even if we want some other solution long-ter...

Read more...

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Also, from what I can tell this behavior is the same on 2.0. Have you had a chance to try it?

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Another possibility is for the agent to try to fix itself or the controller to do likewise. However, those would be even bigger changes requiring stronger justification for something that shouldn't ever be necessary and certainly not often.

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

Eric,

Thanks for your comments on this.

Well, please consider one important fact; we didn't make any specific change in the 1.25.5 environment including state or units, the only action that was triggered by the operator was to reboot some of the units.

You can easily reproduce this behavior by altering the agent.conf file, but as I mentioned before, this was not the case on the 2 environments experiencing this issue, it just occurs on some units after rebooting them. So, some sort of condition is causing
the API authentication to fail and this is causing the agent termination to take place.

What I would really like to understand in first place is why the "unauthorized access" or "not provisioned" error is
returned in first place without intervening the environment.

- Does is this error can be wrapping/hiding any other error code or not well handled exception?

The discussion around if worker.ErrTerminateAgent should cause the agent to uninstall, obviously requires to track down
that code's history in order to understand what was the rationale behind this decision. In my opinion, this should be never a decision taken by a piece
of software unless this is directly instructed by the machine operator, so having a flag such as DO_NOT_UNINSTALL enabled by default , makes a lot of sense to me.

Some of the given suggestions make a lot of sense, specially the option 2. You can retry the connection a few times,
and after that , disable the agent without reinstalling, at least this gives us some possibilities to recover or inspect the unit manually.

At the moment we are afraid of this behavior to be repeated on other units and leave the units not operable by
juju, any mechanism that can allows to re-take that specific unit after the uninstall would be also a great addition.

By the moment, we are looking for a way to prevent this to happen in 1.25.5 and understand the root causes of this.

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

To be clear, I agree that a bug (e.g. reboots breaking API authentication) should not result in the agent being uninstalled. This is doubly true if a retry (or several over the next minute) would resolve the issue.

Personally I would favor option 1 for a quick solution and then 4 (or 5) for a long-term one. Option 2 isn't really feasible since it means the agent would try to restart over and over. Option 3 would leave files in the same directory that the agent would use later, potentially causing conflicts.

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

Eric,

What about having a longer retry/catch (3 retries / 10 secs delay) period + uninstall just if the DO_NOT_UNINSTALL option is True?

Changed in juju-core:
assignee: nobody → Eric Snow (ericsnowcurrently)
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

I've sent an email to the juju-dev list soliciting feedback before we do anything about the uninstall issue.

Having a retry around the API connect failure is a reasonable idea that we can pursue separately.

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

@Jorge: would you mind opening a separate issue to track the bug of authentication issues due to rebooting while an upgrade is in progress? It should include pretty much everything in comment #2.

That is a separate bug and I don't want it to get lost while we sort out the uninstall issue.

Thanks!

Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

I've opened several related bugs:

lp:1580233 (1.25) Machine agent is uninstalled even when "uninstall-agent" file does not exist.
lp:1580221 Auto-uninstall of machine agent should be more strictly limited.
lp:1580231 (2.0) Machine agent is uninstalled when it shouldn't be.

Let's keep this one focused on the matter of why auth is failing. Per comment #2, this may be related to a reboot during an upgrade.

Changed in juju-core:
assignee: Eric Snow (ericsnowcurrently) → nobody
status: In Progress → Triaged
summary: - Invalid entity name or password error, causes Juju to uninstall
+ "Invalid entity name or password" error with valid credentials.
description: updated
Changed in juju-core:
milestone: none → 2.0-beta7
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta7 → 2.0-beta8
Revision history for this message
Anastasia (anastasia-macmood) wrote :
tags: added: blocker
Changed in juju-core:
assignee: nobody → Anastasia (anastasia-macmood)
status: Triaged → In Progress
tags: removed: blocker
Revision history for this message
Anastasia (anastasia-macmood) wrote :
Changed in juju-core:
importance: High → Critical
tags: added: blocker
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
tags: removed: blocker
Changed in juju-core:
status: Fix Committed → In Progress
importance: Critical → High
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I am working on this.

My repro steps, at this stage on 1.25:

1. bootstrap
(check logs has no auth issues: "invalid entity..." entries are not present)

2. deploy ubuntu
(monitor logs, the message "invalid entity..." starts coming up before unit starts)

Something is happening during provisioning and starting stage. I am investigating further.

Once I figure out what's happening, I will forward-port to master.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

The reason why we were seeing "invalid entity.." was because first time around we would try to login with empty password knowing in advance that it is invalid.

This PR proposes the fix for 1.25: https://github.com/juju/juju/pull/5506

tags: added: blocker
Revision history for this message
Anastasia (anastasia-macmood) wrote :

The fix has landed on 1.25.

However, I will keep this bug open until Jorge confirmed that auth failures are no more.

Revision history for this message
Anastasia (anastasia-macmood) wrote :
Changed in juju-core:
importance: High → Critical
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I am assuming that with the fix this bug is no longer present, i.e. fixed.

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
tags: removed: blocker
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta8 → none
milestone: none → 2.0-beta8
Changed in juju-core:
assignee: nobody → Anastasia (anastasia-macmood)
importance: Undecided → Critical
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.