Canonical Juju

Juju HA upgrade 2.1.x -> 2.2.X never finish.

Bug #1717911 reported by José Pekkarinen on 2017-09-18

This bug affects 1 person

	Status	Importance	Assigned to
Canonical Juju	Fix Released	Undecided	Unassigned
2.1	Won't Fix	Undecided	Unassigned
2.2	Fix Released	Undecided	Unassigned

Bug Description

Hi,

On a juju ha deployment, I tried to upgrade the model controller using juju upgrade

juju upgrade-juju --agent-version 2.2.4 -m controller

meanwhile controlling the traffic using iptables rules such as the following for the primary
node:

-A INPUT -s <juju_client_ip>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
-A INPUT -s <controller_2>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
-A INPUT -s <controller_3>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 17070 -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p tcp -m tcp --dport 17070 -j DROP

And similar for the rest of controllers:

-A INPUT -s <controller_1>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
-A INPUT -s <controller_3>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 17070 -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p tcp -m tcp --dport 17070 -j DROP

-A INPUT -s <controller_2>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
-A INPUT -s <controller_3>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 17070 -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p tcp -m tcp --dport 17070 -j DROP

Juju status will answer the client reporting all nodes are down, and no further
change will happen for nearly 12h. From the logs, it's possible to capture all controllers
are trying to reach each other regularly without success, as all answer telling they are in
upgrade state:

2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:93 [B3E5] API connection from $controller3:34268
2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:93 [B3E6] API connection from $controller3:52890
2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:145 <- [B3E5] {"request-id":1,"type":"Admin","version":3,"request":"Login","params":"'params redacted'"}
2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:106 [B3E6] API connection terminated after 785.69µs
2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:171 -> [B3E5] 263.965µs {"request-id":1,"error":"login for \"machine-9\" blocked because upgrade in progress","response":"'body redacted'"} Admin[""].Login
2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:106 [B3E5] API connection terminated after 1.781122ms

....

2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:93 [B3E9] API connection from $controller2:56498
2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:93 [B3EA] API connection from $controller2:54582
2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:145 <- [B3E9] {"request-id":1,"type":"Admin","version":3,"request":"Login","params":"'params redacted'"}
2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:171 -> [B3E9] 242.092µs {"request-id":1,"error":"login for \"machine-5\" blocked because upgrade in progress","response":"'body redacted'"} Admin[""].Login
2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:106 [B3EA] API connection terminated after 715.978µs
2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:93 [B3EB] API connection from $controller2:46728
2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:106 [B3E9] API connection terminated after 1.577115ms
2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:106 [B3EB] API connection terminated after 648.661µs

This last kind of output is observable from all controllers, and is not tied to the controllers
itself, units would be able to show up with same answer in the log. Please let me know any
needed output to push this forward.

Best regards.

José.

Tags:

Revision history for this message

John A Meinel (jameinel) wrote on 2017-09-19: Re: [Bug 1717911] [NEW] Juju HA upgrade 2.1.x -> 2.2.X never finish.

Download full text (5.0 KiB)

This sounds like a repeat of
https://bugs.launchpad.net/bugs/1697956

which we believe is fixed in 2.2.1

Presumably a couple of your controllers have come up and are waiting for
the 3rd but it is failing because of the "upgrade in progress bug.

There is a manual fix for this if you just want things working, but we've
never managed to reproduce the failure directly, so it would be good to
know if we really did fix the underlying issue.

If you just want to restore the system, you can look in /var/lib/juju/tools
(I believe). There should be a symlink for the machine agent pointing
(currently) to the old agent version. Controllers wait for an indication
that all controllers are ready to move to the next upgrade state.

I'm hesitant to give off the top of my head advice in case I get something
wrong. Are you able to talk more later today?

John
=:->

On Sep 18, 2017 16:11, "José Pekkarinen" <email address hidden>
wrote:

This sounds like a repeat of
  https://bugs.launchpad.net/bugs/1697956

which we believe is fixed in 2.2.1

Presumably a couple of your controllers have come up and are waiting for
the 3rd but it is failing because of the "upgrade in progress bug.

There is a manual fix for this if you just want things working, but we've
never managed to reproduce the failure directly, so it would be good to
know if we really did fix the underlying issue.

I'm hesitant to give off the top of my head advice in case I get something
wrong. Are you able to talk more later today?

John
=:->

On Sep 18, 2017 16:11, "José Pekkarinen" <jose.pekkarinen@canonical.com>
wrote:

> Public bug reported:
>
> Hi,
>
> On a juju ha deployment, I tried to upgrade the model controller using
> juju upgrade
>
> juju upgrade-juju --agent-version 2.2.4 -m controller
>
> meanwhile controlling the traffic using iptables rules such as the
> following for the primary
> node:
>
> -A INPUT -s <juju_client_ip>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
> -A INPUT -s <controller_2>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
> -A INPUT -s <controller_3>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 17070 -m state --state RELATED,ESTABLISHED
> -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 17070 -j DROP
>
> And similar for the rest of controllers:
>
> -A INPUT -s <controller_1>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
> -A INPUT -s <controller_3>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 17070 -m state --state RELATED,ESTABLISHED
> -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 17070 -j DROP
>
> -A INPUT -s <controller_2>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
> -A INPUT -s <controller_3>/32 -p tcp -m tcp --dport 17070 -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 17070 -m state --state RELATED,ESTABLISHED
> -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 17070 -j DROP
>
> Juju status will answer the client reporting all nodes are down, and no
> further
> change will happen for nearly 12h. From the logs, it's possible to capture
> all controllers
> are trying to reach each other regularly without success, as all answer
> telling they are in
> upgrade state:
>
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:93 [B3E5] API
> connection from $controller3:34268
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:93 [B3E6] API
> connection from $controller3:52890
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:145 <-
> [B3E5]  {"request-id":1,"type":"Admin","version":3,"request":"Login","params":"'params
> redacted'"}
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:106 [B3E6]
> API connection terminated after 785.69µs
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:171 ->
> [B3E5]  263.965µs {"request-id":1,"error":"login for \"machine-9\" blocked
> because upgrade in progress","response":"'body redacted'"} Admin[""].Login
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:106 [B3E5]
> API connection terminated after 1.781122ms
>
> ....
>
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:93 [B3E9] API
> connection from $controller2:56498
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:93 [B3EA] API
> connection from $controller2:54582
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:145 <-
> [B3E9]  {"request-id":1,"type":"Admin","version":3,"request":"Login","params":"'params
> redacted'"}
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:171 ->
> [B3E9]  242.092µs {"request-id":1,"error":"login for \"machine-5\" blocked
> because upgrade in progress","response":"'body redacted'"} Admin[""].Login
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:106 [B3EA]
> API connection terminated after 715.978µs
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:93 [B3EB] API
> connection from $controller2:46728
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:106 [B3E9]
> API connection terminated after 1.577115ms
> 2017-09-16 04:03:44 DEBUG juju.apiserver request_notifier.go:106 [B3EB]
> API connection terminated after 648.661µs
>
> This last kind of output is observable from all controllers, and is not
> tied to the controllers
> itself, units would be able to show up with same answer in the log. Please
> let me know any
> needed output to push this forward.
>
> Best regards.
>
> José.
>
> ** Affects: juju
>      Importance: Undecided
>          Status: New
>
>
> ** Tags: 4010
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1717911
>
> Title:
>   Juju HA upgrade 2.1.x -> 2.2.X never finish.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1717911/+subscriptions
>

Revision history for this message

José Pekkarinen (koalinux) wrote on 2017-09-19:

Similar they are, but this was from 2.1.3 to 2.2.4, so either it's not fixed, or we have a regression.

Revision history for this message

José Pekkarinen (koalinux) wrote on 2017-09-19:

tools seems to be synchronised properly, but I cannot test the workaround anytime, I have
to schedule a time gap to try it:

juju run -m controller --machine 0,5,9 "ls -l /var/lib/juju/tools"
- MachineId: "0"
  Stdout: |
    total 12
    drwxr-xr-x 2 root root 4096 Apr 16 02:11 2.1.2.1-xenial-amd64
    drwxr-xr-x 2 root root 4096 Jun 30 20:54 2.1.3.1-xenial-amd64
    drwxr-xr-x 2 root root 4096 Sep 15 14:21 2.2.4-xenial-amd64
    lrwxrwxrwx 1 root root 40 Sep 16 04:16 machine-0 -> /var/lib/juju/tools/2.1.3.1-xenial-amd64
- MachineId: "5"
  Stdout: |
    total 12
    drwxr-xr-x 2 root root 4096 Apr 19 09:23 2.1.2.1-xenial-amd64
    drwxr-xr-x 2 root root 4096 Jun 30 20:54 2.1.3.1-xenial-amd64
    drwxr-xr-x 2 root root 4096 Sep 15 14:21 2.2.4-xenial-amd64
    lrwxrwxrwx 1 root root 40 Sep 16 04:17 machine-5 -> /var/lib/juju/tools/2.1.3.1-xenial-amd64
- MachineId: "9"
  Stdout: |
    total 12
    drwxr-xr-x 2 root root 4096 Apr 21 14:17 2.1.2.1-xenial-amd64
    drwxr-xr-x 2 root root 4096 Jun 30 20:54 2.1.3.1-xenial-amd64
    drwxr-xr-x 2 root root 4096 Sep 15 14:21 2.2.4-xenial-amd64
    lrwxrwxrwx 1 root root 40 Sep 16 04:17 machine-9 -> /var/lib/juju/tools/2.1.3.1-xenial-amd64

Revision history for this message

John A Meinel (jameinel) wrote on 2017-09-19: Re: [Bug 1717911] Re: Juju HA upgrade 2.1.x -> 2.2.X never finish.

So I believe that if you stop the jujud-machine-X service on each machine,
and then update the symlink, and then start them again, they will all end
up happy on the new version.

John
=:->

On Tue, Sep 19, 2017 at 12:03 PM, José Pekkarinen <
<email address hidden>> wrote:

> tools seems to be synchronised properly, but I cannot test the workaround
> anytime, I have
> to schedule a time gap to try it:
>
> juju run -m controller --machine 0,5,9 "ls -l /var/lib/juju/tools"
> - MachineId: "0"
> Stdout: |
> total 12
> drwxr-xr-x 2 root root 4096 Apr 16 02:11 2.1.2.1-xenial-amd64
> drwxr-xr-x 2 root root 4096 Jun 30 20:54 2.1.3.1-xenial-amd64
> drwxr-xr-x 2 root root 4096 Sep 15 14:21 2.2.4-xenial-amd64
> lrwxrwxrwx 1 root root 40 Sep 16 04:16 machine-0 ->
> /var/lib/juju/tools/2.1.3.1-xenial-amd64
> - MachineId: "5"
> Stdout: |
> total 12
> drwxr-xr-x 2 root root 4096 Apr 19 09:23 2.1.2.1-xenial-amd64
> drwxr-xr-x 2 root root 4096 Jun 30 20:54 2.1.3.1-xenial-amd64
> drwxr-xr-x 2 root root 4096 Sep 15 14:21 2.2.4-xenial-amd64
> lrwxrwxrwx 1 root root 40 Sep 16 04:17 machine-5 ->
> /var/lib/juju/tools/2.1.3.1-xenial-amd64
> - MachineId: "9"
> Stdout: |
> total 12
> drwxr-xr-x 2 root root 4096 Apr 21 14:17 2.1.2.1-xenial-amd64
> drwxr-xr-x 2 root root 4096 Jun 30 20:54 2.1.3.1-xenial-amd64
> drwxr-xr-x 2 root root 4096 Sep 15 14:21 2.2.4-xenial-amd64
> lrwxrwxrwx 1 root root 40 Sep 16 04:17 machine-9 ->
> /var/lib/juju/tools/2.1.3.1-xenial-amd64
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1717911
>
> Title:
> Juju HA upgrade 2.1.x -> 2.2.X never finish.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1717911/+subscriptions
>

Revision history for this message

José Pekkarinen (koalinux) wrote on 2017-09-19:

last time I tried something similar juju was taking info from the database, figuring out the
link was wrong and recreating it. You sure you want me to give it a go?

Revision history for this message

John A Meinel (jameinel) wrote on 2017-09-19:

If they are at the stage you mention, they're waiting for each controller
agent to check in, and then they will do exactly that. I would expect it to
be enough to kick them over the line and actually switch to 2.2.

On Tue, Sep 19, 2017 at 7:33 PM, José Pekkarinen <
<email address hidden>> wrote:

> last time I tried something similar juju was taking info from the
> database, figuring out the
> link was wrong and recreating it. You sure you want me to give it a go?
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1717911
>
> Title:
> Juju HA upgrade 2.1.x -> 2.2.X never finish.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1717911/+subscriptions
>

Revision history for this message

José Pekkarinen (koalinux) wrote on 2017-09-19:

no need to fix upgrade-juju?

Revision history for this message

John A Meinel (jameinel) wrote on 2017-09-21:

So I actually think it is fixed in 2.2, but its 2.1 that you're trying to
upgrade.

On Tue, Sep 19, 2017 at 9:18 PM, José Pekkarinen <
<email address hidden>> wrote:

> no need to fix upgrade-juju?
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1717911
>
> Title:
> Juju HA upgrade 2.1.x -> 2.2.X never finish.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1717911/+subscriptions
>

Anastasia (anastasia-macmood) on 2017-09-21

Changed in juju:
status:	New → Fix Released

Revision history for this message

José Pekkarinen (koalinux) wrote on 2017-09-22:

I got some time to work a bit on this right now, and I see the upgrades lock with the links
properly set:

lrwxrwxrwx 1 root root 38 Sep 22 15:09 machine-0 -> /var/lib/juju/tools/2.2.4-xenial-amd64
lrwxrwxrwx 1 root root 38 Sep 22 15:09 machine-5 -> /var/lib/juju/tools/2.2.4-xenial-amd64
lrwxrwxrwx 1 root root 38 Sep 22 15:09 machine-9 -> /var/lib/juju/tools/2.2.4-xenial-amd64

So if the fix is released as stated here, there is a regression.

Ante Karamatić (ivoks) on 2017-09-27

tags:

added: cpe-onsite

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.