Lost relation data after juju killing upgrader/uniter on state connection lost.

Bug #1356500 reported by Jorge Niedbalski
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Invalid
Medium
Unassigned

Bug Description

[Environment details]

Juju-version: juju-1.18.1.1-trusty-amd64
Ubuntu version: Trusty

[Reproduction steps]

I was unable to reproduce this exact behavior on juju-core 1.20.1

$ juju --show-log boostrap -e local -v
$ juju deploy keystone
$ juju deploy mysql
$ juju add-relation keystone mysql
$ juju ssh keystone:

ubuntu# grep connection /etc/keystone/keystone.conf
connection = mysql://keystone:gsoDGPC6paDLQ8RE@10.0.3.246/keystone

Then force the jujud teardown:

$ service juju-agent-local stop
$[...]

The shutdown is noticed by the unit,

unit-mysql-0: 2014-08-13 19:52:36 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
unit-mysql-0: message repeated 10 times: [2014-08-13 19:52:36 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down]
unit-mysql-0: 2014-08-13 19:52:36 ERROR juju.worker.uniter.filter filter.go:116 connection is shut down
unit-mysql-0: 2014-08-13 19:52:36 ERROR juju.worker runner.go:207 fatal "logger": connection is shut down
unit-mysql-0: 2014-08-13 19:52:36 ERROR juju.worker runner.go:207 fatal "upgrader": connection is shut down
unit-mysql-0: 2014-08-13 19:52:36 ERROR juju.worker runner.go:207 fatal "apiaddressupdater": connection is shut down
unit-mysql-0: 2014-08-13 19:52:36 ERROR juju.worker runner.go:207 fatal "uniter": connection is shut down
unit-mysql-0: 2014-08-13 19:52:36 ERROR juju.worker runner.go:207 fatal "rsyslog": connection is shut down
unit-mysql-0: 2014-08-13 19:52:36 ERROR juju.worker runner.go:218 exited "api": connection is shut down
unit-mysql-0: 2014-08-13 19:52:39 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"
unit-mysql-0: 2014-08-13 19:52:42 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"
unit-mysql-0: 2014-08-13 19:52:45 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"
unit-mysql-0: 2014-08-13 19:52:48 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"

Then the workers are restarted and the config-changed event is triggered:

unit-keystone-0: 2014-08-13 19:52:59 INFO juju-log Migrating the keystone database.
unit-keystone-0: 2014-08-13 19:52:59 INFO config-changed keystone stop/waiting
unit-keystone-0: 2014-08-13 19:52:36 ERROR juju.worker runner.go:207 fatal "uniter": connection is shut down
unit-keystone-0: 2014-08-13 19:52:36 ERROR juju.worker runner.go:207 fatal "rsyslog": connection is shut down
unit-keystone-0: 2014-08-13 19:52:36 ERROR juju.worker runner.go:218 exited "api": connection is shut down
unit-keystone-0: 2014-08-13 19:52:39 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"
unit-keystone-0: 2014-08-13 19:52:42 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"
unit-keystone-0: 2014-08-13 19:52:45 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"
unit-keystone-0: 2014-08-13 19:52:48 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"
unit-keystone-0: 2014-08-13 19:53:00 INFO config-changed keystone start/running, process 4972

$ grep connection /etc/keystone/keystone.conf
connection = mysql://keystone:gsoDGPC6paDLQ8RE@10.0.3.246/keystone

==================

This last step returns None, on juju-core 1.18.1, this relation data is populated by the keystone charm using

'database_host': relation_get('host', rid=rid, unit=unit)

[Summary]

1) For some undefined reason the agent lost connection with the state.
    *juju.cmd.jujud agent.go:137 error pinging *api.State: connection is shut down

2) Then a 'environment change' event is triggered
     * juju.worker.uniter uniter.go:747 new environment change

3) Juju kills uniter, upgrader, logger workers

4) Then the connection is restablished:
    * juju apiclient.go:124 state/api: connection established

5) Then juju starts again the upgrader/uniter workers
     * runner.go:262 worker: start "uniter" [..]

6) All the relations are joined again
      * 2014-08-12 11:05:15 INFO juju.worker.uniter uniter.go:690 joining relation "glance:cluster"

7) And finally a config-changed event is triggered
       * 2014-08-12 11:05:18 INFO juju.worker.uniter modes.go:382 ModeConfigChanged starting

[Issues]

- After this final config-changed event, the relation data that had been set previously ( database credentials ) has been lost, so the resulting configuration files written by the charms (that uses relation_get) have erroneous values ( empty strings ).

[ Logs ]

2014-08-12 10:50:58 INFO juju.worker.upgrader upgrader.go:121 desired tool version: 1.18.1.11
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-08-12 11:05:00 DEBUG juju.worker.uniter uniter.go:747 new environment change
2014-08-12 11:05:00 DEBUG juju.worker.uniter.filter filter.go:325 got service change
2014-08-12 11:05:00 ERROR juju.worker.uniter.filter filter.go:116 watcher iteration error: EOF
2014-08-12 11:05:00 INFO juju.cmd.jujud agent.go:137 error pinging *api.State: connection is shut down
2014-08-12 11:05:00 ERROR juju runner.go:209 worker: fatal "upgrader": watcher iteration error: EOF
2014-08-12 11:05:00 DEBUG juju runner.go:243 worker: killing "logger"
2014-08-12 11:05:00 DEBUG juju runner.go:243 worker: killing "uniter"
2014-08-12 11:05:00 DEBUG juju runner.go:243 worker: killing "rsyslog"
2014-08-12 11:05:00 INFO juju.cmd.jujud agent.go:137 error pinging *api.State: connection is shut down
2014-08-12 11:05:00 ERROR juju runner.go:209 worker: fatal "logger": watcher iteration error: EOF
2014-08-12 11:05:00 DEBUG juju.worker.uniter modes.go:384 ModeAbide exiting
2014-08-12 11:05:00 INFO juju.worker.uniter uniter.go:141 unit "glance/1" shutting down: tomb: dying
2014-08-12 11:05:00 DEBUG juju.worker.uniter runlistener.go:97 juju-run listener stopping
2014-08-12 11:05:00 DEBUG juju.worker.uniter runlistener.go:117 juju-run listener stopped
2014-08-12 11:05:00 INFO juju.cmd.jujud agent.go:137 error pinging *api.State: connection is shut down
2014-08-12 11:05:00 ERROR juju runner.go:209 worker: fatal "uniter": watcher iteration error: EOF
2014-08-12 11:05:00 INFO juju.cmd.jujud agent.go:137 error pinging *api.State: connection is shut down
2014-08-12 11:05:00 ERROR juju runner.go:209 worker: fatal "rsyslog": watcher iteration error: EOF
2014-08-12 11:05:00 INFO juju.rpc server.go:295 error closing codec: write tcp 10.30.0.101:17070: connection reset by peer
2014-08-12 11:05:00 ERROR juju runner.go:220 worker: exited "api": watcher iteration error: EOF
2014-08-12 11:05:00 INFO juju runner.go:254 worker: restarting "api" in 3s
2014-08-12 11:05:03 INFO juju runner.go:262 worker: start "api"
2014-08-12 11:05:03 INFO juju apiclient.go:114 state/api: dialing "wss://10.30.0.101:17070/"
2014-08-12 11:05:03 ERROR juju apiclient.go:119 state/api: websocket.Dial wss://10.30.0.101:17070/: dial tcp 10.30.0.101:17070: connection refused
2014-08-12 11:05:03 ERROR juju runner.go:220 worker: exited "api": websocket.Dial wss://10.30.0.101:17070/: dial tcp 10.30.0.101:17070: connection refused
2014-08-12 11:05:03 INFO juju runner.go:254 worker: restarting "api" in 3s
2014-08-12 11:05:06 INFO juju runner.go:262 worker: start "api"
2014-08-12 11:05:06 INFO juju apiclient.go:114 state/api: dialing "wss://10.30.0.101:17070/"
2014-08-12 11:05:06 INFO juju apiclient.go:124 state/api: connection established
2014-08-12 11:05:08 INFO juju runner.go:262 worker: start "upgrader"
2014-08-12 11:05:08 INFO juju runner.go:262 worker: start "logger"
2014-08-12 11:05:08 DEBUG juju.worker.logger logger.go:35 initial log config: "<root>=DEBUG;unit=DEBUG"
2014-08-12 11:05:08 INFO juju runner.go:262 worker: start "uniter"
2014-08-12 11:05:08 DEBUG juju.worker.logger logger.go:60 logger setup
2014-08-12 11:05:08 INFO juju runner.go:262 worker: start "rsyslog"
2014-08-12 11:05:08 DEBUG juju.worker.rsyslog worker.go:76 starting rsyslog worker mode 1 for "unit-glance-1" ""
2014-08-12 11:05:09 DEBUG juju.worker.rsyslog worker.go:168 Reloading rsyslog configuration
2014-08-12 11:05:12 INFO juju.worker.upgrader upgrader.go:121 desired tool version: 1.18.1.1
2014-08-12 11:05:15 INFO juju.worker.upgrader upgrader.go:121 desired tool version: 1.18.1.1
2014-08-12 11:05:15 INFO juju.worker.uniter uniter.go:690 joining relation "glance:cluster"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:716 joined relation "glance:cluster"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:690 joining relation "r2-compute:image-service glance:image-service"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:716 joined relation "r2-compute:image-service glance:image-service"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:690 joining relation "nova-cloud-controller:image-service glance:image-service"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:716 joined relation "nova-cloud-controller:image-service glance:image-service"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:690 joining relation "glance:ceph r1-mon:client"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:716 joined relation "glance:ceph r1-mon:client"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:690 joining relation "cinder:image-service glance:image-service"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:716 joined relation "cinder:image-service glance:image-service"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:690 joining relation "glance:shared-db mysql:shared-db"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:716 joined relation "glance:shared-db mysql:shared-db"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:690 joining relation "glance:amqp rabbitmq-server:amqp"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:716 joined relation "glance:amqp rabbitmq-server:amqp"
2014-08-12 11:05:16 INFO juju.worker.uniter uniter.go:690 joining relation "r1-compute:image-service glance:image-service"
2014-08-12 11:05:17 INFO juju.worker.uniter uniter.go:716 joined relation "r1-compute:image-service glance:image-service"
2014-08-12 11:05:17 INFO juju.worker.uniter uniter.go:690 joining relation "glance:ha hacluster-glance:ha"
2014-08-12 11:05:17 INFO juju.worker.uniter uniter.go:716 joined relation "glance:ha hacluster-glance:ha"
2014-08-12 11:05:17 INFO juju.worker.uniter uniter.go:690 joining relation "glance:identity-service keystone:identity-service"
2014-08-12 11:05:17 INFO juju.worker.uniter uniter.go:716 joined relation "glance:identity-service keystone:identity-service"
2014-08-12 11:05:17 DEBUG juju.worker.uniter uniter.go:217 starting juju-run listener on unix:/var/lib/juju/agents/unit-glance-1/run.socket
2014-08-12 11:05:17 INFO juju.worker.uniter uniter.go:112 unit "glance/1" started
2014-08-12 11:05:17 DEBUG juju.worker.uniter runlistener.go:84 juju-run listener running
2014-08-12 11:05:18 INFO juju.worker.uniter modes.go:382 ModeContinue starting
2014-08-12 11:05:18 INFO juju.worker.uniter modes.go:31 loading uniter state
2014-08-12 11:05:18 INFO juju.worker.uniter modes.go:48 continuing after "config-changed" hook
2014-08-12 11:05:18 DEBUG juju.worker.uniter modes.go:384 ModeContinue exiting
2014-08-12 11:05:18 INFO juju.worker.uniter modes.go:382 ModeConfigChanged starting
2014-08-12 11:05:18 DEBUG juju.worker.uniter uniter.go:747 new environment change
2014-08-12 11:05:18 DEBUG juju.worker.uniter.filter filter.go:507 charm check skipped, not yet installed.
2014-08-12 11:05:18 DEBUG juju.worker.uniter.filter filter.go:317 got unit change
2014-08-12 11:05:18 DEBUG juju.worker.uniter.filter filter.go:333 got config change
2014-08-12 11:05:18 DEBUG juju.worker.uniter.filter filter.go:337 preparing new config event
2014-08-12 11:05:18 DEBUG juju.worker.uniter.filter filter.go:440 discarded config event
2014-08-12 11:05:18 DEBUG juju.worker.uniter.filter filter.go:341 got relations change
2014-08-12 11:05:18 INFO juju.utils.fslock fslock.go:146 attempted lock failed "uniter-hook-execution", glance/1: running hook "config-changed", currently held: hacluster-glance/5: running hook "config-changed"
2014-08-12 11:05:18 DEBUG juju.worker.uniter.filter filter.go:325 got service change
2014-08-12 11:05:18 DEBUG juju.worker.uniter.filter filter.go:521 no new charm event
2014-08-12 11:05:19 INFO juju.worker.uniter uniter.go:480 running "config-changed" hook
2014-08-12 11:05:19 INFO config-changed

Tags: sts
description: updated
description: updated
Revision history for this message
Curtis Hovey (sinzui) wrote :

This issue may be fixed. We will want to review this bug again to determine if it no longer affects 1.20 and 1.21.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → next-stable
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :
Download full text (49.0 KiB)

Hello Guys,

Here are the json export files, settings file for mysql

    {
        "_id": "r#157#provider#mysql/0",
        "db_host": "10.30.0.XX",
        "password": "xxxx",
        "private-address": "10.30.0.XX",
        "txn-queue": [

        ],
        "txn-revno": 3
    },

Relations.json

[
    {
        "_id": "r1-mon:mon",
        "endpoints": [
            {
                "servicename": "r1-mon",
                "relation": {
                    "name": "mon",
                    "role": "peer",
                    "interface": "ceph",
                    "optional": false,
                    "limit": 1,
                    "scope": "global"
                }
            }
        ],
        "id": 0,
        "life": 0,
        "txn-queue": [],
        "txn-revno": 5,
        "unitcount": 3
    },
    {
        "_id": "r2-mon:mon",
        "endpoints": [
            {
                "servicename": "r2-mon",
                "relation": {
                    "name": "mon",
                    "role": "peer",
                    "interface": "ceph",
                    "optional": false,
                    "limit": 1,
                    "scope": "global"
                }
            }
        ],
        "id": 1,
        "life": 0,
        "txn-queue": [],
        "txn-revno": 5,
        "unitcount": 3
    },
    {
        "_id": "r2-compute:compute-peer",
        "endpoints": [
            {
                "servicename": "r2-compute",
                "relation": {
                    "name": "compute-peer",
                    "role": "peer",
                    "interface": "nova",
                    "optional": false,
                    "limit": 1,
                    "scope": "global"
                }
            }
        ],
        "id": 3,
        "life": 0,
        "txn-queue": [],
        "txn-revno": 8,
        "unitcount": 4
    },
    {
        "_id": "rabbitmq-server:cluster",
        "endpoints": [
            {
                "servicename": "rabbitmq-server",
                "relation": {
                    "name": "cluster",
                    "role": "peer",
                    "interface": "rabbitmq-ha",
                    "optional": false,
                    "limit": 1,
                    "scope": "global"
                }
            }
        ],
        "id": 5,
        "life": 0,
        "txn-queue": [],
        "txn-revno": 4,
        "unitcount": 2
    },
    {
        "_id": "nova-cloud-controller:cluster",
        "endpoints": [
            {
                "servicename": "nova-cloud-controller",
                "relation": {
                    "name": "cluster",
                    "role": "peer",
                    "interface": "nova-ha",
                    "optional": false,
                    "limit": 1,
                    "scope": "global"
                }
            }
        ],
        "id": 7,
        "life": 0,
        "txn-queue": [],
        "txn-revno": 4,
        "unitcount": 2
    },
    {
        "_id": "openstack-dashboard:cluster",
        "endpoints": [
            {
                "servicename": "openstack-dashboard",
                "relation": {
  ...

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

All the relations.json data appears to be OK, and since that stage worked fine after the state server reconnection, i don't think the problem is there.

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

Hello,

Attached are the mongod exported json files, I am missing something? All the relations
data appears to be OK , and that specific part worked OK after the outage.

I am not sure if all the settings data for each relation are still there.

I took a look into the settings.json file and all that i can find referencing the mysql settings
is:

    {
        "_id": "r#157#provider#mysql/0",
        "db_host": "xxxxx",
        "password": "xxxx",
        "private-address": "xxxx",
        "txn-queue": [

        ],
        "txn-revno": 3
    },

I am unsure if this entry is specific for the relation id 57, which is the gateway, and by coincidence
or not, this was the only service re-deployed correctly after the outage.

Any ideas ?

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

OK, this is becoming interesting, since i found an already unnoticed panic that is raised after a relation_set command
is performed, since the relation_set command doesn't returns any non-zero exit code, the charm installation succeed,
which explains why the relation data is missed.

Panic traceback

2014-08-12 11:06:09 DEBUG worker.uniter.jujuc server.go:104 hook context id "keystone/0:config-changed:6969413113383759108"; dir "/var/lib/juju/agents/unit-keystone-0/charm"
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x20 pc=0x5bd970]

goroutine 15439 [running]:
runtime.panic(0xcdad60, 0x1a86648)
 /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
launchpad.net/juju-core/worker/uniter/jujuc.(*RelationSetCommand).Run(0xc210253750, 0xc210203050, 0x7fc83a9dbe80, 0xc21008d4d0)
 /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/worker/uniter/jujuc/relation-set.go:68 +0x340
launchpad.net/juju-core/cmd.Main(0x7fc83a9efe58, 0xc210253750, 0xc210203050, 0xc2102255a0, 0x11, ...)
 /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/cmd/cmd.go:244 +0x283
launchpad.net/juju-core/worker/uniter/jujuc.(*Jujuc).Main(0xc21029ab50, 0xc2102536c0, 0x2d, 0xc2102536f0, 0x2a, ...)
 /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/worker/uniter/jujuc/server.go:105 +0x5fd
reflect.Value.call(0xc59b80, 0xc87a70, 0x130, 0xe019d0, 0x4, ...)
 /usr/lib/go/src/pkg/reflect/value.go:474 +0xe0b
reflect.Value.Call(0xc59b80, 0xc87a70, 0x130, 0x7fc83a848ee8, 0x3, ...)
 /usr/lib/go/src/pkg/reflect/value.go:345 +0x9d
net/rpc.(*service).call(0xc210090bc0, 0xc210090b80, 0xc210174b00, 0xc2101e7800, 0xc210247920, ...)
 /usr/lib/go/src/pkg/net/rpc/server.go:381 +0x159
created by net/rpc.(*Server).ServeCodec
 /usr/lib/go/src/pkg/net/rpc/server.go:452 +0x3bb

Revision history for this message
Ian Booth (wallyworld) wrote :

The panic occurs because there is a bug in RelationSetCommand.Run().
 settings, err := r.Settings()
 for k, v := range c.Settings {
  if v != "" {
   settings.Set(k, v)
  } else {
   settings.Delete(k)
  }
 }

The error from r.Settings() is not checked.
If the error were checked, the Run() method would have returned an error instead of panicing. This would then have printed the real reason for the underlying issue rather than logging a panic.
Bug 1358585 has been raised for the fix to error checking.

Curtis Hovey (sinzui)
Changed in juju-core:
importance: High → Medium
milestone: next-stable → none
Felipe Reyes (freyes)
tags: added: sts
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

Fixed on 1.20, 1.18 not needed anymore.

Changed in juju-core:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.