Canonical Juju

[2.2.2][2.2.b4] model deletion hung, unable to completely destroy the model or create a new one with the same name, errors in logsink.log

Bug #1695894 reported by Dmitrii Shcherbakov on 2017-06-05

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Andrew Wilkins	Canonical Juju 2.2.3
	OpenStack Charm Test Infra	Fix Released	Undecided	Unassigned

Bug Description

It's hard to tell what got it in that state but model deletion hanged for a while so I sent SIGINT (C-c) to the client and tried to delete the model again. I noticed that sometimes Juju client just doesn't exit on its own even when a controller has already finished deleting a model.

Here, however, I encountered an entirely different case (and on this controller I reproduced it 2 times actually):

http://paste.ubuntu.com/24783149/
"...
Waiting on model to be removed...
^C
(openstack-client) ubuntu@maas:~/cpe-foundation/templates/bundles$ juju destroy-model os
WARNING! This command will destroy the "os" model.
This includes all machines, applications, data and other resources.
Continue [y/N]? y
ERROR cannot connect to API: model 5fecf1b0-93df-42d0-8b61-c454825aa720 has been removed"

logsink.log is full of errors:

root@controller-juju:/var/log/juju# tail -f logsink.log
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.api.watcher watcher.go:86 error trying to stop watcher: read tcp 10.232.3.216:48638->10.232.3.216:37017: i/o timeout
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.worker runner.go:392 exited "txnlog": read tcp 127.0.0.1:59074->127.0.0.1:37017: i/o timeout
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.worker.dependency engine.go:546 "not-dead-flag" manifold worker returned unexpected error: read tcp 10.232.3.216:48638->10.232.3.216:37017: i/o timeout
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.worker runner.go:392 exited "txnlog": read tcp 10.232.3.216:48638->10.232.3.216:37017: i/o timeout
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.worker.dependency engine.go:546 "environ-tracker" manifold worker returned unexpected error: cannot create environ: settings not found (not found)
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.worker.dependency engine.go:546 "environ-tracker" manifold worker returned unexpected error: cannot create environ: settings not found (not found)
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.api.watcher watcher.go:86 error trying to stop watcher: read tcp 127.0.0.1:59074->127.0.0.1:37017: i/o timeout
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.worker.dependency engine.go:546 "not-dead-flag" manifold worker returned unexpected error: read tcp 127.0.0.1:59074->127.0.0.1:37017: i/o timeout

10.232.3.216

root@controller-juju:/var/log/juju# ip -4 -o a s
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
2: ens3 inet 10.232.3.216/21 brd 10.232.7.255 scope global ens3\ valid_lft forever preferred_lft forever

jujud seems pretty busy (for a 12-core VM):

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5884 root 20 0 1702468 224928 44604 S 12.3 0.7 1482:13 jujud
5708 root 20 0 2384288 1.247g 27836 S 3.7 4.0 565:46.65 mongod

root@controller-juju:/var/log/juju# cat /proc/cpuinfo | grep pro
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
processor : 8
processor : 9
processor : 10
processor : 11

ubuntu@maas:~$ juju status
ERROR model 5fecf1b0-93df-42d0-8b61-c454825aa720 has been removed
ubuntu@maas:~$ juju models
Controller: samaas

Model Cloud/Region Status Machines Cores Access Last connection
controller samaas available 1 12 admin just now

See original description

Tags:

Dmitrii Shcherbakov (dmitriis) on 2017-06-05

description:

updated

Revision history for this message

Paul Gear (paulgear) wrote on 2017-07-24: Re: [2.2.b4] model deletion hung, unable to completely destroy the model or create a new one with the same name, errors in logsink.log

I'm getting a number of similar errors on a controller I just upgraded to 2.2.2. Getting status for a hosted model hangs (http://pastebin.ubuntu.com/25160416/). mongodb on the server is running and responds to commands normally.

summary:	- [2.2.b4] model deletion hanged, unable to completely destroy the model - or create a new one with the same name, errors in logsink.log + [2.2.b4] model deletion hung, unable to completely destroy the model or + create a new one with the same name, errors in logsink.log
tags:	added: canonical-is

Revision history for this message

Paul Gear (paulgear) wrote on 2017-07-24:

The above should have read: "Getting status for a hosted model hangs (mine has been running for 10 minutes without producing output), and logsink.log is filled with messages similar to the above (http://pastebin.ubuntu.com/25160416/)."

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-07-25:

@Dmitrii Shcherbakov (dmitriis) and Paul Gear (paulgear),

We either need a reproducible scenario or an access to an already broken environment that we can poke at.

It is hard to say what got your systems into twist. Obviously, from the logs, something drastic has happened... But what?

Changed in juju:
status:	New → Incomplete

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-07-25:

paulgear, what was the original version prior to 2.2.2?

Revision history for this message

Paul Gear (paulgear) wrote on 2017-07-26:

@anastasia-macmood: https://private-fileshare.canonical.com/~paulgear/lp1695894/ has logs; @2-xtian was looking at them a couple of days back.

@dmitriis: I upgraded from 2.1.3.

Changed in juju:
status:	Incomplete → New

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-07-26:

logs after destroy-model Edit (24.2 MiB, application/x-tar)

Got a reproducer on a fresh 2.2.2 model from https://bugs.launchpad.net/juju/+bug/1706340

logs/dumps before destroy-model:
https://bugs.launchpad.net/juju/+bug/1706340/+attachment/4921572/+files/juju.tar.gz
https://bugs.launchpad.net/juju/+bug/1706340/+attachment/4921582/+files/mongodump.tar.gz

May even be the same issue but different symptoms.

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-07-26:

dumps after destroy-model Edit (37.5 MiB, application/x-tar)

summary:	- [2.2.b4] model deletion hung, unable to completely destroy the model or - create a new one with the same name, errors in logsink.log + [2.2.2][2.2.b4] model deletion hung, unable to completely destroy the + model or create a new one with the same name, errors in logsink.log
tags:	added: cdo-qa-blocker cpec

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-07-26:

In addition to #6 and #7 http://paste.ubuntu.com/25178603/ - cli steps performed.

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-08-01:

/var/log/juju on the PRIMARY controller Edit (33.3 MiB, application/x-tar)

I am getting this very reliably on 2.2.2 with HA after several days of completely idle operation (only regular update-status):

root@controller-juju:/home/ubuntu# grep ERROR /var/log/juju/logsink.log | tail -n1000
http://paste.ubuntu.com/25218947/

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2017-08-08:

#10

See related (possibly duplicate): https://bugs.launchpad.net/bugs/1709324

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2017-08-10:

#11

After upgrading from Juju 2.2.1 to 2.2.2, we started having the following issue with iterative deploys using the Juju OpenStack Provider.

"ERROR model f66bb9f7-fef3-4f7b-82dd-c22a13388b8a has been removed"

+ juju switch auto-osci-sv00
auto-osci-sv00 (controller) (no change)
+ juju switch auto-osci-sv00:auto-osci-sv00
auto-osci-sv00 (controller) -> auto-osci-sv00:admin/auto-osci-sv00
+ juju set-model-constraints -m auto-osci-sv00 virt-type=kvm
ERROR model f66bb9f7-fef3-4f7b-82dd-c22a13388b8a has been removed

https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_amulet_full/openstack/charm-neutron-api/482995/7/365/consoleText.test_charm_single_1264.txt

I also destroyed the controllers and rebootstrapped with 2.2.2, still have the same behavior intermittently. So that confirms that the issue is not specific to an upgraded controller.

Nicholas Skaggs (nskaggs) on 2017-08-15

Changed in juju:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Nicholas Skaggs (nskaggs) wrote on 2017-08-15:

#12

I believe this to be fallout from our server-side rate-limiting. A large deployment on teardown is likely to be hit by this. It seems it's being overly aggressive, and our fix https://github.com/juju/juju/pull/7617 perhaps isn't completely solving the issue. Possibly related https://github.com/juju/juju/pull/7613.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2017-08-17:

#13

Nicholas, what led you to that conclusion? I can't see a link. Looks to me like it's an issue on the controller, and not between the client and the controller, or agents and controller. Maybe I'm missing something?

It's *possible* that this was caused by a bug in the "undertaker" worker -- the worker that transitions models from Dying to Dead, destroys environ cloud resources, and then finally removes the model.

The undertaker is ignoring all errors when attempting to transition from Dying to Dead. See: https://github.com/juju/juju/blob/develop/worker/undertaker/undertaker.go#L172; note the TODO in particular. It's possible for the ProcessDyingModel call to fail, and for there to be no further changes to model entity references to trigger it to reattempt.

I'm still looking for a smoking gun, but in the mean time I'll look at fixing that issue.

Revision history for this message

Nicholas Skaggs (nskaggs) wrote on 2017-08-17:

#14

Andrew, I wasn't able to substantiate my claim after digging in. 2.2.2 didn't contain the PR I thought it would have. Anyways, thanks for having a look. My original theory I think is bunk.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2017-08-18:

#15

This fixes the undertaker issue: https://github.com/juju/juju/pull/7761
(but again, may not fix this bug)

Ian Booth (wallyworld) on 2017-08-22

Changed in juju:
milestone:	none → 2.2.3
assignee:	nobody → Andrew Wilkins (axwalk)
status:	Triaged → In Progress

Ian Booth (wallyworld) on 2017-08-29

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2017-09-08

Changed in juju:
status:	Fix Committed → Fix Released

Ante Karamatić (ivoks) on 2017-09-27

tags:

added: cpe-onsite
removed: cpec

Revision history for this message

James Page (james-page) wrote on 2017-09-29:

#16

Marking charm-test-infra bug task as fix released as well as we are using 2.2.4 in CI now.

Changed in charm-test-infra:
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.