[2.2.2][2.2.b4] model deletion hung, unable to completely destroy the model or create a new one with the same name, errors in logsink.log

Bug #1695894 reported by Dmitrii Shcherbakov
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Andrew Wilkins
OpenStack Charm Test Infra
Fix Released
Undecided
Unassigned

Bug Description

It's hard to tell what got it in that state but model deletion hanged for a while so I sent SIGINT (C-c) to the client and tried to delete the model again. I noticed that sometimes Juju client just doesn't exit on its own even when a controller has already finished deleting a model.

Here, however, I encountered an entirely different case (and on this controller I reproduced it 2 times actually):

http://paste.ubuntu.com/24783149/
"...
Waiting on model to be removed...
^C
(openstack-client) ubuntu@maas:~/cpe-foundation/templates/bundles$ juju destroy-model os
WARNING! This command will destroy the "os" model.
This includes all machines, applications, data and other resources.
Continue [y/N]? y
ERROR cannot connect to API: model 5fecf1b0-93df-42d0-8b61-c454825aa720 has been removed"

logsink.log is full of errors:

root@controller-juju:/var/log/juju# tail -f logsink.log
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.api.watcher watcher.go:86 error trying to stop watcher: read tcp 10.232.3.216:48638->10.232.3.216:37017: i/o timeout
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.worker runner.go:392 exited "txnlog": read tcp 127.0.0.1:59074->127.0.0.1:37017: i/o timeout
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.worker.dependency engine.go:546 "not-dead-flag" manifold worker returned unexpected error: read tcp 10.232.3.216:48638->10.232.3.216:37017: i/o timeout
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.worker runner.go:392 exited "txnlog": read tcp 10.232.3.216:48638->10.232.3.216:37017: i/o timeout
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.worker.dependency engine.go:546 "environ-tracker" manifold worker returned unexpected error: cannot create environ: settings not found (not found)
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.worker.dependency engine.go:546 "environ-tracker" manifold worker returned unexpected error: cannot create environ: settings not found (not found)
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.api.watcher watcher.go:86 error trying to stop watcher: read tcp 127.0.0.1:59074->127.0.0.1:37017: i/o timeout
d123bde1-c80a-487f-8922-14c5cdf4de02: machine-0 2017-06-05 12:36:32 ERROR juju.worker.dependency engine.go:546 "not-dead-flag" manifold worker returned unexpected error: read tcp 127.0.0.1:59074->127.0.0.1:37017: i/o timeout

10.232.3.216

ubuntu@maas:~$ maas maas machines read | jq '.[] | select( .ip_addresses[] | contains("10.232.3.216") ) | .system_id + " " + .fqdn + " " + (.tag_names | join(","))'
"8tk7tc controller-juju.maas virtual,bootstrap,controller"

root@controller-juju:/var/log/juju# ip -4 -o a s
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
2: ens3 inet 10.232.3.216/21 brd 10.232.7.255 scope global ens3\ valid_lft forever preferred_lft forever

jujud seems pretty busy (for a 12-core VM):

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 5884 root 20 0 1702468 224928 44604 S 12.3 0.7 1482:13 jujud
 5708 root 20 0 2384288 1.247g 27836 S 3.7 4.0 565:46.65 mongod

root@controller-juju:/var/log/juju# cat /proc/cpuinfo | grep pro
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
processor : 8
processor : 9
processor : 10
processor : 11

ubuntu@maas:~$ juju status
ERROR model 5fecf1b0-93df-42d0-8b61-c454825aa720 has been removed
ubuntu@maas:~$ juju models
Controller: samaas

Model Cloud/Region Status Machines Cores Access Last connection
controller samaas available 1 12 admin just now

description: updated
Revision history for this message
Paul Gear (paulgear) wrote : Re: [2.2.b4] model deletion hung, unable to completely destroy the model or create a new one with the same name, errors in logsink.log

I'm getting a number of similar errors on a controller I just upgraded to 2.2.2. Getting status for a hosted model hangs (http://pastebin.ubuntu.com/25160416/). mongodb on the server is running and responds to commands normally.

summary: - [2.2.b4] model deletion hanged, unable to completely destroy the model
- or create a new one with the same name, errors in logsink.log
+ [2.2.b4] model deletion hung, unable to completely destroy the model or
+ create a new one with the same name, errors in logsink.log
tags: added: canonical-is
Revision history for this message
Paul Gear (paulgear) wrote :

The above should have read: "Getting status for a hosted model hangs (mine has been running for 10 minutes without producing output), and logsink.log is filled with messages similar to the above (http://pastebin.ubuntu.com/25160416/)."

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Dmitrii Shcherbakov (dmitriis) and Paul Gear (paulgear),

We either need a reproducible scenario or an access to an already broken environment that we can poke at.

It is hard to say what got your systems into twist. Obviously, from the logs, something drastic has happened... But what?

Changed in juju:
status: New → Incomplete
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

paulgear, what was the original version prior to 2.2.2?

Revision history for this message
Paul Gear (paulgear) wrote :

@anastasia-macmood: https://private-fileshare.canonical.com/~paulgear/lp1695894/ has logs; @2-xtian was looking at them a couple of days back.

@dmitriis: I upgraded from 2.1.3.

Changed in juju:
status: Incomplete → New
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
summary: - [2.2.b4] model deletion hung, unable to completely destroy the model or
- create a new one with the same name, errors in logsink.log
+ [2.2.2][2.2.b4] model deletion hung, unable to completely destroy the
+ model or create a new one with the same name, errors in logsink.log
tags: added: cdo-qa-blocker cpec
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

In addition to #6 and #7 http://paste.ubuntu.com/25178603/ - cli steps performed.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

I am getting this very reliably on 2.2.2 with HA after several days of completely idle operation (only regular update-status):

root@controller-juju:/home/ubuntu# grep ERROR /var/log/juju/logsink.log | tail -n1000
http://paste.ubuntu.com/25218947/

Revision history for this message
Ryan Beisner (1chb1n) wrote :

See related (possibly duplicate): https://bugs.launchpad.net/bugs/1709324

Revision history for this message
Ryan Beisner (1chb1n) wrote :

After upgrading from Juju 2.2.1 to 2.2.2, we started having the following issue with iterative deploys using the Juju OpenStack Provider.

"ERROR model f66bb9f7-fef3-4f7b-82dd-c22a13388b8a has been removed"

+ juju switch auto-osci-sv00
auto-osci-sv00 (controller) (no change)
+ juju switch auto-osci-sv00:auto-osci-sv00
auto-osci-sv00 (controller) -> auto-osci-sv00:admin/auto-osci-sv00
+ juju set-model-constraints -m auto-osci-sv00 virt-type=kvm
ERROR model f66bb9f7-fef3-4f7b-82dd-c22a13388b8a has been removed

https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_amulet_full/openstack/charm-neutron-api/482995/7/365/consoleText.test_charm_single_1264.txt

I also destroyed the controllers and rebootstrapped with 2.2.2, still have the same behavior intermittently. So that confirms that the issue is not specific to an upgraded controller.

Changed in juju:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Nicholas Skaggs (nskaggs) wrote :

I believe this to be fallout from our server-side rate-limiting. A large deployment on teardown is likely to be hit by this. It seems it's being overly aggressive, and our fix https://github.com/juju/juju/pull/7617 perhaps isn't completely solving the issue. Possibly related https://github.com/juju/juju/pull/7613.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Nicholas, what led you to that conclusion? I can't see a link. Looks to me like it's an issue on the controller, and not between the client and the controller, or agents and controller. Maybe I'm missing something?

It's *possible* that this was caused by a bug in the "undertaker" worker -- the worker that transitions models from Dying to Dead, destroys environ cloud resources, and then finally removes the model.

The undertaker is ignoring all errors when attempting to transition from Dying to Dead. See: https://github.com/juju/juju/blob/develop/worker/undertaker/undertaker.go#L172; note the TODO in particular. It's possible for the ProcessDyingModel call to fail, and for there to be no further changes to model entity references to trigger it to reattempt.

I'm still looking for a smoking gun, but in the mean time I'll look at fixing that issue.

Revision history for this message
Nicholas Skaggs (nskaggs) wrote :

Andrew, I wasn't able to substantiate my claim after digging in. 2.2.2 didn't contain the PR I thought it would have. Anyways, thanks for having a look. My original theory I think is bunk.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

This fixes the undertaker issue: https://github.com/juju/juju/pull/7761
(but again, may not fix this bug)

Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.2.3
assignee: nobody → Andrew Wilkins (axwalk)
status: Triaged → In Progress
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
Ante Karamatić (ivoks)
tags: added: cpe-onsite
removed: cpec
Revision history for this message
James Page (james-page) wrote :

Marking charm-test-infra bug task as fix released as well as we are using 2.2.4 in CI now.

Changed in charm-test-infra:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.