read: connection reset by peer during action

Bug #1899657 reported by Marian Gasparovic
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Unassigned

Bug Description

While testing openstack-charms-next when executing ceph-mon action ceph-mon charm could not communicate with Juju.
This is the first time we saw this.

2020-10-13 14:48:06 DEBUG juju.worker.dependency engine.go:598 "hook-retry-strategy" manifold worker stopped: restart immediately
2020-10-13 14:48:06 DEBUG juju.worker.uniter runlistener.go:129 juju-run listener stopping
2020-10-13 14:48:06 DEBUG juju.worker.uniter runlistener.go:148 juju-run listener stopped
2020-10-13 14:48:07 DEBUG juju.worker.dependency engine.go:564 "hook-retry-strategy" manifold worker started at 2020-10-13 14:48:07.465930028 +0000 UTC
2020-10-13 14:48:09 DEBUG juju.worker.proxyupdater proxyupdater.go:168 applying in-process juju proxy settings proxy.Settings{Http:"http://squid.internal:3128/", Https:"http://squid.internal:3128/", Ftp:"", NoProxy:"10.0.0.0/8,10.246.64.200,10.246.64.201,10.246.64.202,127.0.0.1,172.16.0.0/12,192.168.0.0/16,localhost", AutoNoProxy:""}
2020-10-13 14:48:11 DEBUG juju.worker.uniter.operation executor.go:130 executing operation "run action 48"
2020-10-13 14:48:11 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] executing: running action juju-run
2020-10-13 14:48:12 DEBUG juju.machinelock machinelock.go:186 machine lock released for uniter (run action 48)
2020-10-13 14:48:12 DEBUG juju.worker.uniter.operation executor.go:113 lock released
2020-10-13 14:48:12 ERROR juju.worker.uniter agent.go:31 resolver loop error: executing operation "run action 48": read tcp 10.246.64.202:36182->10.246.64.200:37017: read: connection reset by peer
2020-10-13 14:48:12 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] failed: resolver loop error
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:598 "migration-minion" manifold worker stopped: txn watcher sync error
2020-10-13 14:48:12 ERROR juju.worker.dependency engine.go:671 "migration-minion" manifold worker returned unexpected error: txn watcher sync error
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:673 stack trace:
txn watcher sync error
/workspace/_build/src/github.com/juju/juju/rpc/client.go:178:
/workspace/_build/src/github.com/juju/juju/api/apiclient.go:1200:
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:598 "log-sender" manifold worker stopped: cannot send log message: write tcp 10.246.65.65:36142->10.246.64.202:17070: write: connection reset by peer
2020-10-13 14:48:12 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: write tcp 10.246.65.65:36142->10.246.64.202:17070: write: connection reset by peer
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:673 stack trace:
write tcp 10.246.65.65:36142->10.246.64.202:17070: write: connection reset by peer
/workspace/_build/src/github.com/juju/juju/api/client.go:683:
/workspace/_build/src/github.com/juju/juju/api/logsender/logsender.go:75: cannot send log message
/workspace/_build/src/github.com/juju/juju/worker/logsender/worker.go:69:
2020-10-13 14:48:12 DEBUG juju.worker.leadership tracker.go:227 ceph-mon/0 waiting for ceph-mon leadership release gave err: error blocking on leadership release: lease manager stopped
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:598 "leadership-tracker" manifold worker stopped: error while ceph-mon/0 waiting for ceph-mon leadership release: error blocking on leadership release: lease manager stopped
2020-10-13 14:48:12 ERROR juju.worker.dependency engine.go:671 "leadership-tracker" manifold worker returned unexpected error: error while ceph-mon/0 waiting for ceph-mon leadership release: error blocking on leadership release: lease manager stopped
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:673 stack trace:
lease manager stopped
/workspace/_build/src/github.com/juju/juju/api/leadership/client.go:61: error blocking on leadership release
/workspace/_build/src/github.com/juju/juju/worker/leadership/tracker.go:140: error while ceph-mon/0 waiting for ceph-mon leadership release
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:598 "hook-retry-strategy" manifold worker stopped: txn watcher sync error
2020-10-13 14:48:12 ERROR juju.worker.dependency engine.go:671 "hook-retry-strategy" manifold worker returned unexpected error: txn watcher sync error
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:673 stack trace:
txn watcher sync error
/workspace/_build/src/github.com/juju/juju/rpc/client.go:178:
/workspace/_build/src/github.com/juju/juju/api/apiclient.go:1200:
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:598 "meter-status" manifold worker stopped: txn watcher sync error
2020-10-13 14:48:12 ERROR juju.worker.dependency engine.go:671 "meter-status" manifold worker returned unexpected error: txn watcher sync error
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:673 stack trace:
txn watcher sync error
/workspace/_build/src/github.com/juju/juju/rpc/client.go:178:
/workspace/_build/src/github.com/juju/juju/api/apiclient.go:1200:
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:598 "api-address-updater" manifold worker stopped: txn watcher sync error
2020-10-13 14:48:12 ERROR juju.worker.dependency engine.go:671 "api-address-updater" manifold worker returned unexpected error: txn watcher sync error
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:673 stack trace:
txn watcher sync error
/workspace/_build/src/github.com/juju/juju/rpc/client.go:178:
/workspace/_build/src/github.com/juju/juju/api/apiclient.go:1200:
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:598 "upgrader" manifold worker stopped: txn watcher sync error
2020-10-13 14:48:12 ERROR juju.worker.dependency engine.go:671 "upgrader" manifold worker returned unexpected error: txn watcher sync error

Logs and crashdumps at https://oil-jenkins.canonical.com/artifacts/f8e0ec98-ccf6-4ecc-afb4-86e7ca491528/index.html

Tags: cdo-qa
Revision history for this message
Pen Gale (pengale) wrote :

This looks like it might be a transient networking issue. Juju should probably be handling it better, though.

Work here involves digging through the logs to figure out what happened, and figuring out if retry or better error messaging would be appropriate.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Marian Gasparovic (marosg) wrote :

We had 4 more hits on this during the weekend

Revision history for this message
Michael Skalka (mskalka) wrote :

Given the history of this bug [0] and the disruption it is having on release and day to day testing I am subscribing it to field-high.

0. https://solutions.qa.canonical.com/bugs/bugs/bug/1853047

Revision history for this message
Michael Skalka (mskalka) wrote :
Pen Gale (pengale)
Changed in juju:
importance: Medium → High
milestone: none → 2.8-next
Revision history for this message
John A Meinel (jameinel) wrote :

2020-10-13 14:48:12 ERROR juju.worker.dependency engine.go:671 "migration-minion" manifold worker returned unexpected error: txn watcher sync error
2020-10-13 14:48:12 DEBUG juju.worker.dependency engine.go:673 stack trace:
txn watcher sync error

^- this is a rather serious error, where while Juju is trying to read the active transactions going on in Mongo, it gets disconnected. This sort of error generally causes a full restart of everything going on in Juju (all the workers get restarted, because Juju has lost track of what is going on in the model at this point).

While "txn watcher sync" gets all the noise in the logs, somewhere else there should be a separate log line indicating the underlying error. This might be something like EOF (which just indicates that Mongo hung up on us).

  added, err := w.sync()
  if err != nil {
   w.hub.Publish(txnWatcherSyncErr, nil)
   return errors.Trace(err)

That should be in the controller logs (vs the unit agent logs).

Revision history for this message
John A Meinel (jameinel) wrote :

Looking at one of the crashdumps, I see:
2020-10-18 15:49:35 WARNING juju.mongo open.go:166 TLS handshake failed: read tcp 10.246.64.200:39736->10.246.64.200:37017: read: connection reset by peer
2020-10-18 15:49:35 WARNING juju.mongo open.go:166 TLS handshake failed: read tcp 10.246.64.200:39740->10.246.64.200:37017: read: connection reset by peer
2020-10-18 15:49:35 INFO juju.state.pool.txnwatcher txnwatcher.go:163 watcher loop failed: watcher iteration error: Closed explicitly

That error seems to come from:
<email address hidden>+incompatible/socket.go

// Close terminates the socket use.
func (socket *mongoSocket) Close() {
        socket.kill(errors.New("Closed explicitly"), false)
}

It is unclear why we would be closing the socket that we are using for queries. It feels like a case where we might be reusing a mongo connection for multiple things but we should be using it only for this watcher.

Pen Gale (pengale)
Changed in juju:
milestone: 2.8-next → 2.8.7
Revision history for this message
Pen Gale (pengale) wrote :

This should be addressed by the other txn watcher changes.

Changed in juju:
status: Triaged → Fix Committed
status: Fix Committed → In Progress
Revision history for this message
Pen Gale (pengale) wrote :

... though the above hasn't been ported to 2.8 just yet.

Revision history for this message
Ian Booth (wallyworld) wrote :

We were reluctant to make such a low level change as a point release to 2.8.
We can definitely backport it once we have confidence that the fix works well in 2.9.0 once that is released.

Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Triaged
assignee: Ian Booth (wallyworld) → nobody
Changed in juju:
milestone: 2.8.7 → 2.8.8
Changed in juju:
milestone: 2.8.8 → 2.8.9
Changed in juju:
milestone: 2.8.9 → 2.8.10
Changed in juju:
milestone: 2.8.10 → 2.8.11
Revision history for this message
John A Meinel (jameinel) wrote :

this was fixed in the 2.9 beta stages by doing better handling of the txn watcher sync failures.

Changed in juju:
milestone: 2.8.11 → 2.9.1
status: Triaged → Fix Released
Revision history for this message
Heitor (heitorpbittencourt) wrote :

Could this be backported to 2.8.x? This is hitting us hard.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1899657] Re: read: connection reset by peer during action

Is there a reason you cannot upgrade to 2.9? Any release on 2.8 is still
going to require that you do an upgrade anyway.

Can you also confirm where/when you are encountering this?

On Wed, Aug 25, 2021 at 1:25 PM Heitor <email address hidden> wrote:

> Could this be backported to 2.8.x? This is hitting us hard.
>
> --
> You received this bug notification because you are a member of Canonical
> Field High, which is subscribed to the bug report.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1899657
>
> Title:
> read: connection reset by peer during action
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1899657/+subscriptions
>
>

Revision history for this message
Heitor (heitorpbittencourt) wrote :

We don't know how to proceed with the 2.9 upgrade.

The environment in question uses jimm/candid and has ha controllers for both vSphere and MAAS clouds, as well as a top level controller stack which houses all of the controllers for the respective clouds + candid and monitoring infra and other top level infra. We are unsure about upgrading controllers in this environment.

Do you think we are OK to let it rip?

Revision history for this message
Heitor (heitorpbittencourt) wrote :

We initially opened the bug https://bugs.launchpad.net/juju/+bug/1930899, but we are not sure which bug (this one or the other one) is what is going on on our infrastructure.

We have a support ticket open, with number 00312211.

Revision history for this message
mengalong (mengalong) wrote :

@John A Meinel (jameinel) , where can I view the patch for this bug? I haven't found it

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.