i/o timeout errors can cause non-atomic service deploys

Bug #1486553 reported by Данило Шеган
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Nate Finch
juju-core
Fix Released
High
Nate Finch
1.24
Fix Released
Critical
Nate Finch
1.25
Fix Released
High
Nate Finch

Bug Description

We've recently started seeing "i/o timeout errors" when issuing serviceDeploy API calls from Landscape:

EDIT
##########
To clarify:

landscape ---------> |17070:state server -> mongo:37017|

The timeout is happening inside the juju state server box, between the api and mongo.

It's NOT happening between landscape and juju. It's inside juju.
##########

Aug 7 20:45:06 job-handler-1 INFO Traceback (failure with no frames): <class 'canonical.juju.errors.RequestError'>: cannot add service "mongodb": read tcp 127.0.0.1:37017: i/o timeout
...
Aug 7 20:45:06 job-handler-1 INFO Traceback (failure with no frames): <class 'canonical.juju.errors.RequestError'>: cannot add service "hacluster-mysql": read tcp 192.168.216.21:37017: i/o timeout
...

On retry, we hit the "service already exists" error, so the service has actually been deployed.

However, recently, we've also hit a case where service was only partially deployed:

Aug 17 16:22:18 job-handler-1 INFO Traceback (failure with no frames): <class 'canonical.juju.errors.RequestError'>: cannot add
service "ceph-radosgw": read tcp 127.0.0.1:37017: i/o timeout

We emit a service deploy call together with the full service configuration, but no service configuration was done in this case. It seems as if service deploy and configuration are not atomic: we'd expect the call to either fail (so we don't hit service-already-exists error later) or to succeed *fully* (including configuration) even if it fails to report back to the client due to network issues.

See bug 1482791 for reference.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

It is not atomic indeed - it can fail mid-way for various reasons, like "charm not found" (which leaves the service "empty" but existing).

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.26.0
Revision history for this message
Curtis Hovey (sinzui) wrote :

1.22.8 and 1.22.6 was tested and failed

tags: added: cisco landscape
Changed in juju-core:
milestone: 1.26.0 → none
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Dropped the milestore as it was not set correctly: there are no plans (that I know of) to fix it by the 1.26.0 release.
Consider that to fix the issue properly we need to have a (cooperating) persistency layer with a reliable support for transactions across multiple collections, including the ability to rollback such transactions.

Revision history for this message
Free Ekanayaka (free.ekanayaka) wrote :

@Dimiter: out of curiosity, I guess the "cooperating persistency layer" you are alluding to is the juju/state package?

Because the underlying technologies (mongodb + mgo/txn) do seem to have reliable support for transactions across multiple collections, afaiu. The problem seems a "factoring" one in juju/state which afaics basically ends up doing:

RunTransaction(<ADD SERVICE OPS>)
RunTransaction(<ATTACH FETCHED CHARM TO SERVICE OPS>
RunTransaction(<SET SERVICE CONFIG OPS>)

as opposed to:

RunTransaction(
    <ADD SERVICE OPS> + <ATTACH FETCHED CHARM TO SERVICE OPS> + <SET SERVICE CONFIG OPS>)

Is this the case?

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: none → 1.26.0
Revision history for this message
Nate Finch (natefinch) wrote :

I'm looking into this one now.

Changed in juju-core:
assignee: nobody → Nate Finch (natefinch)
Nate Finch (natefinch)
Changed in juju-core:
status: Triaged → In Progress
description: updated
Revision history for this message
Nate Finch (natefinch) wrote :

Note that users have sometimes experienced the timeout causing the service's settings set at deploy time to not actually get set.

Changed in juju-core:
status: In Progress → Triaged
assignee: Nate Finch (natefinch) → nobody
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

It's worth noting that MongoDB i/o timeout errors always happen after the ensure-availability has been issued. As each new member of the replicaset comes up MongoDB breaks all connections to it, causing these errors. Juju should handle this better.

Could that be what's happening here? Was the replicaset coming up when the deploy was requested?

Revision history for this message
David Britton (dpb) wrote : Re: [Bug 1486553] [NEW] i/o timeout errors can cause non-atomic service deploys

On Sunday, September 6, 2015, Menno Smits <email address hidden> wrote:
>
> Could that be what's happening here? Was the replicaset coming up when
> the deploy was requested?
>
>
In this case, no. Non HA juju.

--
David Britton <email address hidden>

Revision history for this message
Nate Finch (natefinch) wrote :

FYI, I have a change that adds service settings to the same transaction as the service creation, so there's no possible failure in between the two. Now working on getting unit creation into the same transaction. Assigning units to machines is going to be done asynchronously by a worker to ensure retries in the case of timeouts etc.

Revision history for this message
Free Ekanayaka (free.ekanayaka) wrote :

Great to see service creation and config atomic. FWIW Landscape sets NumUnits to 0 when deploying a service (and the add units later with a separate API call), so atomicity of the unit creation part is not critical for our use case.

Nate Finch (natefinch)
Changed in juju-core:
assignee: nobody → Nate Finch (natefinch)
status: Triaged → In Progress
Revision history for this message
Nate Finch (natefinch) wrote :

The first half of this bug is landing in 1.24 today (the part about making sure the settings get applied to the service atomically with service deployment).

Revision history for this message
Nate Finch (natefinch) wrote :

And this has now landed in 1.24.

Revision history for this message
Nate Finch (natefinch) wrote :

The second half of this bug - making unit assignment atomic(ish) is being tracked here: https://bugs.launchpad.net/juju-core/+bug/1497312

David Britton (dpb)
tags: added: kanban-cross-team
tags: removed: kanban-cross-team
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Nate - has this landed in master?

Changed in juju-core:
milestone: 1.26.0 → 2.0-alpha1
Nate Finch (natefinch)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-alpha1 → none
milestone: none → 2.0-alpha1
Changed in juju-core:
assignee: nobody → Nate Finch (natefinch)
importance: Undecided → High
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.