i/o timeout errors can cause non-atomic service deploys
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju |
High
|
Nate Finch | ||
| | juju-core |
High
|
Nate Finch | ||
| | 1.24 |
Critical
|
Nate Finch | ||
| | 1.25 |
High
|
Nate Finch | ||
Bug Description
We've recently started seeing "i/o timeout errors" when issuing serviceDeploy API calls from Landscape:
EDIT
##########
To clarify:
landscape ---------> |17070:state server -> mongo:37017|
The timeout is happening inside the juju state server box, between the api and mongo.
It's NOT happening between landscape and juju. It's inside juju.
##########
Aug 7 20:45:06 job-handler-1 INFO Traceback (failure with no frames): <class 'canonical.
...
Aug 7 20:45:06 job-handler-1 INFO Traceback (failure with no frames): <class 'canonical.
...
On retry, we hit the "service already exists" error, so the service has actually been deployed.
However, recently, we've also hit a case where service was only partially deployed:
Aug 17 16:22:18 job-handler-1 INFO Traceback (failure with no frames): <class 'canonical.
service "ceph-radosgw": read tcp 127.0.0.1:37017: i/o timeout
We emit a service deploy call together with the full service configuration, but no service configuration was done in this case. It seems as if service deploy and configuration are not atomic: we'd expect the call to either fail (so we don't hit service-
See bug 1482791 for reference.
| Dimiter Naydenov (dimitern) wrote : | #1 |
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → High |
| milestone: | none → 1.26.0 |
| Curtis Hovey (sinzui) wrote : | #2 |
1.22.8 and 1.22.6 was tested and failed
| tags: | added: cisco landscape |
| Changed in juju-core: | |
| milestone: | 1.26.0 → none |
| Dimiter Naydenov (dimitern) wrote : | #3 |
Dropped the milestore as it was not set correctly: there are no plans (that I know of) to fix it by the 1.26.0 release.
Consider that to fix the issue properly we need to have a (cooperating) persistency layer with a reliable support for transactions across multiple collections, including the ability to rollback such transactions.
| Free Ekanayaka (free.ekanayaka) wrote : | #4 |
@Dimiter: out of curiosity, I guess the "cooperating persistency layer" you are alluding to is the juju/state package?
Because the underlying technologies (mongodb + mgo/txn) do seem to have reliable support for transactions across multiple collections, afaiu. The problem seems a "factoring" one in juju/state which afaics basically ends up doing:
RunTransaction(<ADD SERVICE OPS>)
RunTransaction(
RunTransaction(<SET SERVICE CONFIG OPS>)
as opposed to:
RunTransaction(
<ADD SERVICE OPS> + <ATTACH FETCHED CHARM TO SERVICE OPS> + <SET SERVICE CONFIG OPS>)
Is this the case?
| Changed in juju-core: | |
| milestone: | none → 1.26.0 |
| Nate Finch (natefinch) wrote : | #5 |
I'm looking into this one now.
| Changed in juju-core: | |
| assignee: | nobody → Nate Finch (natefinch) |
| Changed in juju-core: | |
| status: | Triaged → In Progress |
| description: | updated |
| Nate Finch (natefinch) wrote : | #6 |
Note that users have sometimes experienced the timeout causing the service's settings set at deploy time to not actually get set.
| Changed in juju-core: | |
| status: | In Progress → Triaged |
| assignee: | Nate Finch (natefinch) → nobody |
| Menno Finlay-Smits (menno.smits) wrote : | #7 |
It's worth noting that MongoDB i/o timeout errors always happen after the ensure-availability has been issued. As each new member of the replicaset comes up MongoDB breaks all connections to it, causing these errors. Juju should handle this better.
Could that be what's happening here? Was the replicaset coming up when the deploy was requested?
| David Britton (davidpbritton) wrote : Re: [Bug 1486553] [NEW] i/o timeout errors can cause non-atomic service deploys | #8 |
On Sunday, September 6, 2015, Menno Smits <email address hidden> wrote:
>
> Could that be what's happening here? Was the replicaset coming up when
> the deploy was requested?
>
>
In this case, no. Non HA juju.
--
David Britton <email address hidden>
| Nate Finch (natefinch) wrote : | #9 |
FYI, I have a change that adds service settings to the same transaction as the service creation, so there's no possible failure in between the two. Now working on getting unit creation into the same transaction. Assigning units to machines is going to be done asynchronously by a worker to ensure retries in the case of timeouts etc.
| Free Ekanayaka (free.ekanayaka) wrote : | #10 |
Great to see service creation and config atomic. FWIW Landscape sets NumUnits to 0 when deploying a service (and the add units later with a separate API call), so atomicity of the unit creation part is not critical for our use case.
| Changed in juju-core: | |
| assignee: | nobody → Nate Finch (natefinch) |
| status: | Triaged → In Progress |
| Nate Finch (natefinch) wrote : | #11 |
The first half of this bug is landing in 1.24 today (the part about making sure the settings get applied to the service atomically with service deployment).
| Nate Finch (natefinch) wrote : | #12 |
And this has now landed in 1.24.
| Nate Finch (natefinch) wrote : | #13 |
The second half of this bug - making unit assignment atomic(ish) is being tracked here: https:/
| tags: | added: kanban-cross-team |
| tags: | removed: kanban-cross-team |
| Cheryl Jennings (cherylj) wrote : | #14 |
Nate - has this landed in master?
| Changed in juju-core: | |
| milestone: | 1.26.0 → 2.0-alpha1 |
| Changed in juju-core: | |
| status: | In Progress → Fix Committed |
| Changed in juju-core: | |
| status: | Fix Committed → Fix Released |
| affects: | juju-core → juju |
| Changed in juju: | |
| milestone: | 2.0-alpha1 → none |
| milestone: | none → 2.0-alpha1 |
| Changed in juju-core: | |
| assignee: | nobody → Nate Finch (natefinch) |
| importance: | Undecided → High |
| status: | New → Fix Released |


It is not atomic indeed - it can fail mid-way for various reasons, like "charm not found" (which leaves the service "empty" but existing).