Canonical Juju

i/o timeout errors can cause non-atomic service deploys

Bug #1486553 reported by Данило Шеган on 2015-08-19

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	High	Nate Finch	Canonical Juju 2.0-alpha1
juju-core	Fix Released	High	Nate Finch
1.24	Fix Released	Critical	Nate Finch	juju-core 1.24.6
1.25	Fix Released	High	Nate Finch	juju-core 1.25-beta1

Bug Description

We've recently started seeing "i/o timeout errors" when issuing serviceDeploy API calls from Landscape:

EDIT
##########
To clarify:

landscape ---------> |17070:state server -> mongo:37017|

The timeout is happening inside the juju state server box, between the api and mongo.

It's NOT happening between landscape and juju. It's inside juju.
##########

Aug 7 20:45:06 job-handler-1 INFO Traceback (failure with no frames): <class 'canonical.juju.errors.RequestError'>: cannot add service "mongodb": read tcp 127.0.0.1:37017: i/o timeout
...
Aug 7 20:45:06 job-handler-1 INFO Traceback (failure with no frames): <class 'canonical.juju.errors.RequestError'>: cannot add service "hacluster-mysql": read tcp 192.168.216.21:37017: i/o timeout
...

On retry, we hit the "service already exists" error, so the service has actually been deployed.

However, recently, we've also hit a case where service was only partially deployed:

Aug 17 16:22:18 job-handler-1 INFO Traceback (failure with no frames): <class 'canonical.juju.errors.RequestError'>: cannot add
service "ceph-radosgw": read tcp 127.0.0.1:37017: i/o timeout

We emit a service deploy call together with the full service configuration, but no service configuration was done in this case. It seems as if service deploy and configuration are not atomic: we'd expect the call to either fail (so we don't hit service-already-exists error later) or to succeed *fully* (including configuration) even if it fails to report back to the client due to network issues.

See bug 1482791 for reference.

See original description

Tags:

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-08-19:

It is not atomic indeed - it can fail mid-way for various reasons, like "charm not found" (which leaves the service "empty" but existing).

Dimiter Naydenov (dimitern) on 2015-08-19

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 1.26.0

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-08-20:

1.22.8 and 1.22.6 was tested and failed

Andreas Hasenack (ahasenack) on 2015-08-26

tags:

added: cisco landscape

Dimiter Naydenov (dimitern) on 2015-08-27

Changed in juju-core:
milestone:	1.26.0 → none

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-08-27:

Dropped the milestore as it was not set correctly: there are no plans (that I know of) to fix it by the 1.26.0 release.
Consider that to fix the issue properly we need to have a (cooperating) persistency layer with a reliable support for transactions across multiple collections, including the ability to rollback such transactions.

Revision history for this message

Free Ekanayaka (free.ekanayaka) wrote on 2015-08-27:

@Dimiter: out of curiosity, I guess the "cooperating persistency layer" you are alluding to is the juju/state package?

Because the underlying technologies (mongodb + mgo/txn) do seem to have reliable support for transactions across multiple collections, afaiu. The problem seems a "factoring" one in juju/state which afaics basically ends up doing:

RunTransaction(<ADD SERVICE OPS>)
RunTransaction(<ATTACH FETCHED CHARM TO SERVICE OPS>
RunTransaction(<SET SERVICE CONFIG OPS>)

as opposed to:

RunTransaction(
<ADD SERVICE OPS> + <ATTACH FETCHED CHARM TO SERVICE OPS> + <SET SERVICE CONFIG OPS>)

Is this the case?

Curtis Hovey (sinzui) on 2015-08-27

Changed in juju-core:
milestone:	none → 1.26.0

Revision history for this message

Nate Finch (natefinch) wrote on 2015-09-01:

I'm looking into this one now.

Changed in juju-core:
assignee:	nobody → Nate Finch (natefinch)

Nate Finch (natefinch) on 2015-09-02

Changed in juju-core:
status:	Triaged → In Progress

Andreas Hasenack (ahasenack) on 2015-09-02

description:

updated

Revision history for this message

Nate Finch (natefinch) wrote on 2015-09-03:

Note that users have sometimes experienced the timeout causing the service's settings set at deploy time to not actually get set.

Katherine Cox-Buday (cox-katherine-e) on 2015-09-03

Changed in juju-core:
status:	In Progress → Triaged
assignee:	Nate Finch (natefinch) → nobody

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-09-06:

It's worth noting that MongoDB i/o timeout errors always happen after the ensure-availability has been issued. As each new member of the replicaset comes up MongoDB breaks all connections to it, causing these errors. Juju should handle this better.

Could that be what's happening here? Was the replicaset coming up when the deploy was requested?

Revision history for this message

David Britton (dpb) wrote on 2015-09-06: Re: [Bug 1486553] [NEW] i/o timeout errors can cause non-atomic service deploys

On Sunday, September 6, 2015, Menno Smits <email address hidden> wrote:
>
> Could that be what's happening here? Was the replicaset coming up when
> the deploy was requested?
>
>
In this case, no. Non HA juju.

--
David Britton <email address hidden>

Revision history for this message

Nate Finch (natefinch) wrote on 2015-09-09:

FYI, I have a change that adds service settings to the same transaction as the service creation, so there's no possible failure in between the two. Now working on getting unit creation into the same transaction. Assigning units to machines is going to be done asynchronously by a worker to ensure retries in the case of timeouts etc.

Revision history for this message

Free Ekanayaka (free.ekanayaka) wrote on 2015-09-10:

#10

Great to see service creation and config atomic. FWIW Landscape sets NumUnits to 0 when deploying a service (and the add units later with a separate API call), so atomicity of the unit creation part is not critical for our use case.

Nate Finch (natefinch) on 2015-09-11

Changed in juju-core:
assignee:	nobody → Nate Finch (natefinch)
status:	Triaged → In Progress

Revision history for this message

Nate Finch (natefinch) wrote on 2015-09-11:

#11

The first half of this bug is landing in 1.24 today (the part about making sure the settings get applied to the service atomically with service deployment).

Revision history for this message

Nate Finch (natefinch) wrote on 2015-09-11:

#12

And this has now landed in 1.24.

Revision history for this message

Nate Finch (natefinch) wrote on 2015-09-18:

#13

The second half of this bug - making unit assignment atomic(ish) is being tracked here: https://bugs.launchpad.net/juju-core/+bug/1497312

David Britton (dpb) on 2015-10-02

tags:	added: kanban-cross-team
tags:	removed: kanban-cross-team

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-12-07:

#14

Nate - has this landed in master?

Changed in juju-core:
milestone:	1.26.0 → 2.0-alpha1

Nate Finch (natefinch) on 2016-01-12

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2016-01-20

Changed in juju-core:
status:	Fix Committed → Fix Released

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-alpha1 → none
milestone:	none → 2.0-alpha1

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

Changed in juju-core:
assignee:	nobody → Nate Finch (natefinch)
importance:	Undecided → High
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.