[RFE] Upgrade controllers with no API downtime

Bug #1566520 reported by Ihar Hrachyshka
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Wishlist
Unassigned

Bug Description

Currently pretty much every major upgrade requires full shutdown for all neutron-server instances for the time while upgrade process is running. The downtime is due to the need to run alembic scripts that modify schema and transform data. Neutron-server instances are currently not resilient to working with older schema. We also don't make an effort to avoid 'contract' migrations.

The goal of the RFE is to allow upgrading controller services one by one, without full shutdown for all of them in an HA setup. This will allow to avoid public shutdown for API for rolling upgrades.

The RFE involves:
- adopting object facades for all interaction with database models;
- forbidding contract migrations in alembic;
- implementing new contract migrations in backwards compatible way in runtime.

Tags: rfe-approved
Changed in neutron:
importance: Undecided → Wishlist
tags: added: rfe
Changed in neutron:
status: New → Confirmed
summary: - Upgrade controllers with no API downtime
+ [RFE]Upgrade controllers with no API downtime
summary: - [RFE]Upgrade controllers with no API downtime
+ [RFE] Upgrade controllers with no API downtime
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I was under the impression that the EXPAND branch could already be executed without shutting down the services. That said, the problem certainly lies with the CONTRACT branches. This feels like a rehash of parts of [1], and as such is part of a much larger solution required to accommodate a no downtime upgrade for the neutron servers. Looking at the DB alone won't suffice to ensure that requests are handled correctly during the upgrade. Please provide a more detailed plan, and break it down into chewable pieces for Newton. I suspect that once you do that the RFE title won't sound like 'Upgrade controllers with no API downtime'.

[1] https://blueprints.launchpad.net/neutron/+spec/online-schema-migrations

Revision history for this message
Artur Korzeniewski (artur-korzeniewski) wrote :

The EXPAND branch can be executed in runtime, that improvement was introduced in Liberty.
The spec [1] introduces the expand/contract migrations but does not address the full story for rolling upgrades. There are pieces like oslo.versionedobjects, moving contracting migration to next releases and coexistence of N and N+1 neutron servers in the same time working on DB access.

I guess that this BP is an umbrella for finer grained RFEs like it is listed in the description:
- adopting object facades for all interaction with database models - already existing [2]
- moving contract migration to next releases - need new RFE
- coexistence of mixed version of neturon servers in the same time - new RFE needed

[1] https://github.com/openstack/neutron-specs/blob/master/specs/liberty/online-schema-migrations.rst
[2] https://bugs.launchpad.net/neutron/+bug/1541928

Revision history for this message
Artur Korzeniewski (artur-korzeniewski) wrote :

And there is also the topic of online data migration, which can be done partially in runtime by accessing the values in OVO, and for data not used background python script should be added to migrate the data in small chunks. This also requires the separate RFE.

But all the work in rolling upgrades matter is dependent on finishing the OVO implementation first...

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I tried to map the effort around upgrades (in the .svg attachment). The sections in a black box are defining the required pieces to deliver for 'no-downtime' controller upgrade experience.

Legend:
- green: completed
- yellow: in progress
- orange: not started
- arrow: depends on
- dotted arrow: would benefit from

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I don't understand what the right side of the picture means. Can you elaborate?

Having said that, It seems to me that to address this need all that's required is mostly procedural (e.g. left side of the diagram), once hurdles like versioning of objects and API endpoints is put in place. As for the former there's been an ongoing plan but what about the latter? How can we force API request handling to be at the lowest supportable version without microversioning? Have you given this any thought?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Ping?

Assaf Muller (amuller)
Changed in neutron:
status: Confirmed → Triaged
Revision history for this message
Assaf Muller (amuller) wrote :

http://eavesdrop.openstack.org/irclogs/%23openstack-meeting/%23openstack-meeting.2016-06-16.log.html#t2016-06-16T22:09:21

Ihar to supply more information next week, hopefully the work scoped for N will be clearer then.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Armando, sorry for a really late response. I somehow missed your pings in the whirl of other stuff.

> I don't understand what the right side of the picture means. Can you elaborate?

The .svg image maps out the effort that is being covered by upgrades subteam right now or planned for the future. The right side of the picture is part of the effort, but is not directly related to this RFE. Sorry for putting it there probably misleading readers.

What we should care about in the context of the RFE is the left side that is contoured by a solid line. Note that API versioning/pinning is out of the contour. This is because the RFE speaks about technical ability to run mixed versions of the controller service, without considering usability limitations like inconsistent replies from different load balanced controllers. Those are indeed put into the scheme to show the next steps, but those are beyond the consideration for this RFE.

What this RFE is to cover is getting to a point where:
- we have the code supporting the new mode of operation;
- ...and it's proved through targeted gating jobs that it's working;
- we provide framework to avoid data migration in alembic;
- ...and we actually forbid data migration in alembic.

Once all this is covered, I claim that neutron-server indeed supports the new model of operation. With that in place, we can start looking into tackling usability limitations we spot.

One of those limitations is indeed potentially inconsistent behaviour of neutron-server between different major versions. I actually believe that we will need to return back to microversioning idea for the next cycle. And if we manage to achieve most of the things that I mapped above for Newton/start of Ocata, I would love to look at detailed plan for API versioning/pinning.

That said, I don't believe we need such a detailed plan in place right now while we haven't laid down the foundation for mixed versions mode with objects work and proper gating for the feature.

So tl;dr I believe it's only the contoured blocks that should be tracked by the RFE, and other blocks will require a separate discussion.

Revision history for this message
Akihiro Motoki (amotoki) wrote :

My understanding is this RFE covers handling of data migration and contracted schema change by oslo.versionedobjects.

Before applying a contract migration, the DB schema is still an older version. N+1 version of neutron-server needs to be aware that an older version N of the schema is still used and needs to generate a new version of a versioned object (and also convert a data model to a new one [version N+1]). This needs to be done in the running neutron-server.
By doing so, N+1 version of neutron-server can work correctly even with N version of DB schema.

Is my understanding correct?

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Akihiro, that's correct. That RFE requires three pieces: 1) newer neutron-server being aware of older schema; 2) no data migration in alembic (moving it to the neutron-server process thru objects); 3) postpone contract schema changes to when all servers are upgraded. The complete process may span more cycles than just two because the new server will still access the old schema even after all services are upgraded. So you can drop the old schema in a next cycle only AFTER you started to roll in the update to database.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Ok, I think I have a better understanding of the scope of the effort: the ultimate goal is that you'd like to be in a position of running X replicas of the Neutron servers (just the servers) at any given time where X1 run on version N and X2 run on version N-1 (X1+X2=X). All the agents are still at version N-1.

Most importantly You want this to work, and you want to test this in the gate. Correct?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
tags: added: rfe-approved
removed: rfe
Changed in neutron:
milestone: none → ocata-rc1
milestone: ocata-rc1 → pike-1
Changed in neutron:
assignee: nobody → Ihar Hrachyshka (ihar-hrachyshka)
status: Triaged → In Progress
Changed in neutron:
milestone: pike-1 → pike-2
Changed in neutron:
assignee: Ihar Hrachyshka (ihar-hrachyshka) → nobody
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Bug closed due to lack of activity, please feel free to reopen if needed.

Changed in neutron:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.