tripleo

buildlogs.centos.org CDN issues

Bug #1674681 reported by Emilien Macchi on 2017-03-21

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Unassigned	tripleo queens-1

Bug Description

We hit 138 times a situation where package couldn't be downloaded from centos repositories:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3A%20*tripleo-ci*%20AND%20build_status%3A%20FAILURE%20AND%20message%3A%20%5C%22No%20more%20mirrors%20to%20try%5C%22

That's way too much and our CI becomes really unstable because of that. We need to find a solution.

Note: it happens on all cloud providers provided by OpenStack Infra.

Tags:

Emilien Macchi (emilienm) on 2017-03-21

Changed in tripleo:
milestone:	none → pike-1

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-03-21:

Issue also reported here: https://bugs.centos.org/view.php?id=12996

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2017-03-21:

If failed, infra/CI scripts should do "yum clean expire-cache" then retry the failed step: "yum history redo last" or the like

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-03-21:

fyi, related work: https://review.openstack.org/#/q/topic:cbs/https

Revision history for this message

Alfredo Moralejo (amoralej) wrote on 2017-03-21:

These issues shouldn't be related to bad cached metadata from buildlogs.centos.org as it keeps old package versions even when newer versions are added to the repos. In fact the url of the package is working properly.

Retrying could help if it's just networking glitch but proper solution would be to get it fixed in the ifra. It seems we are only getting that error downloading packages from buildlogs, not from delorean repos, what seems to point to problem in centos infra but it's weird that we are not hitting it in puppet jobs or in RDO CI. I'll try to dig a bit more.

Revision history for this message

Alan Pevec (apevec) wrote on 2017-03-21:

I could not reproduce locally, and from logstach looks like this has started happening recently.
buildlogs.centos.org is backed up by the donated CDN network not under control of CentOS team so it could be speculated that upstream infra might be hitting bad edge nodes.

Revision history for this message

Alan Pevec (apevec) wrote on 2017-03-21:

> started happening recently

== last week, see screenshot https://apevec.fedorapeople.org/openstack/openstack-infra-buildlogs.centos-history.png

Revision history for this message

Michele Baldessari (michele) wrote on 2017-03-22:

I feel this is getting worse? All my jobs keep failing on mirror issues:
http://logs.openstack.org/79/445479/3/check/gate-tripleo-ci-centos-7-scenario004-multinode-oooq/b90fb5e/logs/undercloud/home/jenkins/undercloud_install.log.txt.gz#_2017-03-22_07_46_26_059

http://logs.openstack.org/16/447116/4/gate/gate-tripleo-ci-centos-7-nonha-multinode-oooq/4218c2b/logs/subnode-1/var/log/messages.txt.gz#_Mar_22_02_11_04

http://logs.openstack.org/29/447229/4/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/a9b5154/logs/undercloud/home/jenkins/undercloud_install.log.txt.gz#_2017-03-22_07_46_24

Are we even using a proxy for these downloads? Wouldn't that help a little and lessen the load on the CDN? Otherwise we either fix the CDN or we add some more robust retry logic?

Revision history for this message

Anssi Johansson (t-launchpad-n) wrote on 2017-03-22:

Echoing what I wrote in the bugs.centos.org bug entry:

It would help tremendously if you could trigger this issue with debugging output enabled -- like "URLGRABBER_DEBUG=1 yum install python-whatever". That would show all the HTTP requests that get sent, and the IP addresses of each connection.

Revision history for this message

Alan Pevec (apevec) wrote on 2017-03-22:

@Michele issue is not the load on the CDN, we cannot reproduce this issue outside of upstream infra, so it would help if debug output requested as requested from one of rh1/2 nodes.

Revision history for this message

Michele Baldessari (michele) wrote on 2017-03-23:

#10

Not entirely sure it will work, but am trying something around URLGRABBER_DEBUG in https://review.openstack.org/448982. Will update here if it does indeed give us extra logs

Alan Pevec (apevec) on 2017-03-23

summary:

- centos.org repositories are unreliable
+ buildlogs.centos.org CDN issues

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-03-23:

#11

@Alan, most of problems happens on hosts of various cloud providers, not on rh1 cloud. (Multinode jobs run on nodepool images of openstack infra, which are hosted on ovh, osic, and other clouds).

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-03-23:

#12

As you can see from Kibana[1] or picture in this thread, all problem in jobs started from 14 March. It's exactly the time when centos infra was moving hardware according to[2].

[1] http://logstash.openstack.org/#/dashboard/file/logstash.json?query=build_name:%20*tripleo-ci*%20AND%20build_status:%20FAILURE%20AND%20message:%20%5C%22No%20more%20mirrors%20to%20try%5C%22
[2] https://seven.centos.org/2017/03/infra-scheduled-major-outage-for-several-services/

Revision history for this message

Alan Pevec (apevec) wrote on 2017-03-23:

#13

CDN backing[1] buildlogs.centos did not change, it's external to the centos infra which was moved on March 14th.

[1] https://lists.centos.org/pipermail/centos-devel/2016-March/014552.html

Revision history for this message

David Moreau Simard (dmsimard) wrote on 2017-03-23:

#14

There's a couple different things going on here.

We need to switch everything that uses "http://buildlogs" to "https://buildlogs".
This is happening here: https://review.openstack.org/#/q/topic:cbs/https

We're seeing "No more mirrors to try" errors on some jobs.
This seems to be because TripleO-Quickstart uses trunk.rdoproject.org/current instead of recovering the hash of current and then using trunk.rdoproject.org/<hash>.
Since "current" is bound to change midway through the job as new builds come in, it is likely that jobs can fail on packages that are not found.

This is being fixed by (as per Sagi):
- https://review.openstack.org/#/c/448511/
- https://review.openstack.org/#/c/448512/
- https://review.openstack.org/#/c/449004

Let's get all this merged and see if we're still experiencing problems.

Revision history for this message

Alfredo Moralejo (amoralej) wrote on 2017-03-23:

#15

Note that we already changed to https://buildlogs and it didn't help (yum can properly follow redirects) and created other issues (specially in jobs running in ci.centos, as the mirror inside doesn't support https.

I agree we need to fix the issue with "current" repo too. It's a totally different issue that the one with buildlogs but as the "No more mirrors to try" message appears, it may lead us to mix both problems.

Revision history for this message

Paul Belanger (pabelanger) wrote on 2017-03-23:

#16

If you can enable rsync from buildlogs.centos.org, we can mirror this infrastructure into openstack. Then you'll be hitting our AFS backend. I've mentioned it a few times, but documenting it for this issue.

Revision history for this message

Alan Pevec (apevec) wrote on 2017-03-23:

#17

"No more mirrors to try" has not appeared in the last 24h, I think we can de-escalate this issue and try to come up with the theory of the root cause to ensure it does not happen again.

tags:

removed: alert

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-03-23:

#18

It didn't appear because we still didn't add quickstart logs to logstash indexing. The non-quickstart jobs (OVB) are stuck now because some rh1 problems. It's still the big problem.

tags:

added: alert

Revision history for this message

David Moreau Simard (dmsimard) wrote on 2017-03-23:

#19

Just as complementary information...

I've reached out to the following communities:
- openstack-ansible
- chef-openstack
- puppet-openstack
- kolla

They all rely on RDO's buildlogs repositories as part of their gate and besides the python-vine issue that happened today, none report any unusual instability.

Revision history for this message

Alan Pevec (apevec) wrote on 2017-03-24:

#20

@Sagi can we get yum failure stats for non-oooq tripleo scenario jobs, I some of those are still around?

Revision history for this message

Alfredo Moralejo (amoralej) wrote on 2017-03-24:

#21

I've created https://review.openstack.org/#/c/449548/ to add debug in oooq jobs.

Emilien Macchi (emilienm) on 2017-03-25

tags:	removed: alert
Changed in tripleo:
status:	Triaged → Fix Released

Revision history for this message

Alan Pevec (apevec) wrote on 2017-03-25:

#22

@Emilien what was the fix?

Revision history for this message

David Moreau Simard (dmsimard) wrote on 2017-03-25:

#23

We did not merge "one" fix to fix everything, but after a lot of work we managed to land a series of patch meant to improve the status of CI: https://review.openstack.org/#/q/topic:tripleo/outstanding

The situation should be much, much better now but we should monitor if we are still experiencing problems.

Revision history for this message

Paul Belanger (pabelanger) wrote on 2017-03-26:

#24

So, I am pretty sure this isn't fixed. And not the real issue. Based on comments dmsimard made previously, I started looking more into this.

A few issues:

1) We do appears to be having networking issues in osic-cloud1. From what I see, we are getting packet loss within the rackspace networks. Surprisingly, this maybe we limited to centos-7 hosts, which might explain why other projects haven't see this before. I'll be looking into this on monday.

2) Tripleo should take this opportunity to better understand jobs should not be depending on the internet. Specifically, tripleo goes out to the network way too much for my comfort. We make a big effort to keep devstack from going out to the public web, for this exact reason. Tripleo is now at the point where a significant amount of jobs are failing, because of networking blips.

I have started working on patching rubygems-mirror to allow openstack-infra to mirror gems, but things like DLRN and buildlogs.centos.org should also be mirrored.

github.com is a bigger issue, but I am going to talk about this for our tuesday openstack-infra meeting.

3) Add ipv6 support for trunk.rdoproject.org. This is specific to potential issues in osic-cloud, but if rdoproject.org did provide ipv6 support, it may fix some networking issues we are seeing. Other wise, traffic from osic-cloud1 needs to hit an ipv4 NAT.

So, with all of that. We should reopen this and use it to help track the issues above

Paul Belanger (pabelanger) on 2017-03-27

Changed in tripleo:
assignee:	nobody → Paul Belanger (pabelanger)

Emilien Macchi (emilienm) on 2017-03-27

Changed in tripleo:
status:	Fix Released → In Progress

Emilien Macchi (emilienm) on 2017-04-11

Changed in tripleo:
milestone:	pike-1 → pike-2

Emilien Macchi (emilienm) on 2017-06-08

Changed in tripleo:
milestone:	pike-2 → pike-3

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-06-13:

#25

We haven't been seeing mirror issues AFAIK, but I guess I'll leave it open per the previous comments. Dropping the priority though since it's clearly not critical anymore.

Changed in tripleo:
importance:	Critical → Medium

Emilien Macchi (emilienm) on 2017-07-30

Changed in tripleo:
milestone:	pike-3 → pike-rc1

Emilien Macchi (emilienm) on 2017-07-31

Changed in tripleo:
milestone:	pike-rc1 → queens-1

Revision history for this message

Dan Trainor (dtrainor) wrote on 2017-07-31:

#26

I think I'm seeing this creep up again when trying to install in a new environment (rdo-cloud). Multiple failures indicating:

delorean-pike-testing/x86_64/p FAILED
https://buildlogs.centos.org/centos/7/cloud/x86_64/openstack-pike/repodata/700104f6690b05335dd322398eb584976bdc1a300bf1a5bc560930e58fa61fa3-primary.sqlite.bz2: [Errno 12] Timeout on https://buildlogs.centos.org/centos/7/cloud/x86_64/openstack-pike/repodata/700104f6690b05335dd322398eb584976bdc1a300bf1a5bc560930e58fa61fa3-primary.sqlite.bz2: (28, 'Operation timed out after 30001 milliseconds with 0 out of 0 bytes received')
Trying other mirror.

Revision history for this message

Paul Belanger (pabelanger) wrote on 2017-07-31:

#27

You'll need to setup mirror infrastructure in RDO cloud, like we do in openstack-infra. This can be done pretty easily if you join our AFS cell. You could also run puppet from system-config for the mirror server. Then you don't actually need to manage anything.

Revision history for this message

Paul Belanger (pabelanger) wrote on 2017-07-31:

#28

You could setup a crontab in the mirror in rdo using the following:
http://git.openstack.org/cgit/openstack-infra/system-config/tree/manifests/site.pp#n616

Revision history for this message

Paul Belanger (pabelanger) wrote on 2017-07-31:

#29

also, we don't have buildlogs.centos.org in our reverse proxy cache. If you are depending on it, we can add it to avoid hitting it directly.

Revision history for this message

David Moreau Simard (dmsimard) wrote on 2017-08-01:

#30

There has been various reports of issues with buildlogs.centos.org today, the core CentOS infrastructure team was notified.

Revision history for this message

David Moreau Simard (dmsimard) wrote on 2017-08-01:

#31

@Paul, setting up a reverse proxy to buildlogs.centos.org could be a good idea.

Revision history for this message

Dan Trainor (dtrainor) wrote on 2017-08-01:

#32

Excellent, thank you Paul for your feedback.

Emilien Macchi (emilienm) on 2017-08-01

tags:	added: alert
Changed in tripleo:
milestone:	queens-1 → pike-rc1
importance:	Medium → Critical
assignee:	Paul Belanger (pabelanger) → nobody

Revision history for this message

Ian Wienand (iwienand) wrote on 2017-08-01:

#33

One thing I noticed is that http://buildlogs.centos.org 302 redirects to https:// which might throw a bit of a spanner in the works for a proxy

Emilien Macchi (emilienm) on 2017-08-01

Changed in tripleo:
status:	In Progress → Triaged

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-08-01:

#34

sounds like it's back. I'll keep it open until my morning and see how it goes. Feel free to update it as well if you think it's fixed.

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-08-01:

#35

I still see timeouts:
http://logs.openstack.org/73/484373/4/check/gate-tripleo-ci-centos-7-undercloud-upgrades-nv/c23cded/console.html#_2017-08-01_11_20_33_823920

It sounds like it affects the job runtime. I'll keep looking.

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-08-01:

#36

Paul is working on mirroring buildlogs: https://review.openstack.org/#/c/489462/

Emilien Macchi (emilienm) on 2017-08-01

tags:	removed: alert
Changed in tripleo:
status:	Triaged → In Progress

Emilien Macchi (emilienm) on 2017-08-25

Changed in tripleo:
milestone:	pike-rc1 → pike-rc2

Emilien Macchi (emilienm) on 2017-09-05

Changed in tripleo:
milestone:	pike-rc2 → queens-1

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-09-25:

#37

Not using buildlogs anymore.

Changed in tripleo:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

mantis #12996 Edit

Bug watches keep track of this bug in other bug trackers.