buildlogs.centos.org CDN issues

Bug #1674681 reported by Emilien Macchi
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

We hit 138 times a situation where package couldn't be downloaded from centos repositories:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3A%20*tripleo-ci*%20AND%20build_status%3A%20FAILURE%20AND%20message%3A%20%5C%22No%20more%20mirrors%20to%20try%5C%22

That's way too much and our CI becomes really unstable because of that. We need to find a solution.

Note: it happens on all cloud providers provided by OpenStack Infra.

Tags: ci
Changed in tripleo:
milestone: none → pike-1
Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

If failed, infra/CI scripts should do "yum clean expire-cache" then retry the failed step: "yum history redo last" or the like

Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
Alfredo Moralejo (amoralej) wrote :

These issues shouldn't be related to bad cached metadata from buildlogs.centos.org as it keeps old package versions even when newer versions are added to the repos. In fact the url of the package is working properly.

Retrying could help if it's just networking glitch but proper solution would be to get it fixed in the ifra. It seems we are only getting that error downloading packages from buildlogs, not from delorean repos, what seems to point to problem in centos infra but it's weird that we are not hitting it in puppet jobs or in RDO CI. I'll try to dig a bit more.

Revision history for this message
Alan Pevec (apevec) wrote :

I could not reproduce locally, and from logstach looks like this has started happening recently.
buildlogs.centos.org is backed up by the donated CDN network not under control of CentOS team so it could be speculated that upstream infra might be hitting bad edge nodes.

Revision history for this message
Alan Pevec (apevec) wrote :

> started happening recently

== last week, see screenshot https://apevec.fedorapeople.org/openstack/openstack-infra-buildlogs.centos-history.png

Revision history for this message
Michele Baldessari (michele) wrote :
Revision history for this message
Anssi Johansson (t-launchpad-n) wrote :

Echoing what I wrote in the bugs.centos.org bug entry:

It would help tremendously if you could trigger this issue with debugging output enabled -- like "URLGRABBER_DEBUG=1 yum install python-whatever". That would show all the HTTP requests that get sent, and the IP addresses of each connection.

Revision history for this message
Alan Pevec (apevec) wrote :

@Michele issue is not the load on the CDN, we cannot reproduce this issue outside of upstream infra, so it would help if debug output requested as requested from one of rh1/2 nodes.

Revision history for this message
Michele Baldessari (michele) wrote :

Not entirely sure it will work, but am trying something around URLGRABBER_DEBUG in https://review.openstack.org/448982. Will update here if it does indeed give us extra logs

Alan Pevec (apevec)
summary: - centos.org repositories are unreliable
+ buildlogs.centos.org CDN issues
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@Alan, most of problems happens on hosts of various cloud providers, not on rh1 cloud. (Multinode jobs run on nodepool images of openstack infra, which are hosted on ovh, osic, and other clouds).

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

As you can see from Kibana[1] or picture in this thread, all problem in jobs started from 14 March. It's exactly the time when centos infra was moving hardware according to[2].

[1] http://logstash.openstack.org/#/dashboard/file/logstash.json?query=build_name:%20*tripleo-ci*%20AND%20build_status:%20FAILURE%20AND%20message:%20%5C%22No%20more%20mirrors%20to%20try%5C%22
[2] https://seven.centos.org/2017/03/infra-scheduled-major-outage-for-several-services/

Revision history for this message
Alan Pevec (apevec) wrote :

CDN backing[1] buildlogs.centos did not change, it's external to the centos infra which was moved on March 14th.

[1] https://lists.centos.org/pipermail/centos-devel/2016-March/014552.html

Revision history for this message
David Moreau Simard (dmsimard) wrote :

There's a couple different things going on here.

We need to switch everything that uses "http://buildlogs" to "https://buildlogs".
This is happening here: https://review.openstack.org/#/q/topic:cbs/https

We're seeing "No more mirrors to try" errors on some jobs.
This seems to be because TripleO-Quickstart uses trunk.rdoproject.org/current instead of recovering the hash of current and then using trunk.rdoproject.org/<hash>.
Since "current" is bound to change midway through the job as new builds come in, it is likely that jobs can fail on packages that are not found.

This is being fixed by (as per Sagi):
- https://review.openstack.org/#/c/448511/
- https://review.openstack.org/#/c/448512/
- https://review.openstack.org/#/c/449004

Let's get all this merged and see if we're still experiencing problems.

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

Note that we already changed to https://buildlogs and it didn't help (yum can properly follow redirects) and created other issues (specially in jobs running in ci.centos, as the mirror inside doesn't support https.

I agree we need to fix the issue with "current" repo too. It's a totally different issue that the one with buildlogs but as the "No more mirrors to try" message appears, it may lead us to mix both problems.

Revision history for this message
Paul Belanger (pabelanger) wrote :

If you can enable rsync from buildlogs.centos.org, we can mirror this infrastructure into openstack. Then you'll be hitting our AFS backend. I've mentioned it a few times, but documenting it for this issue.

Revision history for this message
Alan Pevec (apevec) wrote :

"No more mirrors to try" has not appeared in the last 24h, I think we can de-escalate this issue and try to come up with the theory of the root cause to ensure it does not happen again.

tags: removed: alert
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

It didn't appear because we still didn't add quickstart logs to logstash indexing. The non-quickstart jobs (OVB) are stuck now because some rh1 problems. It's still the big problem.

tags: added: alert
Revision history for this message
David Moreau Simard (dmsimard) wrote :

Just as complementary information...

I've reached out to the following communities:
- openstack-ansible
- chef-openstack
- puppet-openstack
- kolla

They all rely on RDO's buildlogs repositories as part of their gate and besides the python-vine issue that happened today, none report any unusual instability.

Revision history for this message
Alan Pevec (apevec) wrote :

@Sagi can we get yum failure stats for non-oooq tripleo scenario jobs, I some of those are still around?

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

I've created https://review.openstack.org/#/c/449548/ to add debug in oooq jobs.

tags: removed: alert
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Alan Pevec (apevec) wrote :

@Emilien what was the fix?

Revision history for this message
David Moreau Simard (dmsimard) wrote :

We did not merge "one" fix to fix everything, but after a lot of work we managed to land a series of patch meant to improve the status of CI: https://review.openstack.org/#/q/topic:tripleo/outstanding

The situation should be much, much better now but we should monitor if we are still experiencing problems.

Revision history for this message
Paul Belanger (pabelanger) wrote :

So, I am pretty sure this isn't fixed. And not the real issue. Based on comments dmsimard made previously, I started looking more into this.

A few issues:

1) We do appears to be having networking issues in osic-cloud1. From what I see, we are getting packet loss within the rackspace networks. Surprisingly, this maybe we limited to centos-7 hosts, which might explain why other projects haven't see this before. I'll be looking into this on monday.

2) Tripleo should take this opportunity to better understand jobs should not be depending on the internet. Specifically, tripleo goes out to the network way too much for my comfort. We make a big effort to keep devstack from going out to the public web, for this exact reason. Tripleo is now at the point where a significant amount of jobs are failing, because of networking blips.

I have started working on patching rubygems-mirror to allow openstack-infra to mirror gems, but things like DLRN and buildlogs.centos.org should also be mirrored.

github.com is a bigger issue, but I am going to talk about this for our tuesday openstack-infra meeting.

3) Add ipv6 support for trunk.rdoproject.org. This is specific to potential issues in osic-cloud, but if rdoproject.org did provide ipv6 support, it may fix some networking issues we are seeing. Other wise, traffic from osic-cloud1 needs to hit an ipv4 NAT.

So, with all of that. We should reopen this and use it to help track the issues above

Changed in tripleo:
assignee: nobody → Paul Belanger (pabelanger)
Changed in tripleo:
status: Fix Released → In Progress
Changed in tripleo:
milestone: pike-1 → pike-2
Changed in tripleo:
milestone: pike-2 → pike-3
Revision history for this message
Ben Nemec (bnemec) wrote :

We haven't been seeing mirror issues AFAIK, but I guess I'll leave it open per the previous comments. Dropping the priority though since it's clearly not critical anymore.

Changed in tripleo:
importance: Critical → Medium
Changed in tripleo:
milestone: pike-3 → pike-rc1
Changed in tripleo:
milestone: pike-rc1 → queens-1
Revision history for this message
Dan Trainor (dtrainor) wrote :

I think I'm seeing this creep up again when trying to install in a new environment (rdo-cloud). Multiple failures indicating:

delorean-pike-testing/x86_64/p FAILED
https://buildlogs.centos.org/centos/7/cloud/x86_64/openstack-pike/repodata/700104f6690b05335dd322398eb584976bdc1a300bf1a5bc560930e58fa61fa3-primary.sqlite.bz2: [Errno 12] Timeout on https://buildlogs.centos.org/centos/7/cloud/x86_64/openstack-pike/repodata/700104f6690b05335dd322398eb584976bdc1a300bf1a5bc560930e58fa61fa3-primary.sqlite.bz2: (28, 'Operation timed out after 30001 milliseconds with 0 out of 0 bytes received')
Trying other mirror.

Revision history for this message
Paul Belanger (pabelanger) wrote :

You'll need to setup mirror infrastructure in RDO cloud, like we do in openstack-infra. This can be done pretty easily if you join our AFS cell. You could also run puppet from system-config for the mirror server. Then you don't actually need to manage anything.

Revision history for this message
Paul Belanger (pabelanger) wrote :

You could setup a crontab in the mirror in rdo using the following:
  http://git.openstack.org/cgit/openstack-infra/system-config/tree/manifests/site.pp#n616

Revision history for this message
Paul Belanger (pabelanger) wrote :

also, we don't have buildlogs.centos.org in our reverse proxy cache. If you are depending on it, we can add it to avoid hitting it directly.

Revision history for this message
David Moreau Simard (dmsimard) wrote :

There has been various reports of issues with buildlogs.centos.org today, the core CentOS infrastructure team was notified.

Revision history for this message
David Moreau Simard (dmsimard) wrote :

@Paul, setting up a reverse proxy to buildlogs.centos.org could be a good idea.

Revision history for this message
Dan Trainor (dtrainor) wrote :

Excellent, thank you Paul for your feedback.

tags: added: alert
Changed in tripleo:
milestone: queens-1 → pike-rc1
importance: Medium → Critical
assignee: Paul Belanger (pabelanger) → nobody
Revision history for this message
Ian Wienand (iwienand) wrote :

One thing I noticed is that http://buildlogs.centos.org 302 redirects to https:// which might throw a bit of a spanner in the works for a proxy

Changed in tripleo:
status: In Progress → Triaged
Revision history for this message
Emilien Macchi (emilienm) wrote :

sounds like it's back. I'll keep it open until my morning and see how it goes. Feel free to update it as well if you think it's fixed.

Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
Emilien Macchi (emilienm) wrote :

Paul is working on mirroring buildlogs: https://review.openstack.org/#/c/489462/

tags: removed: alert
Changed in tripleo:
status: Triaged → In Progress
Changed in tripleo:
milestone: pike-rc1 → pike-rc2
Changed in tripleo:
milestone: pike-rc2 → queens-1
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Not using buildlogs anymore.

Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.