Server Guide draft has higher Google rank than released version

Bug #122297 reported by Hippu
4
Affects Status Importance Assigned to Milestone
Ubuntu Documentation
Fix Released
Medium
Unassigned
Ubuntu Website - OBSOLETE
Invalid
Undecided
Unassigned

Bug Description

1. Do a Google, Yahoo, or Live search for "ubuntu server guide".

What should happen: The Server Guide for Ubuntu's latest release <https://help.ubuntu.com/ubuntu/serverguide/C/>is the first result.
What actually happens: On all three search engines, the draft Server Guide for the *next* release is the first result.

This can be fixed with a robots.txt instructing search engines not to index doc.ubuntu.com.

Tags: serverguide
Revision history for this message
Hippu (teemu-heinamaki) wrote :

Ok, stupid me i was browsing the draft files because they seem to come up first in Google.

See: http://www.google.fi/search?q=ubuntu+server+guide&ie=utf-8&oe=utf-8&aq=t&rls=com.ubuntu:en-US:official&client=firefox-a

Changed in ubuntu-doc:
status: New → Invalid
description: updated
Changed in ubuntu-doc:
importance: Undecided → Medium
status: Invalid → Confirmed
Revision history for this message
LaserJock (laserjock) wrote :

I don't think I like the idea of not indexing doc.ubuntu.com at all, there are docs on there that are often referenced (although I suppose that could be a bug in itself) like the Packaging Guide and Edubuntu Handbook. I think I prefer using the Draft watermark that we used to have to show that the docs are WIP.

Revision history for this message
Matthew Paul Thomas (mpt) wrote :

The Edubuntu Handbook at least should be on help.ubuntu.com. The Packaging Guide could perhaps be on wiki.ubuntu.com (though I don't know how well that would fit with its development model).

Revision history for this message
Dean Sas (dsas) wrote :

Is the Edubuntu Handbook still something that's used/updated? The packaging guide at least has moved to the wiki as far as I know (and so should probably be permanently redirected there)

If there are still docs in there that we wish to be crawled then use a regex which won't match their directory name in the robots.txt file.

Revision history for this message
Neal McBurnett (nealmcb) wrote :

Re:
 http://doc.ubuntu.com/ubuntu/serverguide/C/index.html
 https://help.ubuntu.com/ubuntu/serverguide/C/index.html

Shouldn't these two redirect to something closer to
 https://help.ubuntu.com/7.10/server/C/

(which needs a 2007 copyright....)

I gather that part of the problem is that the URL changed from
"serverguide" to "server" at some point, and the former is probably
favored by google's heuristics.

Perhaps a google site map would help:
 https://www.google.com/webmasters/tools/docs/en/protocol.html
 http://en.wikipedia.org/wiki/Sitemaps

Revision history for this message
Matthew East (mdke) wrote :

ubuntu-doc was the appropriate project.

Changed in ubuntu-website:
status: New → Invalid
Revision history for this message
Matthew East (mdke) wrote :

I think the appropriate solution to this bug is as stated by the original report - not indexing doc.ubuntu.com. I don't see any disadvantage to doing so - the site is for the documentation team only (in fact I think it would be clearer if the url were docteam.ubuntu.com) and is linked from the relevant team pages on the wiki.

There is now a draft watermark on all pages but it doesn't solve the google problem.

Revision history for this message
Matthew East (mdke) wrote :

So, how does one stop google from indexing a site?

Revision history for this message
Jim Campbell (jwcampbell) wrote : Re: [Bug 122297] Re: Server Guide draft has higher Google rank than released version

You can do this through a robots.txt file, through the meta tags on your
site... I think you can even do it through modifications to htaccess.

ubuntu.com already has a robots.txt file in place, but I'm not sure how
robots.txt files applie to subdomains. I also do not know what kind of
control we have over the meta tags in the draft documentation. Are the meta
tags auto-generated as part of the page creation process?

On 2/29/08, Matthew East <email address hidden> wrote:
>
> So, how does one stop google from indexing a site?
>
> --
> Server Guide draft has higher Google rank than released version
> https://bugs.launchpad.net/bugs/122297
> You received this bug notification because you are a member of Ubuntu
> Documentation Project Team, which is subscribed to Ubuntu Documentation.
>

Revision history for this message
Matthew East (mdke) wrote :

Hi,

On Fri, Feb 29, 2008 at 10:15 PM, Jim Campbell <email address hidden> wrote:
> You can do this through a robots.txt file, through the meta tags on your
> site... I think you can even do it through modifications to htaccess.
>
> ubuntu.com already has a robots.txt file in place, but I'm not sure how
> robots.txt files applie to subdomains. I also do not know what kind of
> control we have over the meta tags in the draft documentation. Are the meta
> tags auto-generated as part of the page creation process?

Yes, although no doubt it is possible to customise them if necessary.
http://www.sagehill.net/docbookxsl/HtmlHead.html looks like it has the
relevant instructions and I could take care of that aspect of it. But
I'm not familiar with robots.txt files.

--
Matthew East
http://www.mdke.org
gnupg pub 1024D/0E6B06FF

Revision history for this message
Dean Sas (dsas) wrote :

Matthew East wrote:
> Hi,
>
> On Fri, Feb 29, 2008 at 10:15 PM, Jim Campbell <email address hidden> wrote:
>> You can do this through a robots.txt file, through the meta tags on your
>> site... I think you can even do it through modifications to htaccess.
>>
>> ubuntu.com already has a robots.txt file in place, but I'm not sure how
>> robots.txt files applie to subdomains. I also do not know what kind of
>> control we have over the meta tags in the draft documentation. Are the meta
>> tags auto-generated as part of the page creation process?
>
> Yes, although no doubt it is possible to customise them if necessary.
> http://www.sagehill.net/docbookxsl/HtmlHead.html looks like it has the
> relevant instructions and I could take care of that aspect of it. But
> I'm not familiar with robots.txt files.

Either:
doc.ubuntu.com should have a file called 'robots.txt' in the site root
containing the following two lines:
User-agent: *
Disallow: /

(disallow all bots access to all pages)

Or:
The HTML head tag needs to contain a meta tag like so:
<meta name="robots" content="noindex, nofollow">
(noindex means don't index this page, and nofollow means don't crawl any
links on this page)

This should be added to every html page.

http://www.robotstxt.org is a good resource.

Revision history for this message
Neal McBurnett (nealmcb) wrote :

I see no reason to use robots.txt to shroud this valuable documentation completely from the testing, documenting, developer and user communities. sometimes the best documentation on features of previous versions is in updated documentation. google is often easier to use to find pages than other methods, and robots.txt would prevent or complicate "internal" indexing with other tools also. I think links to other versions, a site map, and labeling of the pages with version numbers, would be much better.
The site map spec is at https://www.google.com/webmasters/tools/docs/en/protocol.html, and see also http://en.wikipedia.org/wiki/Site_map

Revision history for this message
Neal McBurnett (nealmcb) wrote :

Another problem we have with Ubuntu documentation is that actually the server guide in general has a lower rank for many queries than many other less authoritative sources out there. So rather than killing off one of our best google hits (doc.ubuntu.com) via robots.txt, we should make sure that if people get to it, they know what it is for and can navigate easily to what they want if this isn't it.

Revision history for this message
Matthew East (mdke) wrote :

Neal,

As you'll probably have gathered by now from the discussion on the mailing list, I simply disagree. There is a good reason to hide this information from users, and that is that they shouldn't *ever* be using it. There simply isn't a good reason - the preview material on doc.ubuntu.com is work in progress and applies only to the development version of Ubuntu.

The only reason for people to read the material is if they are assisting with development of the documentation: that doesn't require google. It just requires us to ensure that the site can be clearly found by those interested in helping out in that area: by linking it from the appropriate pages at https://wiki.ubuntu.com/DocumentationTeam.

I agree that it's in our interest to raise the google rank of official documentation - we should address this by seeking to raise the google rank of help.ubuntu.com, not by promoting unstable documentation for the development release of Ubuntu.

What do other people think?

Revision history for this message
Matthew East (mdke) wrote :

There hasn't been much discussion on this for a while. To see how it goes, I've implemented a robots.txt which will stop search engines from indexing doc.ubuntu.com. If anyone has any new comments on this subject please post here and we'll keep considering alternative solutions if appropriate.

Changed in ubuntu-doc:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers