Optimisation

The good news is that if you are using one of the popular repository software packages such as DSpace or EPrints you probably do not need worry about having to optimise your repository for search engine robots, because their default set-ups are fine. This section will therefore mainly be of interest if you have developed your own software or site, used an uncommon package, or heavily customised your installation. See Jody L. DeRidder (2008) for a case study, and SHERPA's Ways to snatch Defeat from the Jaws of Victory for a list of gotchas.

How Web Crawlers Work

Search engines index websites (including OA repositories) use special programs called 'robots' (or 'bots' for short). They usually have their own names. For instance, Google's main robot is called 'Googlebot'. Other terms you may encounter are 'web crawler' and 'spider', which reflect the way these robots work. Starting from a specific web page, the crawler follows all the page's hyperlinks to index other pages on the website, and often external pages too. In this way, the whole website is covered, although it may take some time to reach the lower levels of the page hierarchy, and some search engines do not guarantee to index every page. Some technical features of your website could prevent pages from being crawled. Also, there are ways of blocking certain pages from being indexed, if you wish. These are discussed later.

Ensuring Browsability

Links are the key to successful web crawling. More specifically, it is static links that are the key. Dynamic links that are, for example, generated by an interactive search are likely to be unreachable by a robot. A related point is that URLs using arguments (i.e. the URL contains a '?') may be bypassed by some robots because they may regard them as transient dynamic content. Ideally, the links should be text links, although linked images are also usually OK. Links that use buttons to run JavaScript, PHP or other programmed functions will normally be ignored.

For effective web crawling, it must be possible to visit every page and document in your repository just by clicking on hyperlinks – without ever needing to type in text or to use buttons.

Website Structure

As mentioned earlier, web crawlers may take a while to reach the lower levels of your web page hierarchy, and some robots may only dig down so far. It therefore helps if you can keep your hierarchy as shallow as possible. (This also helps usability for humans.) A typical structure would have a set of 'Browse by...' options on the home page that link to lists of documents, thence to metadata pages for individual items, and finally to the full text.

e.g. Browse by Year > 2007 > [List of titles] > [Metadata page] > [Full text PDF]

This is still more than the 'three clicks' ideal for human usability. Performance can be partially improved by listing some items on the home page - typically 'Recent Additions', 'Popular Papers', etc. Such lists primarily provide a useful "come back" feature that encourages people to revisit the repository, but fortuitously also cause the listed papers to be indexed sooner by web crawlers.

General advice on making websites friendly to search engines can be found in Google's Webmaster Guidelines, and Peter Suber has prepared more specific advice on how to optimise repositories for Google crawling. Particular advice for DSpace users can be found in the DSpace Wiki page 'Ensuring your instance is indexed'.

Blocking Robots

There are cases where you may wish to prevent a robot from indexing a particular page or a group of pages. A depositors' login page would be a typical example. There are two methods of doing this that are accommodated by nearly all the reputable search engine spiders:

  • 'robots.txt' Files – This approach is the best method for blocking groups of pages, although it can also be used to block single pages. A plain text file named robots.txt is placed in the website's root directory that contains set of instructions about robots to be excluded and/or pages to be ignored. Each block of instructions starts with a line specifying the 'User-Agent' to which the block applies, followed by one or more lines indicating the files or directories that are to be 'Disallowed'. e.g.

    User-agent: *
    The * indicates the instructions apply to all robots
    Disallow: /login.php
    Robots should not to index the file login.php
    Disallow: /restricted/
    Ignore all files in the /restricted/ directory tree
  • More information is available in the Web Server Administrator's Guide to the Robots Exclusion Protocol and in the relevant Google Help Page.
  • 'robots' Meta Tags – This method can be used to block a single web page, although it may not be as reliable as the previous approach. Meta tag elements in a page's HTML <head> division provide information for robots such as authors' names, keywords, description, etc., and are not displayed on the visible web page. One meta tag – 'robots' – controls whether or not a web crawler (a) indexes the page, and/or (b) follows the links on the page. The example below blocks both indexing and link-following:

    <meta name="robots" content="noindex,nofollow" />

    …which can be shortened to…

    <meta name="robots" content="none" />

    See the HTML Author's Guide to the Robots META tag or Google's Help Page for more information on 'robots' meta tag options.

Unfortunately, there are some doubts about the reliability of both of these blocking methods for all web crawlers, although the robots.txt approach has a reputation for being more reliable. Consequently, for pages where blocking is particularly important, it is probably advisable to use both methods.

Sitemaps.org

'Sitemaps' turn the traditional relationship between search engines and websites on its head, by allowing websites to tell search engines what pages are in their site, rather than the conventional method of a search engine having to crawl the whole of a website to find new content.

A sitemap is a collection of XML files which, in their simplest form, can tell search engines what pages exist on a website, and when they were last updated. These files are typically built each night, and then compliant search engines can be 'pinged' (by visiting a specially formed URL) to inform them that the site map has been updated.

To date (November 2007) only DSpace has implemented sitemaps which is an optional feature in version 1.4, and will be included in version 1.5.

References

Find out more from the JISC Digital Repositories infoKit