What’s Important for a Robot

Designing a website nowadays always includes the task of optimizing the website for search engines. Otherwise you might have designed a brilliant website but nobody will be able to find it! Ideally, your site will be in the top 10 search results, i.e. on the first page. This blog series by Marcus Günther and Oliver Schmidt describes how to attain this goal. The first lesson is to master the art of being crawled by a search engine robot.

Search engine robots are programs that automatically browse websites on the Internet and download new or modified webpages for later indexing by the search engines. Other terms used for search engine robots are bots, crawlers, spiders etc. And the task itself is often called crawling or spidering. In order to make a robot spider your website, you either need links from websites already know to the search engine or you need to submit your website to the search engines which have special pages for that.
Before robots start crawling your website, all well-behaved robots (e.g. from Bing, Google, Yahoo, etc.) check the existence and contents of the file /robots.txt.

Guiding Robots with robots.txt

The robots.txt file is a plain text file where you can specify which content of your website can or cannot be indexed by search engine robots. The file’s structure is standardized by the Robots Exclusion Standard Protocol (REP).

You can include/exclude single files or folders as well as file and directory name patterns. For some search engine robots, you can even create a rule to ex- or include URLs which contain specific characters, for example the question mark character (“?”).

However, it’s not absolutely necessary having a robots.txt file. If your website is well-organized, chances are very high that bots will crawl all your content successfully even without that file.

But what if you do not want all of your pages to be indexed? Maybe you provide pages which are useful for visitors but do not need to be indexed at all, e.g. error pages. These pages provide no further information about you, your services or the content of your website, they even might look strange on result pages, so it’s better to exclude them from indexing by using the following statement in robots.txt:

Disallow: /errors

Another scenario you might want to anticipate: Someone copying your website using tools like wget, mechanize, etc. With a robots.txt file, you can make it harder for them (but beware, these tools have options for ignoring robots.txt):

User-Agent: Wget
User-Agent: Wget/1.6
User-Agent: Wget/1.5.3
Disallow: /

There are more ways to tell search engine robots not to crawl specific pages. That will be covered in our next part of this blog series.

However, the robots.txt file is only a hint for robots. If some robots are not well-behaved, ignore the Robots Exclusion Standard and try to spider your website anyway, then your robots.txt file is nothing more than an edgeless sword. To completely exclude these bots, you have to know their “user agent string” and can instruct the webserver to no serve any content for them.

Nevertheless, you can provide the robots.txt file for all “friendly” search engine robots to specify how they should index your website. That makes their and your life easier. Here’s another example of a short robots.txt file:

# robots.txt file for www.example.org
Sitemap: http://www.example.org/sitemap.xml

# First part – bad bots
User-Agent: badBot
User-Agent: badBot/2.1
Disallow: /

# Second part – Google image bot
UserAgent: googlebot-image
Disallow: /errors
Disallow: /contact
Disallow: /disclaimers
Crawl-delay: 180

# Third part – all other bots
UserAgent: *
Disallow: /errors/
Disallow: /*.pdf$

In the above example, we used some non-standard directives like the “$” sign to mark an end of line. Google, Bing and Yahoo! can interpret those useful directives, however.

The first part of the example (lines 4-7) forbids a bot called “badBot” (as well as another version of that bot) access to all files and directories on the domain http://www.example.org/.

Lines 9-14: For the user agent “googlebot-image”, which is Google’s robot for the image search, the example disallows access to some directories where there are no relevant images. With the “Crawl-delay” statement, each page of the website will be accessed after a delay of 180 seconds.

The third part (lines 16-19) of the example forbids all bots access to all files under directories starting with the string “errors” and all PDF documents under the domain.

At the beginning of the example, there is a statement which has not been discussed yet in this article: With the Sitemap statement, you can specify where search engine robots can find your sitemap.xml file. The following section describes what it is about the sitemap.xml.

Further Info:

Describing your site’s structure

The sitemap.xml file really is a file which resides on your website and lists links to all of your assets that robots should spider and index. That’s important because some of your pages might only be accessible via Flash or JavaScript links which could easily be missed by a crawling process.

Below is a simple example for a sitemap.xml:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.nonexistent.com/</loc>
      <lastmod>2010-01-01</lastmod>
      <changefreq>weekly</changefreq>
      <priority>1.0</priority>
   </url>
   <url>
      <loc>http://www.nonexistent.com/content/</loc>
      <changefreq>daily</changefreq>
   </url>
</urlset>

For every page, that you want to be indexed, you need to add a <url> section (see line 3). Within that section, you must define the URL of the page within the <loc></loc> tags (line 4). You may add optional information, like

  • the date of last modification <lastmod>,
  • the change frequency of that page <changefreq> or
  • the priority of that page relative to all other pages within your website <priority>.

As the specification for sitemap.xml is quite complicaed and offers far more possibilities than e.g. robots.txt we just explain the most important tags, for more information, please read the specification.

You might guess that creating a sitemap.xml file is not a difficult but time-consuming task. Google who has invented the sitemap.xml standard is here to help you with Google’s sitemap generator which is a module for the Apache web server that can parse Apache’s log files, scan the website root directory and filter URLs in order to create a complete sitemap.xml specific for your website. Additionally, it can automatically submit the newest versions of your sitemap to Google, Bing, Yahoo! and Ask.

However, if you have any chance of creating that file on your own, that will be the better alternative as all automated tools are having a hard time distinguishing between real content and e.g. search results which should not be indexed.

The Google Sitemap Generator module for the Apache web server

Most Blog/CMS frameworks as well as enterprise portals also provide automatic generation of the sitemap.xml file while you add more and more pages, see

Once you have created and adjusted the sitemap.xml for your website, there are several ways for its publication. You can

  1. store it anywhere on your webserver and provide its exact location in the robots.txt file (see above),
  2. submit it in the search engine provider’s webmaster tools – such tools exist for several providers, for example Google, Bing and Yahoo!
  3. submit it via your search engine providers’ ping service
    1. Bing sitemap ping service http://www.bing.com/webmaster/ping.aspx?siteMap= http%3A%2F%2Fwww.yoursite.com%2Fsitemap.gz
    2. Yahoo! Site Explorer
    3. Google sitemap ping service for re-submission http://www.google.com/webmasters/tools/ping?sitemap=http%3A%2F%2Fwww.yoursite.com%2Fsitemap.gz

Sitemap files can get very large, therefore most search engines support adding more than one sitemap.xml. Another useful case for multiple sitemaps are websites which use different sofware for different sections (like shop, CMS, blog, etc.).

In order to use multiple sitemaps you have to submit a sitemap index file to your search engine provider and refer to the other sitemaps from there. Google supports adding more than one sitemap.xml in their webmaster central to make that a bit easier.

Submitting a sitemap.xml file to Bing webmaster tools

Further Info:

The next articles will cover the topics how to structure your website, how to structure your URLs and keywords; it will also cover topics like how to find keywords, what kind of keywords you should focus on, where you can place your keywords and so on.

Share

Leave a Reply

*

2 Responses to “What’s Important for a Robot”

  1. Jenn@FFP says:

    I’m trying to figure out how the best way to block bots from crawling paid text links throughout my blog (to remain in compliance w/Google’s TOS). I always forget to use rel-”nofollow” attribute and thought maybe a bot was the way to go but after reading this, I may have to add the link attribute since the bot would exclude the entire article in that category instead of the outbound link. However it does give me a few ideas on other pages I should use this. Thanks for the great info.

    • Oliver Schmidt says:

      Hey Jenn,
      you’re welcome! Thanks to Google the rel-nofollow Attribute is a great help in preventing robots from following (unwanted) outbound links.

      I also use other microformats like hcard, hcalendar or rel-home to add semantic value to my pages.

      Cheers
      Oliver