Les trois plus grands moteurs de recherche (Google, Yahoo!, Microsoft Live) viennent simultanément de communniquer sur un standard commun concernant le Robots Exclusion Protocol (REP), régissant le contenu du fichier "robots.txt", placé à la racine d'un site (www.votresite.com/robots.txt) et indiquant aux spiders des moteurs de recherche ce qu'ils ne doivent pas faire sur un site web, mais également sur l'utilisation des balises META (noindex, nofollow, norachive, nosnippet, noodp, etc.) dans cette même optique.

Voici ces différentes directives, extraites du blog pour webmasters de Google (notez toutefois que chaque moteur propose des syntaxes qui leur sont propres dans ce cadre, elles sont rappelées sur chacun de leur blog - voir adresses ci-dessous) :

1. Robots.txt Directives

DIRECTIVE IMPACT USE CASES
Disallow Tells a crawler not to index your site -- your site's robots.txt file still needs to be crawled to find this directive, however disallowed pages will not be crawled 'No Crawl' page from a site. This directive in the default syntax prevents specific path(s) of a site from being crawled.
Allow Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed except for a small section within it
$ Wildcard Support Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages 'No Crawl' files with specific patterns, for example, files with certain filetypes that always have a certain extension, say pdf
* Wildcard Support Tells a crawler to match a sequence of characters 'No Crawl' URLs with certain patterns, for example, disallow URLs with session ids or other extraneous parameters
Sitemaps Location Tells a crawler where it can find your Sitemaps Point to other locations where feeds exist to help crawlers find URLs on a site

2. HTML META Directives

DIRECTIVE IMPACT USE CASES
NOINDEX META Tag Tells a crawler not to index a given page Don't index the page. This allows pages that are crawled to be kept out of the index.
NOFOLLOW META Tag Tells a crawler not to follow a link to other content on a given page Prevent publicly writeable areas to be abused by spammers looking for link credit. By using NOFOLLOW you let the robot know that you are discounting all outgoing links from this page.
NOSNIPPET META Tag Tells a crawler not to display snippets in the search results for a given page Present no snippet for the page on Search Results
NOARCHIVE META Tag Tells a search engine not to show a "cached" link for a given page Do not make available to users a copy of the page from the Search Engine cache
NOODP META Tag Tells a crawler not to use a title and snippet from the Open Directory Project for a given page Do not use the ODP (Open Directory Project) title and snippet for this page

Robot Spider
 
Source de l'image : R-geek

http://www.robotstxt.org/

Source(s) :
- One Standard Fits All: Robots Exclusion Protocol for Yahoo!, Google and Microsoft (Yahoo!)
- More on Robots Exclusion Protocol (REP) (Google)
- Improving on Robots Exclusion Protocol (Google)
- Robots Exclusion Protocol: Joining Together to Provide Better Documentation (Microsoft)

Articles connexes sur ce site :

- Google propose un générateur de fichier robots.txt (28 mars 2008)
- Fichiers robots.txt : avantage Google (16 novembre 2007)
- Google analyse votre fichier robots.txt (7 février 2006)
- Google teste la directive "Noindex" dans les fichiers robots.txt (26 novembre 2007)
- Les 4 moteurs majeurs s'entendent sur une nouvelle fonction du standard Sitemaps (13 avril 2007)

Toutes les pages du réseau Abondance pour la requête robots.txt...