Robots.txt Tips

Each web site you write should have a robots.txt file in the main folder, to suggest to search engines which folders you want them to not scan or index.

You can make this file in any text editor (but not a word processor, must be plain text).

Comments start with #, so the first line is for your information, specifying which web site this is for. Comments can be after any directive.

The Sitemap line (can be anywhere in the file, but I put it at top) is a way to tell search engines the full URL where your site map is.

The “User-agent: *” line says “for all User Agents, this is for you” (The User-agent normally would be browser type, e.g. Firefox or Chrome or IE, but search engines also tell web servers some User Agent.)

# robots.txt for yoursite.com
Sitemap: http://www.yoursite.com/sitemap.xml

User-agent: *    # all search engines should ignore the folders Disallowed below

# Maybe have one folder for site images that shouldn't be indexed
# another for photographs you want indexed by search engines
Disallow: /images/
Allow: /photos/

# Here are more examples, use the folders that apply to your site:
Disallow: /cgi-bin/
Disallow: /freecontactform/
Disallow: /junk/
Disallow: /newsfeeds/
Disallow: /oldboring/

You can exclude individual files this way (but probably simpler for you to put files into a separate directory):

Disallow: /test/junk.html

If you have WordPress on your site, include these lines:

# WordPress
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /category/*/*
Disallow: /comments
Disallow: */comments
Disallow: /feed
Disallow: */feed
Disallow: /manager
Disallow: /trackback
Disallow: */trackback
Allow: /wp-content/uploads

You can say a specific search engine has different permissions, if you want. (I don’t, all are commented out in the example below)
Don’t use Allow: /*
(I actually saw this on some dumb robots.txt teaching site!) Google correctly reads the Allow: /* as overriding the folders for ‘any search’ (see that line above, User-Agent: *), “Ah, everybody but Me shouldn’t read /youtubevideosite/ etc.” and then reports hundreds of errors from your YouTube, Amazon etc. site widgets having bad links that it can’t find on your site. (Of course not, those links are on YouTube or Amazon or the comment writer’s own site.)

User-agent: Googlebot
# Disallow:
# Allow: /images/

# Google Images
User-agent: Googlebot-Image
# Disallow:
# Allow: /images/

# Google AdSense
User-agent: Mediapartners-Google*
# Disallow:
# Allow: /ads/

# digg mirror
User-agent: duggmirror
Disallow: /

See Google’s test page, to make sure Google is reading your robots.txt the way you expect: https://www.google.com/webmasters/tools/ and under Crawl click on Blocked URLs:

Robots.txt does not provide Any security. Use .htaccess for that. Robots.txt is a suggestion for good search bots, e.g. Google or Bing, to not bother indexing files. Do NOT mention private areas in robots.txt, you just told bad search bots to go looking there! Don’t bother filling your robots.txt with lines like this, since bad bots don’t even read your robots.txt

User-agent: Flaming AttackBot   # waste of time, bad bots ignore it
Disallow: /

There is no way to specify, in robots.txt, to not index “everything except” a single file, there is no use of ‘wild cards’ in file names. Best is put the files you prefer search engines ignore in separate directories.

There is a way to specify inside an HTML page that search engines shouldn’t bother indexing it. However, this is only seen once the search engine has loaded the page (robots.txt is generally read by search engines before loading any pages in that folder). Simply put a “noindex” meta tag in the <head> section of the page: <meta name="robots" content="NOINDEX, NOFOLLOW, NOARCHIVE" />. I use this on the “thank you” page after a form was submitted, and on my site error pages.

Also see http://perishablepress.com/robots-notes-plus/.

Here is my robots.txt in one piece:

# robots.txt for lernerconsult.com
Sitemap: http://www.lernerconsult.com/sitemap.xml

User-agent: *
Disallow: /htm/
Disallow: /lists/
Disallow: /sitespecific/
Disallow: /images/
Disallow: /cgi-bin/
Disallow: /cgi-bin/apf4/
Disallow: /ddtabmenufiles/
Disallow: /feedforall/
Disallow: /newsfeeds/
Disallow: /utils/
Disallow: /shared/
Disallow: /shared/youtubevideosite/
Disallow: /socialbookmarkscript/
Disallow: /styles/
Disallow: /youtubevideosite/

# WordPress
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /category/*/*
Disallow: /comments
Disallow: */comments
Disallow: /feed
Disallow: */feed
Disallow: /manager
Disallow: /trackback
Disallow: */trackback
Allow: /wp-content/uploads

# See test page https://www.google.com/webmasters/tools/ then under Crawl click on Blocked URLs

User-agent: Googlebot
# Disallow:
# Allow: /googlefolder/

# Google Image
User-agent: Googlebot-Image
# Disallow:
# Allow: /googlefolder/

# Google AdSense
User-agent: Mediapartners-Google*
# Disallow:
# Allow: /googlefolder/

# digg mirror
User-agent: duggmirror
Disallow: /

Robots.txt Tips

Comments

Leave a Reply Cancel reply