Each web site you write should have a robots.txt file in the main folder, to suggest to search engines which folders you want them to not scan or index.
You can make this file in any text editor (but not a word processor, must be plain text).
Comments start with #, so the first line is for your information, specifying which web site this is for. Comments can be after any directive.
The Sitemap line (can be anywhere in the file, but I put it at top) is a way to tell search engines the full URL where your site map is.
The “User-agent: *” line says “for all User Agents, this is for you” (The User-agent normally would be browser type, e.g. Firefox or Chrome or IE, but search engines also tell web servers some User Agent.)
# robots.txt for yoursite.com Sitemap: http://www.yoursite.com/sitemap.xml User-agent: * # all search engines should ignore the folders Disallowed below # Maybe have one folder for site images that shouldn't be indexed # another for photographs you want indexed by search engines Disallow: /images/ Allow: /photos/ # Here are more examples, use the folders that apply to your site: Disallow: /cgi-bin/ Disallow: /freecontactform/ Disallow: /junk/ Disallow: /newsfeeds/ Disallow: /oldboring/
You can exclude individual files this way (but probably simpler for you to put files into a separate directory):
If you have WordPress on your site, include these lines:
# WordPress Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: /category/*/* Disallow: /comments Disallow: */comments Disallow: /feed Disallow: */feed Disallow: /manager Disallow: /trackback Disallow: */trackback Allow: /wp-content/uploads
You can say a specific search engine has different permissions, if you want. (I don’t, all are commented out in the example below)
Don’t use Allow: /*
(I actually saw this on some dumb robots.txt teaching site!) Google correctly reads the Allow: /* as overriding the folders for ‘any search’ (see that line above, User-Agent: *), “Ah, everybody but Me shouldn’t read /youtubevideosite/ etc.” and then reports hundreds of errors from your YouTube, Amazon etc. site widgets having bad links that it can’t find on your site. (Of course not, those links are on YouTube or Amazon or the comment writer’s own site.)
User-agent: Googlebot # Disallow: # Allow: /images/ # Google Images User-agent: Googlebot-Image # Disallow: # Allow: /images/ # Google AdSense User-agent: Mediapartners-Google* # Disallow: # Allow: /ads/ # digg mirror User-agent: duggmirror Disallow: /
See Google’s test page, to make sure Google is reading your robots.txt the way you expect: https://www.google.com/webmasters/tools/ and under Crawl click on Blocked URLs:
Robots.txt does not provide Any security. Use .htaccess for that. Robots.txt is a suggestion for good search bots, e.g. Google or Bing, to not bother indexing files. Do NOT mention private areas in robots.txt, you just told bad search bots to go looking there! Don’t bother filling your robots.txt with lines like this, since bad bots don’t even read your robots.txt
User-agent: Flaming AttackBot # waste of time, bad bots ignore it Disallow: /
There is no way to specify, in robots.txt, to not index “everything except” a single file, there is no use of ‘wild cards’ in file names. Best is put the files you prefer search engines ignore in separate directories.
There is a way to specify inside an HTML page that search engines shouldn’t bother indexing it. However, this is only seen once the search engine has loaded the page (robots.txt is generally read by search engines before loading any pages in that folder). Simply put a “noindex” meta tag in the <head> section of the page:
<meta name="robots" content="NOINDEX, NOFOLLOW, NOARCHIVE" />. I use this on the “thank you” page after a form was submitted, and on my site error pages.
Here is my robots.txt in one piece:
# robots.txt for lernerconsult.com Sitemap: http://www.lernerconsult.com/sitemap.xml User-agent: * Disallow: /htm/ Disallow: /lists/ Disallow: /sitespecific/ Disallow: /images/ Disallow: /cgi-bin/ Disallow: /cgi-bin/apf4/ Disallow: /ddtabmenufiles/ Disallow: /feedforall/ Disallow: /newsfeeds/ Disallow: /utils/ Disallow: /shared/ Disallow: /shared/youtubevideosite/ Disallow: /socialbookmarkscript/ Disallow: /styles/ Disallow: /youtubevideosite/ # WordPress Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: /category/*/* Disallow: /comments Disallow: */comments Disallow: /feed Disallow: */feed Disallow: /manager Disallow: /trackback Disallow: */trackback Allow: /wp-content/uploads # See test page https://www.google.com/webmasters/tools/ then under Crawl click on Blocked URLs User-agent: Googlebot # Disallow: # Allow: /googlefolder/ # Google Image User-agent: Googlebot-Image # Disallow: # Allow: /googlefolder/ # Google AdSense User-agent: Mediapartners-Google* # Disallow: # Allow: /googlefolder/ # digg mirror User-agent: duggmirror Disallow: /