Sometimes, as a webmaster you would like to block pages or even entire directories of our website from search engines. There can be many reasons for that:
– Password protected pages.
– System management pages.
– Private content.
– Duplicate content pages.
– Pages we don’t want to be indexed for any reason.
We have two main techniques to block bots from crawling and/or indexing our pages: With Robots.txt file or with meta tag robots.
Robots.txt
Robots.txt is a file located on our root domain (domain.com/robots.txt) and it has a very specific job: Define to search bots what they can crawl and what not.
The syntax is pretty simple- First we define the bot, mostly we set it as asterisk that indicates all bots, and then we define the page or directory along the word Allow or Disallow. Here are the most common examples:
Allowing to crawl the whole domain:
User-agent: *
Allow: /
Blocking the whole domain:
User-agent: *
Disallow: /
Blocking specific directory:
User-agent: *
Disallow: /directory/
Blocking specific page:
User-agent: *
Disallow: /page.html
Meta Tag Robots
Another way to block search bots from indexing our pages is by inserting this tag to the page source code inside the header part. Because its inserted for each page individually, you can’t block with this method an entire domain or directory.
The meta tag robots has two functions:
noindex- When you define this, the bot can crawl the page but not to index it, unlike robots.txt where it can’t enter it at all. It means that it can still learn what’s on the page and follow links.
nofollow- When you define this, it makes all the links on the page to no follow links. Mostly we do not use this function, but tagging specific links.
The syntax of the meta tag robots inside the header (between <head> and </head>):
noindex+nofollow:
<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>
noindex:
<META NAME=”ROBOTS” CONTENT=”NOINDEX”>
nofollow:
<META NAME=”ROBOTS” CONTENT=”NOFOLLOW”>