Friday, October 18, 2013

How to disallow crawling on certain parts of your website

Did you ever try searching your own website on Google or other search engines and you find that certain pages like your calendar, admin page or any others pages are appearing in the search results. Well, we definitely do not want Google to crawl each and every page and index it. Reason can be multiple like security or you may not want certain pages to be indexed for some reason. There is a way to fix this and not allow Google or other search engines from crawling or in simple terms "disallow crawling".


googlebot crawl


All that you need to do it update your robots.txt file. You can find this file by typing  "yourwebsite.com/robots.txt in the URL field to view it and if you want to edit it then login to your hosting server and you will find robots.txt file there. If you are using Wordpress then by default you robots file will look something like:



User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

As you can see the admin page is blocked here. Similarly you can disallow crawling by adding the page you do not want search engines to crawl. The reason why you need to enter these details in robots.txt is - Googlebot will go through this robots.txt file before the crawl beings, to check if there is anything that is to be excluded from crawling. If the Googlebot identified your robots.txt file but is unreachable or unable to read this file then or if it doesn't return a 200 or 404 HTTP status code then Google will delay the crawling and will wait for the robots.txt file.


It is not a must to include robots.txt file always. But if you want search engines to ignore some of the contents from indexing then its a must to include. If you want everything to be indexed on your website then you can leave the robots.txt file empty or you need not have any file at all. Googlebot will look for this file and once it detects that it not present or empty it will throw a 404 HTTP status message, which will help googlebot to continue crawling your website.


If you want googlebot to crawl your entire website then this is how your robot.txt will look like:



User-agent: *
Disallow:

or you don't need to include a robots.txt file at all or



User-agent: *
Allow: /

To disallow crawling of your whole website then use this code:



User-agent: *
Disallow: /

and if you want to disallow crawling of some parts like admin or calendar etc. then you can add it under disallow as shown:



User-agent: *
Disallow: /wp-admin/
Disallow: /calendar/

Note that this robot.txt file can be viewed by anyone. So, if you want any private content to be hidden then this is not the place to include it. Instead use proper authentication to hide it else you location to private content will be public. Also, adding pages to robot file will disallow crawling but may not always stop Google from indexing.

1 comment: