There are two main reasons why you need to tell the search engine robots not to crawl certain sections of your website directories and files.
Besides, when a search engine robot is crawling on your website, drilling down to every directory, sub-directory and files; it is consuming bandwidth. Bandwidth is money because you pay to your hosting provider for the bandwidth that you consume. Naturally, you would like the precious bandwidth to be available more for your site visitors and less for your search engines.
Hence, the need to tell search engines what not to crawl.
Fortunately, the world wide web provides for an REP (Robots Exclusion Protocol) for communicating this to the search engines. When an REP compliant robot (most search engine robots are) visits your website, it first looks for a robots.txt file in your website home directory. If this file exists, it will read through all instructions contained in the file and accordingly decide what not to crawl. So, it is this file where you need to tell the search engines not to crawl certain sections of your website directory and file structure.
Below is a sample robots.txt file.
In the above example, all files in four directories and additionally one specific file, are excluded. Note that you need to place a separate Disallow line for every directory or specific file that you want excluded. Also, you cannot use regular expressions to specify multiple directories. For instance, an entry like - Disallow: /doc/*.pvt would be invalid. The only place where a * character is allowed is in the User-agent field – here the * indicates all robots.
When a search engine robot finds a robots.txt file as in the above example, it will crawl though all directories and files except the ones listed above against Disallow fields.
A point of caution: The robots.txt file is publicly viewable, thereby exposing the internal structure of your website directories and files. Hence, you should organize your directories and files in such a manner that you do not have to expose too much. See below for a suggested structure.
If you do not wish to exercise any restrictions for search engine robots and allow access to your entire website, you can do one of the following:
Disallow all search engines completely
Disallow all search engines completely, and only allow Google to crawl your website
You would have noticed in the first image above that there is a sitemap entry - Sitemap: http://www.how2lab.com/sitemap.xml. This entry allows you to tell search engines that you have an xml sitemap prepared especiialy for them and where it is located. (Read more about How to build a search engine friendly sitemap).
Note that this directive is independent of the User-agent line, hence it does not matter where your place it in your robots.txt file.
CEO, Computer Solutions
Rajeev Kumar is the primary author of How2Lab. He is a B.Tech. from IIT Kanpur with several years of experience in IT education and Software development. He has taught a wide spectrum of people including fresh young talents, students of XLRI, industry professionals, and govt. officials.
Rajeev has founded Computer Solutions & WebServicesWorldwide.com, and has hands-on experience of building variety of web applications and portals, that include - SAAS based ERP & e-commerce systems, independent B2B, B2C, Matrimonial & Job portals, and many more.