One of the concerns facing most website owners today is how to make search engines find all the pages of their website. Search engines do have a fairly smart spider which is capable of crawling through your entire website and extracting all links. However, for large websites consisting of several hundred or perhaps a few thousand web pages, search engines might miss some deeper level pages, especially if they are linked from within inner pages and do not find a place in your main navigation menu tree.
Hence, it is always a good idea to present your entire list of links to search engines in an easy way so that the search engine can find all your links in one page. Off-course there is more to a sitemap than merely presenting a list of links, which you will learn as you read on.
What is a sitemap?
A sitemap for a website is analogous to the index page of a book. Normally, when you build your website you would provide a nice easy to navigate multi-level menu bar at the top, so that visitors can quickly find out what they are looking for and jump to that page by clicking at the appropriate link in your menu tree.
So you may ask – If I have created a nice multi-level menu tree for ease of navigation, why do I then need a second index in the form of a sitemap. Well, to answer this - while your menu tree is useful for your human site visitors, a sitemap file is more meaningful for search engine crawlers.
Normally, a sitemap would just be a single file containing your entire list of links along with other meaningful information for the crawler. Naturally, this file must be written in a program friendly format, and that format is XML. This file is always named sitemap.xml (all lower case). Nearly all search engine crawlers today support the xml format sitemap. So one file does it all for all search engines.
Note that providing a sitemap xml file does not necessarily guarantee that search engines will index all your pages. Finally, it is the prerogative of the crawler to decide which pages to ignore, based upon several other factors which is a subject matter of SEO.
Purposes that a sitemap file serves:
Lists all links of your website providing the crawler with absolute urls of all your pages.
Tells the crawler when the page was last updated.
Tells the crawler how frequently the page content is likely to change.
Tells the crawler how important or relevant is the link with respect to other listed links.
Before we delve on this further, let us first take a look at a typical sitemap file. Check the sitemap.xml file for this website to get an idea of a real sitemap file.
Below is an example of a very basic sitemap file with just 3 links. Note that the file should contain characters in UTF-8 encoding.
Now let us explain the tags
The sitemap.xml file must begin with an opening <urlset> tag and end with </urlset>.
Each link entry will begin with <url> tag and end with </url>.
Within the <url> </url> tags you will have the following children entries:
<loc> </loc> : This will encapsulate the absolute url, i.e. full url starting with http:// of the link. The length of the url string should not exceed 2048 characters.
<lastmod> </lastmod> : This will encapsulate the link file's last modified date in YYYY-MM-DD format only. It is not mandatory to modify this date every time you make changes to your link page as search engines do extract the link file's timestamp when crawling. However, it would be nice to update them.
<changefreq> </changefreq> : This tag pair will encapsulate one of the following words - always, hourly, daily, weekly, monthly, yearly, never. It suggests to the crawler how frequently the page is modified and thus how often it should be indexed. It is not necessary that crawlers will follow this suggestion exactly. They may follow their own schedule for re-visiting pages. Note that always is used for pages which are dynamically generated, or modified on every access. While never may suggest that the page never changes, search engine crawlers may still re-visit the page on a weekly basis.
<priority> </priority> : This will encapsulate a value between 0.0 to 1.0, telling the crawler the relative importance/priority of the link with respect to other links. 1.0 is the highest priority and 0.0 is the lowest.
Where should sitemap.xml reside and how to tell the search engine where it is?
The sitemap.xml file must always reside in the home directory of your hosting account which is usually the public_html directory (in case of a linux system) and the httpdocs directory in case of a windows system.
Tell all search engines the location of your xml sitemap by placing an entry into your robots.txt file as below:
Here is a typical example of a robots.txt with the sitemap entry. The robots.txt file must also reside in the home directory.
Points to Note
You must have noticed that for the <loc> tag we have enclosed the url string within a CDATA section. A CDATA section starts with
<![CDATA[ and ends with ]]>. This is done to escape certain special characters that may be contained in your link url – such as
& (ampersand), >, <, ' (single quote), " (double quote), etc. Hence, it would be safer to enclose all url strings within the CDATA section,
so you won’t have to worry about the special characters included in your url strings.
Make sure that your sitemap.xml file size does not exceed 10 Mb. For very very large websites, where this may be unavoidable, there is provision to
create multiple sitemap files.
If you have SSL implemented in your website and you have a situation where some urls begin with http://, while others begin with https://,
you should not include both url versions in the same sitmap file. Make sure to use any one of the two versions, whichever is suitable for all your website pages.
The relative order of urls in your sitemap.xml file is immaterial. You can place them in any order.
Rajeev Kumar CEO, Computer Solutions Jamshedpur, India
Rajeev Kumar is the primary author of How2Lab. He is a B.Tech. from IIT Kanpur with several years of experience in IT education and Software development. He has taught a wide spectrum of people including fresh young talents, students of XLRI, industry professionals, and govt. officials.
Rajeev has founded Computer Solutions & WebServicesWorldwide.com, and has hands-on experience of building variety of web applications and portals, that include - SAAS based ERP & e-commerce systems, independent B2B, B2C, Matrimonial & Job portals, and many more.