Thursday, December 13, 2007

Sitemap for a Large Site

If you're running a site with a large number of articles it is advisable to use sitemaps and submit them to Google for a better coverage of your content. In theory you should be able to submit your sitemaps to Yahoo! and MSN as well (more on this later).

What is a sitemap?
Taken from www.sitemaps.org:
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site
A typical sitemap file that contains links you wish to submit to a search engine will look like:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/dir/p1.html</loc>
<lastmod>2004-06-07</lastmod>
<changefreq>monthly</changefreq>
<priority>1</priority>
</url>
<url>
<loc>http://www.example.com/dir/p2.html</loc>
<lastmod>2004-06-07</lastmod>
<changefreq>monthly</changefreq>
<priority>1</priority>
</url>
... up to 50,000 links like these in a single file ...
</urlset>
Please read through the official section on how do you go about submitting your sitemap to search engines.

How to submit a very large number of URLs?
You'll have to submit multiple sitemap files. The trick is simple: build individual sitemap files full of links and then have a main "directory" (or sitemap index) stating what are all the files and their names. Then submit the index file to the search engine. Obviously XML for the sitemap index is slightly different. Below is a sample of what a sitemap index file might look like:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.example.com/sitemap-1.xml.gz</loc>
<lastmod>2007-12-21</lastmod>
</sitemap>
<sitemap>
<loc>http://www.example.com/sitemap-2.xml.gz</loc>
<lastmod>2007-12-21</lastmod>
</sitemap>
</sitemapindex>
Notice that the sitemap index refers to 2 separate sitemap files which are the ones that contain the actual links. In the above example, these are called sitemap-1.xml.gz and sitemap-2.xml.gz. Also notice that the reference is made to the full URL that points to the gzipped sitemap file and not the actual sitemap file itself (sitemap-1.xml).

Google officially accepts up to 50,000 URLs per single sitemap file OR (and this is quite important) the size of the sitemap file (once unzipped) cannot be bigger then 10MB. The second condition was added a few months back.

Finally, you should compress these files prior to submitting them. This means that your sitemap index file should have references to zipped files. Google works with gzip so find a way to zip the sitemap individual files containing all those links. My testing showed that compression makes a 50,000 URLs file (fairly long URLs I'd say) to about 400KB. Otherwise its about 10MB in raw format. This was done with a CFMX 7.1 on Win Server 2003 with the help of the excellent Zip CFC.

Notes
Few things to remember: Google seems to be the only one that happily is trying to work its way through a submitted sitemap file. Yahoo! and MSN are not. At least that is what I've noticed so far. Truth be said - Google is the only one that provides useful webmaster tools that you can use to see how your site fares with the search engine.

It may be useful to monitor whether google is picking up your sitemap. First things to check is the webmaster tools (click Sitemap on the main menu once you login) and submit your sitemap. Its quite quick to pickup your change so check shortly and it will report the number of URLs it picked up from the submitted sitemaps. Somewhere inside the webmaster tools is a page that says what's the crawl rate. From my experience it takes about 7 to 10 days for google to start seriously reporting something into that graph. However, by the time you can see a graph there your web server log should have shown you quite a lively interaction with the google bot.

Useful Links for Submitting Your Website to Search Engines
  1. Sitemap main site (http://www.sitemaps.org/)
  2. Search engines unite on sitemap discovery (article available at http://searchengineland.com/070411-080716.php)
  3. Google sitemap-related pages
  4. Yahoo Site Explorer Search Documentation (http://developer.yahoo.com/search/siteexplorer/V1/updateNotification.html)
  5. Google sitemap ping url:

  6. MSN seems to work through MOREOVER (I couldn't find info on submitting directly to MSN but have found some reference to MOREOVER in that context instead so I'm providing it here too). Their sitemap ping url is:
    http://api.moreover.com/ping?u=http://www.example.com/sitemap.xml
  7. Yahoo! sitemap ping URL:
    http://search.yahooapis.com/SiteExplorerService/V1/updateNotification?appid=YahooDemo&url=http://www.example.com/sitemap.xml
    (You'd need to get a Yahoo! AppID before you can submit this)
  8. ASK.com sitemap ping url:
    http://submissions.ask.com/ping?sitemap=http%3A//www.example.com/sitemap.xml
(for all of the URLs above please change "www.example.com" to the domain you're trying to manage).

Final note: I have not really gotten great results with MSN and Yahoo! so perhaps I'm missing something in this post.

No comments: