Interesting. The documentation shows a very simple format:
<?xml...
<urlset...
<url>
<loc>http://www...
<lastmod>2007-0...
<changefreq>monthly...
<priority>0.8...
</url>
<url for next page...
</urlset>Is it
really that simple? If it is, what I'll do is write a little utility to read a list of pages I want in the sitemap and spit out both my Sitemap.php page and the Google sitemap. Have cron kick it off early each morning, and I'm in business!
Several questions for
Mitch:
What is the name of the file? Is it
sitemap.gz or
sitemap.xml.gz? They show both. Make up your minds, guys! I'm assuming that
gzip is available -- I can't sign on to cPanel right now to check.
The instructions say, "After you produce your Sitemap, you will need to notify search engines of the Sitemap's location." Huh? I have to do manual submissions? They aren't going to look for
public_html/sitemap.whatever.gz on the next crawl? That bites.
Is there a validation site, something like the W3C's page validators? Google gives instructions on some convoluted set of XML/XSL/Xwhatever tools and schemas to download and run in some manner. Come on! I just want to feed my sitemap file to something that will tell me if it's properly formed -- bonus points if it can compare it against my site and point out discrepancies.
I will be adding an e-store. Is it good practice to put individual items in the sitemap? Usually this involves a URL query string with item numbers, etc. From the FAQ, it sounded like that must be what people do (put individual item pages in the sitemap). How else would you approach the 50,000 page limit for a sitemap? I'll have to look into monkeying with my utility to read my product database, rather than manually updating the page list.
What do I gain (or lose) by providing this sitemap, over just letting Google et al. crawl my site?