Bacon and I are always working on search engine optimization (SEO) for PersistentFan.com. In addition to ad buys, viral features, etc., like many sites, we have a monthly spend for acquiring traffic. Also like many sites, we constantly tweak many aspects of this spend to optimize.
Over the last few months, we added features to improve the visibility of a given topic (like Justin Bieber) by tweeting new topics from our @PersistentFan account, making it easier to share a link for a specific video (e.g. Jimmy Fallon), and Facebook/Digg sharing. Search engine crawlers are fairly latent; it can take upwards of a month to see the impact of certain changes. As we introduced various features, we began to see an uptake in both eCPM and search engine traffic/visibility. However, the progress was fairly muted overall. Before the holidays, we decided to start back at the beginning and analyze how search engines were crawling our site. I spent a lot of time doing this:
tail -f /var/log/httpd/access_log
One thing that struck me almost immediately was that I saw a ton of crawlers (Tweetmeme, etc) from the @PersistentFan tweets, but didn't see a lot of folks like GoogleBot. After some analysis, we noticed the obvious - the front door of PersistentFan.com showed both the most popular topics of the last few hours, along with a scrolling list of recently viewed videos. At any given time, we were exposing about 15 or 20 topics to bots, but no more. The many thousands of topics we track for persistent fans were effectively hidden from bots because we didn't provide direct linkage or navigation to discover them. (doh!)
The solution to this was to provide a sitemap. Since our topics grow dynamically, both from user-generated searches and topic creation and various feeds that we use on PersistentFan, we needed a solution that generated a sitemap on a regular basis. I looked around but couldn't find much in the way of libraries so in the end I wrote a Java app that grabs all of the topics in our database, builds and serializes a DOM to create a sitemap on a nightly basis. Once this was complete, a simple change to our robots.txt let the crawlers of the world know where to go:
Within a few hours we began to see crawlers performing GETs on URLs which were not previously requested. Sitemaps, FTW!