Dec 10

For a few weeks now we have been running a bot identified by this User-Agent string :

Mozilla/5.0 (compatible; Yoono; http://www.yoono.com/)

Some people asked us what this bot was precisely doing. Yoono is a social search engine based on the fact that our users share parts of their bookmarks, so why would we need to have a bot crawl the web ?

Well, our bot is doing a few things that we need to do by ourselves :

  1. We fetch the URL that are referenced in our users’ bookmarks in order to retrieve an official title for the page. Some people like to customize the title associated to their bookmarks, so we cannot systematically trust the title they could provide us when sharing an URL with other users. Therefore, our bot regularly check the “true” title of the web page.
  2. When we fetch a webpage, we look for related RSS / Atom feeds and cross index them, in order to give you the soon-to-be released “blogsearch” feature. Whenever you are visiting a web site or blog article, the blogsearch feature gives you related blog articles, so that you can immediatly discover the blogosphere buzz around what you are currently reading. For this, we need to regularly fetch the entire RSS / Atom feeds.

The bot refreshes the web page titles once a week, and RSS / Atom feeds once an hour. We implemented HTTP conditional GET (with Last-Modified and ETag), so if your server supports it, the impact on your bandwidth and CPU will be minimal.

Writing a crawler is not the easiest thing on Earth (if you think so, you should read what Larry and Sergei wrote about this). There are millions of strange web servers outside, and one of the most annoying thing we had to cope with was strange session systems that would trap our crawler. Everything is sorted out now (well, until the next weird thing), but during the development and test phase, we may have been, ahem, a little heavy on some particular web servers. Please accept our apologies if we haven’t already contacted you.

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

  • Not bad, it really can occur
  • Spammers suck a lot
  • No more spam, man!
  • Yeah I'm getting more and more impressed as well as I find out more and more each second :D
  • We are impressed with your concept and
    wish you with the warmest regards to continue
    doing this great service .

    Always willing to support your NOBLE WORK .

    http://blogs.mindbodynsoul.com
  • Bananeweizen
    Hi. Today your bot crashed our web server by submitting around 5000 requests for the same feed (with different order of parameters and with the parameters encoded multiple times). And it's the recent changes feed from a standard MediaWiki installation, so I hope that you fix this bug really fast.
    Additionally I hope that your bot respects the robots.txt, as I don't want to have it index our page again.

    Ciao, Michael.
  • tm, TTL support is coming up in the next version of our crawler. We are of course interested in reducing our side of the bandwidth as much as your side.

    In any case, please note that if your server supports conditional GETs (most of them do if your RSS file is a regularly generated static file), then we should be pretty much painless - and if you don't you'll only get 24 requests a day in the current instance of the crawler, instead of 200 000 requests from our users. Could you please send me in a private e-mail the adress of your RSS feed, so that I can check what we do with it ?

    Regards,
    Nicolas
  • tm
    You had better support TTL in rss feeds because mine are only to be loaded once a day.
blog comments powered by Disqus
preload preload preload