Open Source Web Crawling is About Ten to Fifteen Years Behind Google

In 1999, it took Google one month to crawl and build an index of about 50 million pages. In 2012, the same task was accomplished in less than one minute. The 2012 capability is about 50,000 times faster. This is slightly better than doubling the speed every year for 14 years.

In 2016, a new open-source Bubing web crawler was announced that can achieve around 12,000 crawled pages per second on a relatively slow connection. This is could be 1 billion pages per day. The pricing is about $40 per day. There is an arxiv article from 2016. (BUbiNG: Massive Crawling for the Masses) This is about the capability that Google had about ten to fifteen years ago.

BUbiNG is here at github.

a 64-core, 64 GB workstation it can download hundreds of million of pages at more than 10 000 pages per second respecting politeness both by host and by IP, analyzing, compressing and storing more than 160 MB/s of data.

It is about $200 for a 10 Terabyte hard drive. This would store about one hour of crawling.

Borislav Agapiev is a web crawling expert who described what was needed to run Bubing with about Google 2005 capabilities.

Using Amazon AWS spot instances, in particular hi 1.4xlarge which is a good machine with dual Xeon E5620 CPUs with 60GB RAM. Sitting on a 10Gbit network. Borislav verified hundreds Mbps incoming crawling bandwidth which is great.

Spot pricing is $0.16/hr which is awesome, especially considering regular price is $3/hr. Of course, with spot machines, they can get taken out under you at any moment but it is a good practice to be able to handle that anyway.

With simple setup, I was able to get 25MB/s (200Mbps) sustained, with almost 1200 requests/sec on a single machine, which is awesome performance. That could be close to 100M pages/day for < $4/day, which is spectacular. Of course, you will need to process and offload the stuff you need as the full archive piles up very quickly.

Web Crawling Versus Search Request Handling

Google had far more massive search request volume growth. Google’s handling of search request increased 17,000% year to year between 1998 and 1999, 1000% between 1999 and 2000, and 200% between 2000 and 2001. Google search continued to grow at rates of between 40% to 60% between 2001 and 2009. It started to slow down stabilizing at a 10% to 15% rate in recent years.

SOURCES- Quora Borislav Agapiev, Arxiv, BUbiNG, Github, Internet Live Stats
Written By Brian Wang, Nextbigfuture.com

6 thoughts on “Open Source Web Crawling is About Ten to Fifteen Years Behind Google”

  1. “Where typing in a search term gives you the most linked sites for that term. “

    If naive keyword matching is your definition of a good search engine then the top 5 isn’t for you.

    “Not the site that has paid the most money to Google. Not the site that
    has ensured they hire “representative” proportions of every known,
    guessed at and computer predicted “gender”. Not the websites that make
    sure they have banned anyone who once linked to a joke made by someone
    else who was later found to have donated money to a political campaign

    that stopped being socially acceptable 2 years after they donated..”

    Care to provide some support for that, or like Bellmore, you instinctively just know it’s true.

    If you want low quality loony tunes content at the top of your results, you will need to add loony tunes key words to all your keyword searches.

    Reply
  2. Yes, Google has more capability, but they are using that capability to extract rent not to provide a better service.

    I’m saying that if a competitor offered the service that you could get from 2010 Google then that would be a better service than Google offers now.
    And switching costs are zero. So they could take a big market share and quickly.

    Now Google may well respond by improving their product. Which could well crush the startup, or at least keep it small.

    But either way the customers win.

    Reply
  3. They still have had (and have) terrific engineers, working on these technologies over several years and accumulating an aeon worth of man hours on intellectual property.

    Reply
  4. But isn’t it generally accepted that Google of 10 years ago was actually functionally superior to 2019 Google?

    Not from the point of view of being a profitable business, but from the point of view of the consumer experience. Where typing in a search term gives you the most linked sites for that term.

    Not the site that has paid the most money to Google. Not the site that has ensured they hire “representative” proportions of every known, guessed at and computer predicted “gender”. Not the websites that make sure they have banned anyone who once linked to a joke made by someone else who was later found to have donated money to a political campaign that stopped being socially acceptable 2 years after they donated…

    Reply
  5. Google’s intelligence on its customers was a Stasi’s wet dream. Understandable that they’re willing to work with the Chicoms over our own govt.

    Reply

Leave a Comment