I was randomly surfing the interwebs and I came across this blog post by Google: http://googlepublicpolicy.blogspot.com/2010/02/this-stuff-is-tough.html . It led me to wondering what, exactly, the google search algorithm looks like and how they managed to scale it so well that it returns a search with millions of potential results in 1 millisecond. Any ideas?
20 years of spaghetti code!
Google is nuts. Recent I did some light EULA violation and scraped a certain large social networking site for some very targeted information. The URLs alone were 10gb and are a bitch to work with. I couldn't even imagine trying to _store_ the data from the site, not to mention search it in any reasonable timeframe. And that's only one site out of millions!
Google has like an entire system operating system that their software runs on, I think. (Somebody does, maybe Facebook?)
I also bet in this scenario, high quality hard drives would help a ton. Drives with very fast read speeds would speed of queries quite a bit I think.
A lot of their speed is garnered by some very specifically tailored hardware architecture modifications. I wrote a paper / presentation on it when I was still an undergrad and could dig it up if anyone is interested.
Yes, that would be quite interesting, I think!
Quote from: Blaze on July 17, 2010, 02:07:36 AM
Yes, that would be quite interesting, I think!
Yes, I think that would be quite interesting.
Quote from: Chavo on July 16, 2010, 05:40:36 PM
A lot of their speed is garnered by some very specifically tailored hardware architecture modifications. I wrote a paper / presentation on it when I was still an undergrad and could dig it up if anyone is interested.
I'm interested.
It sounds interesting.
Google's distribution algorithm is called MapReduce (http://en.wikipedia.org/wiki/MapReduce). It basically is a massive parallelization algorithm.
Quote from: MyndFyre on August 12, 2010, 12:31:09 AM
Google's distribution algorithm is called MapReduce (http://en.wikipedia.org/wiki/MapReduce). It basically is a massive parallelization algorithm.
Combo breaker!
I have always thought of map reduce more like a framework. I don't think it's really an algorithm is it?
Quote from: Sidoh on August 12, 2010, 06:22:41 AM
I have always thought of map reduce more like a framework. I don't think it's really an algorithm is it?
Quote from: wikipediaMapReduce is a software framework
I never remember to go look when I'm at home and have access to my backup server. However, here is a crappy article about it from people that don't know what they are talking about and impressed by things that are actually pretty common in the Enterprise environment:
http://news.cnet.com/8301-1001_3-10209580-92.html
What it doesn't talk about is the multi-tiered structure that search requests are actually handled (at a hardware level). Each cluster group has nodes dedicated to handling requests, routing requests to servers most likely to have results cached, and servers that do nothing but handle optimizing what is currently cached in their huge memory banks from disk. Essentially, they replace a typical SAN environment with a distributed cache/routing/control cluster.