A regular Internet Outsider reader roasted me for embracing dark-fiber conspiracy theories as an explanation for where all that Google CAPEX is going and provided the explanation below. I have heard counter-arguments to this, which suggest that Google could not possibly require twice as much hardware as Yahoo! just for search alone, but said arguments were not nearly so detailed.
Thanks again to the anonymous contributor (and to anyone else who cares to weigh in). I'm happy to keep tossing out boneheaded theories as long as they prompt better-informed folks to share their knowledge.
Google's CAPEX is higher than Yahoo!s because of spending on hardware (i.e. machines, cpu's, memory,motherboards and hard drives). Yahoo!'s search traffic is much smaller than Google's. In the US, Yahoo is about half Google's size and internationally (save Japan) they are further behind. There is a (near) linear relationship between number of searches served and the number of machines needed to serve them.
Secondly, Google's search index is larger than Yahoo!'s, (about three times as large). This does not mean that Google needs three times as many machines, (there are tricks that can be done as not all searches need the full index), but it does nearly double Google's hardware needs.
Thirdly, Yahoo!'s non search properties need less machines per serving event than a search. To serve a webpage, requires bothering one machine, and each machine can probably deal with hundreds of requests a second. To do a search requires bothering thousands of machines. Thus, although Yahoo! has more total traffic, the hardware needed to serve it is much less.
Fourthly, Google's Adsense product requires a lot of machines. In Google's last analyst presentation they disclose how many Adsense impressions they get. (128 per user/month for 68% of internet users. = 1B impressions a day.) Adsense serving is similar to search (with less data to index, but more work required in book-keeping).
Finally, in Google's analyst presentation, they say "All webpages included in the Google index and searched all the time". This suggests that they intend to increase their index size.
Henry, I have been a fan since your Amazon.com call. I was about 17 at the time. Im just busting your balls about having more content bro. You have a good site. Not as good as guckedgoogle.com , but still good.
I found a new one today called something Bastille's Search blog. Good sitte there too.
Posted by: King Troll | April 13, 2006 at 08:01 PM
Not to perpetuate the dark fiber conspiracy, but there is an interesting theory played out in Robert Cringely's weekly PBS column. He says that Google IS buying fiber, but not to enter the ISP market. Instead, Google is building the infrastructure to support intense data feeds from video downloads, IPTV, VOIP, and other goodies expected from web 2.0. For those interested, check out the November 17th and November 24th posts. http://www.pbs.org/cringely/archive/2005.html
Posted by: Michael | April 13, 2006 at 11:20 PM
There a few small inaccuracies I'd like to address here
>> Secondly, Google's search index is larger than Yahoo!'s, (about three times as large).
Not true. Overall the 2 have about the same total index size, Google having a slight edge. The apparent massive disparity is due to both using different rules to decide what to dump from the total index when constructing the searchable index. Yahoo claimed to have indexed a total of 20 billion items in August '05, although much if it is clearly spam, so doesn't make it into the main search index
>> To do a search requires bothering thousands of machines
Not true. Search engines do a lot of preprocessing of their data, and break the index up into chunks (I think Google refer to them as "shards") which contain topically related datasets. When you enter a search at google.com, you first hit a load-balancing bank, which assigns your query to the datacentre likely to be able to answer you most quickly, based on your location, and current datacentre loading.
You then hit the query parser at the DC which looks at your query nd decides which shard, or shards are most likely to contain your answer. The query is sent to the relevant machine(s), and a SERP is served. Additionally, popular searches are run ahead of time, and cached, which is why popular terms / searches often come back with a sub-second reponse time. A given query might impact on tens of machines, but not thousands. It is still a much more expensive process than serving a web page though, and Google do serve more searches than anyone else.
>> Fourthly, Google's Adsense product requires a lot of machines.
And Yahoo's equivalent product, Yahoo Publisher Network, is run by ad fairies? AdSense is bigger because it's older, but YPN got a flying start out of the blocks due to all the "banned from AdSense" sites out there. I know a couple of major players (in the millions of contextual ad impressions served a day over their networks) who prefer YPN simply because it keeps them off Googles radar
I just can't believe that ALL the money is going on search. Yes Google have a higher search volume to service, but Yahoo aren't slouches either. Google are also very good at keeping their unit costs down on the machines they use. They created a custom OS and filesystem specifically to allow them to use cheap, off the shelf components in their DC machines, $1000 / unit to run the biggest SE on the planet.
Even with the current upgrade cycle, I just can't see how they could spend that much cash on search hardware. Maybe it not all going on dark fibre, but it isn't going on search...
Posted by: TallTroll | April 14, 2006 at 07:09 AM
Apparently all that money is going to fridges - I got a small fridge that says "Google AdWords: Cooler Thinking" on it - you think this might be their alternative to dark fiber? I know what you're thinking - this couldn't even put a dent in all that capital they have available. Before you come to that conclusion, though, keep in mind that it also comes with a connector to plug into a cigarette lighter (hello road trip) and a button to make it a food warmer instead of a food cooler....
Posted by: Preston | April 14, 2006 at 11:11 AM
This is a wonderful insight into the CAPEX discrepancy Henry, and it makes a lot of sense. Thanks for the detective work!
Posted by: Victor | April 14, 2006 at 07:08 PM
yes.ok
Posted by: music | April 15, 2006 at 01:54 AM
A common trade-off in computer science is between speed and storage space. Undoubtedly, one of the ways Google has made searches fast is by storing data is less efficient ways (for instance, redundant copies of data). It wouldn't surprise me if Google requires several times the storage space that its competitors require for an equivalent amount of data.
Posted by: Roger | April 15, 2006 at 05:48 PM
Henry -
Perhaps Goog is spending all their - i mean investor money on building a V network....http://biz.yahoo.com/fool/060417/114529297707.html?.v=2
Posted by: bellerose | April 17, 2006 at 04:59 PM
good.
Posted by: mtv200 | April 19, 2006 at 09:25 AM
*sigh* mtv200 is another forum spammer from China. Cleanup on aisle 10!
Posted by: TallTroll | April 19, 2006 at 09:39 AM