Unique Pages and Google Indexing

Unique Pages and Google Indexing
The web is a big place. Google index was started in 1998. At that time it had 26 million pages. By 2000, the web page numbers crossed the billion marks. During the last nine years, content in Google has grown to a considerable size. Recently, even the Google engines were amazed at how big the web has grown. The process links on the new content hit reached 1 trillion, which means there are 1 trillion unique URLs in the web at the moment and the number is continuing to grow at a fast pace.

Now the question is how to find these pages. The way is to start with a set of well-connected pages. These can follow the links to new pages. From these links you can go to even more pages and go on and on until we get a new set of links. The URL has a duplicate content. The duplicate content in the URLS is auto generated. After removing the duplicate URLs, we get trillion unique URLS. So the number of unique web pages is increasing day by day.

How many web pages or unique URLs does the web really have? It is very difficult to answer. Why? The number of web pages is infinite. You have a web calendar for instance. You can follow the link to the next day and from there your search can on to the infinite days. Such a search has got very little benefit. The size of the web really depends on what you are looking for.

The next problem is how to index the web pages. Not all the trillion pages in the web are indexed. Many of them have content that is auto generated or duplicated. The index is comprehensive. It is the most comprehensive index found in any search engine. The goal of Google is to index all the data on the internet.

Initially, Google had developed a set of data to process the queries. At that time it was done in batches. For example, one workstation would compute pagerank graph of 26 million pages in a matter of hours. This set of pages would be used as Google index for a fixed period of time. Now, the process is different. Google downloads from the web continuously, updates the page information and the entire web link graph is reprocessed. This is done several times of the day. This way, each web page is computed daily like finding the intersection of the roads in United States. Every road, every intersection is explored.

In Google search, you just need to do this. There are a trillion connections and through the distributed infrastructure you can easily sort petabytes of data.

Tags: Google, unique web pages, unique URLS
Author: Sone Selva