Hotel Internet Marketing: Indexing of Webpages by Google

Google set up a crawler-type software, named Googlebot. It is a robot indexing Web pages (and now other types). Its principle is simple (but not its implementation!): when it reads a page, it adds to its list of pages to visit all those linked to the page in the current process.

Theoretically, it should thus be able to know the majority of the pages of the Web, i.e. all those which are not orphan (a page is known as orphan if no other links to it). The volume of data to be treated being important, this robot is a program distributed on hundreds of servers.

In addition to the knowledge of the greatest number of pages, Google also wants to index them regularly, because many the pages are updated from time to time. Moreover the frequency of visit of Googlebot on a Web page depends on its PageRank : the larger it is, the more it will often index it. From one passage to another, Googlebot can detect a page become non-existent ("error 404").

This colossal mass of information will be analyzed by Google in full details. Each word or sentence will be associated to a type, based on HTML tags. Thus a word contained in the title will be considered to be more significant than in the body text. These types may be classified according to their importance (title of the page , headings H1 to H6, bold, italic, etc). This preprocessing, associated with other criteria including the PageRank, makes it possible to provide the most relevant results in first.

Hotel Internet Marketing

Tuesday, June 3, 2008

Indexing of Webpages by Google

No comments:

Translate Webpage into Hindi

Blog Archive