Google set up a crawler-type software, named Googlebot. It is a robot indexing Web pages (and now other types). Its principle is simple (but not its implementation!): when it reads a page, it adds to its list of pages to visit all those linked to the page in the current process.
Theoretically, it should thus be able to know the majority of the pages of the Web, i.e. all those which are not orphan (a page is known as orphan if no other links to it). The volume of data to be treated being important, this robot is a program distributed on hundreds of servers.
In addition to the knowledge of the greatest number of pages, Google also wants to index them regularly, because many the pages are updated from time to time. Moreover the frequency of visit of Googlebot on a Web page depends on its PageRank : the larger it is, the more it will often index it. From one passage to another, Googlebot can detect a page become non-existent ("error 404").
No comments:
Post a Comment