CS100 Lecture Notes - Lecture 10: Bow Tie
Document Summary
Crawl: to follow automatically (hyper)links on the world wide web or a particular web site to retrieve documents, typically for the purpose of indexing. After creating a postings list hyperlinks need to be extracted to be read and indexed. Crawlers create queues to make sure no hyperlinks are missed. Crawlers start from pages that are known to be good seeds (e. g. www. yahoo. com) According to web dragons the structure of the web resembles a bow tie. Deep web: collection of data stored on pages without html not accessible to many web crawlers. Sink page: has no links to other pages. The size of web is measured by the number of indexed pages. The total size of the web is estimated by comparing coverage and overlaps. Search engines employ multiple crawlers simultaneously to gather more pages/hour. Crawlers revisit web pages periodically of indexes remain correct. To have a webpage discovered as a page it needs to be linked into an existing page.