CS100 Lecture Notes - Lecture 10: Regular Graph, Directed Graph, Financial Institution
Document Summary
Spiders/crawlers/robots: computer program that starts at one website and explore links to other websites: will explore various website and try out the links, collect indexed information that a user can search. Indexing the web: focus, can have a spider that is very focused on a certain topic. Some websites require subscriptions to access: relationships between different companies and web spiders to allow google to see what is on the page, dynamic content, may change its content based on who is viewing it. Spider may see the webpage differently from the user: query strings, does the spider need to check every page on every website. Index should include all variations with or without accents. stop words: the, it, is, do not compile every occurrence of these words, too overwhelming, occur too frequently, word variants, e. g. sell, sells, selling, sold, resell, resold, unsold, etc, needs to be treated as the same word.