Please refer to the syllabus for more information about the course.
The focus of this homework assignment is crawling. Write a program that crawls the Internet given a seed URL. Your crawler program will need crawling guidance.
Once the information has been crawled and locally stored, meta information must be extracted. Thus this assignment focuses extraction. Write a program that extracts information from the crawled data from assignment 1.
The process of indexing and ranking involves taking explicited and implicitedly obtained metadata and stored them in a "database" for faster recall. Most importantly, we are concerned with how to organize information such that the "intent" of the users is correctly met. Write a program that ranks the data collected from assignment 1 and the extracted meta information from assignment 2.
Building a "Query Interface" is the provisioning a simple mechanism to search and present the hierarchical information retrieved. Write a command line interface and a web enabled interface that is based on indexed and ranked information collected in assignment 3.
Your final project is consisted of putting together a complete information retrieval system. This search engine is combining all of your previous programming assignments into a suite of applications. Note that a search engine system is not necessarily a single program.