This week we are combining several themes of the last few labs to implement the HITS algorithm, which is more or less a Google-lite Algorithm. Since this is 10th week and this lab is more a challenge of putting past work together I am demanding that you work no more than 3 hours on it. Get as far as you can but do not let it dominate your studying.
Either use your internet web spider or a list of generated webpages to get a set of pages. With these pages use Professor Levow's IR notes to get a list of pages that match a query. Use these webpages as your set of pages for the next step.
With your webpages, create a node with neighbors such that webpage q is a neighbor of webpage p if and only if p contains a hyperlink to q. Notice that just because q is a nieghbor to p, this does not mean that p is a neighbor of q.
A webpage is an authority if many pages link to it, and a webpage is a hub if it links to an authority. You might notice that this is a bit of a catch-22 statement. How do you know if the webpages is a hub if you don't know which are authorities? We discover this by iterating!
We will use a vector xp to represent authority weights and yq to represent hub weights. Where p and q are the webpages. Initially let all values of x and y be 1. Then iterate through the following process a number of times (lets just say 3):
The highest weights of x are your "authorities" and the highest weights of y are your "hubs". This is typical information used by internet search algorithms.