Andy R Terrel - CMSC 16100 - Escaping the Minotaur: Graph Game

Introduction

This week we are combining several themes of the last few labs to implement the HITS algorithm, which is more or less a Google-lite Algorithm. Since this is 10th week and this lab is more a challenge of putting past work together I am demanding that you work no more than 3 hours on it. Get as far as you can but do not let it dominate your studying.

HITS Algorithm

The algorithm has three sections; complete each section as you please and document what you accomplish.

Determine a list of associated webpages
Apply a graph structure to these webpages based upon links
Compute the "authorities" and "hubs" based upon the graph

Associated webpages

Either use your internet web spider or a list of generated webpages to get a set of pages. With these pages use Professor Levow's IR notes to get a list of pages that match a query. Use these webpages as your set of pages for the next step.

Graph structure

With your webpages, create a node with neighbors such that webpage q is a neighbor of webpage p if and only if p contains a hyperlink to q. Notice that just because q is a nieghbor to p, this does not mean that p is a neighbor of q.

Computation

A webpage is an authority if many pages link to it, and a webpage is a hub if it links to an authority. You might notice that this is a bit of a catch-22 statement. How do you know if the webpages is a hub if you don't know which are authorities? We discover this by iterating!

We will use a vector x_p to represent authority weights and y_q to represent hub weights. Where p and q are the webpages. Initially let all values of x and y be 1. Then iterate through the following process a number of times (lets just say 3):

For all webpages p, set x_p = the sum of y_q where q are webpages linking to p
For all webpages p, set y_p = the sum of x_q where q are webpages linked to by p

The highest weights of x are your "authorities" and the highest weights of y are your "hubs". This is typical information used by internet search algorithms.