Andy R Terrel - CMSC 16100 - Creating an Internet Spider

Introduction

Scheme is a scripting language, which means that it is able to do very high level operations with very few lines of code. This week we are going to take advantage of this feature and create a spider to crawl webpages and extract email addresses. Getting a webpage in Scheme is fairly easy using the url.ss library so the challenging part of this lab is deciding how to get the next webpage and extract the correct information.

HTML Desciption

For those who are not familiar with HTML, it is incredibly easy to learn. HTML is a markup language that means it uses marks to determine what to do with a specific bit of text. For the purpose of this lab, you will only need to know about the href object of an A element. It is how a webpage will link to another webpage, an email address, any other type of file, and sometimes just javascript code. If the href object links to a email address the then it will start with "mailto:".

For a rough sketch of my functions see the template for this lab. Make sure all your functions have test cases and appropriate documentation.

Using url.ss

The function below will give you a list of html elements, where each element is a list of symbols or strings based off of the html.

(require (lib "url.ss" "net") (lib "html.ss" "html") (lib "xml.ss" "xml"))

;; Contract: string -> list of html elements
;; Purpose: given a string, get-web-page-exp will return a list of html elements
(define (get-web-page-exp url-string)
  (display "Fetching: ")
  (display url-string)
  (newline)
   (map xml->xexpr
        (read-html-as-xml (get-pure-port (string->url url-string)))))

First you will want to write a some functions that will be used to process the list by taking out all the webpages and email addresses you can find. Remember, you probably don't care about .ico, .css, .js, .pdf, .mov, and other such file extensions if you are looking for webpages.

Making the spider

So the spider should be fairly simple. It should process a webpage and then add any links it found to the pages that it will process and emails to the list of emails it found. At the end it should display the emails it found.

The net is not so standard

You might find that you get a large number of errors depending on the webpages you traverse. For example if you try to get http://www.uchicago.edu with the get-web-page-exp expression, you find it gives a Bad Request 400 error page. You will find many pages that will give you errors, whether it is because the HTTP Request of url.ss is not the best (it puts an extra carriage return in the request and gets stuck on a few pages) or because the page is not even responding. So you should find some pages that work and put them in your test cases, please don't just do what your neighbor is doing because we don't want to give any one server a flash crowd.

Isn't this malware?

Some of you might have some issues about just using a program to strip the net of its information. I would like to point out that a large percentage of traffic on the internet is actually bots and not humans. You will also find that of the emails you get, many will be unusable (try getting some from craigslist), that's because so many people have made such programs and filled our inboxes with junk.

Still even more important is the law and stripping content. There are several major court cases, spanning all levels of the courts, about websites that strip information from another site and publish it. This program you made is not a crime but it is something that could be used to publish in a way that is not completely a closed issue in copy right laws. Don't worry I am sure you will learn all about how to make denial of service attacks and buffer overflow attacks in more advanced classes, so you will get to hear about committing crimes with your computer again.

Extra Credit

So the spider can be pretty basic, but it has some flaws. How do you know if you have visited the site before? How do you know whether you have gotten the email before? Use sets or some other mechanism to improve your spider.

Turning Everything in

Add appropriate test cases to the end of your file as usual. Save the definitions window as (your_cnet_id)-Lab7.scm and submit it to chalk. So for example I would save my file as aterrel-Lab7.scm. Also do yourself a favour and save your file somewhere you can look back at it (email yourself or store it on your own media).