Andy R Terrel - CMSC 16100 Lab 6 - Sorting and Counting Shakespeare

Introduction

Today we will be working with some common algorithms for sorting things. At the end of the lab, you will then use these algorithms to sort and count the words in some Shakespeare plays. The first three sorting algorithms that we will use are insertion sort, quicksort, and mergesort. In effort to make your code as reusable as possible you will provide the algorithm a comparison function and list to sort. The second part of the lab, we will use a binary search tree to count the number of unique words in some of Shakespeare's plays. This could certainly be done other ways, but this exercise will give you some experience dealing with a binary search tree data structure.

I am providing you with a couple of files to help you out along the way. First is a template file for the lab. I have made a template to use if you so choose that includes exactly what I expect you to write before you start writing code: comments and test cases. To use the file notice that the first test case is uncommented, this should be the first test case you get working. You should be able to move right down the list. Once you uncomment a test case don't comment it back, this allows you to make sure that future changes are not breaking your code. Let me venture a guess and say that coding my way will probably make the experience faster and more enjoyable, but by all means do not feel that my invitation to a more structured style of coding is manditory. Even if you do not use this template file you must include comments and test cases for all your functions as is done in the template.

Second I am giving you a module to read the text files from. This module provides the function file->string_list which takes the path of a file as an argument and returns a list of strings (the strings are the strings in the read file such that they are delimited by whitespace).

Part I - Sorting Algorithms

Below the insertion sort, quicksort, and mergesort algorithms are explained. For each one, make a function that will take a comparison function and a list and return a sorted list with the order determined by the comparison function. Your comparison function should take two elements from the list and return whether the first element is greater than, equal, or less than the second element. For each algorithm, make a test case that will convince you the algorithm is working correctly. (For sorting strings, look up string>? in the Help Desk. You should be able to find some other string functions to use as well.)

Insertion sort

Insertion sort takes an element out of a list, sorts the rest of the list, and then inserts the element back into the list. The intuition is just like sorting a rollerdex of cards -- it is easier to sort a smaller number of cards, and it is easier to insert into a sorted rollerdex.

Quicksort

Quicksort is a divide-and-conquer algorithm, meaning it divides its work somehow and performs its work on two smaller sets. For quicksort, you pick an element from a list, then divide the list into two smaller list, things that are "smaller" and things that are "bigger". You sort these two lists and append everything to return a sorted list. The efficiency of quicksort is highly dependent on how you pick the element to divide your sets, for this lab you can just use the first element of the list but just know that it is not the most efficient way of doing things.

Mergesort

Mergesort is another divide-and-conquer algorithm. In mergesort, you divide your list into two equal parts, sort those parts, and then merge the two sorted lists. The idea here is that merging two sorted lists is easy, and sorting smaller lists is also easier.

Part II - Counting words with Binary Search Tree

As you learned in class, a binary search tree is a binary tree such that the value of a node is greater than the values of its left predecessors and less than the values or its right predecessors. For this exercise you will need to build a binary search tree such that the value at each node contains a word and a count. When you insert a new item into the tree, if the word is already in the tree you will increment the count, otherwise you will add it in the appropriate spot. To order the words by count, you need to build a function that will take the binary tree to a list and then you can send it to one of the sorting algorithms with an appropriate compare function. Make some small datasets to determine if your tree is working before trying to build one with one of the Shakespeare files.

It is arguable that this binary tree method might not save anytime over just building a random list and checking the whole list on every word. As you have learned in class, you can build a balanced binary search tree, such as a red-black tree, but doing so is a complex process which would take more time than this lab allows. Thus this exercise is not quite what you would want to do if you were hired to do the job.

Now using the module, use the file->string_list function to read Hamlet. Ideally you would have some function to manipulate the punctuation so that words such as 'Jessica', 'Jessica.', and 'Jessica's' are not counted as unique, but for this lab this is optional. You may use other plays if you wish.

Now to see how your different sort algorithms compare, do some timing with the time function. For example, I used these commands:

(time (car (insertionsort hamlet-list entry-count-compare)))
(time (car (quicksort hamlet-list entry-count-compare)))
(time (car (mergesort hamlet-list entry-count-compare)))

Extra Credit -- Removing Punctuation

Even though our binary search algorithm is sorting by unique words, we are still over counting. For example, in the above algorithm the following words are counted as distinct 'HAMLET', 'Hamlet', 'Hamlet's'. Create an preprocessor for the word list that you feed to list->blist that will remove capitalizations and punctuation marks that cause this to happen. My suggestion is to use the string->list function to convert each string to a list of characters. Look up in the help desk for functions that manipulate strings and characters.

Turning Everything in

Add appropriate test cases to the end of your file as usual. Save the definitions window as (your_cnet_id)-Lab06.scm and submit it to chalk.