Andrew's Engineering Blog: 11. WebSpider

Link

Note this does not include the jar file even if the script does build it. I figure it can be generated, so I excluded it.

This assignment seemed really simple in the beginning - just build a hash table and store the counts in there. But there were many hitches along the way.

The first involved the build.xml files, it seems like a lot of editing had to be done in order to get them to work the way I wanted them to. I don't think Dr. Johnson did this on purpose, but then again maybe he did to give us an interesting exercise to work with. The issues I can recall offhand were:

base build.xml needed to be updated to include junit's jar file
dist.build.xml needed to include my name in the generated filename
emma.build.xml needed to be adjusted for the paths for generating the html (theres no way I want to read the xml every time I want to see the coverage reports)
javadoc.build.xml needed the overview.html path set (it initially said stack)

All in all, transitioning the system over from stack probably wasn't as smooth as Dr. Johnson wanted ^^

On to the actual work now...

Part 0: Package creation + tests

It was initially painless... until adding the Junit jar became an issue. So that took me a while to realize.

Part 1: Totallinks implementation
Part 2: Mostpopular implementation

I realized parts 1 and 2 were very similar, so I decided to try to make their implementations similar, with the final result differing. However...

Httpunit does not like parsing Javascript. Many links do contain it; however, Kevin English did have a nice way to disable the exceptions from being thrown. (I just caught them all) But that certainly complicated matters. The only thing is I am unsure if it still processes the pages, or it just suppresses the exceptions.

Also, I initially used the data structure of a HashMap. Then I realized I would need to traverse the structure, so I changed it to a TreeMap. After that, though, I realized I would want a queue to determine which value would be next, so I actually ended up implementing two separate data structures - a queue for which URL would be next to process, and a TreeMap to keep track of the counts.

Using parent classes made changing data structures easy, luckily.

Part 3: Logging

Easy. I just use System.out.println.... I can't? Then I notice the line in the assignment:

You can implement logging using System.out.println, but that's lame.

Well, there goes that idea. However, the HackystatLogger class seemed to fit the description very well of what I needed to do for this task, so I just used that class. (and attributed it in the JavaDoc). Then I built a WebSpiderLog on top of that, which becomes enabled if logging is enabled at the command line.

Part 4: Extra Credit

It's supposed to be a separate entry, but I didn't attempt it.

Conclusions

What an annoying assignment. I initially thought it would be a simple task, and then found all these wonderful humps that slowed down progress. In the end... its a good thing I started moderately early on it (around Wednesday), because otherwise I'd be burning the midnight oil until early hours tonight.

Andrew's Engineering Blog

Blog Archive

Monday, September 24, 2007

11. WebSpider

No comments: