Monday, September 24, 2007

11. WebSpider

Link

Note this does not include the jar file even if the script does build it. I figure it can be generated, so I excluded it.

This assignment seemed really simple in the beginning - just build a hash table and store the counts in there. But there were many hitches along the way.

The first involved the build.xml files, it seems like a lot of editing had to be done in order to get them to work the way I wanted them to. I don't think Dr. Johnson did this on purpose, but then again maybe he did to give us an interesting exercise to work with. The issues I can recall offhand were:
  • base build.xml needed to be updated to include junit's jar file
  • dist.build.xml needed to include my name in the generated filename
  • emma.build.xml needed to be adjusted for the paths for generating the html (theres no way I want to read the xml every time I want to see the coverage reports)
  • javadoc.build.xml needed the overview.html path set (it initially said stack)
All in all, transitioning the system over from stack probably wasn't as smooth as Dr. Johnson wanted ^^

On to the actual work now...

Part 0: Package creation + tests

It was initially painless... until adding the Junit jar became an issue. So that took me a while to realize.

Part 1: Totallinks implementation
Part 2: Mostpopular implementation

I realized parts 1 and 2 were very similar, so I decided to try to make their implementations similar, with the final result differing. However...

Httpunit does not like parsing Javascript. Many links do contain it; however, Kevin English did have a nice way to disable the exceptions from being thrown. (I just caught them all) But that certainly complicated matters. The only thing is I am unsure if it still processes the pages, or it just suppresses the exceptions.

Also, I initially used the data structure of a HashMap. Then I realized I would need to traverse the structure, so I changed it to a TreeMap. After that, though, I realized I would want a queue to determine which value would be next, so I actually ended up implementing two separate data structures - a queue for which URL would be next to process, and a TreeMap to keep track of the counts.

Using parent classes made changing data structures easy, luckily.

Part 3: Logging

Easy. I just use System.out.println.... I can't? Then I notice the line in the assignment:

You can implement logging using System.out.println, but that's lame.

Well, there goes that idea. However, the HackystatLogger class seemed to fit the description very well of what I needed to do for this task, so I just used that class. (and attributed it in the JavaDoc). Then I built a WebSpiderLog on top of that, which becomes enabled if logging is enabled at the command line.

Part 4: Extra Credit

It's supposed to be a separate entry, but I didn't attempt it.

Conclusions

What an annoying assignment. I initially thought it would be a simple task, and then found all these wonderful humps that slowed down progress. In the end... its a good thing I started moderately early on it (around Wednesday), because otherwise I'd be burning the midnight oil until early hours tonight.

No comments: