Click to See Complete Forum and Search --> : Removing duplicate items from a list?


Fryguy8
09-04-2004, 09:16 PM
I'm writing a link validator, and one of the requirements is for it to not check the same link 2x. I thought about this for a while, and I think the easiest way to do this is to add the URLs to a linked list, and then later on run a removeDupes function to do it's namesake.

First of all, if you have a better idea, I'm open to suggestions :)

So I implemented the linked list, works great, I can traverse the link and stuff fine, however my removeDupes is hacked together and not working, and I'm looking for some input to clean up the code:


public static void removeDupes(LinkedList urlList)
{
URL firstCompare, secondCompare;
int listSize = urlList.size();
ListIterator firstIterator = urlList.listIterator(0);
int counter = 0;
while (firstIterator.hasNext())
{
firstCompare = (URL) firstIterator.next();
counter++;
if (firstIterator.hasNext())
{
ListIterator secondIterator = urlList.listIterator(counter);
while (secondIterator.hasNext())
{
secondCompare = (URL) secondIterator.next();
if (firstCompare.sameFile(secondCompare))
secondIterator.remove();
}
}
}


Assume the comparisons are working correctly, and the List is properly generated (it is). The problem occurs when a duplicate actually gets removed. counter gets screwed up and I think it tries to traverse an item in the list that isn't there, here's the error:

Exception in thread "main" java.util.ConcurrentModificationException
at java.util.LinkedList$ListItr.checkForComodificatio n(Unknown Source)
at java.util.LinkedList$ListItr.next(Unknown Source)
at LinkValidator.removeDupes(LinkValidator.java:151)
at LinkValidator.main(LinkValidator.java:67)


If anyone has an easy fix for this, an easier way to write removeDupes, or a better data structure to use, I'm up for suggestions

bwkaz
09-04-2004, 09:37 PM
Collections almost always invalidate their iterators when you add or remove items from them. If you don't reinitialize the iterator, you'll get an exception thrown if you try to use it.

This is the case in Java (which I assume is the language you used, right?), the .Net Framework, C++'s STL (I think), and probably a few other languages, if the language in question has an "exception" concept.

The first alternate implementation that comes to mind is to add the URL that you're going to remove to another collection in your main loop, and then run through that collection later and remove all its items from the main collection. But it doesn't look like that will work, due to the way you're doing the removals, so never mind.

My second thought is to use a hashtable. Where your code adds the URL to the list now, have it instead insert it into the hashtable (use the String class's hash function, called on the URL's string, as the hash for your URLs). But before adding it, have your code check whether it's already there, and if so, have it do nothing.

The "check whether it's already there" and the "insert it into the hashtable" operations are both O(1) in the normal case, so it's probably faster than your O(n^2) double-search routine to remove duplicates... ;)

thaddaeus
09-04-2004, 09:54 PM
Another way is to traverse one list, adding values to another list and skipping over duplicates.


....if (first != next)
add value to list2
}

Fryguy8
09-04-2004, 10:26 PM
the hashtable idea would work fine. For some reason when I was planning the implementation I was dead set on not having to extend standard Data Structure functions (I had to write them all from scratch last year in my Data Structures class, this year I am determined to use what is given to me by the java library). And for some reason, I thought I was going to have to override insert, when instead I can just check before I add.

Switching the code over shouldn't be overly hard either.

Thanks for the help.

and for the person who suggested just compare it to the next, the list isn't ordered in anyway.

thaddaeus
09-04-2004, 10:50 PM
What i ment was to grab a value from the list, compare it to the new list if it dosn't exist in the new list then add otherwise get the next value and so forth, i m,ay not have explained it well enough, i seem to do that a lot. sorry

mwinterberg
09-05-2004, 02:30 AM
http://java.sun.com/j2se/1.4.2/docs/api/java/util/HashSet.html
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
Either will suit you fine, though as bwkaz stated, the HashSet will have better performance on lookups and insertions than the TreeSet.