Monday, May 25, 2009

Using BitSet or BitVector to store a set of integers

When programming with Java, we typically use HashSet to store a set of Integer, if we want to keep track of which Integer has been used and which hasn't. When we need to store millions of Integers, HashSet can be too slow, and it may take up too much memory.

Java's BitSet and BitVector from COLT are great alternative; they are MUCH faster and consumes way less memory. For example, to store 20 million integers, both BitSet and BitVector only use about 2.5MB and take only, respectively, 0.67s and 0.27s (which means BitVector is the fastest) to set and get all of the values once. using HashSet, it takes ~962MB, and 6.3s. This is more than 380 times of memory saving, and ~20 times speed up.

One "disadvantage" for BitVector is that it's size can't dynamically scale, but you can always fake it by manually growing or shrinking the set via setSize(...)

I can't figure out why Java's HashSet implementation uses so much RAM; I tried to reduce the size to 2 million, and that uses 100MB, approximately 1/10 of the size when having 20 millions. So this means indeed the Java implementation is memory-hungry. I'm sure there's some way to cut down the memory consumption, but I haven't spent the time to investigate.