Monday, May 25, 2009

Using BitSet or BitVector to store a set of integers

When programming with Java, we typically use HashSet to store a set of Integer, if we want to keep track of which Integer has been used and which hasn't. When we need to store millions of Integers, HashSet can be too slow, and it may take up too much memory.

Java's BitSet and BitVector from COLT are great alternative; they are MUCH faster and consumes way less memory. For example, to store 20 million integers, both BitSet and BitVector only use about 2.5MB and take only, respectively, 0.67s and 0.27s (which means BitVector is the fastest) to set and get all of the values once. using HashSet, it takes ~962MB, and 6.3s. This is more than 380 times of memory saving, and ~20 times speed up.

One "disadvantage" for BitVector is that it's size can't dynamically scale, but you can always fake it by manually growing or shrinking the set via setSize(...)

I can't figure out why Java's HashSet implementation uses so much RAM; I tried to reduce the size to 2 million, and that uses 100MB, approximately 1/10 of the size when having 20 millions. So this means indeed the Java implementation is memory-hungry. I'm sure there's some way to cut down the memory consumption, but I haven't spent the time to investigate.

Saturday, May 9, 2009

Speed Comparison of: (1) Java's built-in HashMap, (2) Trove's TIntIntHashMap, and (3) Colt's OpenIntIntHashMap

Finally get to do a comparison among several popular Java HashMap implementations:
  • Java's built-in HashMap
  • Trove's implementation for primitive types
  • Colt's implementation for primitive types
Not too surprisingly, Java's built-in implementation is the slowest, and consumes the largest amount of memory among the three, since it stores objects instead of primitive types (e.g., instead of storing the number 3 as an int, it stores it as an Integer). What WAS surprisingly, however, was that Colt's implementation was SO MUCH faster than Java's implementation. For example, for a int->int hashmap, running 10,000,000 puts() with Colt took less than 1/3 of the time. Running 10,000,000 gets() took less than 1/6, and 10,000,000 containsKey() less than 1/5! This is some substantial timing saving. This time-saving advantage is still there, even when we use a int->float[] hashmap.

Memory-wise, the primitive implementations from Trove and Colt don't save us that much. For int->int hashmap, both Trove and Colt save us about 30% of memory, but for int->float[] (each float[] value has length 5) save us less than 10%. Nevertheless, there's still some saving.

Since I need to deal with large data sets, where it's common to use hashmaps with millions of entries, I'll use the Colt implementation from now on, for both it's impressive time saving and (slighly) lower memory consumption.



Here's the code I used for evaluating int->float[] hashmap. To get the timing and memory consumption for one implementation, comment out the other two that you DON'T want. You will also need to pass in these arguments to the VM, so enough heap memory will be alocated "-Xms1200M -Xmx1200M".


import java.util.HashMap;

import cern.colt.map.OpenIntIntHashMap;
import cern.colt.map.OpenIntObjectHashMap;
import gnu.trove.TIntIntHashMap;
import gnu.trove.TIntObjectHashMap;


public class Compare {

public static void main(String args[]){

System.out.println("1st line: time used(s)\n2nd line: heap memory used so far(MB)");

int n = 10000000;

long startTime = System.nanoTime();
long startHeapSize = Runtime.getRuntime().freeMemory();


// BEGIN: benchmark for Java's built-in hashmap
System.out.println("\n===== Java's built-in HashMap =====");
HashMap jIntIntMap = new HashMap();

System.out.println("\n-- " + n + " puts(key, value) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { jIntIntMap.put(i,new float[]{0f,1f,2f,3f,4f}); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );

System.out.println("\n-- " + n + " gets(key) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { jIntIntMap.get(i); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );

System.out.println("\n-- " + n + " containsKey(key) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { jIntIntMap.containsKey(i); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );
// END


// BEGIN: benchmark for Trove's TIntIntHashMap
System.out.println("\n===== Trove's TIntIntHashMap =====");
TIntObjectHashMap tIntIntMap = new TIntObjectHashMap();

System.out.println("\n-- " + n + " puts(key, value) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { tIntIntMap.put(i,new float[]{0f,1f,2f,3f,4f}); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );

System.out.println("\n-- " + n + " gets(key) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { tIntIntMap.get(i); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );

System.out.println("\n-- " + n + " containsKey(key) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { tIntIntMap.containsKey(i); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );
// END

// BEGIN: benchmark for Colt's OpenIntIntHashMap
System.out.println("\n===== Colt's OpenIntIntHashMap =====");
OpenIntObjectHashMap cIntIntMap = new OpenIntObjectHashMap();

System.out.println("\n-- " + n + " puts(key, value) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { cIntIntMap.put(i,new float[]{0f,1f,2f,3f,4f}); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );

System.out.println("\n-- " + n + " gets(key) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { cIntIntMap.get(i); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );

System.out.println("\n-- " + n + " containsKey(key) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { cIntIntMap.containsKey(i); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );
// END


}

}




Friday, January 23, 2009

How to write a great research paper

A great guide by Simon Peyton Jones

Wednesday, December 3, 2008

Embedding web browser in Java application

I found this NativeSwing library an excellent library for embedding a web browser in a Java application, seemingly easier to use and much better documented that JDIC.

Tuesday, November 25, 2008

Primitive Java Collections (primitive hashset, etc.)

Java's built-in collections such as hashmap, hashset, etc are not efficient when you have a lot of data (e.g., 1 million of items), it's because everything in the hashmap (i.e. its keys and values) and hashset (i.e. its values) are stored as objects, which are expensive to store and access. So if you want to store, say, 1 million int in a hashset, use a primitive hashset instead, such as Trove, or COLT. They seem to be some of the best primitive collections. But I haven't compared their performace (memory consumption and access speed) so I don't know which one is better.

Tuesday, November 18, 2008

Downloading full CiteSeerX data

Here is, in my opinion, the easiest way to download the full dataset from CiteSeerX. (Note that CiteSeer is the older version, which is no longer updated.)

Steps for downloading the full dataset from CiteSeerX:
  1. Download and extract the "Demo" from http://www.oclc.org/research/software/oai/harvester.htm
  2. Go to the directory of the extracted files, type the following command to download the full dataset of CiteSeerX to the file "citeseerx_alldata.xml"
    java -classpath .;oaiharvester.jar;xerces.jar org.acme.oai.OAIReaderRawDump http://citeseerx.ist.psu.edu/oai2 -o citeseerx_alldata.xml
I also wanted to thank Dr. Lee Giles from the CiteSeerX project for pointing me at the right directions to obtain the data and recommending me to try some Perl harvesters (though I didn't eventually use them).

Friday, February 22, 2008

People won't buy what they don't understand

"People won't buy what they don't understand"
A great sentence! That's why we need to write plain English. :-)

Polo's Blog

About Me