Thursday, August 13, 2009

Creating decorators in the Prefuse framework


I'm working on some graph visualization using Prefuse, and I want to allow the use to select nodes in the graph by dragging a marquee. In addition, I want to show a tiny black square indicator at the top-left corner of a selected node. The image on the left is what I eventually got.

I wanted to use the Decorator design pattern provided by Prefuse to add the square indicators. Conceptually, the steps are:
  1. programatically selects a subset of the nodes in the graph
  2. applies the desired decoration on those subset of nodes
But in the actual implementation, these basic steps unfold into much more complex steps:
  1. creates a "focus group", of type "DefaultTupleSet", to the "visualization" object, using the method visualization.addFocusGroup(); you'll need to give this newly created group a name. This group (which is a "tuple set") is precisely the data structure that you use to store the selected nodes
  2. look at this example on how to create a control that lets you draw the rectangle marquee, which will put the selected nodes into the "focus group" that you created in step 1.
  3. now comes the nontrivial part -- creating the "decorators", which are objects created to decorate the selected nodes. Programmatically, you call visualization.addDecorators() to create this new group of decorators. Again, you give this decorator group a name.
  4. creates a layout action to continously relocate the indicator so that it always stay at the top-left of the node
  5. creates a renderer to render the tiny square. I think we can use a ShapeRenderer instead so this step could be skipped.
NEVER USE "graph.nodes.xxx.yyy.zzz" AS A GROUP'S NAME. IT BREAKS PREFUSE. I haven't had the time to figure out why, probably because of some of its internal parsing error, but this kind of names break Prefuse. I learned about this the hard way; wasted 4+ hours.

Monday, May 25, 2009

Using BitSet or BitVector to store a set of integers

When programming with Java, we typically use HashSet to store a set of Integer, if we want to keep track of which Integer has been used and which hasn't. When we need to store millions of Integers, HashSet can be too slow, and it may take up too much memory.

Java's BitSet and BitVector from COLT are great alternative; they are MUCH faster and consumes way less memory. For example, to store 20 million integers, both BitSet and BitVector only use about 2.5MB and take only, respectively, 0.67s and 0.27s (which means BitVector is the fastest) to set and get all of the values once. using HashSet, it takes ~962MB, and 6.3s. This is more than 380 times of memory saving, and ~20 times speed up.

One "disadvantage" for BitVector is that it's size can't dynamically scale, but you can always fake it by manually growing or shrinking the set via setSize(...)

I can't figure out why Java's HashSet implementation uses so much RAM; I tried to reduce the size to 2 million, and that uses 100MB, approximately 1/10 of the size when having 20 millions. So this means indeed the Java implementation is memory-hungry. I'm sure there's some way to cut down the memory consumption, but I haven't spent the time to investigate.

Saturday, May 9, 2009

Speed Comparison of: (1) Java's built-in HashMap, (2) Trove's TIntIntHashMap, and (3) Colt's OpenIntIntHashMap

Finally get to do a comparison among several popular Java HashMap implementations:
  • Java's built-in HashMap
  • Trove's implementation for primitive types
  • Colt's implementation for primitive types
Not too surprisingly, Java's built-in implementation is the slowest, and consumes the largest amount of memory among the three, since it stores objects instead of primitive types (e.g., instead of storing the number 3 as an int, it stores it as an Integer). What WAS surprisingly, however, was that Colt's implementation was SO MUCH faster than Java's implementation. For example, for a int->int hashmap, running 10,000,000 puts() with Colt took less than 1/3 of the time. Running 10,000,000 gets() took less than 1/6, and 10,000,000 containsKey() less than 1/5! This is some substantial timing saving. This time-saving advantage is still there, even when we use a int->float[] hashmap.

Memory-wise, the primitive implementations from Trove and Colt don't save us that much. For int->int hashmap, both Trove and Colt save us about 30% of memory, but for int->float[] (each float[] value has length 5) save us less than 10%. Nevertheless, there's still some saving.

Since I need to deal with large data sets, where it's common to use hashmaps with millions of entries, I'll use the Colt implementation from now on, for both it's impressive time saving and (slighly) lower memory consumption.



Here's the code I used for evaluating int->float[] hashmap. To get the timing and memory consumption for one implementation, comment out the other two that you DON'T want. You will also need to pass in these arguments to the VM, so enough heap memory will be alocated "-Xms1200M -Xmx1200M".


import java.util.HashMap;

import cern.colt.map.OpenIntIntHashMap;
import cern.colt.map.OpenIntObjectHashMap;
import gnu.trove.TIntIntHashMap;
import gnu.trove.TIntObjectHashMap;


public class Compare {

public static void main(String args[]){

System.out.println("1st line: time used(s)\n2nd line: heap memory used so far(MB)");

int n = 10000000;

long startTime = System.nanoTime();
long startHeapSize = Runtime.getRuntime().freeMemory();


// BEGIN: benchmark for Java's built-in hashmap
System.out.println("\n===== Java's built-in HashMap =====");
HashMap jIntIntMap = new HashMap();

System.out.println("\n-- " + n + " puts(key, value) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { jIntIntMap.put(i,new float[]{0f,1f,2f,3f,4f}); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );

System.out.println("\n-- " + n + " gets(key) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { jIntIntMap.get(i); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );

System.out.println("\n-- " + n + " containsKey(key) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { jIntIntMap.containsKey(i); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );
// END


// BEGIN: benchmark for Trove's TIntIntHashMap
System.out.println("\n===== Trove's TIntIntHashMap =====");
TIntObjectHashMap tIntIntMap = new TIntObjectHashMap();

System.out.println("\n-- " + n + " puts(key, value) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { tIntIntMap.put(i,new float[]{0f,1f,2f,3f,4f}); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );

System.out.println("\n-- " + n + " gets(key) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { tIntIntMap.get(i); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );

System.out.println("\n-- " + n + " containsKey(key) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { tIntIntMap.containsKey(i); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );
// END

// BEGIN: benchmark for Colt's OpenIntIntHashMap
System.out.println("\n===== Colt's OpenIntIntHashMap =====");
OpenIntObjectHashMap cIntIntMap = new OpenIntObjectHashMap();

System.out.println("\n-- " + n + " puts(key, value) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { cIntIntMap.put(i,new float[]{0f,1f,2f,3f,4f}); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );

System.out.println("\n-- " + n + " gets(key) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { cIntIntMap.get(i); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );

System.out.println("\n-- " + n + " containsKey(key) --");
startTime = System.nanoTime();
for (int i = 0; i < n; i++) { cIntIntMap.containsKey(i); }
System.out.println( (System.nanoTime() - startTime) / 1000000000.0 );
System.out.println( (startHeapSize - Runtime.getRuntime().freeMemory()) /1048576.0 );
// END


}

}




Friday, January 23, 2009

How to write a great research paper

A great guide by Simon Peyton Jones

Wednesday, December 3, 2008

Embedding web browser in Java application

I found this NativeSwing library an excellent library for embedding a web browser in a Java application, seemingly easier to use and much better documented that JDIC.

Tuesday, November 25, 2008

Primitive Java Collections (primitive hashset, etc.)

Java's built-in collections such as hashmap, hashset, etc are not efficient when you have a lot of data (e.g., 1 million of items), it's because everything in the hashmap (i.e. its keys and values) and hashset (i.e. its values) are stored as objects, which are expensive to store and access. So if you want to store, say, 1 million int in a hashset, use a primitive hashset instead, such as Trove, or COLT. They seem to be some of the best primitive collections. But I haven't compared their performace (memory consumption and access speed) so I don't know which one is better.

Tuesday, November 18, 2008

Downloading full CiteSeerX data

Here is, in my opinion, the easiest way to download the full dataset from CiteSeerX. (Note that CiteSeer is the older version, which is no longer updated.)

Steps for downloading the full dataset from CiteSeerX:
  1. Download and extract the "Demo" from http://www.oclc.org/research/software/oai/harvester.htm
  2. Go to the directory of the extracted files, type the following command to download the full dataset of CiteSeerX to the file "citeseerx_alldata.xml"
    java -classpath .;oaiharvester.jar;xerces.jar org.acme.oai.OAIReaderRawDump http://citeseerx.ist.psu.edu/oai2 -o citeseerx_alldata.xml
I also wanted to thank Dr. Lee Giles from the CiteSeerX project for pointing me at the right directions to obtain the data and recommending me to try some Perl harvesters (though I didn't eventually use them).

Polo's Blog

About Me