Tuesday, November 25, 2008

Primitive Java Collections (primitive hashset, etc.)

Java's built-in collections such as hashmap, hashset, etc are not efficient when you have a lot of data (e.g., 1 million of items), it's because everything in the hashmap (i.e. its keys and values) and hashset (i.e. its values) are stored as objects, which are expensive to store and access. So if you want to store, say, 1 million int in a hashset, use a primitive hashset instead, such as Trove, or COLT. They seem to be some of the best primitive collections. But I haven't compared their performace (memory consumption and access speed) so I don't know which one is better.

Tuesday, November 18, 2008

Downloading full CiteSeerX data

Here is, in my opinion, the easiest way to download the full dataset from CiteSeerX. (Note that CiteSeer is the older version, which is no longer updated.)

Steps for downloading the full dataset from CiteSeerX:
  1. Download and extract the "Demo" from http://www.oclc.org/research/software/oai/harvester.htm
  2. Go to the directory of the extracted files, type the following command to download the full dataset of CiteSeerX to the file "citeseerx_alldata.xml"
    java -classpath .;oaiharvester.jar;xerces.jar org.acme.oai.OAIReaderRawDump http://citeseerx.ist.psu.edu/oai2 -o citeseerx_alldata.xml
I also wanted to thank Dr. Lee Giles from the CiteSeerX project for pointing me at the right directions to obtain the data and recommending me to try some Perl harvesters (though I didn't eventually use them).

About Me