Tuesday, November 18, 2008

Downloading full CiteSeerX data

Here is, in my opinion, the easiest way to download the full dataset from CiteSeerX. (Note that CiteSeer is the older version, which is no longer updated.)

Steps for downloading the full dataset from CiteSeerX:
  1. Download and extract the "Demo" from http://www.oclc.org/research/software/oai/harvester.htm
  2. Go to the directory of the extracted files, type the following command to download the full dataset of CiteSeerX to the file "citeseerx_alldata.xml"
    java -classpath .;oaiharvester.jar;xerces.jar org.acme.oai.OAIReaderRawDump http://citeseerx.ist.psu.edu/oai2 -o citeseerx_alldata.xml
I also wanted to thank Dr. Lee Giles from the CiteSeerX project for pointing me at the right directions to obtain the data and recommending me to try some Perl harvesters (though I didn't eventually use them).