Working with WARC Files
Table of contents
Data Format on Disk
Web pages are stored in gzipped files that are in WARC format. The WARC formatting used conforms to the WARC ISO 28500 final draft (as of June 18th, 2008), version 018.Specifications for the format can be found at:
- http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
- http://www.scribd.com/doc/4303719/WARC-ISO-28500-final-draft-v018-Zentveld-080618
One custom field is added to the WARC response header information named "WARC-TREC-ID". This is a globally unique identifier for the dataset that describes the location of the individual record within the entire ClueWeb09 Dataset. See the Dataset Information page for more information.
Linebreaks within the GZipped Files
Note that the WARC files were generated on a Linux based system, and as such, the line breaks are CR only (\n). If the original GZip files are opened and re-saved on a Windows system, the Windows system may transform the linebreaks in the file to CR/LF (\n\r). If this occurs, the content lengths of the HTML content as well as the WARC record content length may be incorrect and your code will have to recalculate the content lengths, or account for them in some fashion.File Classes for Java
Below are two Java classes that may help for processing of the data:Classes for Hadoop
The supplemental classes below introduce a FileInputFormat for reading WarcRecords. The record tuple returned is a LongWritable (sequential number) and a WritableWarcRecord object containing the actual WARC record.Example Code
The example code below takes as input the full path to the gzipped WARC file on the command line, iterates through the file, and prints out each TREC ID in the WARC file as well as the Target URI for that TREC ID:import java.io.DataInputStream; import java.io.FileInputStream; import java.io.IOException; import java.util.zip.GZIPInputStream; import edu.cmu.lemurproject.WarcRecord; import edu.cmu.lemurproject.WarcHTMLResponseRecord; public class ReadWarcSample { public static void main(String[] args) throws IOException { String inputWarcFile=args[0]; // open our gzip input stream GZIPInputStream gzInputStream=new GZIPInputStream(new FileInputStream(inputWarcFile)); // cast to a data input stream DataInputStream inStream=new DataInputStream(gzInputStream); // iterate through our stream WarcRecord thisWarcRecord; while ((thisWarcRecord=WarcRecord.readNextWarcRecord(inStream))!=null) { // see if it's a response record if (thisWarcRecord.getHeaderRecordType().equals("response")) { // it is - create a WarcHTML record WarcHTMLResponseRecord htmlRecord=new WarcHTMLResponseRecord(thisWarcRecord); // get our TREC ID and target URI String thisTRECID=htmlRecord.getTargetTrecID(); String thisTargetURI=htmlRecord.getTargetURI(); // print our data System.out.println(thisTRECID + " : " + thisTargetURI); } } inStream.close(); } }
University of Maryland's Experiences (Jimmy Lin's group)
Jimmy Lin (UMD) has written an excellent page on their experiences working with the ClueWeb09 English collection. The page can be found at:Jimmy Lin and his group also have a web page talking about random access within the ClueWeb09 dataset under Hadoop using the Cloud9 MapReduce library at:
One major item to note is the list of doc IDs that they have found to have malformed data. If you are experiencing problems with the dataset, you may want to check this list.