Working with WARC Files

Data Format on Disk
- Linebreaks within the GZipped Files
File Classes for Java
- Classes for Hadoop
- Example Code
University of Maryland's Experiences (Jimmy Lin's group)

Data Format on Disk

Web pages are stored in gzipped files that are in WARC format. The WARC formatting used conforms to the WARC ISO 28500 final draft (as of June 18th, 2008), version 018.

Specifications for the format can be found at:

One custom field is added to the WARC response header information named "WARC-TREC-ID". This is a globally unique identifier for the dataset that describes the location of the individual record within the entire ClueWeb09 Dataset. See the Dataset Information page for more information.

Linebreaks within the GZipped Files

Note that the WARC files were generated on a Linux based system, and as such, the line breaks are CR only (\n). If the original GZip files are opened and re-saved on a Windows system, the Windows system may transform the linebreaks in the file to CR/LF (\n\r). If this occurs, the content lengths of the HTML content as well as the WARC record content length may be incorrect and your code will have to recalculate the content lengths, or account for them in some fashion.

File Classes for Java

Below are two Java classes that may help for processing of the data:

Classes for Hadoop

The supplemental classes below introduce a FileInputFormat for reading WarcRecords. The record tuple returned is a LongWritable (sequential number) and a WritableWarcRecord object containing the actual WARC record.

Example Code

The example code below takes as input the full path to the gzipped WARC file on the command line, iterates through the file, and prints out each TREC ID in the WARC file as well as the Target URI for that TREC ID:

import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.zip.GZIPInputStream;
import edu.cmu.lemurproject.WarcRecord;
import edu.cmu.lemurproject.WarcHTMLResponseRecord;

public class ReadWarcSample {

  public static void main(String[] args) throws IOException {
    String inputWarcFile=args[0];
    // open our gzip input stream
    GZIPInputStream gzInputStream=new GZIPInputStream(new FileInputStream(inputWarcFile));
    
    // cast to a data input stream
    DataInputStream inStream=new DataInputStream(gzInputStream);
    
    // iterate through our stream
    WarcRecord thisWarcRecord;
    while ((thisWarcRecord=WarcRecord.readNextWarcRecord(inStream))!=null) {
      // see if it's a response record
      if (thisWarcRecord.getHeaderRecordType().equals("response")) {
        // it is - create a WarcHTML record
        WarcHTMLResponseRecord htmlRecord=new WarcHTMLResponseRecord(thisWarcRecord);
        // get our TREC ID and target URI
        String thisTRECID=htmlRecord.getTargetTrecID();
        String thisTargetURI=htmlRecord.getTargetURI();
        // print our data
        System.out.println(thisTRECID + " : " + thisTargetURI);
      }
    }
    
    inStream.close();
  }
}

University of Maryland's Experiences (Jimmy Lin's group)

Jimmy Lin (UMD) has written an excellent page on their experiences working with the ClueWeb09 English collection. The page can be found at:

http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/content/clue.html

Jimmy Lin and his group also have a web page talking about random access within the ClueWeb09 dataset under Hadoop using the Cloud9 MapReduce library at:

http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/content/clue-access.html

One major item to note is the list of doc IDs that they have found to have malformed data. If you are experiencing problems with the dataset, you may want to check this list.