Language Identification
For the ClueWeb09 crawling, we used language id on the fly as we crawled, so that we could get an estimate of the overall language distribution, and focus crawling on the poorly crawled languages.
Language identification for a crawled page
After Nutch downloads a page, it will guess the encoding of the page based on the HTTP header sent by the HTTP server and the metadata of the HTML file. Nutch further extracts the content of the HTML page, removing the HTTP tags. Our customized Nutch crawler then decodes the content of every page out of its original encoding, and encodes it in a uniform UTF-8 encoding. It then performs language identification on the extracted UTF-8 encoded content, and stores the prediction.We used an open source statistical language id tool TextCat< to classify every page. For the training data, we simply encoded the training texts for the 10 target languages in UTF-8 encoding, and trained TextCat models on the UTF-8 byte stream. The training encoding has to be consistent with the test time encoding. And because a uniform UTF-8 encoding is used at test time, the same encoding is used for the training data. Training data for EU languages came from the EU parliament dataset. For English Japanese and Korean, standard newswire collections were used to generate the training data. For Chinese, the Gigaword collection was used. All training data were down-sampled to about 7M Bytes to allow fast training of the TextCat models. Using 600 features during training, given only 400 bytes for testing, test precision for all 10 target languages are above 99.7% and recall for most languages are above 98.5%. When given shorter texts (160 bytes), test precision and recall are still above 98.3%. TextCat confuses Japanese with Chinese because of the shared characters, and Japanese has a 100% precision and 95.7% recall.
The uniform UTF-8 encoding is desirable because a Web page can have numerous possible (language, encoding) combinations, and we cannot afford to train so many models for all possible combinations.
Language guessing for a URL to be crawled
One reason to do language id is to know the language distribution of the crawl. A more important reason is to guide the crawl toward poorly crawled languages. We did this by terminating the crawling for a certain language after a limit is reached.We do not want to crawl a page, do language id, then discard if the page belongs to a saturated language class. This is out of an efficiency consideration. If we did so, the majority of the crawling time will be spent on crawling and discarding pages from the already saturated major languages.
The crawler has to guess, before crawling, what URLs in its job queue might belong to a poorly covered language. Our customized Nutch crawler uses the referring pages' language ids as guesses for the referred URL. If multiple pages point to the same target URL, all language id guesses are assumed possible. The crawler skips a URL when no referring page is in a poorly crawled language.
All statistical language id tools behave relatively poorer when the target text is short. In the ClueWeb09 crawl, pages with fewer than 400 bytes of content were tagged as "short" together with the language id. These "short" referring pages will not be used to assign a guess for the referred URLs.