The ClueWeb22 Dataset:
Query Details
ClueWeb22 includes a large set of crowdsourced queries and relevance assessments created by the Lemur Project. These queries are intended to enable support the training and evaluation of ranking algorithms.
This work is in progress. The initial sets of queries and relevance assessments are planned for release in November 2022.
Queries and Relevance Assessments
Queries and relevance assessments were collected from crowd workers. Workers must be from an English-speaking country, have a good worker score, and pass a qualification test in which they correctly select relevant documents and avoid selecting non-relevant documents.
The process for a single query was as follows.
- Obtain a query
- Show the worker a list of five categories selected randomly from a list of about 50 categories. For example, "sports", "politics", "entertainment", "health", and "other".
- Ask the worker to select a category and enter a query for that category.
- Rank documents
- Use BM25 to retrieve 1,000 documents.
Use BERT to rerank the top 500 documents.
- We hope eventually to select a ranker randomly from a set of rankers. If you are interested in contributing a ranker, please contact us.
- The ranking is pruned to eliminate duplicates and restrict each web domain to two documents.
- Use BM25 to retrieve 1,000 documents.
Use BERT to rerank the top 500 documents.
- Obtain assessments
- Select the top ten documents and three additional documents that probably are not relevant. Sort the list randomly.
- Show the worker snippets for these thirteen results.
- Ask the worker to identify which results are good search results.
- Validate assessments
- Discard the query if the query or assessments do not meet quality requirements.
A single worker may contribute no more than ten queries per day.