The ClueWeb22 Dataset:
VDOM Information

Visual Rendering Annotations

 

Below is the list of visual rendering features with brief explanations. Most or all are standard DOM properties that are also documented on sites like https://developer.mozilla.org/ and https://www.w3schools.com/, in case the below description is not enough context.

Feature Name Variable Name Description Range
X position position_x the initial horizontal position
Y position position_y the initial vertical position
Width position_w element's width
Height position_h element's height
Offset left offset_left left position relative to the parent
Offset top offset_top top position relative to the parent
Offset width offset_w width of an element, including padding, border and scrollbar
Offset height offset_h height of an element, including padding, border and scrollbar
Client left client_left the width of the element's left border
Client top client_top width of the top border of an element
Client width client_w width of an element in pixels, including padding
Client height client_h height of an element, including padding
Font color Alpha font_color_a alpha value of font color
Font color Red font_color_r red value of font color [0, 255]
Font color Blue font_color_b blue value of font color [0, 255]
Font color Green font_color_g green value of font color [0, 255]
Font weight font_weight the weight (or boldness) of the font
Font size font_size the size of font
Font italic style font_italic font in italic style or not
Text decoration style font_decoration specifies the decoration added to text such as underline, overline
List style type list_style the type of list-item marker in a list
Display display_style display behavior (the type of rendering box) of an element, such as none (invisible), inline, block
Cursor cursor_style specifies the mouse cursor to be displayed when pointing over an element
Line Height line_height specifies the height of a line
Text transform text_transform controls the capitalization of text
Opacity opacity the opacity level for an element [0, 10]
Border style Left border_style_left the style of an element's left borders
Border style Top border_style_top the style of an element's top borders
Border style Right border_style_right the style of an element's right borders
Border style Bottom border_style_bottom the style of an element's bottom borders

 

Semantic Annotations

 

Below is the list of semantic annotations assigned to DOM tree nodes. These annotations are assigned by a classifier, using the Visual Rendering Annotations above.

Feature Name Variable Name Description Range
Heading
List
Paragraph
Primary Content
Table
Title

 

All other content is implicitly considered secondary content.