The ClueWeb22 Dataset:
VDOM Information
Visual Rendering Annotations
Below is the list of visual rendering features with brief explanations. Most or all are standard DOM properties that are also documented on sites like https://developer.mozilla.org/ and https://www.w3schools.com/, in case the below description is not enough context.
Feature Name | Variable Name | Description | Range |
---|---|---|---|
X position | position_x | the initial horizontal position | |
Y position | position_y | the initial vertical position | |
Width | position_w | element's width | |
Height | position_h | element's height | |
Offset left | offset_left | left position relative to the parent | |
Offset top | offset_top | top position relative to the parent | |
Offset width | offset_w | width of an element, including padding, border and scrollbar | |
Offset height | offset_h | height of an element, including padding, border and scrollbar | |
Client left | client_left | the width of the element's left border | |
Client top | client_top | width of the top border of an element | |
Client width | client_w | width of an element in pixels, including padding | |
Client height | client_h | height of an element, including padding | |
Font color Alpha | font_color_a | alpha value of font color | |
Font color Red | font_color_r | red value of font color | [0, 255] |
Font color Blue | font_color_b | blue value of font color | [0, 255] |
Font color Green | font_color_g | green value of font color | [0, 255] |
Font weight | font_weight | the weight (or boldness) of the font | |
Font size | font_size | the size of font | |
Font italic style | font_italic | font in italic style or not | |
Text decoration style | font_decoration | specifies the decoration added to text such as underline, overline | |
List style type | list_style | the type of list-item marker in a list | |
Display | display_style | display behavior (the type of rendering box) of an element, such as none (invisible), inline, block | |
Cursor | cursor_style | specifies the mouse cursor to be displayed when pointing over an element | |
Line Height | line_height | specifies the height of a line | |
Text transform | text_transform | controls the capitalization of text | |
Opacity | opacity | the opacity level for an element | [0, 10] |
Border style Left | border_style_left | the style of an element's left borders | |
Border style Top | border_style_top | the style of an element's top borders | |
Border style Right | border_style_right | the style of an element's right borders | |
Border style Bottom | border_style_bottom | the style of an element's bottom borders |
Semantic Annotations
Below is the list of semantic annotations assigned to DOM tree nodes. These annotations are assigned by a classifier, using the Visual Rendering Annotations above.
Feature Name | Variable Name | Description | Range |
---|---|---|---|
Heading | |||
List | |||
Paragraph | |||
Primary Content | |||
Table | |||
Title |
All other content is implicitly considered secondary content.