Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as

Yüklə 445 b.

tarix	30.01.2018
ölçüsü	445 b.
	#42195

Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as:

Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as:

menus,
status bars,
advertisements,
sponsored information,
etc.

Component reuse. Web developers can automatically extract components from a webpage.

Component reuse. Web developers can automatically extract components from a webpage.
Enhancing indexers and text analyzers to increase their performance by only processing relevant information.

It has been measured that almost 40-50% of the components of a webpage represent the template.

Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone
Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.

Three main different ways to solve the problem:

Three main different ways to solve the problem:
Using the textual information of the webpage (i.e., the HTML code)
Using the rendered image of the webpage in the browser
Using the DOM tree of the webpage

Three main different ways to solve the problem:

Three main different ways to solve the problem:
Using the textual information of the webpage (i.e., the HTML code)
Using the rendered image of the webpage in the browser
Using the DOM tree of the webpage

Three main different ways to solve the problem:

Three main different ways to solve the problem:
Using the textual information of the webpage (i.e., the HTML code)
Using the rendered image of the webpage in the browser
Using the DOM tree of the webpage

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags).

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags).
Some assume that the main content text is continuous.
Some assume that the system knows a priori the format of the webpage.
Some need to (randomly) load many webpages (several dozens) to compare them.

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].
Some assume that the main content text is continuous [11].
Some assume that the system knows a priori the format of the webpage [10].
Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated.

The main problem of these approaches is a big loss of generality.

The main problem of these approaches is a big loss of generality.
They require to previously know or parse the webpages, or they require the webpage to have a particular structure.
This is very inconvenient because modern webpages are mainly based on <div> tags that do not require to be hierarchically organized (as in the table-based design).
Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori.

The Document Object Model (DOM)

The Document Object Model (DOM)
API that provides programmers with a standard set of objects for the representation of HTML and XML documents.
Given a webpage, it is completely automatic to produce its associated DOM structure and vice versa.
The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically.

The Document Object Model (DOM)

The Document Object Model (DOM)
Nodes in the DOM tree can be of two types: tag nodes, and text nodes:
Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes).
Text nodes are always leaves in the DOM tree because they cannot contain other nodes.

Our method for template extraction in a nutsell:

Our method for template extraction in a nutsell:
Identify a set of webpages in the website topology.

Select those nodes that belong to the menu.
Use a complete subdigraph.

Solve conflicts between those webpages that implement different templates.

Establishing a voting system between the webpages.

The template is the intersection between the initial webpage and the DOM trees in the subdigraph.

The intersection is computed with an Equal Top-Down Mapping between the DOM trees.

The three steps can be done with a linear cost with respect to the size of the DOM trees.

Identify a set of webpages in the website topology.

Identify a set of webpages in the website topology.

Select those nodes that belong to the menu.
Use a complete subdigraph.

Identify a set of webpages in the website topology.

Identify a set of webpages in the website topology.

Select those nodes that belong to the menu.
Use a complete subdigraph.

Identify a set of webpages in the website topology.

Identify a set of webpages in the website topology.

Select those nodes that belong to the menu.
Use a complete subdigraph.

Solve conflicts between those webpages that implement different templates.

Solve conflicts between those webpages that implement different templates.

Establishing a voting system between the webpages.

Our method for template extraction in a nutsell:

Our method for template extraction in a nutsell:
The template is the intersection between the initial webpage and the DOM trees in the subdigraph.

The intersection is computed with an Equal Top-Down Mapping between the DOM trees.

Mapping:

Mapping:

Top-Down Mapping:

Top-Down Mapping:

Equal Top-Down Mapping:

Equal Top-Down Mapping:

Benchmarks: online heterogeneus webpages

Benchmarks: online heterogeneus webpages

Domains with different layouts and page structures
Company’s websites, news articles, forums, etc.

Final evaluation set randomly selected
We determined the actual template of each webpage by downloading it and manually selecting the template.
The DOM tree of the selected elements was then produced and used for comparison evaluation later.
F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall

GOLD STANDARD

GOLD STANDARD
Downloading the complete website of each benchmark.

Company’s websites, news articles, forums, etc.

Four different engineers did the following independently:

Manually exploring the original page and the webpages accessible from it to decide what part of the webpage is the template.
Printing the key page in paper and marking the template.

The four engineers met and together decided what the template was.
Each element marked in the printed page was mapped to the DOM tree of the initial page.

All elements in the DOM tree that did not belong to the template were included in an HTML class non-template (i.e., we enriched the HTML code of the key page with a new class).
This class was later used by an algorithm that we programmed to evaluate the results obtained by our tool.

Yüklə 445 b.

Dostları ilə paylaş:

Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as

Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as:

Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as:

Component reuse. Web developers can automatically extract components from a webpage.

Component reuse. Web developers can automatically extract components from a webpage.

Enhancing indexers and text analyzers to increase their performance by only processing relevant information.

Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone

Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.

Three main different ways to solve the problem:

Three main different ways to solve the problem:

Using the textual information of the webpage (i.e., the HTML code)

Using the rendered image of the webpage in the browser

Using the DOM tree of the webpage

Three main different ways to solve the problem:

Three main different ways to solve the problem:

Using the textual information of the webpage (i.e., the HTML code)

Using the rendered image of the webpage in the browser

Using the DOM tree of the webpage

Three main different ways to solve the problem:

Three main different ways to solve the problem:

Using the textual information of the webpage (i.e., the HTML code)

Using the rendered image of the webpage in the browser

Using the DOM tree of the webpage

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags).

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags).

Some assume that the main content text is continuous.

Some assume that the system knows a priori the format of the webpage.

Some need to (randomly) load many webpages (several dozens) to compare them.

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].

Some assume that the main content text is continuous [11].

Some assume that the system knows a priori the format of the webpage [10].

Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated.

The main problem of these approaches is a big loss of generality.

The main problem of these approaches is a big loss of generality.

They require to previously know or parse the webpages, or they require the webpage to have a particular structure.

This is very inconvenient because modern webpages are mainly based on <div> tags that do not require to be hierarchically organized (as in the table-based design).

Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori.

The Document Object Model (DOM)

The Document Object Model (DOM)

API that provides programmers with a standard set of objects for the representation of HTML and XML documents.

Given a webpage, it is completely automatic to produce its associated DOM structure and vice versa.

The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically.

The Document Object Model (DOM)

The Document Object Model (DOM)

Nodes in the DOM tree can be of two types: tag nodes, and text nodes:

Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes).

Text nodes are always leaves in the DOM tree because they cannot contain other nodes.

Our method for template extraction in a nutsell:

Our method for template extraction in a nutsell:

Identify a set of webpages in the website topology.

Solve conflicts between those webpages that implement different templates.

The template is the intersection between the initial webpage and the DOM trees in the subdigraph.

The three steps can be done with a linear cost with respect to the size of the DOM trees.

Identify a set of webpages in the website topology.

Identify a set of webpages in the website topology.

Identify a set of webpages in the website topology.

Identify a set of webpages in the website topology.

Identify a set of webpages in the website topology.

Identify a set of webpages in the website topology.

Solve conflicts between those webpages that implement different templates.

Solve conflicts between those webpages that implement different templates.

Our method for template extraction in a nutsell:

Our method for template extraction in a nutsell:

The template is the intersection between the initial webpage and the DOM trees in the subdigraph.

Mapping:

Mapping:

Top-Down Mapping:

Top-Down Mapping:

Equal Top-Down Mapping:

Equal Top-Down Mapping:

Benchmarks: online heterogeneus webpages

Benchmarks: online heterogeneus webpages

Final evaluation set randomly selected

We determined the actual template of each webpage by downloading it and manually selecting the template.

The DOM tree of the selected elements was then produced and used for comparison evaluation later.

F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall

GOLD STANDARD

GOLD STANDARD

Downloading the complete website of each benchmark.

F1 metric is computed as **(2PR)/(P+R)** being P the precision and R the recall