Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as



Yüklə 445 b.
tarix30.01.2018
ölçüsü445 b.
#42195







Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as:

  • Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as:

    • menus,
    • status bars,
    • advertisements,
    • sponsored information,
    • etc.








Component reuse. Web developers can automatically extract components from a webpage.

  • Component reuse. Web developers can automatically extract components from a webpage.

  • Enhancing indexers and text analyzers to increase their performance by only processing relevant information.

    • It has been measured that almost 40-50% of the components of a webpage represent the template.
  • Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone

  • Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.







Three main different ways to solve the problem:







Three main different ways to solve the problem:

  • Three main different ways to solve the problem:

  • Using the textual information of the webpage (i.e., the HTML code)

  • Using the rendered image of the webpage in the browser

  • Using the DOM tree of the webpage



Three main different ways to solve the problem:

  • Three main different ways to solve the problem:

  • Using the textual information of the webpage (i.e., the HTML code)

  • Using the rendered image of the webpage in the browser

  • Using the DOM tree of the webpage



Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags).

  • Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags).

  • Some assume that the main content text is continuous.

  • Some assume that the system knows a priori the format of the webpage.

  • Some need to (randomly) load many webpages (several dozens) to compare them.



Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].

  • Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].

  • Some assume that the main content text is continuous [11].

  • Some assume that the system knows a priori the format of the webpage [10].

  • Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated.



The main problem of these approaches is a big loss of generality.

  • The main problem of these approaches is a big loss of generality.

  • They require to previously know or parse the webpages, or they require the webpage to have a particular structure.

  • This is very inconvenient because modern webpages are mainly based on <div> tags that do not require to be hierarchically organized (as in the table-based design).

  • Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori.







The Document Object Model (DOM)

  • The Document Object Model (DOM)

  • API that provides programmers with a standard set of objects for the representation of HTML and XML documents.

  • Given a webpage, it is completely automatic to produce its associated DOM structure and vice versa.

  • The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically.



The Document Object Model (DOM)

  • The Document Object Model (DOM)

  • Nodes in the DOM tree can be of two types: tag nodes, and text nodes:

  • Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes).

  • Text nodes are always leaves in the DOM tree because they cannot contain other nodes.





Our method for template extraction in a nutsell:

  • Our method for template extraction in a nutsell:

  • Identify a set of webpages in the website topology.

    • Select those nodes that belong to the menu.
    • Use a complete subdigraph.
  • Solve conflicts between those webpages that implement different templates.

    • Establishing a voting system between the webpages.
  • The template is the intersection between the initial webpage and the DOM trees in the subdigraph.

    • The intersection is computed with an Equal Top-Down Mapping between the DOM trees.
  • The three steps can be done with a linear cost with respect to the size of the DOM trees.



Identify a set of webpages in the website topology.

  • Identify a set of webpages in the website topology.

    • Select those nodes that belong to the menu.
    • Use a complete subdigraph.


Identify a set of webpages in the website topology.

  • Identify a set of webpages in the website topology.

    • Select those nodes that belong to the menu.
    • Use a complete subdigraph.


Identify a set of webpages in the website topology.

  • Identify a set of webpages in the website topology.

    • Select those nodes that belong to the menu.
    • Use a complete subdigraph.


Solve conflicts between those webpages that implement different templates.

  • Solve conflicts between those webpages that implement different templates.

    • Establishing a voting system between the webpages.


Our method for template extraction in a nutsell:

  • Our method for template extraction in a nutsell:

  • The template is the intersection between the initial webpage and the DOM trees in the subdigraph.

    • The intersection is computed with an Equal Top-Down Mapping between the DOM trees.


Mapping:

  • Mapping:



Top-Down Mapping:

  • Top-Down Mapping:



Equal Top-Down Mapping:

  • Equal Top-Down Mapping:





Benchmarks: online heterogeneus webpages

  • Benchmarks: online heterogeneus webpages

  • Final evaluation set randomly selected

  • We determined the actual template of each webpage by downloading it and manually selecting the template.

  • The DOM tree of the selected elements was then produced and used for comparison evaluation later.

  • F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall





GOLD STANDARD

  • GOLD STANDARD

  • Downloading the complete website of each benchmark.

    • Company’s websites, news articles, forums, etc.
  • Four different engineers did the following independently:

    • Manually exploring the original page and the webpages accessible from it to decide what part of the webpage is the template.
    • Printing the key page in paper and marking the template.
  • The four engineers met and together decided what the template was.

  • Each element marked in the printed page was mapped to the DOM tree of the initial page.

    • All elements in the DOM tree that did not belong to the template were included in an HTML class non-template (i.e., we enriched the HTML code of the key page with a new class).
    • This class was later used by an algorithm that we programmed to evaluate the results obtained by our tool.










Yüklə 445 b.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin