Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as



Yüklə 445 b.
tarix02.11.2017
ölçüsü445 b.
#27426





Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as:

  • Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as:

    • menus,
    • status bars,
    • advertisements,
    • sponsored information,
    • etc.








Component reuse. Web developers can automatically extract components from a webpage.

  • Component reuse. Web developers can automatically extract components from a webpage.

  • Enhancing indexers and text analyzers to increase their performance by only processing relevant information.

    • It has been measured that almost 40-50% of the components of a webpage represent the template.
  • Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone

  • Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.





Three main different ways to solve the problem:

  • Three main different ways to solve the problem:

  • Using the textual information of the webpage (i.e., the HTML code)

  • Using the rendered image of the webpage in the browser

  • Using the DOM tree of the webpage







Three main different ways to solve the problem:

  • Three main different ways to solve the problem:

  • Using the textual information of the webpage (i.e., the HTML code)

  • Using the rendered image of the webpage in the browser

  • Using the DOM tree of the webpage



Three main different ways to solve the problem:

  • Three main different ways to solve the problem:

  • Using the textual information of the webpage (i.e., the HTML code)

  • Using the rendered image of the webpage in the browser

  • Using the DOM tree of the webpage



Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].

  • Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].

  • Some assume that the main content text is continuous [11].

  • Some assume that the system knows a priori the format of the webpage [10].

  • Some need to (randomly) load many webpages (several dozens) to compare them [15].



Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].

  • Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].

  • Some assume that the main content text is continuous [11].

  • Some assume that the system knows a priori the format of the webpage [10].

  • Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated [12].



The main problem of these approaches is a big loss of generality.

  • The main problem of these approaches is a big loss of generality.

  • They require to previously know or parse the webpages, or they require the webpage to have a particular structure.

  • This is very inconvenient because modern webpages are mainly based on <div> tags that do not require to be hierarchically organized (as in the table-based design).

  • Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori.







Using a content code vector (CCV) [13]

  • Using a content code vector (CCV) [13]

  • CCV represents all characters in a document determining whether they are content or code.

  • With the CCV, they compute a content code ratio to identify the amount of code and content that surrounds the elements of the CCV.

  • Finally, with this information, they can determine what parts of the document contain the main content.



Using the tag ratios (TR) [14]

  • Using the tag ratios (TR) [14]

  • Given a webpage, the TR is computed for each line with the number of non-HTML-tag characters divided by the number of HTML-tags.



Using the tag ratios (TR) [14]

  • Using the tag ratios (TR) [14]

  • Given a webpage, the TR is computed for each line with the number of non-HTML-tag characters divided by the number of HTMLtags.

  • The main problem of the approaches based on characters or lines (such as these two), or words such as [16], is the fact of completely ignoring the structure of the webpage





Using the tag ratios (TR) [14]

  • Using the tag ratios (TR) [14]

  • Given a webpage, the TR is computed for each line with the number of non-HTML-tag characters divided by the number of HTMLtags.

  • The main problem of the approaches based on characters or lines (such as these two), or words such as [16], is the fact of completely ignoring the structure of the webpage

  • The distribution of the code between the lines of a webpage is not necessarily the one expected by the user (e.g., Google).

  • The format of the HTML code can be completely unbalanced (i.e., without tabulations, spaces or even carriage returns), specially when it is generated by a non-human directed system.





The Document Object Model (DOM) [17]

  • The Document Object Model (DOM) [17]

  • API that provides programmers with a standard set of objects for the representation of HTML and XML documents.

  • Given a webpage, it is completely automatic to produce its associated DOM structure and vice-versa.

  • The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically.



The Document Object Model (DOM) [17]

  • The Document Object Model (DOM) [17]

  • Nodes in the DOM tree can be of two types: tag nodes, and text nodes:

  • Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes).

  • Text nodes are always leaves in the DOM tree because they cannot contain other nodes.





Our method for template extraction in a nutsell:

  • Our method for template extraction in a nutsell:

  • Identify a set of webpages in the website topology.

    • Select those nodes that belong to the menu.
    • Use a complete subdigraph.
  • The template is the intersection between the initial webpage and all DOM trees in the subdigraph.

    • The intersection is computed with a Top-Down Exact Mapping between the DOM trees.
  • Both steps can be done with a cost linear with the size of the DOM trees.



Identify a set of webpages in the website topology.

  • Identify a set of webpages in the website topology.

    • Select those nodes that belong to the menu.
    • Use a complete subdigraph.




Our method for template extraction in a nutsell:

  • Our method for template extraction in a nutsell:

  • The template is the intersection between the initial webpage and all DOM trees in the subdigraph.

    • The intersection is computed with a Top-Down Exact Mapping between the DOM trees.


Mapping:

  • Mapping:



Top-Down Mapping:

  • Top-Down Mapping:



Top-Down Exact Mapping:

  • Top-Down Exact Mapping:



Experiments

  • Experiments

  • Benchmarks: online heterogeneus webpages

    • Domains with different layouts and page structures
    • Company’s websites, news articles, forums, etc.
  • Final evaluation set randomly selected

  • We determined the actual template of each webpage by downloading it and manually selecting the template.

  • The DOM tree of the selected elements was then produced and used for comparison evaluation later.

  • F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall



Experiments

  • Experiments



GOLD STANDARD

  • GOLD STANDARD

  • Downloading the complete website of each benchmark.

    • Company’s websites, news articles, forums, etc.
  • Four different engineers did the following independently:

    • Manually exploring the original page and the webpages accessible from it to decide what part of the webpage is the template.
    • Printing the key page in paper and marking the template.
  • The four engineers met and together decided what was the template.

  • Each element marked in the printed page was mapped to the DOM tree of the initial page.

    • All elements in the DOM tree that did not belong to the template were included in an HTML class non-template (i.e., we enriched the HTML code of the key page with a new class).
    • This class was later used by an algorithm that we programmed to evaluate the results obtained by our tool.


Experiments

  • Experiments

  • Using CNR:

  • The average recall is 94.39 and the average precision is 74.08.

  • Using tag ratios:

  • The average recall is 92.72 and the average precision is 71.93.

  • Interesting Phenomenon (property): Either the recall, the precision, or both, are 100%.







Template extraction from Wikipedia

  • Template extraction from Wikipedia

  • Recall 100%, precision 100% (50% of the times).



Template extraction from FilmAffinity

  • Template extraction from FilmAffinity

  • Recall >100% (sometimes forced by the designer).



Template extraction from FilmAffinity

  • Template extraction from FilmAffinity

  • Recall <100% (6% of the times).
























Yüklə 445 b.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2022
rəhbərliyinə müraciət

    Ana səhifə