Boilerplate removal - existing approaches
For a given Web page, we first wish to determine if it contains some meaningful content (i.e., longer, informative, not necessarily contiguous text, resembling a newspaper article). Then, we wish to extract the main content without the surrounding or the interleaving boilerplate. Besides the main goal of separating the main content from the boilerplate, we would also like to differ between various subtypes of the main content. These can be headlines, user comments, related content, supplemental content, and alike. Extracted text should retain all the original formatting and punctuation marks with the exception of HTML tags for the purpose of displaying the content to the user and also for the purpose of information extraction (a lot of information extraction algorithms, e.g., sentence splitting, rely on punctuation marks).
A first method that comes to mind, when skimming through the HTML of a group of Web pages from the same source, is to handcraft a rule that separates the meaningful text from the boilerplate. Such rule may provide the desired accuracy for a unique Web page template, but quickly becomes obsolete when the page template changes or when dealing with many different Web sources. Other than being unpractical in the long term, this manual task is also relatively expensive.
To overcome the aforementioned issues, Web pages from many different sources should be considered as a learning dataset for machine learning methods in order to automatically discover rules (models) accurate and yet general enough to suit various Web sources.
By looking at several news article Web pages, it becomes obvious that different semantic parts occupy usually the same place. The main article content is in the middle, headlines are above the main content, user comments are at the end, and the unwanted advertisements are on the sides. Visual Page Segmentation (VIPS) technique (Cai, Yu, Wen, & Ma, 2003) makes use of the page layout features to obtain a partitioning of the Web page. A tree of HTML blocks is built, according to their position in the Web page.
Most of the methods for boilerplate removal rely on dividing the Web page into contiguous blocks. This is implied by the HTML tree structure, where textual content is enclosed into blocks by the tags. On such sequence of blocks, features can be constructed and existing methods for finding and labelling sequences can be applied.
The basic way of annotating the main content in a Web page is marking the beginning and the end of it. The method of maximum subsequence segmentation (Pasternack & Roth, 2009) finds such beginning and an end by maximising the sum of the probabilities assigned to separate tokens (i.e., words, symbols, and tags). The probability that a token belongs to the article is estimated by a local (Naive Bayes) classifier trained on a dataset of HTML Web pages where the starts and the ends of the news articles are marked. To be accurately extracted, the article text should be contiguous, coherent, and more than about eight sentences in length. Undesired content inside the identified article block is removed by using simple heuristics. The article boundary detection technique showed to be too coarse for the content other than the contiguous article text. Besides the fairly good accuracy, its noticeable advantage is the linear complexity and thus the suitability for a fast real-time pipeline. An on-line demonstration of this method is available at http://took.cs.uiuc.edu/MSS/default.aspx.
More elaborate text extraction can be made by dividing a Web page into smaller blocks and classifying each block separately. The method based on shallow text features (Kohlschütter, Fankhauser, & Nejdl, 2010) extracts each block of text bounded by an opening or closing HTML tag. Such granularity, with a proper choice of features, allows rather accurate extraction of not necessarily contiguous content (i.e., main article text), but also of other valuable content such as headlines and user comments, which differ subtly. The article promotes rather simple features for boilerplate detection, namely text block features such as number of words per block, text density, and link density. This choice is backed up by accurate classification results when employing a simple decision tree model. This is the method we chose to implement. An on-line demonstration of this method is available at http://boilerpipe-web.appspot.com/.