Existing relevant semantic resources

There are very few existing ontologies in the area of finance. Probably the best known example is Eddy Vanderlinden‘s ontology on financial instruments, involved parties, processes and procedures in securities handling (available from http://www.fadyart.com/ontologies/data/Finance.owl). Descriptions of ontologies containing financial instruments can be found in works by Thomas Locke Hobbs (description available from http://www.isi.edu/~hobbs/open-domain/) and Mike Bennett of Hypercube Ltd (description available from www.hypercube.co.uk/docs/ontologyexploration.doc).

Ontology learning

An ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain, and may be used to define the domain. In the project, an ontology will be employed for supporting the information extraction task (i.e., the construction of high-level features used in decision support models).

I wonder who will read this article: a person or a machine :-). It is amazing how much can be achieved by using computer technology and algorithms when you feed it with the right information.

 

Spam and opinion spam detection

For the rapidly increasing amount of information available on the Internet, there exists only little quality control, especially over the user-generated content that can be found on forums, blogs, and on other Web sites where the users can post their comments. Although it is recognised that user-generated content contains valuable information for a variety of applications, the lack of quality control attracts spammers who have found many ways to draw their benefits from spamming, some even make a living from it.

Detecting near-duplicates in document streams

Much of the relevant Web content is duplicated. News stories probably put themselves as an everyday example. Exact duplicates can be identified by relatively simple hash-like methods. More problematic is the near-duplicate content, which differs in subtle details—such as copyright notices and advertisements—irrelevant for most of the further text processing. The first issue to consider in designing a system for near-duplicate detection arises from the ever growing size of the Web. Such system should scale to several billions of indexed Web pages and also support high throughput rate in a stream-based setting.

 

Language detection

Most of the text mining and natural-language processing (NLP) tools are language-specific. In text mining, stemming (or lemmatisation) and lists of stop words depend on the language, and in NLP, POS tagging, chunking, and deep parsing are all language-dependent technologies. The first stage in the Web content mining tasks is usually gathering HTML pages from the Web (e.g., Web crawling or fetching pages through RSS feeds). The cleaning steps that follow need to take care of boilerplate removal, language detection, and code-page detection. This ensures that irrelevant content and HTML tags are removed (boilerplate removal), special characters are encoded correctly (code-page detection), and documents which cannot be handled by the selected language-dependent analysis tools are removed from the corpus (language detection and filtering).

Features for classifying text blocks extracted from Web pages

Features for classifying text blocks extracted from Web pages can be defined on several different levels. Site features are specific to all documents originating from the same Web source. These are usually omitted as they may lead to over-fitting to the specific source. Structural features originate from the HTML structure, more specifically from the HTML tags preceding and following the text block. Specific CSS classes and sequences of HTML tags may lead to over-fitting and are therefore not considered. When examining text blocks, we extract language-independent higher-level shallow text features. These features are word-oriented and also include simple heuristics such as the number of a certain type of characters (such as digits and uppercase letters). Nearly as significant are densitometric features, primarily link density and text density.

Boilerplate removal - existing approaches

For a given Web page, we first wish to determine if it contains some meaningful content (i.e., longer, informative, not necessarily contiguous text, resembling a newspaper article). Then, we wish to extract the main content without the surrounding or the interleaving boilerplate. Besides the main goal of separating the main content from the boilerplate, we would also like to differ between various subtypes of the main content. These can be headlines, user comments, related content, supplemental content, and alike.

Every research endeavour to text classification must demonstrate that it improves classification accuracy and performance in real-world settings. The challenge for sentiment analysis in FIRST is that real-world blogs can not directly be used for evaluation purposes. The reason is that these blogs are no “gold standard” corpus, because of missing labels. There are two alternatives to overcoming this problem: First, we could search for a suitable corpus. Unfortunately, such a corpus does not exist for the use cases of FIRST. Second, we could create a new corpus.

Boilerplate removal

The Web offers freely available, almost unlimited amount of heterogeneous data. Different information can be extracted from an average HTML page. Among the most informative types of pages are above all news articles and blogs posts. Most often, it is the main content (i.e., article text, or any meaningful text) of the HTML page that we are interested in. The undesired content of the Web page is called boilerplate (a reusable text or layout formulation commonly found in newspaper articles) and includes mostly scripts, styles, advertisements, etc. It is also desirable to distinguish between the different types of the relevant content, such as the article body, user comments, and headlines.