Spam and opinion spam detection
For the rapidly increasing amount of information available on the Internet, there exists only little quality control, especially over the user-generated content that can be found on forums, blogs, and on other Web sites where the users can post their comments. Although it is recognised that user-generated content contains valuable information for a variety of applications, the lack of quality control attracts spammers who have found many ways to draw their benefits from spamming, some even make a living from it.
There exist different types of spam, which all have different target groups and aim at different goals. Most well-known is e-mail spam which is the form of unsolicited e-mail messages, often of commercial nature, advertising products, or even broadcasting political or social commentaries. Spam filters are widely used, they are built into user‘s e-mail programs and/or mail servers, they incorporate techniques for detection of keywords, templates, sentence structure, suspicious attachments, etc., that are typical for spam e-mails. Spammers continuously find ways to bypass spam filters, therefore on-going research in e-mail spam filtering is necessary (Sahami, Dumais, Heckerman, & Horvitz, 1998; Li, Zhong, & Liu, 2006; Fette, Sadeh-Koniecpol, & Tomasic, 2007).
Another type of spam is Web spam and its objective is to achieve higher ranking of certain Web pages by search engines. This objective is mainly achieved in two ways: content spam and link spam. Link spam is frequent on forums and Web sites allowing users to leave their comments.
Content spam tries to include irrelevant or remotely relevant words to target pages and in this way fool search engines to rank those pages higher. Some research papers dealing with Web spam are(Gyongyi & Garcia-Molina, 2004; Ntoulas, Najork, Manasse, & Fetterly, 2006; Wu, Goel, & Davison, 2006; Castillo, et al., 2006; Wu & Davison, 2006).
Opinion spam, on the other hand, gives an untruthful opinion on a certain topic or product. It can be found among reviews and commentaries on e-commerce Web sites, news Web sites, review Web sites, etc. The spammers try to promote or damage the reputation of people, businesses, products, or services by posting untruthful opinions. A lot of work has been done on analysing the sentiment of user-generated online content, but the focus was only on whether the user‘s opinion is negative or positive (Dave, Lawrence, & Pennock, 2003; Pang, Lee, & Vaithyanathan, 2002; Popescu & Etzioni, 2005; Hu & Liu, 2004). Opinion spam, however, has not yet been extensively studied. Existing studies focus on consumer reviews of certain products as a place to look for opinion spam.
One of the research papers on this topic (Jindal & Liu, 2008) divides spam reviews into three types. Firstly, Type 1, being untruthful reviews, that deliberately give undeserving positive reviews to some product in order to promote it and/or give unjust or malicious negative reviews to other products to damage their reputation. Secondly, Type 2 are reviews on brands only, i.e., reviews that do not comment on a specific product but only brands, manufacturers, or sellers of the product. Although this type of reviews may be useful, they consider them as spam, because they do not target specific products and are often biased. Lastly, Type 3 are non-reviews, which can be roughly categorised into two main subtypes: (1) advertisements and (2) other irrelevant reviews containing no opinions (e.g., questions, answers, and random texts).
Type 2 and Type 3 reviews can be detected by employing standard machine learning techniques for classification using manually labelled spam and non-spam reviews, because these two types of spam reviews can be recognised manually. Therefore the problem of detecting those two types of spam is translated into the task of finding effective features for classification model construction.
Detecting Type 1 spam, on the other hand, proves to be much harder, since manual labelling by simply reading reviews is very hard, if not impossible. However, using duplicate or near-duplicate reviews as guidance for labelling a review as spam, allows spam detection models constructed from data labelled in this way to predict likely harmful reviews to a good extent.
Another approach to opinion spam detection in consumer reviews uses language modelling techniques. For example in (Lai, Xu, Lau, Li, & Jing, 2010), the KL divergence and the probabilistic language modelling based computational model is presented as an efficient approach for the detection of untruthful reviews. Also in (Lai, Xu, Lau, Li, & Song, 2010), an inferential language model equipped with high-order concept association knowledge is proposed as an effective approach for detection of untruthful reviews when compared with other baseline methods.
An empirical study of online consumer review spam is presented in (Lau, Liao, & Xu, 2010), proposing an effective methodology for detection of untruthful consumer reviews that enables an econometric analysis to examine the impact of fake reviews on product sales.
A group of research papers is conversely focused on detection of suspicious reviewers who likely produce untruthful reviews. In (Lim, Nguyen, Jindal, Liu, & Lauw, 2010), product review spammers are detected by a scoring method which measures the degree of spam for each reviewer. In (Jindal, Liu, & Lim, 2010), an unusual review patterns are identified which can represent atypical behaviour of reviewers. The task is to find unexpected rules or rule groups, these rules describe behavioural patterns of reviewers that deviate from the expectations of a truthful reviewer and thus indicate spam activities.
All in all, not a lot of work has been done in the area of opinion spam detection and it is not clear at this point, which approach will be considered in FIRST. Most likely, we will first analyse duplicates in the acquired data in order to better understand their nature, frequency, quantity, and purpose (e.g., spamming vs. ―adoption‖ of content by other sources).