Detecting near-duplicates in document streams
Much of the relevant Web content is duplicated. News stories probably put themselves as an everyday example. Exact duplicates can be identified by relatively simple hash-like methods. More problematic is the near-duplicate content, which differs in subtle details—such as copyright notices and advertisements—irrelevant for most of the further text processing. The first issue to consider in designing a system for near-duplicate detection arises from the ever growing size of the Web. Such system should scale to several billions of indexed Web pages and also support high throughput rate in a stream-based setting.
Ideally, there would be a mapping (hashing) from a document as an ordered list of words into a unique, small, easily comparable number or a symbol, such that the original inter-document similarity is retained among the mapped values. The simhash fingerprinting technique (Charikar, 2002) has such property of local sensitivity. It maps a document into a 64-bit fingerprint with a Hamming distance used as a distance measure. The Hamming distance for two completely different documents would be large and for two similar documents it would be small. A near-duplicate of a document is any document whose fingerprint differs from the given one in at most k bits (e.g., k = 3). Thus, the problem of looking-up for the existing near duplicates is brought down to solving the Hamming Distance Problem – quickly finding at least one fingerprint that differs from the given one in at most k bits.
In our preliminary implementation, we compute a 64-bit simhash fingerprint by processing the bag-of-words representation of a document. In general, other specific features could be used instead of words. The final 64-bit document fingerprint is obtained by processing 64-bit hash codes of the separate words as described in(Manku, Jain, & Sarma, 2007).
Because of the large number of crawled documents (often in billions), we aim at looking-up the near duplicates (i.e., solving the Hamming distance problem) in sublinear time. By storing permuted fingerprints into several look-up tables, it is possible to achieve fast look-ups without significantly increasing space complexity. For example, for k = 3, each 64-bit fingerprint is split into 5 blocks of 13, 13, 13, 13, and 12 bits, respectively. There are 10 ways to choose 2 blocks (where 2 = 5 – k) from these 5, thus we build 10 look-up tables, each indexing fingerprints by the corresponding two blocks. The tables are implemented as binary trees having 25 or 26 levels each (i.e., the size of two blocks). When searching the existing fingerprints for possible near-duplicates, each of these ten binary trees is traversed. Candidates for near-duplicates are discovered quickly in this way and then checked if they are indeed near-duplicates (i.e., if they differ in 3 bits or less). It can be shown that the number of identified candidates is manageably small even for billions of fingerprints in the tables.
The paper (Manku, Jain, & Sarma, 2007) also describes other configurations for the look-up tables. A fingerprint compression technique and batch queries are also discussed in the paper.