Thursday, April 24, 2014
Text Size
Web Spam
These are web spam related research papers

DocumentsDate added

Order by : Name | Date | Hits [ Ascendant ]
"Adversarial Information Retrieval addresses tasks such as gathering, indexing, filtering, retrieving and ranking information from collections wherein a subset has been manipulated maliciously. On the Web, the predominant form of such manipulation is "search engine spamming" or spamdexing, i.e., malicious attempts to influence the outcome of ranking algorithms, aimed at getting an undeserved high ranking for some items in the collection. There is an economic incentive to rank higher in search engines, considering that a good ranking on them is strongly correlated with more traffic, which often translates to more revenue."
Microsoft - In this paper, we continue our investigations of “web spam”: theinjection of artificially-created pages into the web in order to influencethe results from search engines, to drive traffic to certain pagesfor fun or profit. This paper considers some previously-undescribedtechniques for automatically detecting spam pages, examines theeffectiveness of these techniques in isolation and when aggregatedusing classification algorithms. When combined, our heuristicscorrectly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%)in our judged collection of 17,168 pages, while misidentifying 526spam and non-spam pages (3.1%).
We present an algorithm, witch, that learns to detect spamhosts or pages on the Web. Unlike most other approaches,it simultaneously exploits the structure of the Web graphas well as page contents and features. The method is e-cient, scalable, and provides state-of-the-art accuracy on astandard Web spam benchmark.
The main contributions of our work are: (1) User visiting patterns of spam pages are studied and three user behavior features are proposed to separate Web spam from ordinary ones. (2) A novel spam detection framework is proposed that can detect unknown spam types and newly-appeared spam with the help of user behavior analysis. Preliminary experiments on large scale Web access log data (containing over 2.74 billion user clicks) show the effectiveness of the proposed features and detection framework.
Every day millions of users search for information on the web viasearch engines, and provide implicit feedback to the results shownfor their queries by clicking or not onto them. This feedback isencoded in the form of a query log that consists of a sequence ofsearch actions, one per user query, each describing the followinginformation: (i) terms composing a query, (ii) documents returnedby the search engine, (iii) documents that have been clicked, (iv)the rank of those documents in the list of results, (v) date and timeof the search action/click, (vi) an anonymous identifier for eachsession, and more.In this work, we investigate the idea of characterizing the documentsand the queries belonging to a given query log with the goalof improving algorithms for detecting spam, both at the documentlevel and at the query level.
In this paper we apply a modification of LDA, the novel multi-corpus LDAtechnique for web spam classification. We create a bag-ofwords document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if itstotal spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA.
Microsoft - In this paper, we describe several linkbased spam-detection features, both supervised and unsupervised, that can be derived from these approximate supportingsets. In particular, we examine the size of a node's supporting sets and the approximate l2 norm of the PageRank contributions from other nodes. As a supervised feature,we examine the composition of a node's supporting sets. We perform experiments on two labeled real data sets to demonstrate the e effectiveness of these features for spam detection, and demonstrate that these features can be computed efficiently.
In this paper, we extend the work reported in Sydow et al. [12] by introducing more linguistic-based features and studying their potential usability for Web spam classification.Our effort is complementary to the work on content based features reported by others (see Section 1.1). The main contributions are: (1) Computing over 200 new linguistic-based attributes; in order to get a better, less biased insight, we tested various NLP tools and two Web spam corpora together with 3 different document length-restrictionmodes.
Page 1 of 2
Please update your Flash Player to view content.
Restore Default Settings