These are web spam related research papers
Adversarial Information Retrieval on the Webhot! 02/08/2009 Hits: 699
"Adversarial Information Retrieval addresses tasks such as gathering, indexing, filtering, retrieving and ranking information from collections wherein a subset has been manipulated maliciously. On the Web, the predominant form of such manipulation is "search engine spamming" or spamdexing, i.e., malicious attempts to influence the outcome of ranking algorithms, aimed at getting an undeserved high ranking for some items in the collection. There is an economic incentive to rank higher in search engines, considering that a good ranking on them is strongly correlated with more traffic, which often translates to more revenue."
Detecting Spam in Social Boomark systemshot! 01/29/2009 Hits: 346
In this paperwe will transfer this approach to a social bookmarking setting to identify spammers. We will present features considering the topological, semantic and profile-based information which people make public when using the system. The dataset used is a snapshot of the social bookmarking system BibSonomy and was built over the course of several months when cleaning the system from spam.
Detecting Spam Web Pages through Content Analysishot! 02/05/2009 Hits: 668
Microsoft - In this paper, we continue our investigations of “web spam”: theinjection of artificially-created pages into the web in order to influencethe results from search engines, to drive traffic to certain pagesfor fun or profit. This paper considers some previously-undescribedtechniques for automatically detecting spam pages, examines theeffectiveness of these techniques in isolation and when aggregatedusing classification algorithms. When combined, our heuristicscorrectly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%)in our judged collection of 17,168 pages, while misidentifying 526spam and non-spam pages (3.1%).
Exploring Linguistic Features for Web Spam Detection: A Preliminary Studyhot! 01/29/2009 Hits: 668
In this paper, we extend the work reported in Sydow et al.  by introducing more linguistic-based features and studying their potential usability for Web spam classification.Our effort is complementary to the work on content based features reported by others (see Section 1.1). The main contributions are: (1) Computing over 200 new linguistic-based attributes; in order to get a better, less biased insight, we tested various NLP tools and two Web spam corpora together with 3 different document length-restrictionmodes.
Identifying Web Spam with User Behavior Analysishot! 01/29/2009 Hits: 659
The main contributions of our work are: (1) User visiting patterns of spam pages are studied and three user behavior features are proposed to separate Web spam from ordinary ones. (2) A novel spam detection framework is proposed that can detect unknown spam types and newly-appeared spam with the help of user behavior analysis. Preliminary experiments on large scale Web access log data (containing over 2.74 billion user clicks) show the effectiveness of the proposed features and detection framework.
Latent Dirichlet Allocation in Web Spam Filtering hot! 01/29/2009 Hits: 663
In this paper we apply a modification of LDA, the novel multi-corpus LDAtechnique for web spam classification. We create a bag-ofwords document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if itstotal spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA.
Query-log mining for detecting spamhot! 01/29/2009 Hits: 673
Every day millions of users search for information on the web viasearch engines, and provide implicit feedback to the results shownfor their queries by clicking or not onto them. This feedback isencoded in the form of a query log that consists of a sequence ofsearch actions, one per user query, each describing the followinginformation: (i) terms composing a query, (ii) documents returnedby the search engine, (iii) documents that have been clicked, (iv)the rank of those documents in the list of results, (v) date and timeof the search action/click, (vi) an anonymous identifier for eachsession, and more.In this work, we investigate the idea of characterizing the documentsand the queries belonging to a given query log with the goalof improving algorithms for detecting spam, both at the documentlevel and at the query level.
Robust PageRank and Locally Computable Spam Detection Featureshot! 01/29/2009 Hits: 595
Microsoft - In this paper, we describe several linkbased spam-detection features, both supervised and unsupervised, that can be derived from these approximate supportingsets. In particular, we examine the size of a node's supporting sets and the approximate l2 norm of the PageRank contributions from other nodes. As a supervised feature,we examine the composition of a node's supporting sets. We perform experiments on two labeled real data sets to demonstrate the eeffectiveness of these features for spam detection, and demonstrate that these features can be computed efficiently.