Wednesday, April 23, 2014
Text Size
Web Spam
These are web spam related research papers

DocumentsDate added

Order by : Name | Date | Hits [ Descendent ]
In this paperwe will transfer this approach to a social bookmarking setting to identify spammers. We will present features considering the topological, semantic and profile-based information which people make public when using the system. The dataset used is a snapshot of the social bookmarking system BibSonomy and was built over the course of several months when cleaning the system from spam.
In this paper, we extend the work reported in Sydow et al. [12] by introducing more linguistic-based features and studying their potential usability for Web spam classification.Our effort is complementary to the work on content based features reported by others (see Section 1.1). The main contributions are: (1) Computing over 200 new linguistic-based attributes; in order to get a better, less biased insight, we tested various NLP tools and two Web spam corpora together with 3 different document length-restrictionmodes.
Microsoft - In this paper, we describe several linkbased spam-detection features, both supervised and unsupervised, that can be derived from these approximate supportingsets. In particular, we examine the size of a node's supporting sets and the approximate l2 norm of the PageRank contributions from other nodes. As a supervised feature,we examine the composition of a node's supporting sets. We perform experiments on two labeled real data sets to demonstrate the e effectiveness of these features for spam detection, and demonstrate that these features can be computed efficiently.
In this paper we apply a modification of LDA, the novel multi-corpus LDAtechnique for web spam classification. We create a bag-ofwords document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if itstotal spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA.
Every day millions of users search for information on the web viasearch engines, and provide implicit feedback to the results shownfor their queries by clicking or not onto them. This feedback isencoded in the form of a query log that consists of a sequence ofsearch actions, one per user query, each describing the followinginformation: (i) terms composing a query, (ii) documents returnedby the search engine, (iii) documents that have been clicked, (iv)the rank of those documents in the list of results, (v) date and timeof the search action/click, (vi) an anonymous identifier for eachsession, and more.In this work, we investigate the idea of characterizing the documentsand the queries belonging to a given query log with the goalof improving algorithms for detecting spam, both at the documentlevel and at the query level.
The main contributions of our work are: (1) User visiting patterns of spam pages are studied and three user behavior features are proposed to separate Web spam from ordinary ones. (2) A novel spam detection framework is proposed that can detect unknown spam types and newly-appeared spam with the help of user behavior analysis. Preliminary experiments on large scale Web access log data (containing over 2.74 billion user clicks) show the effectiveness of the proposed features and detection framework.
We present an algorithm, witch, that learns to detect spamhosts or pages on the Web. Unlike most other approaches,it simultaneously exploits the structure of the Web graphas well as page contents and features. The method is e-cient, scalable, and provides state-of-the-art accuracy on astandard Web spam benchmark.
Microsoft - In this paper, we continue our investigations of “web spam”: theinjection of artificially-created pages into the web in order to influencethe results from search engines, to drive traffic to certain pagesfor fun or profit. This paper considers some previously-undescribedtechniques for automatically detecting spam pages, examines theeffectiveness of these techniques in isolation and when aggregatedusing classification algorithms. When combined, our heuristicscorrectly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%)in our judged collection of 17,168 pages, while misidentifying 526spam and non-spam pages (3.1%).
Page 1 of 2

Dojo Approved

These ARE NOT affiliate links.

Search the Site

Please update your Flash Player to view content.
Restore Default Settings