Data Mining Approaches for Web Spam Detection
Abstract
Web spam is a serious problem for search engines
because the quality of their results can be severely degraded by the presence of this kind of page. In this paper, we present an efficient spam detection system based on a classifier that combines new linkbased features with language-model (LM)-based ones. We have specifically applied the Kullback–Leibler divergence on different combinations of these sources of information in order to characterize
the relationship between two linked pages. In this paper, we present an efficient spam detection system based on a Hybrid clustering that combines K-means and SVM and then classified by using C4.5 with Qualified link-based features and Language Model(LM) based once. The result is an accurate system for detecting Web spam using fewer features.
Keywords
Full Text:
PDFReferences
Lourdes Araujo and Juan Martinez-Romo,”Web Spam Detection: New
Classification Features Based on Qualified Link Analysis and
Language Models” Vol.5, No.3, 2010.
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates.
Link-based characterization and detection of web spam. In AIRWeb’06:
Proceedings of the 2th international workshop on Adversarial
information retrieval on the web, 2006.
Zoltan Gyongyi, Hector Garcia-Molina, Web spam Taxonomy. In
Proceedings of the 30th International Conference on Very Large
Databases (VLDB), 2004.
D. Zhou, C. J. C. Burges, and T. Tao. Transductive link spam detection.
In AIRWeb ’07: Proceedings of the 3rd international workshop on
Adversarial information retrieval on the web, pages 21–28, New York,
NY, USA, 2007. ACM.
A. A. Bencz´ur, I. B´ır´o, K. Csalog´any, and M. Uher. Detecting
nepotistic links by language model disagreement. In WWW ’06:
Proceedings of the 15th international conference on World Wide Web,
pages 939– 940, New York, NY, USA, 2006. ACM.
Z. Gy¨ongyi and H. Garcia-Molina. Web spam taxonomy. In
Proceedings of the first International Workshop on Adversarial
Information Retrieval on the Web (AIRWeb), 2005.
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and
statistics: using statistical analysis to locate spam web pages. In WebDB
’04: Proceedings of the 7th International Workshop on the Web and
Databases, pages 1–6, New York, NY, USA, 2004.ACM.
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam
web pages through content analysis. In WWW ’06: Proceedings of the
th international conference on World Wide Web, pages 83–92, New
York, NY, USA, 2006. ACM.
J. Piskorski, M. Sydow, and D. Weiss. Exploring linguistic features for
web spam detection: a preliminary study. In AIRWeb ’08: Proceedings
of the 4th international workshop on Adversarial information retrieval
on the web, pages 25–28, New York, NY, USA, 2008. ACM.
J. Abernethy, O. Chapelle, and C. Castillo. Webspam identification
through content and hyperlinks. In Proceedings of the fourth
International Workshop on Adversarial Information Retrieval on the
Web (AIRWeb), 2008.
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know
your neighbors: web spam detection using the web topology. In SIGIR
’07: Proceedings of the 30th annual international ACM SIGIR
conference on Research and development in information retrieval, pages
–430, New York, NY, USA, 2007. ACM.
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with
language model disagreement. In In Proceedings of the First
International Workshop on Adversarial Information Retrieval on the
Web (AIRWeb), 2005.
A. A. Bencz´ur, I. B´ır´o, K. Csalog´any, and M. Uher. Detecting
nepotistic links by language model disagreement. In WWW ’06:
Proceedings of the 15th international conference on World Wide Web,
pages 939– 940, New York, NY, USA, 2006. ACM.
X. Qi, L. Nie, and B. D. Davison. Measuring similarity to detect
qualified links. In AIRWeb ’07: Proceedings of the 3rd international
workshop on Adversarial information retrieval on the web, pages 49–
, New York, NY, USA, 2007. ACM.
X. Qi, L. Nie, and B. D. Davison, “Measuring similarity to detect
qualified links,” in Proc. 3rd Int. Workshop on Adversarial Information
Retrieval on the Web (AIRWeb’07), New York, 2007, pp. 49–56, ACM.
Levent Bolelli, Seyda Ertekin, Ding Zhou and C.LeeGiles(2007). “ KSVMeans:
A Hybrid Clustering Algorithm for Multi-Type Interrelated
Datasets”. In College of Information Sciences and Technology. The
Pennsylvania State University Park, PA, USA , IEEE/WIC/ACM
International Conference on Web Intelligence.
Refbacks
- There are currently no refbacks.
This work is licensed under a Creative Commons Attribution 3.0 License.