This book introduces the reader to methods of data mining on the web, including uncovering patterns in web content (classification, clustering, language processing), structure (graphs, hubs, metrics), and usage (modeling, sequence analysis, performance).
"?it has to be noted that this book is an excellent resource for conducting Web mining lectures or single units within Data mining class. The data can be used for small as well as quite comprehensive business intelligence projects. The book's content is easy to access; even students with very basic statistical skills can get the flavor of the intriguing aspects of Web mining." (Journal of Statistical Software, April 2008)
"?highlight[s] the exciting research related to data mining the Web?a detailed summary of the current state of the art." (CHOICE, December 2007)
"I can say I really enjoyed reading this book?a great educational resource for students and teachers." (Information Retrieval, 2008)
PART I: WEB STRUCTURE MINING.
1 INFORMATION RETRIEVAL AND WEB SEARCH.
Web Search Engines.
Crawling the Web.
Indexing and Keyword Search.
Advanced Text Search.
Using the HTML Structure in Keyword Search.
Evaluating Search Quality.
2 HYPERLINK-BASED RANKING.
Social Networks Analysis.
Authorities and Hubs.
Link-Based Similarity Search.
Enhanced Techniques for Page Ranking.
PART II: WEB CONTENT MINING.
Hierarchical Agglomerative Clustering.
Finite Mixture Problem.
Collaborative Filtering (Recommender Systems).
4 EVALUATING CLUSTERING.
Approaches to Evaluating Clustering.
Similarity-Based Criterion Functions.
Probabilistic Criterion Functions.
MDL-Based Model and Feature Evaluation.
Minimum Description Length Principle.
MDL-Based Model Evaluation.
Precision, Recall, and F-Measure.
General Setting and Evaluation Techniques.
Naive Bayes Algorithm.
PART III: WEB USAGE MINING.
6 INTRODUCTION TO WEB USAGE MINING.
Definition of Web Usage Mining.
Cross-Industry Standard Process for Data Mining.
Web Server Log Files.
Remote Host Field.
HTTP Request Field.
Status Code Field.
Transfer Volume (Bytes) Field.
Common Log Format.
Extended Common Log Format.
User Agent Field.
Example of a Web Log Record.
Microsoft IIS Log Format.
7 PREPROCESSING FOR WEB USAGE MINING.
Need for Preprocessing the Data.
Data Cleaning and Filtering.
Page Extension Exploration and Filtering.
De-Spidering the Web Log File.
Directories and the Basket Transformation.
Further Data Preprocessing Steps.
8 EXPLORATORY DATA ANALYSIS FOR WEB USAGE MINING.
Number of Visit Actions.
Relationship between Visit Actions and Session Duration.
Average Time per Page.
Duration for Individual Pages.
9 MODELING FOR WEB USAGE MINING: CLUSTERING, ASSOCIATION, AND CLASSIFICATION.
Definition of Clustering.
The BIRCH Clustering Algorithm.
Affinity Analysis and the A Priori Algorithm.
Discretizing the Numerical Variables: Binning.
Applying the A Priori Algorithm to the CCSU Web Log Data.
Classification and Regression Trees.
The C4.5 Algorithm.