
A General Introduction to Data Analytics
By: João Moreira, Andre Carvalho, Tomás Horvath
Hardcover | 29 June 2018 | Edition Number 1
At a Glance
352 Pages
22.61 x 15.49 x 2.29
Hardcover
RRP $188.05
$187.75
or 4 interest-free payments of $46.94 with
orShips in 5 to 7 business days
Describes the principles and methods of data analysis in an approach that can be understood by readers without specific knowledge of statistics or programming
This book teaches readers without specific knowledge of statistics or programming how to understand and use data analytics. The authors focus on explanation of intuition beyond the basic data analytics techniques. To do this, they employ easy to use tools to present and illustrate the examples. This book contains four parts. The first part motivates people for the necessity of analyzing data. The next part involves visualizing data and finding natural groups from data. Predicting the unknown is the subject of the next part, in which the authors discuss classification, regression, and advanced predictive methods. The last part discusses mining the web, and covers topics such as information retrieval, social network analysis, working with text, and recommender systems feedback. At the end of parts 2, 3, and 4 there is a project following the CRISP methodology that shows how to develop a project in the area of that part. The proposal is that the readers can develop their own project with their own dataset or with a dataset from a public repository. This book will be of interest to non-mathematicians, non-statisticians, and non-computer scientists interested in getting an introduction to data science.
- Explains the reasoning behind the given data mining techniques
- Uses freely available software packages to show readers how to perform data analysis
- Expands upon a unique illustrative example throughout all chapters
- Contains exercises at the end of each chapter, and larger projects at the end of each part
- Supplementary material includes presentation slides available to instructors
A General Introduction to Data Analytics is a text for upper level undergraduates or first year graduate students in areas that are using quantitative methods but outside mathematics and computer science.
Joao Moreira is a professor in the Department of Computer Engineering at the University of Porto, Porto, Portugal. He received his Ph.D. from University of Porto. Moreira is winner of the Best Paper Award at the 2014 International Conference on Advanced Data Mining and Applications, Guilin, China.
Andre Carvalho is a professor in the Department of Computer Science at the University of Sao Paulo, Brazil. He received his Ph.D. from the University of Kent at Canterbury, United Kingdom. Carvalho is one of the founding and first chief editors of the International Journal of Computational Intelligence and Applications, Imperial College Press and World Scientific.Tomas Horvath is an assistant professor at Pavol Jozef Safarik University in Kosice, Slovakia. He received his Ph.D. from the Institute of Computer Science in Pavol Jozef Safarik University.
Preface xiii
Acknowledgments xv
Presentational Conventions xvii
About the Companion Website xix
Part I Introductory Background 1
1 What Can We Do With Data? 3
1.1 Big Data and Data Science 4
1.2 Big Data Architectures 5
1.3 Small Data 6
1.4 What is Data? 7
1.5 A Short Taxonomy of Data Analytics 9
1.6 Examples of Data Use 10
1.6.1 Breast Cancer in Wisconsin 11
1.6.2 Polish Company Insolvency Data 11
1.7 A Project on Data Analytics 12
1.7.1 A Little History on Methodologies for Data Analytics 12
1.7.2 The KDD Process 14
1.7.3 The CRISP-DM Methodology 15
1.8 How this Book is Organized 16
1.9 Who Should Read this Book 18
Part II Getting Insights from Data 19
2 Descriptive Statistics 21
2.1 Scale Types 22
2.2 Descriptive Univariate Analysis 25
2.2.1 Univariate Frequencies 25
2.2.2 Univariate Data Visualization 27
2.2.3 Univariate Statistics 32
2.2.4 Common Univariate Probability Distributions 38
2.3 Descriptive Bivariate Analysis 40
2.3.1 Two Quantitative Attributes 41
2.3.2 Two Qualitative Attributes, at Least one of them Nominal 45
2.3.3 Two Ordinal Attributes 46
2.4 Final Remarks 47
2.5 Exercises 47
3 Descriptive Multivariate Analysis 49
3.1 Multivariate Frequencies 49
3.2 Multivariate Data Visualization 50
3.3 Multivariate Statistics 59
3.3.1 Location Multivariate Statistics 59
3.3.2 Dispersion Multivariate Statistics 60
3.4 Infographics and Word Clouds 66
3.4.1 Infographics 66
3.4.2 Word Clouds 67
3.5 Final Remarks 67
3.6 Exercises 68
4 Data Quality and Preprocessing 71
4.1 Data Quality 71
4.1.1 Missing Values 72
4.1.2 Redundant Data 74
4.1.3 Inconsistent Data 75
4.1.4 Noisy Data 76
4.1.5 Outliers 77
4.2 Converting to a Diï¬erent Scale Type 77
4.2.1 Converting Nominal to Relative 78
4.2.2 Converting Ordinal to Relative or Absolute 81
4.2.3 Converting Relative or Absolute to Ordinal or Nominal 82
4.3 Converting to a Diï¬erent Scale 83
4.4 Data Transformation 85
4.5 Dimensionality Reduction 86
4.5.1 Attribute Aggregation 88
4.5.1.1 Principal Component Analysis 88
4.5.1.2 Independent Component Analysis 91
4.5.1.3 Multidimensional Scaling 91
4.5.2 Attribute Selection 92
4.5.2.1 Filters 92
4.5.2.2 Wrappers 93
4.5.2.3 Embedded 94
4.5.2.4 Search Strategies 95
4.6 Final Remarks 96
4.7 Exercises 96
5 Clustering 99
5.1 Distance Measures 100
5.1.1 Diï¬erences between Values of Common Attribute Types 101
5.1.2 Distance Measures for Objects with Quantitative Attributes 103
5.1.3 Distance Measures for Non-conventional Attributes 104
5.2 Clustering Validation 107
5.3 Clustering Techniques 108
5.3.1 K-means 110
5.3.1.1 Centroids and Distance Measures 110
5.3.1.2 How K-means Works 111
5.3.2 DBSCAN 115
5.3.3 Agglomerative Hierarchical Clustering Technique 117
5.3.3.1 Linkage Criterion 119
5.3.3.2 Dendrograms 120
5.4 Final Remarks 122
5.5 Exercises 123
6 Frequent Pattern Mining 125
6.1 Frequent Itemsets 127
6.1.1 Setting the min_sup Threshold 128
6.1.2 Apriori â" a Join-based Method 131
6.1.3 Eclat 133
6.1.4 FP-Growth 134
6.1.5 Maximal and Closed Frequent Itemsets 138
6.2 Association Rules 139
6.3 Behind Support and Conï¬dence 142
6.3.1 Cross-support Patterns 143
6.3.2 Lift 144
6.3.3 Simpsonâs Paradox 145
6.4 Other Types of Pattern 147
6.4.1 Sequential patterns 147
6.4.2 Frequent Sequence Mining 148
6.4.3 Closed and Maximal Sequences 148
6.5 Final Remarks 149
6.6 Exercises 149
7 Cheat Sheet and Project on Descriptive Analytics 151
7.1 Cheat Sheet of Descriptive Analytics 151
7.1.1 On Data Summarization 151
7.1.2 On Clustering 151
7.1.3 On Frequent Pattern Mining 153
7.2 Project on Descriptive Analytics 154
7.2.1 Business Understanding 154
7.2.2 Data Understanding 155
7.2.3 Data Preparation 155
7.2.4 Modeling 157
7.2.5 Evaluation 158
7.2.6 Deployment 158
Part III Predicting the Unknown 159
8 Regression 161
8.1 Predictive Performance Estimation 164
8.1.1 Generalization 164
8.1.2 Model Validation 165
8.1.3 Predictive Performance Measures for Regression 169
8.2 Finding the Parameters of the Model 171
8.2.1 Linear Regression 171
8.2.1.1 Empirical Error 173
8.2.2 The Bias-variance Trade-oï¬ 175
8.2.3 Shrinkage Methods 177
8.2.3.1 Ridge Regression 179
8.2.3.2 Lasso Regression 180
8.2.4 Methods that use Linear Combinations of Attributes 181
8.2.4.1 Principal Components Regression 181
8.2.4.2 Partial Least Squares Regression 182
8.3 Technique and Model Selection 182
8.4 Final Remarks 183
8.5 Exercises 184
9 Classiï¬cation 187
9.1 Binary Classiï¬cation 188
9.2 Predictive Performance Measures for Classiï¬cation 192
9.3 Distance-based Learning Algorithms 199
9.3.1 K-nearest Neighbor Algorithms 199
9.3.2 Case-based Reasoning 202
9.4 Probabilistic Classiï¬cation Algorithms 203
9.4.1 Logistic Regression Algorithm 205
9.4.2 Naive Bayes Algorithm 207
9.5 Final Remarks 208
9.6 Exercises 210
10 Additional Predictive Methods 211
10.1 Search-based Algorithms 211
10.1.1 Decision Tree Induction Algorithms 212
10.1.2 Decision Trees for Regression 217
10.1.2.1 Model Trees 218
10.1.2.2 Multivariate Adaptive Regression Splines 219
10.2 Optimization-based Algorithms 221
10.2.1 Artiï¬cial Neural Networks 222
10.2.1.1 Backpropagation 224
10.2.1.2 Deep Networks and Deep Learning Algorithms 230
10.2.2 Support Vector Machines 233
10.2.2.1 SVM for Regression 237
10.3 Final Remarks 238
10.4 Exercises 239
11 Advanced Predictive Topics 241
11.1 Ensemble Learning 241
11.1.1 Bagging 243
11.1.2 Random Forests 244
11.1.3 AdaBoost 245
11.2 Algorithm Bias 246
11.3 Non-binary Classiï¬cation Tasks 248
11.3.1 One-class Classiï¬cation 248
11.3.2 Multi-class Classiï¬cation 249
11.3.3 Ranking Classiï¬cation 250
11.3.4 Multi-label Classiï¬cation 251
11.3.5 Hierarchical Classiï¬cation 252
11.4 Advanced Data Preparation Techniques for Prediction 253
11.4.1 Imbalanced Data Classiï¬cation 253
11.4.2 For Incomplete Target Labeling 254
11.4.2.1 Semi-supervised Learning 254
11.4.2.2 Active Learning 255
11.5 Description and Prediction with Supervised Interpretable Techniques 255
11.6 Exercises 256
12 Cheat Sheet and Project on Predictive Analytics 259
12.1 Cheat Sheet on Predictive Analytics 259
12.2 Project on Predictive Analytics 259
12.2.1 Business Understanding 260
12.2.2 Data Understanding 260
12.2.3 Data Preparation 265
12.2.4 Modeling 265
12.2.5 Evaluation 265
12.2.6 Deployment 266
Part IV Popular Data Analytics Applications 267
13 Applications for Text, Web and Social Media 269
13.1 Working with Texts 269
13.1.1 Data Acquisition 271
13.1.2 Feature Extraction 271
13.1.2.1 Tokenization 272
13.1.2.2 Stemming 272
13.1.2.3 Conversion to Structured Data 275
13.1.2.4 Is the Bag of Words Enough? 276
13.1.3 Remaining Phases 277
13.1.4 Trends 277
13.1.4.1 Sentiment Analysis 278
13.1.4.2 Web Mining 278
13.2 Recommender Systems 278
13.2.1 Feedback 279
13.2.2 Recommendation Tasks 280
13.2.3 Recommendation Techniques 281
13.2.3.1 Knowledge-based Techniques 281
13.2.3.2 Content-based Techniques 282
13.2.3.3 Collaborative Filtering Techniques 282
13.2.4 Final Remarks 289
13.3 Social Network Analysis 291
13.3.1 Representing Social Networks 291
13.3.2 Basic Properties of Nodes 294
13.3.2.1 Degree 294
13.3.2.2 Distance 294
13.3.2.3 Closeness 295
13.3.2.4 Betweenness 296
13.3.2.5 Clustering Coeï¬cient 297
13.3.3 Basic and Structural Properties of Networks 297
13.3.3.1 Diameter 297
13.3.3.2 Centralization 297
13.3.3.3 Cliques 299
13.3.3.4 Clustering Coeï¬cient 299
13.3.3.5 Modularity 299
13.3.4 Trends and Final Remarks 299
13.4 Exercises 300
Apendix A: Comprehensive Description of the CRISP-DM Methodology 303
References 311
Index 315
ISBN: 9781119296249
ISBN-10: 1119296242
Published: 29th June 2018
Format: Hardcover
Language: English
Number of Pages: 352
Audience: Professional and Scholarly
Publisher: Wiley
Country of Publication: US
Edition Number: 1
Dimensions (cm): 22.61 x 15.49 x 2.29
Weight (kg): 0.73
Shipping
| Standard Shipping | Express Shipping | |
|---|---|---|
| Metro postcodes: | $9.99 | $14.95 |
| Regional postcodes: | $9.99 | $14.95 |
| Rural postcodes: | $9.99 | $14.95 |
Orders over $79.00 qualify for free shipping.
How to return your order
At Booktopia, we offer hassle-free returns in accordance with our returns policy. If you wish to return an item, please get in touch with Booktopia Customer Care.
Additional postage charges may be applicable.
Defective items
If there is a problem with any of the items received for your order then the Booktopia Customer Care team is ready to assist you.
For more info please visit our Help Centre.
























