- ホーム
- > 洋書
- > 英文書
- > Computer / Databases
Full Description
The Complete Guide to Data Science with Hadoop—For Technical Professionals, Businesspeople, and Students
 Demand is soaring for professionals who can solve real data science problems with Hadoop and Spark. Practical Data Science with Hadoop® and Spark is your complete guide to doing just that. Drawing on immense experience with Hadoop and big data, three leading experts bring together everything you need: high-level concepts, deep-dive techniques, real-world use cases, practical applications, and hands-on tutorials.
 The authors introduce the essentials of data science and the modern Hadoop ecosystem, explaining how Hadoop and Spark have evolved into an effective platform for solving data science problems at scale. In addition to comprehensive application coverage, the authors also provide useful guidance on the important steps of data ingestion, data munging, and visualization.
 Once the groundwork is in place, the authors focus on specific applications, including machine learning, predictive modeling for sentiment analysis, clustering for document analysis, anomaly detection, and natural language processing (NLP).
 This guide provides a strong technical foundation for those who want to do practical data science, and also presents business-driven guidance on how to apply Hadoop and Spark to optimize ROI of data science initiatives.
 Learn
 
What data science is, how it has evolved, and how to plan a data science career
How data volume, variety, and velocity shape data science use cases
Hadoop and its ecosystem, including HDFS, MapReduce, YARN, and Spark
Data importation with Hive and Spark
Data quality, preprocessing, preparation, and modeling
Visualization: surfacing insights from huge data sets
Machine learning: classification, regression, clustering, and anomaly detection
Algorithms and Hadoop tools for predictive modeling
Cluster analysis and similarity functions
Large-scale anomaly detection
NLP: applying data science to human language
Contents
Foreword xiii
 Preface xv
 Acknowledgments xxi
 About the Authors xxiii
 Part I: Data Science with Hadoop—An Overview 1
 Chapter 1: Introduction to Data Science 3
 What Is Data Science? 3
 Example: Search Advertising 4
 A Bit of Data Science History 5
 Becoming a Data Scientist 8
 Building a Data Science Team 12
 The Data Science Project Life Cycle 13
 Managing a Data Science Project 18
 Summary 18
 Chapter 2: Use Cases for Data Science 19
 Big Data—A Driver of Change 19
 Business Use Cases 21
 Summary 29
 Chapter 3: Hadoop and Data Science 31
 What Is Hadoop? 31
 Hadoop's Evolution 37
 Hadoop Tools for Data Science 38
 Why Hadoop Is Useful to Data Scientists 46
 Summary 51
 Part II: Preparing and Visualizing Data with Hadoop 53
 Chapter 4: Getting Data into Hadoop 55
 Hadoop as a Data Lake 56
 The Hadoop Distributed File System (HDFS) 58
 Direct File Transfer to Hadoop HDFS 58
 Importing Data from Files into Hive Tables 59
 Importing Data into Hive Tables Using Spark 62
 Using Apache Sqoop to Acquire Relational Data 65
 Using Apache Flume to Acquire Data Streams 74
 Manage Hadoop Work and Data Flows with Apache
 Oozie 79
 Apache Falcon 81
 What's Next in Data Ingestion? 82
 Summary 82
 Chapter 5: Data Munging with Hadoop 85
 Why Hadoop for Data Munging? 86
 Data Quality 86
 The Feature Matrix 93
 Summary 106
 Chapter 6: Exploring and Visualizing Data 107
 Why Visualize Data? 107
 Creating Visualizations 112
 Using Visualization for Data Science 121
 Popular Visualization Tools 121
 Visualizing Big Data with Hadoop 123
 Summary 124
 Part III: Applying Data Modeling with Hadoop 125
 Chapter 7: Machine Learning with Hadoop 127
 Overview of Machine Learning 127
 Terminology 128
 Task Types in Machine Learning 129
 Big Data and Machine Learning 130
 Tools for Machine Learning 131
 The Future of Machine Learning and Artificial Intelligence 132
 Summary 132
 Chapter 8: Predictive Modeling 133
 Overview of Predictive Modeling 133
 Classification Versus Regression 134
 Evaluating Predictive Models 136
 Supervised Learning Algorithms 140
 Building Big Data Predictive Model Solutions 141
 Example: Sentiment Analysis 145
 Summary 150
 Chapter 9: Clustering 151
 Overview of Clustering 151
 Uses of Clustering 152
 Designing a Similarity Measure 153
 Clustering Algorithms 154
 Example: Clustering Algorithms 155
 Evaluating the Clusters and Choosing the Number of Clusters 157
 Building Big Data Clustering Solutions 158
 Example: Topic Modeling with Latent Dirichlet Allocation 160
 Summary 163
 Chapter 10: Anomaly Detection with Hadoop 165
 Overview 165
 Uses of Anomaly Detection 166
 Types of Anomalies in Data 166
 Approaches to Anomaly Detection 167
 Tuning Anomaly Detection Systems 170
 Building a Big Data Anomaly Detection Solution with Hadoop 171
 Example: Detecting Network Intrusions 172
 Summary 179
 Chapter 11: Natural Language Processing 181
 Natural Language Processing 181
 Tooling for NLP in Hadoop 184
 Textual Representations 187
 Sentiment Analysis Example 189
 Summary 193
 Chapter 12: Data Science with Hadoop—The Next Frontier 195
 Automated Data Discovery 195
 Deep Learning 197
 Summary 199
 Appendix A: Book Web Page and Code Download 201
 Appendix B: HDFS Quick Start 203
 Quick Command Dereference 204
 Appendix C: Additional Background on Data Science and Apache Hadoop and Spark 209
 General Hadoop/Spark Information 209
 Hadoop/Spark Installation Recipes 210
 HDFS 210
 MapReduce 211
 Spark 211
 Essential Tools 211
 Machine Learning 212
 Index 213


 
               
              


