Full Description
This open access book aims to synthesize and integrate the research challenges in data science and data engineering. It offers a comprehensive survey of the entire data management stack, from scalable and explainable data analytics to traceable data workflows. By providing a consistent framework, it facilitates a thorough understanding of the data science lifecycle, from basic definitions to state-of-the-art concepts and techniques.
The book is divided into four parts, each focusing on a different aspect of the data management and science lifecycle: governance, storage and processing, preparation, and analysis. Each part is organized to provide a coherent conceptual framework and is divided into multiple chapters, each focusing on a specific topic but together offering a comprehensive overview of the state of the art and the key challenges in the respective areas. While the parts and chapters follow a logical sequence, each chapter is designed to be self-contained and can be read independently. Chapters include references for further reading and deeper exploration, and often also provide concrete examples or use cases to make the material more accessible. In addition, many chapters introduce a taxonomy to break down complex research areas into manageable components, highlighting the core directions and developments within each domain.
The book is designed to be a valuable resource for both researchers and practitioners seeking to leverage data engineering for data science applications. For both seasoned experts or budding professionals, it provides the tools and knowledge needed to stay at the forefront of data-driven advancements.
Contents
Part I. Governance and Integration.- Chapter 1. Text Data Integration.- Chapter 2. Exploring the Landscape of Data Fusion.- Chapter 3. Scalable and Privacy-aware Relational Data Synthesis.- Part II Storage and Processing.- Chapter 4. Comprehensive Approach to Feature Selection.- Chapter 5. Current Systems for Managing Massive High Frequency Time Series.- Chapter 6. MLOps Systems for Developing ML Pipelines.- Chapter 7. Workload Placement and Scheduling on Heterogeneous CPU-GPU Architectures.- Part III Preparation.- Chapter 8. Privacy-Preserving Blockchain-Based Federated Learning.- Chapter 9. Example-Based Explainability in Machine Learning.- Chapter 10. Table Search in Data Lakes: Methods, Indexing Techniques, and Research Challenges.- Part IV. Analysis.- Chapter 11. Adversarial Learning for Fraud Detection.- Chapter 12. Approximate and Adaptive Methods for Inference.- Chapter 13. Analysis of Unconstrained Trajectories, the Case of AIS.- Chapter 14. Network-constrained Trajectory Data for Traffic Analytics.



