Description
This open access book aims to synthesize and integrate the research challenges in data science and data engineering. It offers a comprehensive survey of the entire data management stack, from scalable and explainable data analytics to traceable data workflows. By providing a consistent framework, it facilitates a thorough understanding of the data science lifecycle, from basic definitions to state-of-the-art concepts and techniques.
The book is divided into four parts, each focusing on a different aspect of the data management and science lifecycle: governance, storage and processing, preparation, and analysis. Each part is organized to provide a coherent conceptual framework and is divided into multiple chapters, each focusing on a specific topic but together offering a comprehensive overview of the state of the art and the key challenges in the respective areas. While the parts and chapters follow a logical sequence, each chapter is designed to be self-contained and can be read independently. Chapters include references for further reading and deeper exploration, and often also provide concrete examples or use cases to make the material more accessible. In addition, many chapters introduce a taxonomy to break down complex research areas into manageable components, highlighting the core directions and developments within each domain.
The book is designed to be a valuable resource for both researchers and practitioners seeking to leverage data engineering for data science applications. For both seasoned experts or budding professionals, it provides the tools and knowledge needed to stay at the forefront of data-driven advancements.
Part I. Governance and Integration.- Chapter 1. Text Data Integration.- Chapter 2. Exploring the Landscape of Data Fusion.- Chapter 3. Scalable and Privacy-aware Relational Data Synthesis.- Part II Storage and Processing.- Chapter 4. Comprehensive Approach to Feature Selection.- Chapter 5. Current Systems for Managing Massive High Frequency Time Series.- Chapter 6. MLOps Systems for Developing ML Pipelines.- Chapter 7. Workload Placement and Scheduling on Heterogeneous CPU-GPU Architectures.- Part III Preparation.- Chapter 8. Privacy-Preserving Blockchain-Based Federated Learning.- Chapter 9. Example-Based Explainability in Machine Learning.- Chapter 10. Table Search in Data Lakes: Methods, Indexing Techniques, and Research Challenges.- Part IV. Analysis.- Chapter 11. Adversarial Learning for Fraud Detection.- Chapter 12. Approximate and Adaptive Methods for Inference.- Chapter 13. Analysis of Unconstrained Trajectories, the Case of AIS.- Chapter 14. Network-constrained Trajectory Data for Traffic Analytics.
Gilles Dejaegere is a postdoctoral researcher at the DES-Lab of the Université libre de Bruxelles (ULB). He holds a PhD in Engineering Sciences and Technology, and his research lies at the intersection of operational research and data management, with a particular focus on mobility data and the application of AI techniques to mobility analytics. He has been actively involved in several European research initiatives and projects bridging data engineering and decision support. Gilles has also contributed to the organization of top-tier international scientific conferences and summer schools in data management and analytics.
Alberto Abello is Full Professor at Universitat Politecnica de Catalunya (UPC), Barcelona. He has coordinated at UPC European Erasmus Mundus programmes, both at the master and PhD levels, and a MSCA-ITN-EJD, as well as H2020 projects and R+D agreements with companies such as Hewlett Packard, Zurich Insurance, SAP or the World Health Organization. He has also participated in more than ten national research projects or networks of excellence. He has successfully advised ten PhD thesis, whose research results contributed to publish more than 45 journal and 90 conference articles, as well as more than 10 book chapters. He has led different projects for development cooperation, including the collaboration with the department of Neglected Tropical Diseases of World Health Organization and Probitas Foundation.
Kristian Torp is Full Professor in the Department of Computer Science at Aalborg University. His research focuses on managing and analyzing data generated by moving objects, including bicycles, cars, and ships. A key aspect of his work involves developing software prototypes to experimentally validate the findings. He contributes actively to the scientific community and serves on program committees for top-tier international conferences.
Alkis Simitsis is a Research Director at Athena Research Center. He draws on over 20 years of experience in both academic and industry settings, developing innovative data management technologies and enterprise-grade products. His work spans a broad range of areas, including scalable big data infrastructure, data-intensive analytics, information management, business intelligence, massively parallel processing, distributed and columnar databases, graph databases, security analytics, and cloud computing. He holds 46 U.S. and European patents, has authored over 130 papers in leading international journals and conferences, and contributes actively to the scientific community by regularly serving on program committees and in organizational roles for top-tier international scientific conferences.



