Skip to main content

2020 | Buch

Principles of Data Science

herausgegeben von: Dr. Hamid R. Arabnia, Kevin Daimi, Robert Stahlbock, Cristina Soviany, Leonard Heilig, Kai Brüssau

Verlag: Springer International Publishing

Buchreihe : Transactions on Computational Science and Computational Intelligence

insite
SUCHEN

Über dieses Buch

This book provides readers with a thorough understanding of various research areas within the field of data science. The book introduces readers to various techniques for data acquisition, extraction, and cleaning, data summarizing and modeling, data analysis and communication techniques, data science tools, deep learning, and various data science applications. Researchers can extract and conclude various future ideas and topics that could result in potential publications or thesis. Furthermore, this book contributes to Data Scientists’ preparation and to enhancing their knowledge of the field. The book provides a rich collection of manuscripts in highly regarded data science topics, edited by professors with long experience in the field of data science.Introduces various techniques, methods, and algorithms adopted by Data Science expertsProvides a detailed explanation of data science perceptions, reinforced by practical examplesPresents a road map of future trends suitable for innovative data science research and practice

Inhaltsverzeichnis

Frontmatter
Simulation-Based Data Acquisition
Abstract
In data science, the application of most approaches requires the existence of big data from a real-world system. Due to access limitations, nonexistence of the system, or temporal as well as economic restrictions, such data might not be accessible or available. To overcome a lack of real-world data, this chapter introduces simulation-based data acquisition as method for the generation of artificial data that serves as a substitute when applying data science techniques. Instead of gathering data from the real-world system, computer simulation is used to model and execute artificial systems that can provide a more accessible, economic, and robust source of big data. To this end, it is outlined how data science can benefit from simulation and vice versa. Specific approaches are introduced for the design and execution of experiments, and a selection of simulation frameworks is presented that facilitates the conducting of simulation studies for novice and professional users.
Fabian Lorig, Ingo J. Timm
Coding of Bits for Entities by Means of Discrete Events (CBEDE): A Method of Compression and Transmission of Data
Abstract
Every day more and more the world produces, stores, and analyzes millions of dollars, be they telecommunications, medicine, companies, management, and people, among others. In this context, computational simulations are an important ally both for the analysis and for the creation of new methodologies that aid in the management of different types of data. The objective of this research is improving the transmission of content in wireless telecommunication systems, being proposed in the simulation environment a precoding process of bits based on the application of technique discrete events in the signal before the modulation process. The signal transmission on the channel occurs in the discrete domain with the implementation of discrete entities in the process of bit generation. In this way, the methodology developed was named CBEDE (coding of bits for entities by means of discrete events) that implements a proposal based on discrete events in a wireless telecommunication system. For this, the advanced modulation format DBPSK for signal transmission in an AWGN channel was considered. The results show improved memory utilization, related to information compression, in addition to a low level of abstraction, which facilitates the use of the CBEDE methodology in several areas.
Reinaldo Padilha França, Yuzo Iano, Ana Carolina Borges Monteiro, Rangel Arthur
Big Biomedical Data Engineering
Abstract
The Big Data, a massive amount of data, is the most popular buzzword and popular paradigm to change a game of any data-intensive field. The engagement of Big Data technology provides a new direction to an organization and the Big Data gives a vision to biomedical data engineering. Numerous data-intensive fields engage Big Data technology to achieve their vision. Interestingly, the Big Data plays a crucial role in Big Biomedical Data Engineering (BBDE). The massive amount of biomedical data becomes a dilemma in terms of analysis, diagnosis, and prediction. Besides, large-scale medical data cannot be stored and processed without employing Big Data technology. The deployment of Big Data technology can change the game of biomedical engineering. This chapter exploits the role of Big Data in biomedical data engineering and its storage dilemma.
Ripon Patgiri, Sabuzima Nayak
Big Data Preprocessing: An Application on Online Social Networks
Abstract
The mass adoption of social network services enabled online social networks a big data source. Machine learning and statistical analysis results are highly dependent on data preprocessing tasks. The purpose of data preprocessing is to revert the data to a format capable for the analysis and to ensure the high quality of data. However, not only management aspects for unstructured or semi-structured data remain largely unexplored but also new preprocessing techniques are required for addressing big data. In this chapter, the data preprocessing stages for big data sources emphasizing on online social networks are investigated. Special attention is paid to practical questions regarding low-quality data including incomplete, imbalanced, and noisy data. Furthermore, challenges and potential solutions of statistical and rule-based analysis for data cleansing are overviewed. The contribution of natural language processing, feature engineering, and machine learning methods is explored. Online social networks are investigated as (i) context, (ii) analysis practices, (iii) low-quality data, and most importantly (iv) how the latter are being addressed by techniques and frameworks. Last but not least, preprocessing on the broader field of distributed infrastructures is briefly overviewed.
Androniki Sapountzi, Kostas E. Psannis
Feature Engineering
Abstract
Feature engineering represents the methodological framework that allows to design and generate informative and discriminant feature sets for the machine learning algorithms. These design and development tasks exploit the information and knowledge belonging to the application-specific domain in order to properly define, extract, and evaluate the more informative sets of variables that should be further processed and sometimes transformed during the more advanced stages of running the machine learning algorithms. Feature engineering includes feature design, feature transformation, and feature selection, with various algorithms that can be used according to the application requirements as concerning the discriminant performance, execution speed, and application-specific constraints.
Sorin Soviany, Cristina Soviany
Data Summarization Using Sampling Algorithms: Data Stream Case Study
Abstract
Data streams represent a challenge to the data processing operations such as query execution and information retrieval. They pose many constraints in terms of memory space and execution time for the computation process. This is mainly due to the huge volume of the data and their high arrival rate. Generating approximate answers by using a small proportion of the data stream, called “summary,” is acceptable for many applications. Sampling algorithms are used to construct a data stream summary. The purpose of sampling algorithms is to provide information concerning a large set of data from a representative sample extracted from it. An effective summary of a data stream must have the ability to respond, in an approximate manner, to any query, whatever the period of time investigating. In this chapter, we present a survey of these algorithms. Firstly, we introduce the basic concepts of data streams, windowing models, as well as data stream applications. Next, we introduce the state of the art of different sampling algorithms used in data stream environments. We classify these algorithms according to the following metrics: number of passes over the data, memory consumption, and skewing ability. In the end, we evaluate the performance of three sampling algorithms according to their execution time and accuracy.
Rayane El Sibai, Jacques Bou Abdo, Yousra Chabchoub, Jacques Demerjian, Raja Chiky, Kablan Barbar
Fast Imputation: An Algorithmic Formalism
Abstract
Handling missing data has been a challenge for researchers. The foundations of data science adhere to decisive imputation witnessing correct and definitive values in the dataset post-imputation. The filling of missing values in the dataset should make it functional for information exploration and exploitable for learning data patterns and performing complex tasks like predictive analysis. The objective of the chapter is to fill missing values intelligently so that the dataset remains equitable and gives correct results for predictions. This chapter presents an inspection on missing value imputation techniques and their usage exploited for time series prediction and pattern analysis, illustrating the deficits in datasets post-imputation due to inefficient imputation techniques used and improper selection of imputation methods when given a dataset failing to pin its composition to imputation method, disrupting its depth-wise and breadthwise statistics and homogeneity. The current work in this chapter is a comprehensive investigation of theoretical underpinnings of the key properties of data homogeneity and its statistics post-imputation. The major contribution of this chapter is the balance factor for homogeneity and heterogeneity statistics induced post-imputation, incorporating speedy and composed imputation. Since recommender systems are most suitable platform over which the theoretical findings in this chapter could be practically performed. Therefore, well-known dataset typically used for evaluating recommender systems of large size 1 M, 2 M, and 10 M ratings have been exploited for simulations. The objective of the chapter is to fill missing values intelligently so that the dataset remains equitable and gives correct result for predictions.
Devisha Arunadevi Tiwari
A Scientific Perspective on Big Data in Earth Observation
Abstract
The Earth is facing unprecedented climatic, geomorphologic, environmental and anthropogenic changes, which require global-scale observation and monitoring. The interest is in a global understanding involving surveillance of large extended areas over long periods of time, using a broad variety of Earth Observation (EO) sensors, complemented by a multitude of ground-based measurements and additional information. The challenge is the exploration of these data and the timely delivery of focused information and knowledge in a simple understandable format. In this context, the world is witnessing the rise of a new field, EO Big Data, which will revolutionize the perception on our surroundings and provide a new insight and understanding on our planet. The emerging opportunities are amplified by algorithmically and technological breakthroughs successfully demonstrated so far in both EO and non EO domain. In addition, the open science initiatives are propelling the scientific evolution, as presented during the last edition of the Conference on Big data from Space. This new paradigm is forcing a record of innovation in terms of exploitation platforms which enable a virtual open and collaborative environment bringing together EO and non-EO data, dedicated software, ICT resources on one hand and researchers, engineers, end-users, infrastructures and service providers on the other hand. With a pool of resources at hand, the collaboration within the EO community is progressing towards a universal framework to provide consistency to learn generalized models from live archives of Big Data.
Corina Vaduva, Michele Iapaolo, Mihai Datcu
Visualizing High-Dimensional Data Using t-Distributed Stochastic Neighbor Embedding Algorithm
Abstract
Data visualization is a powerful tool and widely adopted by organizations for its effectiveness to abstract the right information, understand, and interpret results clearly and easily. The real challenge in any data science exploration is to visualize it. Visualizing a discrete, categorical data attribute using bar plots, pie charts are a few of the effective ways for data exploration. Most of the datasets have a large number of features. In other words, data is distributed across a high number of dimensions. Visually exploring such high-dimensional data can then become challenging and even practically impossible to do manually. Hence it is essential to understand how to visualize high-dimensional datasets. t-Distributed stochastic neighbor embedding (t-SNE) is a technique for dimensionality reduction and explicitly applicable to the visualization of high-dimensional datasets.
Jayesh Soni, Nagarajan Prabakar, Himanshu Upadhyay
Active and Machine Learning for Earth Observation Image Analysis with Traditional and Innovative Approaches
Abstract
Today we are faced with impressive progress in machine learning and artificial intelligence. This not only applies to autonomous driving for car manufacturers but also to Earth observation, where we need reliable and efficient techniques for the automated analysis and understanding of remote sensing data.
While automated classification of satellite images dates back more than 50 years, many recently published deep learning concepts aim at still more reliable and user-oriented image analysis tools. On the other hand, we should also be continuously interested in innovative data analysis approaches that have not yet reached widespread use.
We demonstrate how established applications and tools for image classification and change detection can profit from advanced information theory together with automated quality control strategies. As a typical example, we deal with the task of coastline detection in satellite images; here, rapid and correct image interpretation is of utmost importance for riskless shipping and accurate event monitoring.
If we combine current machine learning algorithms with new approaches, we can see how current deep learning concepts can still be enhanced. Here, information theory paves the way toward interesting innovative solutions.
The validation of the proposed methods will be demonstrated on two target areas: the first one is the Danube Delta, which is the second largest river delta in Europe and is the best preserved one on the continent. Since 1991, the Danube Delta has been inscribed on the UNESCO World Heritage List due do its biological uniqueness. The second one is Belgica Bank in the north-east of Greenland which is an area of extensive fast land-locked ice that is ideal for monitoring seasonal variations of the ice cover and icebergs.
To analyze these two areas, we selected Synthetic Aperture Radar (SAR) images provided by Sentinel-1, a European twin satellite (Taini G et al., SENTINEL-1 satellite system architecture: design, performances and operations. IEEE international geoscience and remote sensing symposium, Munich, pp 1722–1725, 2012) which has an observation rate of one image every 6 days in the case of the Danube Delta and of at least two images per day in the case of Belgica Bank.
Corneliu Octavian Dumitru, Gottfried Schwarz, Gabriel Dax, Vlad Andrei, Dongyang Ao, Mihai Datcu
Applications in Financial Industry: Use-Case for Fraud Management
Abstract
The actual major issues for application development in various domains, including the financial industry and the associated uses-cases, concern the ways in which big data can be approached in order to meet the real-case constraints. There are already available a lot of data analytics tools that can be successfully applied by the financial organizations in order to perform specific tasks in relationships with their partners and customers, respectively. A challenging use-case of data science in financial industry is represented by the fraud management in which the design solutions are based on supervised and unsupervised learning, in order to avoid the drawbacks of the legacy solutions based on rules. Innovative solutions also include anomaly detection in order to efficiently handle new cases that cannot be actually learned with the supervised predictive modeling-based approaches.
Sorin Soviany, Cristina Soviany
Stochastic Analysis for Short- and Long-Term Forecasting of Latin American Country Risk Indexes
Abstract
Given that the evolution of the Emerging Markets Bond Index (EMBI) can lead to a great impact of investor’s decision- making, the forecasted values may have on the business plan or the investment portfolio an idea of what the trend of the index will be based on its historical values. This article presents a new method to provide the short- and long-term forecast of EMBI measurements from Latin American countries by using artificial neural networks to model the behavior of the underlying process. Motivated by the risk in the decision-making concept, the algorithm can effectively forecast the time-series data by stochastic analysis of its future behavior using fractional Gaussian noise. Relative advantages and limitations of the algorithm by showing the performance of the roughness of the series (Hurst parameter) in its statistical sense are highlighted.
Julián Pucheta, Gustavo Alasino, Carlos Salas, Martín Herrera, Cristian Rodriguez Rivero
Correction to: Principles of Data Science
Hamid R. Arabnia, Kevin Daimi, Robert Stahlbock, Cristina Soviany, Leonard Heilig, Kai Brüssau
Backmatter
Metadaten
Titel
Principles of Data Science
herausgegeben von
Dr. Hamid R. Arabnia
Kevin Daimi
Robert Stahlbock
Cristina Soviany
Leonard Heilig
Kai Brüssau
Copyright-Jahr
2020
Electronic ISBN
978-3-030-43981-1
Print ISBN
978-3-030-43980-4
DOI
https://doi.org/10.1007/978-3-030-43981-1