Skip to main content

2021 | Buch

Advances in Data Science and Information Engineering

Proceedings from ICDATA 2020 and IKE 2020

herausgegeben von: Robert Stahlbock, Gary M. Weiss, Mahmoud Abou-Nasr, Cheng-Ying Yang, Dr. Hamid R. Arabnia, Leonidas Deligiannidis

Verlag: Springer International Publishing

Buchreihe : Transactions on Computational Science and Computational Intelligence

insite
SUCHEN

Über dieses Buch

The book presents the proceedings of two conferences: the 16th International Conference on Data Science (ICDATA 2020) and the 19th International Conference on Information & Knowledge Engineering (IKE 2020), which took place in Las Vegas, NV, USA, July 27-30, 2020. The conferences are part of the larger 2020 World Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE'20), which features 20 major tracks. Papers cover all aspects of Data Science, Data Mining, Machine Learning, Artificial and Computational Intelligence (ICDATA) and Information Retrieval Systems, Information & Knowledge Engineering, Management and Cyber-Learning (IKE). Authors include academics, researchers, professionals, and students.

Presents the proceedings of the 16th International Conference on Data Science (ICDATA 2020) and the 19th International Conference on Information & Knowledge Engineering (IKE 2020); Includes papers on topics from data mining to machine learning to informational retrieval systems;Authors include academics, researchers, professionals and students.

Inhaltsverzeichnis

Frontmatter

Graph Algorithms, Clustering, and Applications

Frontmatter
Phoenix: A Scalable Streaming Hypergraph Analysis Framework

We present Phoenix, a scalable hypergraph analytics framework for data analytics and knowledge discovery that was implemented on the leadership class computing platforms at Oak Ridge National Laboratory (ORNL). Our software framework comprises a distributed implementation of a streaming server architecture which acts as a gateway for various hypergraph generators/external sources to connect. Phoenix has the capability to utilize diverse hypergraph generators, including HyGen, a very large-scale hypergraph generator developed by ORNL. Phoenix incorporates specific algorithms for efficient data representation by exploiting hidden structures of the hypergraphs. Our experimental results demonstrate Phoenix’s scalable and stable performance on massively parallel computing platforms. Phoenix’s superior performance is due to the merging of high-performance computing with data analytic.

Kuldeep Kurte, Neena Imam, S. M. Shamimul Hasan, Ramakrishnan Kannan
Revealing the Relation Between Students’ Reading Notes and Scores Examination with NLP Features

Predicting students’ exam scores has been a popular topic both in educational psychology and data mining areas for many years. Currently, many researchers devote efforts to predict exam score precisely with student behavior data and exercise content data. In this paper, we present the Topic-Based Latent Variable Model (TB-LVM) to predict the midterm and final scores with students’ textbook reading notes. We compare the Topic-Based Latent Variable Model with the Two-Step LDA model. For TB-LVM, the standard deviations of the maximum likelihood estimation and method of moments for the midterm exam are 7.79 and 7.63, respectively. The two standard deviations for the final exam are 8.68 and 7.72, respectively. The results are much better than the results of the Two-Step LDA model, which is 14.38 for the midterm exam and 16.55 for the final exam. Finally, we also compare with the knowledge graph embedding method to predict exam scores.

Zhenyu Pan, Yang Gao, Tingjian Ge
Deep Metric Similarity Clustering

Effective data similarity measures are essential in data clustering. This paper proposes a novel deep metric clustering method with simultaneous non-linear similarity learning and clustering. Unlike pre-defined similarity measures, this deep metric enables more effective data clustering on high-dimensional data with various non-linear similarities. In the proposed method, a similarity function is firstly approximated by a deep metric network. The graph Laplacian matrix is introduced to make data cluster assignments. A stochastic optimization is then proposed to efficiently construct the optimal deep metric network and calculate data similarity and cluster assignment on large-scale data set. For N data samples, the proposed optimization effectively reduces the computation of N 2 pairs of data to M 2 ( M ≪ N ) $$M^2 (M\!\!\ll \!\!N)$$ pairs at each step of the approximation. A co-training method is further introduced to optimize the deep metric network on a portion of semi-supervised data for clustering with targeted purposes. Finally, this paper shows theoretical connections between the proposed method and spectral clustering in subspace learning. This method is able to achieve ∼20% higher accuracies than the best existing multi-view and subspace clustering methods on Caltech and MSRCV1 object recognition data sets. Further results on benchmark and real-world visual data show competitive performance of the proposed method over deep subspace clustering network and many related and state-of-the-art methods.

Shuanglu Dai, Pengyu Su, Hong Man
Estimating the Effective Topics of Articles and Journals Abstract Using LDA and K-Means Clustering Algorithm

Analyzing journals and articles abstract text or documents using topic modeling and text clustering becomes modern solutions for the increasing number of text documents. Topic modeling and text clustering are both intensely involved tasks that can benefit one another. Text clustering and topic modeling algorithms are used to maintain massive amounts of text documents. In this study, we have used LDA, K-means cluster, and also lexical database WordNet for keyphrases extraction in our text documents. K-means cluster and LDA algorithms achieve the most reliable performance for keyphrases extraction in our text documents. This study will help the researcher to make searching string based on journals and articles by avoiding misunderstandings.

Shadikur Rahman, Umme Ayman Koana, Aras M. Ismael, Karmand Hussein Abdalla

Data Science, Social Science, Social Media, and Social Networks

Frontmatter
Modelling and Analysis of Network Information Data for Product Purchasing Decisions

Technology has enabled consumers to gain product information from different online platforms such as social networks, online product reviews and other digital media. Large manufacturers and retailers can make use of this network information to forecast accurately, to manage the demand and thereby to improve profit margin, efficiency, etc. This paper proposes a novel framework to model and analyse consumers’ purchase decision for product choices based on information obtained from two different information networks. The model has also taken into account variables such as socio-economic and demographic characteristics. We develop a utility-based discrete choice model (DCM) to quantify the effect of consumers’ attitudinal factors from two different information networks, namely, social network and product information network. The network information modelling and analysis are discussed in detail taking into account the model complexity, heterogeneity and asymmetry due to the dimension, layer and scale of information in each type of network. The likelihood function, parameter estimation and inference procedures of the full model are also derived for the model. Finally, extensive numeric investigations were carried out to establish the model framework.

Md Asaduzzaman, Uchitha Jayawickrama, Samanthika Gallage
Novel Community Detection and Ranking Approaches for Social Network Analysis

Enterprises are collecting, procuring, storing, and processing increasing quantities of Big Data. This facilitates the detection of new insights that are capable of driving more efficient and effective operations. This provides management with the ability to steer the business proactively. Identifying the crucial nodes, related communities in a network can help in target marketing. Such analyses utilize the concepts of the shortest path, closeness centrality, and clustering coefficient. In this research, we proposed a novel community detection algorithm based on local centrality and node closeness. Exploratory analysis such as the graphical representation of data, to depict an interconnected collection of entities, among people, groups, or products is performed. We also performed network analysis (community detection and ranking algorithms) to analyze the relationships among the entities. The proposed algorithms are applied to multiple datasets to identify the hidden patterns. Among the benchmark datasets, the algorithms were implemented on the American College Football, Dolphin Community, Les Miserables, and Karate Club datasets. We were able to predict the next matches, the most popular member of the club, and their relevant connections with high accuracy as compared to the ground truths. Besides, these algorithms encompass all the features and predict the importance of the community leader, which is a key differentiating factor for the proposed algorithms. Modularity is used as the metric to compare the effectiveness of the proposed methods with state-of-the-art frameworks. The proposed community detection and community ranking algorithms performed well on scale-free networks. We can also identify the hidden patterns of friendships on social media and frequent itemsets purchased together using ranking and community detection methodologies which can help in improving recommendation systems.

Pujitha Reddy, Matin Pirouz
How Is Twitter Talking About COVID-19?

The novel coronavirus (COVID-19 or SARS-CoV-2) virus spread rapidly, both as a pandemic and as a viral topic of conversation. Social networks, especially after the boom of smartphones, completely revolutionised the speed and channels where information spreads. A clear example of this is how fast information spreads across Twitter, a platform famous for creating trends and spreading the news. This work focuses on the analysis of the overall opinion of the COVID-19 pandemic on Twitter. We attempted to study the polarity and emotional impression of the people applying a series of natural language processing techniques to a total of 270,000 tweets identified as related to COVID-19.

Jesus L. Llano, Héctor G. Ceballos, Francisco J. Cantú
Detecting Asian Values in Asian News via Machine Learning Text Classification

The study is aimed at developing supervised machine learning models to automatically detect Asian values of harmony and support in Asian English-language newspaper articles. Harmony has two classes (“harmony” vs. “conflict”), with “harmony” defined as a void of conflict. Support has two classes (“supportive” vs. “critical”), with “supportive” defined as supporting economic, political, and communal unity and strength at home. Nine algorithms, with their parameters tuned, were compared for performance. Results showed logistic regression is the top performer for both the “harmony” and “support” models, with 93.0% and 91.2% accuracy rates, respectively. Logistic regression models were then deployed through web pages to make classifications of unseen, unlabeled data. The testing of the deployed models demonstrated the utilities of the models developed.

Li-jing Arthur Chang

Recommendation Systems, Prediction Methods, and Applications

Frontmatter
The Evaluation of Rating Systems in Online Free-for-All Games

Online competitive games have become increasingly popular. To ensure an exciting and competitive environment, these games routinely attempt to match players with similar skill levels. Matching players is often accomplished through a rating system. There has been an increasing amount of research on developing such rating systems. However, less attention has been given to the evaluation metrics of these systems. In this paper, we present an exhaustive analysis of six metrics for evaluating rating systems in online competitive games. We compare traditional metrics such as accuracy. We then introduce other metrics adapted from the field of information retrieval. We evaluate these metrics against several well-known rating systems on a large real-world dataset of over 100,000 free-for-all matches. Our results show stark differences in their utility. Some metrics do not consider deviations between two ranks. Others are inordinately impacted by new players. Many do not capture the importance of distinguishing between errors in higher ranks and lower ranks. Among all metrics studied, we recommend normalized discounted cumulative gain (NDCG) because not only does it resolve the issues faced by other metrics, but it also offers flexibility to adjust the evaluations based on the goals of the system.

Arman Dehpanah, Muheeb Faizan Ghori, Jonathan Gemmell, Bamshad Mobasher
A Holistic Analytics Approach for Determining Effective Promotional Product Groupings

With companies across industries continually striving to get ahead of the competition, product pricing could be the deciding factor in driving or destroying the margins of a company. Promotional grouping of products is an effective pricing strategy used across multiple industries such as retail, healthcare, and many more. Promotional product groupings or bundling can be seen everywhere, from buffets served at restaurants to the suite of products sold together by MS Office. The fact that the component products are readily available means that bundling is one of the most flexible elements of product strategy. However, some caveats come with bundling, most of which stem from inadequate planning. Bundling could lead to the cannibalization of products that are not present in bundles. Furthermore, it could lead to customers not choosing to buy the desired product because she would have to buy the other product bundled with it.The study encapsulates the selection and creation of labels for promotional product groupings for individual SKUs of a consumer product goods company. The groupings are based on historical data of the company’s incremental sales and competitors’ sales data collected in the same time frame. Currently, product grouping analysis is done manually, which could be compromised by human error and an individual’s unique decision framework that could be biased. A more pertinent issue faced is that the company would fail to recognize the life of a successful promotion. Failure to do so could lead to stagnant promotional groupings that would not only fail to gain traction with customers but also siphon off the already existing sales, eventually leading to the company being overtaken by its competitors and losing market share.In order to develop recommendations for an ideal product grouping strategy, the study initially delves into the existing promotional groupings of the company and compares it with those of its competitors. Detailed competitive analysis provides an idea of the company’s success with its past bundling strategy. The study uses machine learning models to identify the drivers of a successful promotion and finally uses optimization to suggest an ideal bundling strategy that would maximize revenue.

Mehul Zawar, Siddharth Harisankar, Xuanming Hu, Rahul Raj, Vinitha Ravindran, Matthew A. Lanham
Hierarchical POI Attention Model for Successive POI Recommendation

The rapid growth of location-based social networks developed a large number of point-of-interests (POIs). POI recommendation task aims to predict users’ successive POIs, which has attracted more and more research interests recently. POI recommendation is achieved based on POI context, which contains a variety of information, including check-in sequence pattern, POIs’ textual contents, and temporal characteristics. Existing efforts only model part of them and lose valuable information of other aspects.In this paper, we propose a hierarchical POI attention model (HPAM), which jointly takes advantage of both POIs’ text contents, temporal characteristics, and sequential patterns. Specifically, HPAM proposes a lower-level POI representation layer to explore textual content with word attention mechanism, and a higher-level contextual sequence layer to depict the temporal characteristics with a temporal-level attention mechanism. Experimental results on a public dataset show that HPAM consistently outperforms the state-of-the-art methods. Experiment results on HPAM variants evaluate the effectiveness of the proposed multiple attention mechanisms.

Lishan Li
A Comparison of Important Features for Predicting Polish and Chinese Corporate Bankruptcies

This study generates data mining models to predict corporate bankruptcy in Poland and China, and then examines these models to determine the financial characteristics that are of the greatest predictive value. These financial features are then compared for the two countries. The study finds that while there are some common financial indicators for bankruptcy between the two diverse financial markets, there are also key differences. In particular, asset-related features play a much larger role in predicting bankruptcy in China, while operations-related features play a larger role in predicting bankruptcy in Poland.

Yifan Ren, Gary M. Weiss
Using Matrix Factorization and Evolutionary Strategy to Develop a Latent Factor Recommendation System for an Offline Retailer

Recommendation systems have been developed for online services with a simple product mix. This study extended its application to an offline retailer with a more complex product mix. Purchasing records of two thousand members within one year from an offline retailer in Taiwan were used as the dataset for the study. Datasets of the first 9 months were used for training and models were tested by the records of the last 3 months. This study developed a recommendation system by integrating a matrix factorization to uncover latent factors from both customers and items, and an evolutionary program to optimize the parameter settings of duration adjustment functions that were applied to assign weights so that past purchasing records closer to the testing period would receive higher weights. The objective of the system is to predict the likelihood of customers’ purchasing behavior toward items they never purchased during the training period. By measuring the average percentage-ranking of items for two thousand members, the recommendation system developed in this study outperformed two other approaches, popularity and item-based nearest neighborhood systems. In addition, academic and practical contributions were also discussed.

Y. Y. Chang, S. M. Horng, C. L. Chao
Dynamic Pricing for Sports Tickets

This research studies the market demand for sports tickets of a major NFL team and develops a dynamic pricing model for the price of the tickets based on the understanding of the market demand. The authors utilized R together with packages like h2o and ggplot2 to develop predictive models that could reflect future demand of tickets and then developed an optimization strategy based on this model for the use of dynamic pricing. A Tableau dashboard was also created using simulation data from one of the previous games to demonstrate the potential revenue increase of this model.

Ziyun Huang, Wenying Huang, Wei-Cheng Chen, Matthew A. Lanham
Virtual Machine Performance Prediction Based on Transfer Learning of Bayesian Network

The performance of virtual machines in the cloud is fraught with uncertainty due to the complexity of the environment, which poses a challenge for accurate prediction of virtual machine performance. In addition, a well-performing virtual machine performance prediction model cannot be multiplexed in either the temporal or spatial dimensions. In this paper, we build a virtual machine performance prediction model based on Bayesian network to solve the problem of accurate prediction of virtual machine performance. Furthermore, to achieve multiplexing of the performance prediction model in both temporal and spatial dimensions, we propose a Bayesian network transfer learning approach. Experiments show that in our transfer learning approach, in contrast with reconstruction, the amount of data in the training set was reduced by 90%, and the training time was reduced by 75%, while the macro average precision maintaining 79%.

Wang Bobo
A Personalized Recommender System Using Real-Time Search Data Integrated with Historical Data

With companies focusing intensively on customer experience, personalization and platform usability have become crucial for a company’s success. Hence, providing appropriate recommendations to users is a challenging problem in various industries. We work toward enhancing the recommendation system of a timeshare exchange platform by leveraging real-time search data. Previously, the recommendation model utilized historical data to recommend resorts to users and was deployed online once a day. The limitation of this model was that it did not consider the real-time searches of the user, hence losing context. This directly impacted the click-through rate of the recommendations, and the users had to navigate the website excessively to find a satisfactory resort. We build a model such that it utilizes not only the historical transactional and master data but also the real-time search data to provide multiple relevant resort recommendations within 5 s.

Hemanya Tyagi, Mohinder Pal Goyal, Robin Jindal, Matthew A. Lanham, Dibyamshu Shrestha
Automated Prediction of Voter’s Party Affiliation Using AI

The goal of this research is to develop the foundation of a cross-platform app, Litics360, that helps political election campaigns utilize data-driven methods, and high-performance prediction models, to align candidates with voters who share similar socio-political ambitions. To attain this goal, the first step is to understand a voter’s currently aligned political party affiliation, based primarily on historical records of their turnout at previous elections, and basic demographic information. This chapter aims to find a solution to this first step, by comparing varied performance measures to find a reliable prediction model from learning algorithms, including decision tree, random forest, and gradient boosting machine XGBoost binary classifiers. Significant correlations between independent variables and the target prediction class, i.e., voter’s registered party affiliation, contribute toward the development of an automated predictive ML model. The Ohio Secretary of State public voter database was used to collect voter demographics and election turnout data, then prepared using preprocessing methods, and finally used to identify the best performing ML model. Hyperparameter grid search with XGBoost proved to be the superior binary logistic classifier, reproducing a nearly perfect skillful model. Tracking the alignment between voters and PEC candidates the proposed future of Litics360; i.e., to develop an application that promotes a healthy and transparent platform for voters to communicate their socio-political grievances to PECs, enabling efficient appropriation of a PEC’s funds and resources to engineer successful marketing campaigns.

Sabiha Mahmud Sumi

Data Science, Deep Learning, and CNN

Frontmatter
Deep Ensemble Learning for Early-Stage Churn Management in Subscription-Based Business

Churn prediction provides the opportunity to improve customer retention via early intervention. Previous research on churn prediction focused on two types of methods, classification and survival analysis. The comparison and combination of algorithms in these two types have not been fully explored. In this paper, we explore two stacking models to combine predictive capabilities of XGBoost, RNN-DNN classifiers, and survival analysis. We first apply a standard stacking model, where predictions from base learners are fed into a meta-classifier. Furthermore, we propose a novel ensemble model, Deep Stacking, that integrates neural networks with other models. We evaluate the stacking models for early-stage churn prediction at Ancestry®, the global leader in family history and consumer genomics, with metrics dictated by business needs.

Sijia Zhang, Peng Jiang, Azadeh Moghtaderi, Alexander Liss
Extending Micromobility Deployments: A Concept and Local Case Study

Micromobility is a recent phenomenon that refers to the use of small human- or electric-powered vehicles such as scooters and bikes to travel short distances, and sometimes to connect with other modes of transportation such as bus, train, or car. Deployments in major cities of the world have been both successful and challenging. This paper reviews the evolution of micromobility services from shared bicycles, dockless systems, and shared electric scooters. The authors evaluated benefits, deficiencies, and factors in adoption to inform more rigorous and extensive geospatial analysis that will examine intersections with land-use, public transit, socio-economic demographics, road networks, and traffic. This work conducted exploratory spatial data analysis and correlation of publicly available datasets on land use, trip production, traffic, and travel behavior. Data from Washington D.C. served as a case study of best practices for scaling deployments to meet the social, economic, and mobility needs of the city.

Zhila Dehdari Ebrahimi, Raj Bridgelall, Mohsen Momenitabar
Real-Time Spatiotemporal Air Pollution Prediction with Deep Convolutional LSTM Through Satellite Image Analysis

Air pollution is responsible for the early deaths of seven million people every year in the world. The first and the most important step in mitigating the air pollution risks is to understand it, discover the patterns and sources, and predict it in advance. Real-time air pollution prediction requires a highly complex model that can solve this spatiotemporal problem in multiple dimensions. Using a combination of spatial predictive models (deep convolutional neural networks) and temporal predictive models (deep long short-term memory), we utilized the convolutional LSTM structure that learns correlations between various points of location and time. We created a sequential encoder-decoder network that allows for accurate air pollution prediction 10 days in advance using data of 10 days in the past in the county of Los Angeles on a nitrogen dioxide metric. Through a 5D tensor reformatting of air quality satellite image data, we provide a prediction for nitrogen dioxide in various areas of Los Angeles over various time periods.

Pratyush Muthukumar, Emmanuel Cocom, Jeanne Holm, Dawn Comer, Anthony Lyons, Irene Burga, Christa Hasenkopf, Mohammad Pourhomayoun
Performance Analysis of Deep Neural Maps

Deep neural maps are unsupervised learning and visualization methods that combine autoencoders with self-organizing maps. An autoencoder is a deep artificial neural network that is widely used for dimension reduction and feature extraction in machine learning tasks. The self-organizing map is a neural network for unsupervised learning often used for clustering and the representation of high-dimensional data on a 2D grid. Deep neural maps have shown improvements in performance compared to standalone self-organizing maps when considering clustering tasks. The key idea is that a deep neural map outperforms a standalone self-organizing map in two dimensions: (1) better convergence behavior by removing noisy/superfluous dimensions from the input data and (2) faster training due to the fact that the cluster detection part of the DNM deals with a lower dimensional latent space. Traditionally, only the basic autoencoder has been considered for use in deep neural maps. However, many different kinds of autoencoders exist such as the convolutional and the denoising autoencoder, and here we examine the effects of various autoencoders on the performance of the resulting deep neural maps. We investigate five types of autoencoders as part of our deep neural maps using three different data sets. Overall we show that deep neural maps perform better than standalone self-organizing maps both in terms of improved convergence behavior and faster training. Additionally we show that deep neural maps using the basic autoencoder outperform deep neural maps based on other autoencoders on nonimage data. To our surprise, we found that deep neural maps based on contractive autoencoders outperformed deep neural maps based on convolutional autoencoders on image data.

Boren Zheng, Lutz Hamel
Implicit Dedupe Learning Method on Contextual Data Quality Problems

Variety of applications such as information extraction, data mining, e-learning, or web applications use heterogeneous and distributed data. As a result, the usage of data is challenged by deduplication issues. To harmonize this issue, the present study proposed a novel dedupe learning method (DLM) and other algorithms to detect and correct contextual data quality anomalies. The method was created and implemented on structured data. Our methods have been successful in identifying and correcting more data anomalies than current taxonomy techniques. Consequently, these proposed methods would be important in detecting and correcting errors in broad contextual data (big data).

Alladoumbaye Ngueilbaye, Hongzhi Wang, Daouda Ahmat Mahamat, Roland Madadjim
Deep Learning Approach to Extract Geometric Features of Bacterial Cells in Biofilms

We develop a deep learning approach to estimate the geometric characteristics of bacterial biofilms grown on metal surfaces. Specifically, we focus on sulfate-reducing bacteria (SRB) which are widely implicated in microbiologically influenced corrosion (MIC) of metals, costing billions of dollars annually. Understanding the growth characteristics of biofilms is important for designing and developing protective coatings that effectively combat MIC. Our goal here is to automate the extraction of the shape and size characteristics of SRB bacterial cells from the scanning electron microscope (SEM) images of a biofilm generated at various growth stages. Typically, these geometric features are measured using laborious and manual methods. To automate this process, we use a combination of image processing and deep learning approaches to determine the geometric properties. This is done via image segmentation of SEM images using deep convolutional neural networks. To address the challenges associated with detection of individual cells that form clusters, we apply a modified watershed algorithm to separate the cells from the cluster. Finally, we estimate the number of cells as well as their length and width distributions.

Md Hafizur Rahman, Jamison Duckworth, Shankarachary Ragi, Parvathi Chundi, Venkata R. Gadhamshetty, Govinda Chilkoor
GFDLECG: PAC Classification for ECG Signals Using Gradient Features and Deep Learning

ECG signal classification is a popular topic in healthcare for arrhythmia detection. Recently, ECG signal analysis using supervised learning has been investigated with the goal to help physicians to automatically identify the premature atrial complex (PAC) heartbeats. PAC may be a sign of underlying heart conditions, which may change to supraventricular tachycardia that increases the possibility of sudden death. In this paper, we propose a data-driven approach, GFDLECG, which is based on ECG behavior to detect abnormal beats. We extract further features from ECG using the gradient feature generation algorithm. We also build the classification model by utilizing the gated recurrent unit (GRU) and the residual fully convolutional networks with GRU to learn long short-term patterns of ECG behaviors.

Hashim Abu-gellban, Long Nguyen, Fang Jin
Tornado Storm Data Synthesization Using Deep Convolutional Generative Adversarial Network

Predicting violent storms and dangerous weather conditions with current models can take a long time due to the immense complexity associated with weather simulation. Machine learning has the potential to classify tornadic weather patterns much more rapidly, thus allowing for more timely alerts to the public. A challenge in applying machine learning in tornado prediction is the imbalance between tornadic data and non-tornadic data. To have more balanced data, we created in this work a new data synthesization system to augment tornado storm data by implementing a deep convolutional generative adversarial network (DCGAN) and qualitatively compare its output to natural data.

Carlos A. Barajas, Matthias K. Gobbert, Jianwu Wang
Integrated Plant Growth and Disease Monitoring with IoT and Deep Learning Technology

At present, the time, labor, and inaccuracies in plant and seedling care make feasibility a major concern in large-scale agricultural operations. Developments in Internet of Things (IoT) technology and image classification by deep learning have made it possible to monitor various aspects of plant conditions, but an integrated solution that combines IoT sensor data, high-resolution imagery, and manual intervention data in a synchronized time-series database environment has not yet been brought to market. In this paper, we propose such an integrated solution. The overall system architecture is outlined, as well as the individual components including sensors, drone imagery, image processing, database framework, and alerting mechanism. These components are brought together and synchronized in a time-series database. By synchronizing all the variables, this solution presents a comprehensive view and better means for intervention. Finally, opportunities for research and specific component improvements are identified.

Jonathan Fowler, Soheyla Amirian

Data Analytics, Mining, Machine Learning, Information Retrieval, and Applications

Frontmatter
Meta-Learning for Industrial System Monitoring via Multi-Objective Optimization

The complexity of data analysis systems utilizing machine learning for industrial processes necessitates going beyond model selection and learning over the entire system space. It posits learning over various algorithms considered for featurization and feature-based learning, the interplay of these algorithms, and the model space for each algorithm. This problem, often referred to as meta-learning, has not been addressed in industrial monitoring. In this paper, motivated by a real-world problem of quantifying actuations of an industrial robot, we address meta-learning. Our contribution generalizes beyond the specific application; we propose a Pareto-based, multi-objective optimization approach that can be easily generalized to system diagnostics. A detailed evaluation compares solutions of this approach and other existing approaches and shows them superior in distinguishing movements of a robot from recorded acoustic signals.

Parastoo Kamranfar, Jeff Bynum, David Lattanzi, Amarda Shehu
Leveraging Insights from “Buy-Online Pickup-in-Store” Data to Improve On-Shelf Availability

This research provides insights on how to leverage buy-online pickup-in-store data to understand customer preferences, demand patterns, and when products go out of stock (OOS) to improve replenishment decisions for grocery chains. The motivation for this study is to reduce lost sales opportunities by improving on-shelf availability (OSA), subsequently improving the overall revenue and profitability of the retail store. In collaboration with a national grocery chain having over 240 stores in the USA, our team developed and assessed different predictive models to improve on-shelf availability rate. The solution uses various product categories based on the grocery business segments, and then specific predictive models are implemented to predict stockouts for each category. While some research has been performed in this area, our work is novel in how OOS data from brick-and-click is utilized to advise the grocery stores on timely replenishment of stock to reduce overall lost sales. This research aims to evaluate and compare multiple classification algorithms for predicting OOS at a store-product level. Subsequently, the study performed an in-depth analysis to ascertain which business segments rendered better prediction accuracy.

Sushree S. Patra, Pranav Saboo, Sachin U. Arakeri, Shantam D. Mogali, Zaid Ahmed, Matthew A. Lanham
Analyzing the Impact of Foursquare and Streetlight Data with Human Demographics on Future Crime Prediction

Finding the factors contributing to criminal activities and their consequences is essential to improve quantitative crime research. To respond to this concern, we examine an extensive set of features from different perspectives and explanations. Our study aims to build data-driven models for predicting future crime occurrences. In this paper, we propose the use of streetlight infrastructure and Foursquare data along with demographic characteristics for improving future crime incident prediction. We evaluate the classification performance based on various feature combinations as well as with the baseline model. Our proposed model was tested on each smallest geographic region in Halifax, Canada. Our findings demonstrate the effectiveness of integrating diverse sources of data to gain satisfactory classification performance.

Fateha Khanam Bappee, Lucas May Petry, Amilcar Soares, Stan Matwin
Nested Named Sets in Information Retrieval

An important problem for databases is amplifying efficiency of human-database interaction. This goal can be achieved on different levels where queries form the top level and data structures create the basic level, which has an essential impact on all higher levels. The main efforts in this area have been directed at the development of logical systems for database management. This became a leading application of logic within computer science. This is definitely true, but efficiency of human-database interaction depends not only on logic and corresponding query languages but to a great extent on structuration of data and databases. The purpose of this chapter is to develop novel mathematical tools for data structuration and describe their applications to data visualization and querying. To achieve this goal, we utilize the theory of named sets introducing nested named sets, studying their properties, and applying these results to the development of efficient tools for database management.

Mark Burgin, H. Paul Zellweger
Obstacle Detection via Air Disturbance in Autonomous Quadcopters

Autonomous drones can detect and avoid walls as the technology stands today, but they can only do this with camera, ultrasonic, or laser sensors. This paper highlights how data mining classification techniques can be used to predict which side of a drone an object is located from the air disturbance created by the drone being near such an object. Data was collected from the drone’s IMU, while it flew near a wall to its immediate left, right, and front. The IMU includes gyroscope, accelerometer, roll, pitch, and yaw data. Position and barometer data was also collected. The data was then fed to NearestNeighbor, GradientBoosting, and RandomForest classifiers.

Jason Hughes, Damian Lyons
Comprehensive Performance Comparison Between Flink and Spark Streaming for Real-Time Health Score Service in Manufacturing

We investigated two powerful streaming processing engines, Apache Flink and Apache Spark Streaming, for real-time health score service in the perspective of computational performance. A health score is an important index because it represents the machine’s lifetime that is significant measurement for detecting the machine’s failures. Many services have attempted to adopt streaming processing engines in order to compute a health score in real time, but there is limited literature on studying streaming engine’s performance for computing health scores in the aspects of the computational resources. In this chapter, we extensively studied two main open-source streaming projects, Apache Flink and Apache Spark Streaming, when it comes to obtaining health scores. To obtain the health scores, we equipped the deep-learning models with the streaming engine and evaluated the performance of the model calculation time for datasets with milliseconds interval. Especially, we tested our service with a dataset consisting of 10,000 assets to verify two stream engines could process large-scale level of the data sets, demonstrating the process of the massive sensor data records in real time. We anticipate that our study will serve as a good reference for selecting streaming engines in computing health score.

Seungchul Lee, Donghwan Kim, Daeyoung Kim
Discovery of Urban Mobility Patterns

The detection of mobility patterns is crucial for the development of urban planning policies and the design of business strategies. Some of the proposed approaches to carry out this task use surveys, registration data in social networks, or mobile phone data. Although it is possible to infer through the latter the place of residence of the clients and estimate the probability of visiting a store, it cannot be guaranteed, based on this information, that a purchase was actually made. This paper develops a proposal for the discovery of urban mobility patterns by adapting the trade areas approach using bank transaction data. The main advantages of our approach are the estimation of the probability of purchasing in a shop showing the importance of taking into account the business category and including individuals of all social classes. Likewise, different metrics were used to determine commercial attractiveness according to the category to which the business belongs. Some of the real-world benefits of this work include, but are not limited to, serving as a tool to facilitate business decision-making such as the location of a new retail store or the design of marketing campaigns.

Iván Darío Peñaranda Arenas, Hugo Alatrista-Salas, Miguel Núñez-del-Prado Cortez
Improving Model Accuracy with Probability Scoring Machine Learning Models

Binary classification problems are exceedingly common across corporations, regardless of their industry, with examples including predicting attrition or classifying patients as high-risk vs low-risk. The motivation for this research is to determine techniques that improve prediction accuracy for operationalized models. Collaborating with a national partner, we conducted feature experiments to isolate industry-agnostic factors with the most significant impact on conversion rate. We also use probability scoring to highlight incremental changes in accuracy while we applied several improvement techniques to determine which would significantly increase a model’s predictive power. We compare five algorithms: XGBoost, LGBoost, CatBoost, MLP, and an Ensemble. Our results highlight the superior accuracy of the ensemble, with a final log loss value of 0.5784. We also note that the highest levels of improvement in log loss occurs at the beginning of the process, after downsampling and using engineered custom metrics as inputs to the models.

Juily Vasandani, Saumya Bharti, Deepankar Singh, Shreeansh Priyadarshi
Ensemble Learning for Early Identification of Students at Risk from Online Learning Platforms

Online learning platforms have made knowledge easily and readily accessible for people, yet the ratio of students withdrawing or failing a course is relatively high compared to in-class learning as students do not get enough attention from the instructors. We propose an ensemble learning framework for the early identification of students who are at risk of dropping or failing a course. The framework fuses student demographics, assessment results, and daily activities as the total learning statistics and considers the slicing of data with regard to timestamp. A stacking ensemble classifier is then built upon eight base machine learning classification algorithms. Results show that the proposed model outperforms the base classifiers. The framework enables the early identification of possible failures at the half of a course with 85% accuracy; with full data incorporated, an accuracy of 94.5% is achieved. The framework shows great promise for instructors and online platforms to design interventions before it is too late to help students to pass their courses.

Li Yu, Tongan Cai
An Improved Oversampling Method Based on Neighborhood Kernel Density Estimation for Imbalanced Emotion Dataset

Classification problem of imbalanced dataset is one of the main research topics. Imbalanced dataset where majority class outnumbers minority class is more difficult to handle than balanced dataset. The ADASYN approach has tried to solve this problem by generating more minority class samples for a few samples around the border between two classes. However, it is difficult to expect good classification with ADASYN when the imbalanced dataset contains noise samples instead of real minority class samples around the border. In this study, to overcome this problem, a new oversampling approach deals with the probability that a minority class sample belongs to dangerous set, not noise samples by using kernel density estimation. The proposed method generates appropriate synthetic samples to train well the learning model for minority class samples. Experiments are performed on ECG dataset collected for emotion classification. Finally, the experimental results show that our method improves the overall classification accuracy as well as recall rate for minority class.

Gague Kim, Seungeun Jung, Jiyoun Lim, Kyoung Ju Noh, Hyuntae Jeong
Time Series Modelling Strategies for Road Traffic Accident and Injury Data: A Case Study

The paper aims to provide insights of choosing suitable time series models and analysing road traffic accidents and injuries taking road traffic accident (RTA) and injuries (RTI) data in Oman as a case study as the country faces one of the highest numbers of road accidents per year. Data from January 2000 to June 2019 from several secondary sources were gathered. Time series decomposition, stationarity and seasonality checking were performed to identify the appropriate models for RTA and RTI. SARIMA (3, 1, 1)(2, 0, 0)(12) and SARIMA (0, 1, 1)(1, 0, 2)(12) models were found to be the best for the road traffic accident and injury data, respectively, comparing many different models. AIC, BIC and other error values were used to choose the best model. Model diagnostics were also performed to confirm the statistical assumptions, and 2-year forecasting was performed. The analyses in this paper would help many government departments, academic researchers and decision-makers to generate policies to reduce accidents and injuries.

Ghanim Al-Hasani, Md. Asaduzzaman, Abdel-Hamid Soliman
Toward a Reference Model for Artificial Intelligence Supporting Big Data Analysis

This publication will introduce the reference model AI2VIS4BigData for the application domains Big Data analysis, AI, and visualization. Without a reference model, developing a software system and other scientific and industrial activities in this topic field lack a common specification and a common basis for discussion and thus pose a high risk of inefficiency, reinventing the wheel and solving problems that have already been solved elsewhere. To prevent these disadvantages, this publication systematically derives the reference model AI2VIS4BigData with special focus on use cases where Big Data analysis, artificial intelligence (AI), and visualization mutually support each other: AI-powered algorithms empower data scientists to analyze Big Data and thereby exploit its full potential. Big Data enables AI specialists to comfortably design, validate, and deploy AI models. In addition, AI’s algorithms and methods offer the opportunity to make Big Data exploration more efficient for both, involved users and computing and storage resources. Visualization of data, algorithms, and processing steps improves comprehension and lowers entry barriers for all user stereotypes involved in these use cases.

Thoralf Reis, Marco X. Bornschlegl, Matthias L. Hemmje
Improving Physician Decision-Making and Patient Outcomes Using Analytics: ACase Study with the World’s Leading Knee Replacement Surgeon

Every year in the United States more than 300,000 knee replacements are performed. According to Time magazine, this number is expected to increase by 525% by the year 2030. Although knee surgeries are a highly effective treatment, patients are still prone to post-surgery complications which patients, physicians, and insurance companies all hope to minimize. In collaboration with one of the world’s leading knee replacement surgeons, we address this problem using their domain expertise with our analytics capabilities. We show how analysis of unstructured data with patient demographics, patient health data, and insurance codes can help better support the physicians in the diagnosis phase by assessing patient risk of developing complications or the risk of total knee replacement surgery failure. We identified the factors that led to successful knee surgeries (minimal complications and visits) by utilizing various classification algorithms such as random forest and logistic regression. We use these predictive models to provide a recommender system to support the interest of the patient, the hospital, and the insurance company, which helps find the right balance of post-operative patient success and total post-operative treatment costs to try and minimize the rate of relapse and additional physician visits. In the recent past, various studies have been carried out to predict outcomes of total knee replacement surgeries, but most if not all the studies have used similar parameters like pain score or functional score of knees to characterize surgeries as a failure or success. In our study, we have created a new parameter, based on three different conditions (number of post-op visits, direct complications from ICD codes for total knee replacement surgery complications, and whether a revision surgery has been carried out). We show that factors such as BMI, smoking, blood pressure, and age were statistically significant parameters for a surgery outcome. The surgeon performing the surgery was also a significant factor determining the outcome. This could be due to the different techniques used by different surgeons. Our model could save millions of dollars per year by detecting two-thirds of actual complications that would occur. We believe healthcare providers and consulting firms who are developing analytics-driven solutions for their clients in the healthcare industry will find our study novel and inspiring.

Anish Pahwa, Shikhar Jamuar, Varun Kumar Singh, Matthew A. Lanham
Optimizing Network Intrusion Detection Using Machine Learning

Machine learning (ML) techniques are essential in the detection of network attacks and enhancing network security. A device or software that recognizes any unusual pattern in the network and alerts the user about the same is an intrusion detection system (IDS). In this chapter, we have described the use of ML classification algorithms on the UNSW-NB15 dataset, leading to the generation of a network intrusion detection model which classifies incoming traffic into malicious or non-malicious traffic and issues an alert to the user. We have implemented the following ML algorithms – support vector machine, artificial neural network, and one-class support vector machine with an average accuracy of 89.25%, 91.54%, and 93.05%, respectively. Two graphical user interfaces (online and offline versions) have been developed for the system. Thus, the chapter proposes an optimized intrusion detection system that improves upon the existing intrusion detection systems which detect malicious packets in the network.

Sara Nayak, Anushka Atul Patil, Reethika Renganathan, K. Lakshmisudha
Hyperparameter Optimization Algorithms for Gaussian Process Regression of Brain Tissue Compressive Stress

Traumatic brain injury (TBI) is modeled using in vitro mechanical testing on excised brain tissue samples. While such testing is essential for understanding the mechanics of TBI, the results can vary by orders of magnitude due to the varying testing condition protocols. Gaussian process regression (GPR) provides good predictive accuracy of the compressive stress state. Here, the efficacy of different search algorithms in optimizing GPR hyperparameters was evaluated. Bayesian optimization, grid search, and random search were compared. Grid search reached the minimum objective function in fewer iterations, and the final regression model was comparable to that of Bayesian optimization in terms of RMSE and log likelihood in the prediction of compressive stress.

Folly Patterson, Osama Abuomar, R. K. Prabhu
Competitive Pokémon Usage Tier Classification

This chapter investigates competitive Pokémon usage tier classification given a Pokémon’s stats and typing. Pokémon were classified into the usage tiers defined by the competitive battling website Pokémon Showdown based on their individual base stats, the sum of all their base stats (BST), and their number of type weaknesses and type resists. Classifications were done using Weka’s J48 decision tree, lazy IBk 1-nearest neighbor, and logistic regression algorithm. The algorithms were evaluated by the metrics of accuracy and precision. The results of this study give insight into what factors most impact a Pokémon’s use on competitive teams, and could give future insights on how Pokémon may perform as other Pokémon fall in and out of popularity, and as more Pokémon are added in future games.

Devin Navas, Dylan Donohue
Mining Modern Music: The Classification of Popular Songs

The rising popularity of streaming services has made in-depth musical data more accessible than ever before and has created new opportunities for data mining. This project utilizes data from 19,000 songs made available by Spotify. Several data mining algorithms (including J48 Decision Trees, Random Forest, Simple K Means, NaiveBayes, ZeroR, and JRIP) were used to assess the data as a classification task with the target class being popularity. The data was pre-processed and the popularity class was split into two different schemes, both of which were used to train the aforementioned algorithms with the goal of attaining the highest possible classification accuracy. Once reliable models were produced, the best performing algorithms were used in conjunction with association algorithms and Information Gain evaluation in order to assess the importance of features such as key, acousticness, tempo, instrumentalness, etc., in the prediction of the popularity class. Through this lens certain groups of attributes emerged as indicators of what makes a song popular or unpopular, and relationships between the attributes themselves were revealed as well. Overall it was concluded that popular music does in fact have patterns and a formulaic nature, making the “art” of creating music seem more like a science. However, within those patterns enough variation can be seen to account for different genres and musical moods that still persist in this era of pop music, and support the idea that as a modern musical community we still maintain some diversity.

Caitlin Genna
The Effectiveness of Pre-trained Code Embeddings

Few machine learning applications applied to the domain of programming languages make use of transfer learning. It has been shown that in other domains, such as natural language processing, that transfer learning improves performance on various tasks. We investigate the use of transfer learning for programming languages, focusing on two tasks: method name prediction and code retrieval. We find that transfer learning provides improved performance, as it does to natural languages. We also find that these models can be pre-trained on programming languages that are different from the downstream task language and even pre-training models on English language data is sufficient to provide similar performance as pre-training on programming languages.

Ben Trevett, Donald Reay, Nick K. Taylor
An Analysis of Flight Delays at Taoyuan Airport

This chapter is a study that aims to find trends and probabilities of factors resulting in on-time performance. The study uses two models to address the factors affecting flight delays from two different managerial viewpoints. Using flight data at Taoyuan Airport from 2014 to 2016, a linear regression is used to analyze delays in a detailed way, which allows airlines to draw comparisons to their peers. Secondly, a data-mining model uses association rules to find probabilities of flight delays that can be used from an airport’s perspective to improve on-time performance. The models applied in this study show that operation factors, such as flight origin and turnaround times, are related to and can affect delays. Regardless of which method is employed, results show that low-cost carrier business models have successfully undercut their full-service carrier peers, even in a primary airport setting, to produce better on-time performance.

S. K. Hwang, S. M. Horng, C. L. Chao
Data Analysis for Supporting Cleaning Schedule of Photovoltaic Power Plants

To reduce the extent of dependence on nuclear power and thermal power, the government in Taiwan has aggressively promoted the use of green energy such as solar power and wind power in recent years. Solar energy has in fact become an indispensable part in human daily life in Taiwan. One critical issue in photovoltaic (PV) power plant operation is to determine when to clean dirty solar panels caused by dust or other pollutants. Overly frequent cleaning can lead to excessive cleaning fee while insufficient cleaning leads to reduced production. With a tropical island-type climate, it rains frequently in Taiwan in some seasons, which results in the cleaning of dirty solar panels, referred to as natural cleaning in contrast to manual cleaning by maintenance personnel. In this chapter, we investigate the panel cleaning issues in Taiwan under uncontrolled, operational configuration. We propose methods to estimate solar power loss due to dust on panels and further estimate the cumulative revenue loss. When the loss exceeds the cleaning fee, manual cleaning is scheduled. The preliminary result demonstrated that the proposed approach is promising.

Chung-Chian Hsu, Shi-Mai Fang, Yu-Sheng Chen, Arthur Chang

Information & Knowledge Engineering Methodologies, Frameworks, and Applications

Frontmatter
Concept into Architecture: A Pragmatic Modeling Method for the Acquisition and Representation of Information

Process models are very important in a wide range of application areas, for example, software development or operational management. The first step in modeling a process is gaining information and knowledge engineering. This information documented as process models are a very important part of Enterprise Architectures (EAs). All forces in the NATO use the NATO Architecture Framework (NAF). Process models in the NAF are represented as part of the operational view, the sub-view NOV-5. Process models are often the starting point for modeling or creating an EA. According to the principles of proper modeling, not only the correct use of the syntax is necessary but also the relevance of a model. This is inseparable from the underlying information from which the model is created.This chapter deals with the creation of a modeling method that allows subject matter experts (SME) in the area to be modeled to get started with EA. The aim of presentation is to use the method to obtain the necessary information for the creation of the process model and represent the information in a human- and machine-readable form. Also, the goal of this contribution is to transform the information syntactically correctly as a NOV-5 process model, using a software developed as part of this contribution. The transformed NOV-5 model is similar to the original representation of the gained information, so this enables the SME to check the created NAF-compliant model for correctness of content and to use the model without knowledge of the NAF.

Sebastian Jahnen, Stefan Pickl, Wolfgang Bein
Improving Knowledge Engineering Through Inter-Organisational Architecture, Culture, Agility and Change in E-Learning Project Teams

Project management is systematically conducted in order to achieve a deliverable outcome. In the case of information technology projects, project failure is very common – this is also the case with those IT projects that involve the development of e-learning systems, whether they involve minimal or intensive use of ICTs. The aim of this study, therefore, is to propose an approach to project management that involves the creation of a toolkit so that people without a background in project management, or e-learning, can more successfully run projects within their organisation. The toolkit enhances through approach to knowledge engineering by tailoring project management methodologies to an organisation and its partner organisations.

Jonathan Bishop, Kamal Bechkoum
Do Sarcastic News and Online Comments Make Readers Happier?

Newsvendors cultivate online communities to encourage online users’ comments worldwide. However, the incivility of online commenting is an important issue for both researchers and practitioners. This study focuses on the impact of news with and without online comments on readers’ emotions. An online experiment was designed with news sarcasm (sarcastic vs. neutral) and comments (civil, uncivil, and none) to examine participants’ emotions. Two pretests were administered to determine the target news and the incivility of online comments. Five hundred and twenty-nine subjects took part in the formal online experiment, and the results demonstrated both sarcasm in news and incivility in comments made readers significantly unhappy. The interaction effect between sarcasm in news and incivility in comments was also significant, implying that news might form a “frame” in readers’ minds and influence how they judge comments and emotions. Implications and discussions are also included.

Jih-Hsin Tang, Chih-Fen Wei, Ming-Chun Chen, Chih-Shi Chang
GeoDataLinks: A Suggestion for a Replacement for the ESRI Shapefile

The ESRI Shapefile system of geographical information storage was essential to the development of GIS near the end of the previous millennium. However, Shapefile is now frustrating to many professionals who use GIS because the simple array-based data structures used in Shapefile lack the organizational and expressive powers of modern object-oriented programming technology. Alternatives to Shapefile have been proposed. However, even the best of those extant proposed alternatives are based on older technologies such as the Relational database system. In this paper, a new system of geographical information data storage is proposed that is superior to Shapefile as well as the best extant Shapefile alternatives in various important ways such as superior data organization and the ability to securely and naturally express more complex relationships among geographic entities. The new geographical data storage system, called GeoDataLinks, that is presented here is based on the author’s Intentionally-Linked Entities (ILE) database system.

Vitit Kantabutra
Nutrition Intake and Emotions Logging System

Unbalanced diet is a major risk for chronic diseases such as cardiovascular, metabolic, kidney, cancer, and neurodegenerative diseases. Current methods to capture dietary intakes may include food-frequency questionnaires, 7-day food records and 24 h recall. These methods are expensive to conduct, cumbersome for participants, and prone to reporting errors. This chapter discusses the development of a Personal Nutrition Intake and Emotions logging system. The following will be discussed: establishing requirements and designing a simple interactive system, implementing a simple interactive system, performing data analysis, and evaluation of the simple interactive system.

Tony Anusic, Suhair Amer
Geographical Labeling of Web Objects Through Maximum Marginal Classification

Web search engines have become extremely popular in providing requested information to the user. The result set effectiveness of Web search engines has been continuously improving over the years. However, the documents of the result set may also contain irrelevant information having no importance to the user. So, the user has to spend some effort in searching for relevant information in these result set documents. To overcome this searching overhead, Web object search engines have been proposed. Such systems are built by extracting object information from various Web documents and integrating them into object repository. The user is provided with the facility to submit object search queries and the required object information is retrieved. Unlike, Web search engines, providing results to geography-specific queries is still in nascent stage for Web object search engines. Recently, Gaussian Mixture Model based technique for geographical labeling of Web objects was proposed in the literature. However, there is significant scope to improve the labeling accuracy results obtained in this technique. In this chapter, maximum marginal classifier-based technique for Web object geographical labeling is proposed. The advantages of this proposed technique are empirically exhibited on a real-world data set. This proposed technique, outperforms the contemporary technique by at least 40% in labeling accuracy, and is twice better in execution efficiency.

K. N. Anjan Kumar, T. Satish Kumar, J. Reshma
Automatic Brand Name Translation Based on Hexagonal Pyramid Model

Brand name translation is of great importance for international corporations when their goods enter exotic markets. In this chapter, we investigate the strategies and methods of brand name translation from Western languages to Chinese, and propose a hexagonal pyramid brand name translation model, which provides a comprehensive summary of brand name translation methods and makes the classification of some translated brand names from vague to clear. Based on the model and similarity calculating, an efficient automatic translation method has been proposed to provide help of finding adequate translated words in Chinese. And an experiment has been done by the way of a dedicated program with results of a cluster of recommended Chinese brand words with a good potential to be used.

Yangli Jia, Zhenling Zhang, Haitao Wang, Xinyu Cao
A Human Resources Competence Actualization Approach for Expert Networks

Expert networks are popular and useful tools for many organizations. They organize the storage and use of information about employees and their skills. Correspondence of the data stored in expert networks to the real experts’ competencies is crucial for project management. Irrelevant information may lead to unexpected project performance results. Analysis of such results can help to keep the information up to date. The approach presented in this paper uses information about the project’s results and its participants’ competencies to change experts’ competencies. The reference model and the algorithm used in the approach are described in this paper.

Mikhail Petrov
Smart Health Emotions Tracking System

Many diseases such as mental health disorders, can be linked to variable mood. Applications for smart health are important allowing patients with abnormal health conditions to be monitored and provided rapid help. Advances in IOT play a major role in the field of health care by empowering people to connect their health and wealth in a smart way. A simple Smart health tracking emotions system is developed, which records scheduled activities and student emotions over a period. It allows them later to have access to this data that can be analyzed. The emotions are measured based on a scale ranging from 0 to 10.

Geetika Koneru, Suhair Amer

Video Processing, Imaging Science, and Applications

Frontmatter
Content-Based Image Retrieval Using Deep Learning

The problems of content-based image retrieval (CBIR) and analysis is explored in this paper with a focus on the design and implementation of machine learning and image processing techniques that can be used to build a scalable application to assist with indexing large image datasets. The CBIR application will be able to search large image datasets to retrieve digital images that are like predefined specifications such as a given digital image, or a given image type. The search is based on the actual contents of images and not the metadata of these images. Feature extraction techniques are used in this research project to analyze images and extract important features of images. The extracted features reflect the important characteristics of images that are related to contents (such as colors, shapes, edges, and textures) that can identify the image type. Supervised machine learning techniques are used in this project to analyze these extracted features and to retrieve similar images in the form of a convolutional neural network. This network classifies images, and using the statistics that made these classifications, similarities can be drawn between the query image and entities within the database. The developed CBIR algorithms were able to analyze and classify images based on their contents.

Tristan Jordan, Heba Elgazzar
Human–Computer Interaction Interface for Driver Suspicious Action Analysis in Vehicle Cabin

The paper presents an approach for monitoring in-vehicle driver behavior focused on finding vulnerabilities in interaction interfaces between humans and systems built up with artificial intelligence in transport environment. We propose a reference model aimed at analyzing driver suspicious actions, comprising explicit intentional driving behavior and implicit unintentional one. The former kind of actions refers to human attitude placed at a conscious level and are easy to self-report. While, the latter are human attitudes, which are at the unconscious level, are involuntarily formed, and generally unknown. We develop the prototype of software application focused on suspicious actions detection in driver behavior is responsible for analyzing video stream recorded from the camera located in vehicle cabin.

Igor Lashkov, Alexey Kashevnik
Image Resizing in DCT Domain

There is a high demand for effective resizing of images in order to preserve the region of interest as much as possible on various screens with different dimensions and aspect ratios. In addition, image compression is important in many applications. This paper presents a compressed domain image-resizing algorithm by converting the discrete cosine transform (DCT) blocks of the original image into the DCT blocks of the resized one. Experimental results show that the proposed method outperforms other existing algorithms.

Hsi-Chin Hsin, Cheng-Ying Yang, Chien-Kun Su

Data Science and Information & Knowledge Engineering

Frontmatter
Comparative Analysis of Sampling Methods for Data Quality Assessment

Data quality assessment is an integral part of maintaining the quality of a system. The purpose of implementing an assessment is to make reading and using data easier, and it is also a prerequisite for data analysis. Low-quality data can be an obstacle for fast analysis, which is one of the reasons for financial issues in countries, companies, and hospitals. Jesmeen et al. [6] and Laranjeiro et al. [7] pointed out that poor quality data costs approximately $13.3 million per organization and $3 trillion per year for the entire US economy. Not only does poor data impact financial resources, but it also negatively impacts efficiency, productivity, and credibility. Therefore, data quality assessment has become one of the most attractive technologies to help improve data quality. In this work, we use data samples to assess the overall data quality. First, the sample size is determined and a control dataset is created to validate the proposed method. Next, four sampling methods were used to obtain different sized samples. Based on the three quality dimensions, the sampling results are compared using various statistical methods.

Sameer Karali, Hong Liu, Jongyeop Kim
A Resampling Based Semi-supervised Learning Analysis for Identifying School Needs of Backpack Programs

School-based backpack programs, which supply children with food to take home on weekends and holidays, have shown positive effects evidenced by families and schools’ reports. The selection priority is a big issue when the programs plan to expand their service to other schools. To aid with the appropriate design of the program, this study focuses on statistical modeling of the limited data to identify schools with the true needs of the backpack programs. The data in this study was collected from a backpack program organization and Guilford County, North Carolina websites. Utilizing various classification techniques such as logistic regression, Naïve Bayes, decision tree, random forest, and support vector machine, a resampling-based semi-supervised learning (RSSL) method was developed and employed. The proposed RSSL was able to create several ranking systems for the food backpack needs for schools in the county not currently being serviced by the backpack program through exhaustive numerical simulations. Random forest in the proposed RSSL outperformed other selected classifiers in reporting probabilities useful for decision making. The RSSL method developed can be implemented to analyze the same backpack programs in other regions. The RSSL can be applied to other similar problems for the potential identification of misclassified labels with a slight or no modification and limited available data.

Tahir Bashir, Seong-Tae Kim, Liping Liu, Lauren Davis
Data-Driven Environmental Management: A Digital Prototype Dashboard to Analyze and Monitor the Precipitation on Susquehanna River Basin

Streamlined access to data makes forecasting, monitoring, and timely action much easier for any organization. Whether in business, education, and even environmental protection, quick access to data defines the difference between achieving and not achieving a centralized goal. For the Susquehanna River Basin Commission (SRBC), data is key to their core mission, which is to enhance public welfare through comprehensive planning, water supply allocation, and management of the water resources of the Susquehanna River Basin. River basin management involves multiple stakeholders, and scientific management requires significant forecasting capabilities. This research built the requisite digital prototype dashboard to monitor the precipitation as one of the environmental features. In this work, we applied several Machine Learning techniques, visualization, and data mining to identify relationships of related environmental parameters as well as indicators to facilitate better decision making. It will help to develop decision-support tools and methods for governments and businesses that help them make better, more informed, more accurate, and more pragmatic decisions.

Siamak Aram, Maria H. Rivero, Nikesh K. Pahuja, Roozbeh Sadeghian, Joshua L. Ramirez Paulino, Michael Meyer, James Shallenberger
Viability of Water Making from Air in Jazan, Saudi Arabia

Water is a very important resource in the daily life activities of people. It is one of the most significant resources that individuals need every day. Water can be used for individuals, industry, or farms. Unfortunately, many regions of the world do not have sufficient water to use. Some areas like Jazan, Saudi Arabia, do not have enough water, and the necessary water is generated by desalination from sea water. This water can be supplied to cities and farms. However, this desalination plant should be located near the seashore to provide water to the inland, and there would be transportation costs for that water. In this paper, water from air, a new water generation method, is proposed. The Jazan area has a lack of rainfall but it has relatively high humidity and constant wind. Using this wind and sun light, electricity will be generated, and water will be generated from the electricity. The goal of this paper is to check the viability of generation of water from air.

Fathe Jeribi, Sungchul Hong
A Dynamic Data and Information Processing Model for Unmanned Aircraft Systems

Dynamic Data and Information Processing (DDIP) involves symbiotic control feedback through sensor reconfiguration to integrate real-time data into a predictive method for system behavior. Fundamentally, DDIP presents opportunities to advance understanding and analysis of activities, operations, and transformations that contribute to system performance, thereby aiding in decision-making and event prediction. Previously examined DDIP application domains include weather monitoring and forecasting, supply chain system analysis, power system and energy analysis, and structural health monitoring. Currently, there is limited existing work that implements DDIP in support of Unmanned Aircraft Systems (UAS). Developing a DDIP model for a UAS application could further demonstrate DDIP capabilities and provide information relevant to improving maintenance operations and resource management.

Mikaela D. Dimaapi, Ryan D. L. Engle, Brent T. Langhals, Michael R. Grimaila, Douglas D. Hodson
Utilizing Economic Activity and Data Science to Predict and Mediate Global Conflict

The year 2020 has left many individuals finding that their lives are continually being changed based on the state of global circumstances. Some believe that these changes have given many citizens the opportunity to understand the interconnected nature of global actions and domestic consequences.Our preliminary hypothesis and research centers around the belief that an informed global population produces a safer, and better prepared, global society. It is our understanding that when individuals are able to reasonably prepare or expect conflict, early mediation and resource management can save not only tremendous funds but also numerous lives.We believe that creating a source of accessible predictive models is not only possible but also can be done without tremendous resource demand by tracking key pointers within the global economy and historic conflict triggers.

Kaylee-Anna Jayaweera, Caitlin Garcia, Quinn Vinlove, Jens Mache

Machine Learning, Information & Knowledge Engineering, and Pattern Recognition

Frontmatter
A Brief Review of Domain Adaptation

Classical machine learning assumes that the training and test sets come from the same distributions. Therefore, a model learned from the labeled training data is expected to perform well on the test data. However, this assumption may not always hold in real-world applications where the training and the test data fall from different distributions, due to many factors, e.g., collecting the training and test sets from different sources or having an outdated training set due to the change of data over time. In this case, there would be a discrepancy across domain distributions, and naively applying the trained model on the new dataset may cause degradation in the performance. Domain adaptation is a subfield within machine learning that aims to cope with these types of problems by aligning the disparity between domains such that the trained model can be generalized into the domain of interest. This paper focuses on unsupervised domain adaptation, where the labels are only available in the source domain. It addresses the categorization of domain adaptation from different viewpoints. Besides, it presents some successful shallow and deep domain adaptation approaches that aim to deal with domain adaptation problems.

Abolfazl Farahani, Sahar Voghoei, Khaled Rasheed, Hamid R. Arabnia
Fake News Detection Through Topic Modeling and Optimized Deep Learning with Multi-Domain Knowledge Sources

Increased internet access has exacerbated the severity of fake news on social media leading to employing advanced deep learning methods using large-scale data. Most of these methods rely on supervised models, demanding a large volume of training data to avoid overfitting. This paper presents Fake news Identification using Bidirectional Encoder Representations from the Transformers (BERT) model with optimal Neurons and Domain knowledge (FIND), a two-step automatic fake news detection model. To accurately detect it, the FIND approach applies a deep transformer model such as the BERT with the large-scale unlabeled text corpus to facilitate the classification model, and the Latent Dirichlet allocation (LDA) topic detection model to examine the influence of the article’s headline and the body individually and collaboratively. The proposed FIND approach outperforms the existing exBAKE approach in terms of 10.78% of the greater F-score.

Vian Sabeeh, Mohammed Zohdy, Rasha Al Bashaireh
Accuracy Evaluation: Applying Different Classification Methods for COVID-19 Data

A coronavirus is a type of contagious virus that can infect the nose, upper throat, and/or sinuses (Mcnulty et al., Vet Microbiol 9:425–434, 1984). It can spread just like any other viral infection. The Chinese branch of the World Health Organization (WHO) received the first evidence of an unidentified virus behind several cases of pneumonia in Wuhan and identified it as COVID-19. The virus started an outbreak that at first was primarily confined to China but became a worldwide pandemic. John Hopkins University’s COVID-19 website, which collects information from domestic and global health agencies, states that there have been over 2 million reported cases and 357,736 deaths. More than 200 countries and territories have reported the epidemic, with the USA, Italy, and Spain suffering the most acute cases outside of China. In order to better understand the characteristics of the COVID-19 data, in this work, we proposed a framework for classifying patients and evaluating accuracy. We conducted data transformation, descriptive analysis, and inference analysis to understand the data characteristics of COVID-19. In addition, we evaluated three classification methods, namely K-nearest neighbors (KNN), K-means, and the decision tree algorithms.

Sameer Karali, Hong Liu
Clearview, an Improved Temporal GIS Viewer and Its Use in Discovering Spatiotemporal Patterns

A GIS must have correct, efficient temporal functionality to be an effective visualization tool for spatiotemporal processes. Nevertheless, the major commercial GIS software providers have yet to produce GIS software packages that consistently function correctly with temporal data. This paper explains some of the serious errors in these existing software packages. The paper then shows how an efficient and correct temporal GIS viewer called Clearview can be designed and implemented. We also look at how Clearview can uncover new spatiotemporal patterns and knowledge in an HGIS (Historical GIS) database, Harvard and Fudan Universities’ CHGIS (China Historical GIS).

Vitit Kantabutra
Using Entropy Measures for Evaluating the Quality of Entity Resolution

This research describes some of the results from an unsupervised ER process using cluster entropy as a way to self-regulate linking. The experiments were performed using synthetic person references of varying quality. The process was able to obtain a linking accuracy of 93% for samples with moderate to high data quality. While results for low-quality references were much lower, there are many possible avenues of research that could further improve the results from this process. The purpose of this research is to allow ER processes to self-regulate linking based on cluster entropy. The results are very promising for entity references of relatively high quality; using this process for low-quality data needs further improvement. The best overall result obtained from the sample was just over 50% linking accuracy.

Awaad Al Sarkhi, John R. Talburt
Improving Performance of Machine Learning on Prediction of Breast Cancer Over a Small Sample Dataset

The application of machine learning (ML) algorithms aim to develop prognostic tools that could be trained on data that is routinely collected. In a typical scenario, the ML algorithm-based prognostic tool is utilized to search through large volumes of data to look for complex relationships in the training data. However, not much attention has been devoted to scenarios where small sample datasets are a widespread occurrence in research areas involving human participants such as clinical trials, genetics, and neuroimaging. In this research, we have studied the impact of the size of the sample dataset on the model performance of different ML algorithms. We compare the model fitting and model prediction performance on the original small dataset and the augmented dataset. Our research has discovered that the model fitted on a small dataset exhibits severe overfitting during the testing stage, which reduces when the model is trained on the augmented dataset. However, to different ML algorithms, the improvement of the model performance due to trained by the augmented dataset may vary.

Neetu Sangari, Yanzhen Qu
Development and Evaluation of a Machine Learning-Based Value Investing Methodology

The majority of approaches to utilize computers for fundamental analysis in stock investing are plagued with scalability and profitability issues. The present work tested four machine learning algorithms to overcome them. Random Forest and a soft voting ensemble obtained strong risk–reward ratios over 12 test years. An innovative process involving picking exclusively the top 20 stocks based on the algorithms’ softmax confidence values enhanced returns by approximately 35%. Methods such as comparing in-sample and out-of-sample precision performance as well as distributions of test and training data suggested that the algorithm can be used in out-of-sample data.

Jun Yi Derek He, Joseph Ewbank
Backmatter
Metadaten
Titel
Advances in Data Science and Information Engineering
herausgegeben von
Robert Stahlbock
Gary M. Weiss
Mahmoud Abou-Nasr
Cheng-Ying Yang
Dr. Hamid R. Arabnia
Leonidas Deligiannidis
Copyright-Jahr
2021
Electronic ISBN
978-3-030-71704-9
Print ISBN
978-3-030-71703-2
DOI
https://doi.org/10.1007/978-3-030-71704-9

Premium Partner