main-content

This book constitutes the refereed proceedings of five workshops that were held in conjunction with the 25th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2021, in Delhi, India, in May 2021.

The 17 revised full papers presented were carefully reviewed and selected from a total of 39 submissions..

The five workshops were as follows:

Workshop on Smart and Precise Agriculture (WSPA 2021)

PAKDD 2021 Workshop on Machine Learning for Measurement Informatics (MLMEIN 2021)

The First Workshop and Shared Task on Scope Detection of the Peer Review Articles (SDPRA 2021)

The First International Workshop on Data Assessment and Readiness for AI (DARAI 2021)

The First International Workshop on Artificial Intelligence for Enterprise Process Transformation (AI4EPT 2021)

### Identification of Harvesting Year of Barley Seeds Using Near-Infrared Hyperspectral Imaging Combined with Convolutional Neural Network

Abstract
To evaluate the quality and safety of the seeds, identification of the harvesting year is one of the main parameters as the quality of the seeds is deteriorated during storage due to seed aging. In this study, hyperspectral imaging in the near-infrared range of 900–1700 nm was used to non-destructively identify the harvesting time of the barley seeds. The seeds samples including three years from 2017 to 2019 were collected. An end-to-end convolutional neural network (CNN) model was developed using the mean spectra extracted from the ventral and dorsal sides of the seeds. CNN model outperformed other classification models (K-nearest neighbors and support vector machines with and without spectral preprocessing) with a test accuracy of 97.25%. This indicated that near-infrared hyperspectral imaging combined with CNN could be used to rapidly and non-destructively identify the harvesting year of the barley seeds.
Tarandeep Singh, Neerja Mittal Garg, S. R. S. Iyengar

### Plant Leaf Disease Segmentation Using Compressed UNet Architecture

Abstract
In proposed work, a compressed version of UNet has been developed using Differential Evolution for segmenting the diseased regions in leaf images. The compressed model has been evaluated on potato late blight leaf images from PlantVillage dataset. The compressed model needs only 6.8% of space needed by original UNet architecture, and the inference time for disease classification is twice as fast without loss in performance metric of mean Intersection over Union (IoU).
Mohit Agarwal, Suneet Kr. Gupta, K. K. Biswas

### Hierarchical Topic Model for Tensor Data and Extraction of Weekly and Daily Patterns from Activity Monitor Records

Abstract
Latent Dirichlet allocation (LDA) is a popular topic model for extracting common patterns from discrete datasets. It is extended to the pachinko allocation model (PAM) with a hierarchical topic structure. This paper presents a combination meal allocation (CMA) model, which is a further enhanced topic model from the PAM that has both hierarchical categories and hierarchical topics. We consider count datasets in multiway arrays, i.e., tensors, and introduce a set of topics to each mode of the tensors. The topics in each mode are interpreted as patterns in the topics and categories in the next mode. Despite there being a vast number of combinations in multilevel categories, our model provides simple and interpretable patterns by sharing the topics in each mode. Latent topics and their membership are estimated using Markov chain Monte Carlo (MCMC) methods. We apply the proposed model to step-count data recorded by activity monitors to extract some common activity patterns exhibited by the users. Our model identifies four daily patterns of ambulatory activities (commuting, daytime, nighttime, and early-bird activities) as sub-topics, and six weekly activity patterns as super-topics. We also investigate how the amount of activity in each pattern dynamically affects body weight changes.
Shunichi Nomura, Michiko Watanabe, Yuko Oguma

Open Access

### Convolutional Neural Network to Detect Deep Low-Frequency Tremors from Seismic Waveform Images

Abstract
The installation of dense seismometer arrays in Japan approximately 20 years ago has led to the discovery of deep low-frequency tremors, which are oscillations clearly different from ordinary earthquakes. As such tremors may be related to large earthquakes, it is an important issue in seismology to investigate tremors that occurred before establishing dense seismometer arrays. We use deep learning aiming to detect evidence of tremors from past seismic data of more than 50 years ago, when seismic waveforms were printed on paper. First, we construct a convolutional neural network (CNN) based on the ResNet architecture to extract tremors from seismic waveform images. Experiments applying the CNN to synthetic images generated according to seismograph paper records show that the trained model can correctly determine the presence of tremors in the seismic waveforms. In addition, the gradient-weighted class activation mapping clearly indicates the tremor location on each image. Thus, the proposed CNN has a strong potential for detecting tremors on numerous paper records, which can enable to deepen the understanding of the relations between tremors and earthquakes.
Ryosuke Kaneko, Hiromichi Nagao, Shin-ichi Ito, Kazushige Obara, Hiroshi Tsuruoka

### Unsupervised Noise Reduction for Nanochannel Measurement Using Noise2Noise Deep Learning

Abstract
Noise reduction is an important issue in measurement. A difficulty to train a noise reduction model using machine learning is that clean signal on measurement object needed for supervised training is hardly available in most advanced measurement problems. Recently, an unsupervised technique for training a noise reduction model called Noise2Noise has been proposed, and a deep learning model named U-net trained by this technique has demonstrated promising performance in some measurement problems. In this study, we applied this technique to highly noisy signals of electric current waveforms obtained by measuring nanoparticle passages in a multistage narrowing nanochannel. We found that a convolutional AutoEncoder (CAE) was more suitable than the U-net for the noise reduction using the Noise2Noise technique in the nanochannel measurement problem.
Takayuki Takaai, Makusu Tsutsui

### Classification Bandits: Classification Using Expected Rewards as Imperfect Discriminators

Abstract
A classification bandits problem is a new class of multi-armed bandits problems in which an agent must classify a given set of arms into positive or negative depending on whether the number of bad arms are at least $$N_2$$ or at most $$N_1(<N_2)$$ by drawing as fewer arms as possible. In our problem setting, bad arms are imperfectly characterized as the arms with above-threshold expected rewards (losses). We develop a method of reducing classification bandits to simpler one threshold classification bandits and propose an algorithm for the problem that classifies a given set of arms correctly with a specified confidence. Our numerical experiments demonstrate effectiveness of our proposed method.
Koji Tabata, Atsuyoshi Nakumura, Tamiki Komatsuzaki

### Overview and Insights from Scope Detection of the Peer Review Articles Shared Tasks 2021

Abstract
In the current paper, we will present the results of our shared task at The First Workshop & Shared Task on Scope Detection of the Peer Review Articles (SDPRA) collocated with PAKDD 2021. It aims to develop system(s) which can help in the peer-review process in the initial screening usually performed by the editor(s). We received four submissions in total: three from academic institutions and one from the industry. The quality of submission shows a greater interest in the task by the research community.
Saichethan Miriyala Reddy, Naveen Saini

### Scholarly Text Classification with Sentence BERT and Entity Embeddings

Abstract
This paper summarizes our participated solution for the shared task of the text classification (scope detection) of peer review articles at the SDPRA (Scope Detection of the Peer Review Articles) workshop at PAKDD 2021. By participating this challenge, we are particularly interested in how well those pre-trained word embeddings from different neural models, specifically transformer models, such as BERT, perform on this text classification task. Additionally, we are also interested in whether utilizing entity embeddings can further improve the classification performance. Our main finding is that using SciBERT for obtaining sentence embeddings for this task provides the best performance as an individual model compared to other approaches. In addition, using sentence embeddings with entity embeddings for those entities mentioned in each text can further improve a classifier’s performance. Finally, a hard-voting ensemble approach with seven classifiers achieves over 92% accuracy on our local test set as well as the final one released by the organizers of the task. The source code is publicly available at https://​github.​com/​parklize/​pakdd2021-SDPRA-sharedtask.
Guangyuan Piao

### Domain Identification of Scientific Articles Using Transfer Learning and Ensembles

Abstract
This paper describes our transfer learning-based approach for domain identification of scientific articles as a part of the SDPRA-2021 Shared Task. We experiment with transfer learning using pre-trained language models (BERT, RoBERTa, SciBERT), and these are then fine-tuned for this task. The result shows that the ensemble approach performs best as the weights are being taken into consideration. We propose improvements for future work. The codes for the best system are published here: https://​github.​com/​SDPRA-2021/​shared-task/​tree/​main/​IIITT.

### Identifying Topics of Scientific Articles with BERT-Based Approaches and Topic Modeling

Abstract
This paper describes neural models developed for the First Workshop on Scope Detection of the Peer Review Articles shared task collocated with PAKDD 2021. The aim of the task is to identify topics or category of scientific abstracts. We investigate the use of several fine-tuned language representation models pretrained on different large-scale corpora. In addition, we conduct experiments on combining BERT-based models and document topic vectors for scientific text classification. The topic vectors are obtained using LDA topic modeling. The topic-informed soft voting ensemble of neural networks achieved F1-score of 93.82%.
Anna Glazkova

### Using Transformer Based Ensemble Learning to Classify Scientific Articles

Abstract
Many time reviewers fail to appreciate novel ideas of a researcher and provide generic feedback. Thus, proper assignment of reviewers based on their area of expertise is necessary. Moreover, reading each and every paper from end-to-end for assigning it to a reviewer is a tedious task. In this paper, we describe a system which our team FideLIPI submitted in the shared task of SDPRA-2021 (https://​sdpra-2021.​github.​io/​website/​ (accessed January 25, 2021)) [14]. It comprises four independent sub-systems capable of classifying abstracts of scientific literature to one of the given seven classes. The first one is a RoBERTa [10] based model built over these abstracts. Adding topic models/Latent dirichlet allocation (LDA) [2] based features to the first model results in the second sub-system. The third one is a sentence level RoBERTa [10] model. The fourth one is a Logistic Regression model built using Term Frequency Inverse Document Frequency (TF-IDF) features. We ensemble predictions of these four sub-systems using majority voting to develop the final system which gives a F1 score of 0.93 on the test and validation set. This outperforms the existing State Of The Art (SOTA) model SciBERT’s [1] in terms of F1 score on the validation set. Our codebase is available at https://​github.​com/​SDPRA-2021/​shared-task/​tree/​main/​FideLIPI.
Sohom Ghosh, Ankush Chopra

### 1st International Workshop on Data Assessment and Readiness for AI

Abstract
In the last several years, AI/ML technologies have become pervasive in academia and industry, finding its utility in newer and challenging applications.
Bortik Bandyopadhyay, Sambaran Bandyopadhyay, Srikanta Bedathur, Nitin Gupta, Sameep Mehta, Shashank Mujumdar, Srinivasan Parthasarathy, Hima Patel

### Cooperative Monitoring of Malicious Activity in Stock Exchanges

Abstract
Stock exchanges are marketplaces to buy and sell securities such as stocks, bonds and commodities. Due to their prominence, stock exchanges are prone to a variety of attacks which can be classified as external and internal attacks. Internal attacks aim to make profits by manipulation of trading processes e.g., Spoofing, Quote stuffing, Layering and others, which are the specific focus of this paper. Different types of proprietary fraudulent activity detectors are deployed by stock exchanges to analyze the time series data of trader’s activities or the activity of a particular stock to flag potentially malicious transactions while human analysts probe the flagged transactions further. The key issue faced here is that while the number of anomalous transactions identified can run into thousands or tens of thousands, the number of such transactions that can realistically be probed by human analysts would be a small fraction due to resource constraints. The issue therefore reduces to a dynamic resource allocation problem wherein alerts that represent the most malicious transactions need to be mapped to human analysts for further probing across different time intervals. To address this challenge, we encode the scenario as a Cooperative Target Observation (CTO) problem wherein the analysts (modeled as observers) perform a cooperative observation of alerts that represent potentially malicious activity (modeled as targets) and develop multiple solution approaches in order to identify malicious activity.
Bhavya Kalra, Sai Krishna Munnangi, Kushal Majmundar, Naresh Manwani, Praveen Paruchuri

### Data-Debugging Through Interactive Visual Explanations

Abstract
Data readiness analysis consists of methods that profile data and flag quality issues to determine the AI readiness of a given dataset. Such methods are being increasingly used to understand, inspect and correct anomalies in data such that their impact on downstream machine learning is limited. This often requires a human in the loop for validation and application of remedial actions. In this paper we describe a tool to assist data workers in this task by providing rich explanations to results obtained through data readiness analysis. The aim is to allow interactive visual inspection and debugging of data issues to enhance interpretability as well as facilitate informed remediation actions by humans in the loop.
Shazia Afzal, Arunima Chaudhary, Nitin Gupta, Hima Patel, Carolina Spina, Dakuo Wang

### Data Augmentation for Fairness in Personal Knowledge Base Population

Abstract
Cold start knowledge base population (KBP) is the problem of populating a knowledge base from unstructured documents. While neural networks have led to improvements in the different tasks that are part of KBP, the overall F1 of the end-to-end system remains quite low. This problem is more acute in personal knowledge bases, which present additional challenges with regard to data protection, fairness and privacy. In this work, we use data augmentation to populate a more complete personal knowledge base from the TACRED dataset. We then use explainability techniques and representative set sampling to show that the augmented knowledge base is more fair and diverse as well.
Lingraj S. Vannur, Balaji Ganesan, Lokesh Nagalapatti, Hima Patel, M. N. Tippeswamy

### ROC Bot: Towards Designing Virtual Command Centre for Energy Management

Abstract
Domains such as energy management rely heavily on dashboards and other related interfaces to manage the infrastructure and resources. The users of this domain use dashboards to manage the data and extensively perform periodic analysis to save energy and cost. Creating multiple dashboards for visualization of data is not user-friendly from a design perspective. This motivates the need of a single interface through which users can do data exploration, visualization and summarizing. Combining this with features such as anomaly detection can identify various issues and assist in day to day monitoring of an energy management center.
In this paper, we present ROC (Resource Optimization Center) Bot, a novel data exploration tool with a natural language interface. ROC Bot leverages recent advances in deep models to make query understanding more robust in the following ways: First, ROC Bot uses a deep model to translate natural language statements to SQL, making the translation process more robust to paraphrasing and other linguistic variations. Second, to support the users in automatically summarizing data, ROC Bot provides a machine learning model that helps in writing natural looking summaries in any given tabular data.
Rishi Tiwari, Mohammed Afaque, Amit Sangroya, Mrinal Rawat

### Digitize-PID: Automatic Digitization of Piping and Instrumentation Diagrams

Abstract
Digitization of scanned Piping and Instrumentation diagrams (P&ID), widely used in manufacturing or mechanical industries such as oil and gas over several decades, has become a critical bottleneck in dynamic inventory management and creation of smart P&IDs that are compatible with the latest CAD tools. Historically, P&ID sheets have been manually generated at the design stage, before being scanned and stored as PDFs. Current digitization initiatives involve manual processing and are consequently very time consuming, labour intensive and error-prone. Thanks to advances in image processing, machine and deep learning techniques there is an emerging body of work on P&ID digitization. However, existing solutions face several challenges owing to the variation in the scale, size and noise in the P&IDs, the sheer complexity and crowdedness within the drawings, domain knowledge required to interpret the drawings and the very minute visual differences among symbols. This motivates our current solution called Digitize-PID which comprises of an end-to-end pipeline for detection of core components from P&IDs like pipes, symbols and textual information, followed by their association with each other and eventually, the validation and correction of output data based on inherent domain knowledge. A novel and efficient kernel-based line detection and a two-step method for detection of complex symbols based on a fine-grained deep recognition technique is presented in the paper. In addition, we have created an annotated synthetic dataset, Dataset-P&ID, of 500 P&IDs by incorporating different types of noise and complex symbols which is made available for public use (currently there exists no public P&ID dataset). We evaluate our proposed method on this synthetic dataset and a real-world anonymized private dataset of 12 P&ID sheets. Results show that Digitize-PID outperforms the existing state-of-the-art for P&ID digitization.
Shubham Paliwal, Arushi Jain, Monika Sharma, Lovekesh Vig