Skip to main content

2014 | Buch

Advances in Artificial Intelligence

27th Canadian Conference on Artificial Intelligence, Canadian AI 2014, Montréal, QC, Canada, May 6-9, 2014. Proceedings

herausgegeben von: Marina Sokolova, Peter van Beek

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 27th Canadian Conference on Artificial Intelligence, Canadian AI 2014, held in Montréal, QC, Canada, in May 2014. The 22 regular papers and 18 short papers presented together with 3 invited talks were carefully reviewed and selected from 94 submissions. The papers cover a variety of topics within AI, such as: agent systems; AI applications; automated reasoning; bioinformatics and BioNLP; case-based reasoning; cognitive models; constraint satisfaction; data mining; E-commerce; evolutionary computation; games; information retrieval; knowledge representation; machine learning; multi-media processing; natural language processing; neural nets; planning; privacy-preserving data mining; robotics; search; smart graphics; uncertainty; user modeling; web applications.

Inhaltsverzeichnis

Frontmatter

Long Papers

A Novel Particle Swarm-Based Approach for 3D Motif Matching and Protein Structure Classification

This paper investigates the applicability of Particle Swarm Optimization (PSO) to motif matching in protein structures, which can help in protein structure classification and function annotation. A 3D motif is a spatial, local pattern in a protein structure important for its function. In this study, the problem of 3D motif matching is formulated as an optimization task with an objective function of minimizing the least Root Mean Square Deviation (

l

RMSD) between the query motif and target structures. Evaluation results on two protein datasets demonstrate the ability of the proposed approach on locating the true query motif of

all

66 target proteins almost always (9 and 8 times, respectively, on average, out of 10 trials per target). A large-scale application of motif matching is protein classification, where the proposed approach distinguished between the positive and negative examples by consistently ranking

all

positive examples at the very top of the search results.

Hazem Radwan Ahmed, Janice Glasgow
Rhetorical Figuration as a Metric in Text Summarization

We show that surface-level markers of pragmatic intent can be used to recognize the important sentences in text and can thereby improve the performance of text summarization systems. In particular, we focus on using automated detection of rhetorical figures—characteristic syntactic patterns of persuasive language—to provide information for an additional metric to enhance the performance of the MEAD summarizer.

Mohammed Alliheedi, Chrysanne Di Marco
Empirical Evaluation of Intelligent Mobile User Interfaces in Healthcare

The users of mobile healthcare applications are not all the same, and there may be considerable variation in their requirements. In order for the application to be attractive to potential adopters, the interface should be very convenient and straightforward to use, and easy to learn. One way to accomplish this is with an intelligent mobile user interface (IMUI), so that the application can be readily adapted to suit user preferences. This paper presents the results of adapting the IMUI for the various user stereotypes and contexts in the healthcare environment. We begin with a context model of the healthcare domain and an analysis of user needs, and then proceed to solution analysis, followed by product design, development, and user testing in a real environment. In terms of IMUI design, we focus on adapting the MUI features for users in the healthcare context, either at design time or at runtime.

Reem Alnanih, Olga Ormandjieva, Thiruvengadam Radhakrishnan
Learning to Measure Influence in a Scientific Social Network

In research,

influence

is often synonymous with

importance

; the researcher that is judged to be influential is often chosen for the grants, distinctions and promotions that serve as fuel for research programs. The influence of a researcher is often measured by how often he or she is cited, yet as a measure of influence, we show that citation frequency is only weakly correlated with influence ratings collected from peers. In this paper, we use machine learning to enable a new system that provides a better measure of researcher influence. This system predicts the influence of one researcher on another via a range of novel social, linguistic, psychological, and bibliometric features. To collect data for training and testing this approach, we conducted a survey of 74 researchers in the field of computational linguistics, and collected thousands of influence ratings. Our results on this data show that our approach significantly outperforms measures based on citations alone, improving prediction accuracy by 56%. We also perform a detailed analysis of the key features in our model, and make some important observations about the scientific and non-scientific factors that most predict researcher influence.

Shane Bergsma, Regan L. Mandryk, Gordon McCalla
Filtering Personal Queries from Mixed-Use Query Logs

Queries performed against the open Web during working hours reveal missing content in the internal documentation within an organization. Mining such queries is thus advantageous but it must strictly adhere to privacy policy and meet privacy expectations of the employees. Particularly, we need to filter queries related to non-work activities. We show that, in the case of technical support agents, 78.7% of personal queries can be filtered using a words-as-features Maximum Entropy approach, while losing only 9.3% of the business related queries. Further improvements can be expected when running a data mining algorithm on the queries and when filtering private information from its output.

Ary Fagundes Bressane Neto, Philippe Desaulniers, Pablo Ariel Duboue, Alexis Smirnov
Analysis of Feature Maps Selection in Supervised Learning Using Convolutional Neural Networks

Artificial neural networks have been widely used for machine learning tasks such as object recognition. Recent developments have made use of biologically inspired architectures, such as the Convolutional Neural Network. The nature of the Convolutional Neural Network is that each convolutional layer of the network contains a certain number of feature maps or kernels. The number of these used has historically been determined on an ad-hoc basis. We propose a theoretical method for determining the optimal number of feature maps using the dimensions of the feature map or convolutional kernel. We find that the empirical data suggests that our theoretical method works for extremely small receptive fields, but doesn’t generalize as clearly to all receptive field sizes. Furthermore, we note that architectures that are pyramidal rather than equally balanced tend to make better use of computational resources.

Joseph Lin Chu, Adam Krzyżak
Inconsistency versus Accuracy of Heuristics

Many studies in heuristic search suggest that the accuracy of the heuristic used has a positive impact on improving the performance of the search. In another direction, historical research perceives that the performance of heuristic search algorithms, such as A* and IDA*, can be improved by requiring the heuristics to be consistent – a property satisfied by any perfect heuristic. However, a few recent studies show that inconsistent heuristics can also be used to achieve a large improvement in these heuristic search algorithms. These results raise a natural question:

which property of heuristics, accuracy or consistency/inconsistency, should we focus on when building heuristics

?

In this article, we investigate the relationship between the inconsistency and the accuracy of heuristics with A* search. Our analytical result reveals a correlation between these two properties. We then run experiments on the domain for the Knapsack problem with a family of practical heuristics. Our empirical results show that in many cases, the more accurate heuristics also have higher level of inconsistency and result in fewer node expansions by A*.

Hang Dinh
VMSP: Efficient Vertical Mining of Maximal Sequential Patterns

Sequential pattern mining

is a popular data mining task with wide applications. However, it may present too many sequential patterns to users, which makes it difficult for users to comprehend the results. As a solution, it was proposed to mine

maximal sequential patterns

, a compact representation of the set of sequential patterns, which is often several orders of magnitude smaller than the set of all sequential patterns. However, the task of mining maximal patterns remains computationally expensive. To address this problem, we introduce a vertical mining algorithm named

VMSP

(

Vertical mining of Maximal Sequential Patterns

). It is to our knowledge the first vertical mining algorithm for mining maximal sequential patterns. An experimental study on five real datasets shows that VMSP is up to two orders of magnitude faster than the current state-of-the-art algorithm.

Philippe Fournier-Viger, Cheng-Wei Wu, Antonio Gomariz, Vincent S. Tseng
A Comparison of Multi-Label Feature Selection Methods Using the Random Forest Paradigm

In this paper, we discuss three wrapper multi-label feature selection methods based on the Random Forest paradigm. These variants differ in the way they consider label dependence within the feature selection process. To assess their performance, we conduct an extensive experimental comparison of these strategies against recently proposed approaches using seven benchmark multi-label data sets from different domains. Random Forest handles accurately the feature selection in the multi-label context. Surprisingly, taking into account the dependence between labels in the context of ensemble multi-label feature selection was not found very effective.

Ouadie Gharroudi, Haytham Elghazel, Alex Aussem
Analyzing User Trajectories from Mobile Device Data with Hierarchical Dirichlet Processes

Mobile devices have become pervasive among users in both work environments as well as everyday life, and they sense a wealth of information that can be exploited for a variety of tasks, such as activity recognition, security or health monitoring. In this paper, we explore the feasibility of trajectory clustering, i.e., detecting similarities between moving objects, for an application related to workplace productivity improvement. We use Hierarchical Dirichlet Processes due to their ability to automatically extract appropriate trajectory segments. The application domain is the analysis of RSSI data, where this machine learning method proves successfully.

Negar Ghourchian, Doina Precup
Use of Ontology and Cluster Ensembles for Geospatial Clustering Analysis

Geospatial clustering is an important topic in spatial analysis and knowledge discovery research. However, most existing clustering methods clusters geospatial data at data level without considering domain knowledge and users’ goals during the clustering process. In this paper, we propose an ontology-based geospatial cluster ensemble approach to produce good clustering results with the consideration of domain knowledge and users’ goals. The approach includes two components: an ontology-based expert system and a cluster ensemble method. The ontology-based expert system is to represent geospatial and clustering domain knowledge and to identify the appropriate clustering components (e.g., geospatial datasets, attributes of the datasets, and clustering methods) based on a specific application requirement. The cluster ensemble is to combine a diverse set of clustering results produced by recommended clustering components into an optimal clustering result. A real case study has been conducted to demonstrate the efficiency and practicality of the approach.

Wei Gu, Zhilin Zhang, Baijie Wang, Xin Wang
Learning Latent Factor Models of Travel Data for Travel Prediction and Analysis

We describe latent factor probability models of human travel, which we learn from data. The latent factors represent interpretable properties: travel distance cost, desirability of destinations, and affinity between locations. Individuals are clustered into distinct styles of travel. The latent factors combine in a multiplicative manner, and are learned using Maximum Likelihood.

We show that our models explain the data significantly better than histogram-based methods. We also visualize the model parameters to show information about travelers and travel patterns. We show that different individuals exhibit different propensity to travel large distances. We extract the desirability of destinations on the map, which is distinct from their popularity. We show that pairs of locations have different affinities with each other, and that these affinities are partly explained by travelers’ preference for staying within national borders and within the borders of linguistic areas. The method is demonstrated on two sources of travel data: geotags from Flickr images, and GPS tracks from Shanghai taxis.

Michael Guerzhoy, Aaron Hertzmann
Task Oriented Privacy Preserving Data Publishing Using Feature Selection

In this work we show that feature selection can be used to preserve privacy of individuals without compromising the accuracy of data classification. Furthermore, when feature selection is combined with anonymization techniques, we are able to publish privacy preserving datasets. We use several UCI data sets to empirically support our claim. The obtained results show that these privacy-preserving datasets provide classification accuracy comparable and in some cases superior to the accuracy of classification of the original datasets. We generalized the results with a paired t-test applied on different levels of anonymization.

Yasser Jafer, Stan Matwin, Marina Sokolova
A Consensus Approach for Annotation Projection in an Advanced Dialog Context

Data annotation is a common way to improve the reliability of advanced dialog applications. Unfortunately, since those annotations are highly language-dependent, the universalization can become a very lenghty process. Even though some projection methods exist, most of them require a deeper level of annotation than the one used for advanced dialogs. In this paper, we present a consensus approach that exploits the specificities of a sparse annotation in order to do the data projection.

Simon Julien, Philippe Langlais, Réal Tremblay
Partial Satisfaction Planning under Time Uncertainty with Control on When Objectives Can Be Aborted

In real world planning problems, it might not be possible for an automated agent to satisfy all the objectives assigned to it. When this situation arises, classical planning returns no plan. In partial satisfaction planning, it is possible to satisfy only a subset of the objectives. To solve this kind of problems, an agent can select a subset of objectives and return the plan that maximizes the net benefit, i.e. the sum of satisfied objectives utilities minus the sum of the cost of actions. This approach has been experimented for deterministic planning. This paper extends partial satisfaction planning for problems with uncertainty on time. For problems under uncertainty, the best subset of objectives can not be calculated at planning time. The effective duration of actions at execution time may dynamically influence the achievable subset of objectives. Our approach introduces special abort actions to explicitly abort objectives. These actions can have deadlines in order to control when objectives can be aborted.

Sylvain Labranche, Éric Beaudry
Active Learning Strategies for Semi-Supervised DBSCAN

The semi-supervised, density-based clustering algorithm SSDBSCAN extracts clusters of a given dataset from different density levels by using a small set of labeled objects. A critical assumption of SSDBSCAN is, however, that at least one labeled object for each natural cluster in the dataset is provided. This assumption may be unrealistic when only a very few labeled objects can be provided, for instance due to the cost associated with determining the class label of an object. In this paper, we introduce a novel active learning strategy to select “most representative” objects whose class label should be determined as input for SSDBSCAN. By incorporating a Laplacian Graph Regularizer into a Local Linear Reconstruction method, our proposed algorithm selects objects that can represent the whole data space well. Experiments on synthetic and real datasets show that using the proposed active learning strategy, SSDBSCAN is able to extract more meaningful clusters even when only very few labeled objects are provided.

Jundong Li, Jörg Sander, Ricardo Campello, Arthur Zimek
Learning How Productive and Unproductive Meetings Differ

In this work, we analyze the

productivity

of meetings and predict productivity levels using linguistic and structural features. This task relates to the task of automatic extractive summarization, as we define productivity in terms of the number (or percentage) of sentences from a meeting that are considered summary-worthy. We describe the traits that differentiate productive and unproductive meetings. We additionally explore how meetings begin and end, and why many meetings are slow to get going and last longer than necessary.

Gabriel Murray
Gene Functional Similarity Analysis by Definition-based Semantic Similarity Measurement of GO Terms

The rapid growth of biomedical data annotated by Gene Ontology (GO) vocabulary demands an intelligent method of semantic similarity measurement between GO terms remarkably facilitating analysis of genes functional similarities. This paper introduces two efficient methods for measuring the semantic similarity and relatedness of GO terms. Generally, these methods by taking definitions of GO terms into consideration, address the limitations in the existing GO term similarity measurement methods. The two developed and implemented measures are, in essence, optimized and adapted versions of Gloss Vector semantic relatedness measure for semantic similarity/relatedness estimation between GO terms. After constructing optimized and similarity-adapted definition vectors (Gloss Vectors) of all the terms included in GO, the cosine of the angle between terms’ definition vectors represent the degree of similarity or relatedness for two terms. Experimental studies show that this semantic definition-based approach outperforms all existing methods in terms of the correlation with gene expression data.

Ahmad Pesaranghader, Ali Pesaranghader, Azadeh Rezaei, Danoosh Davoodi
Text Representation Using Multi-level Latent Dirichlet Allocation

We introduce a novel text representation method to be applied on corpora containing short / medium length textual documents. The method applies Latent Dirichlet Allocation (LDA) on a corpus to infer its major topics, which will be used for document representation. The representation that we propose has multiple levels (granularities) by using different numbers of topics. We postulate that interpreting data in a more general space, with fewer dimensions, can improve the representation quality. Experimental results support the informative power of our multi-level representation vectors. We show that choosing the correct granularity of representation is an important aspect of text classification. We propose a multi-level representation, at different topical granularities, rather than choosing one level. The documents are represented by topical relevancy weights, in a low-dimensional vector representation. Finally, the proposed representation is applied to a text classification task using several well-known classification algorithms. We show that it leads to very good classification performance. Another advantage is that, with a small compromise on accuracy, our low-dimensional representation can be fed into many supervised or unsupervised machine learning algorithms that empirically cannot be applied on the conventional high-dimensional text representation methods.

Amir H. Razavi, Diana Inkpen
A Comparison of h 2 and MMM for Mutex Pair Detection Applied to Pattern Databases

In state space search or planning, a pair of variable-value assignments that does not occur in any reachable state is considered a mutually exclusive (mutex) pair. To improve the efficiency of planners, the problem of detecting such pairs has been addressed frequently in the planning literature. No known efficient method for detecting mutex pairs is able to find all such pairs. Hence, the number and type of mutex constraints detected by various algorithms are different from one another.

The purpose of this paper is to study the effects on search performance when errors are made by the mutex detection method that is informing the construction of a pattern database (PDB). PDBs are deployed for creating heuristic functions that are then used to guide search. We consider two mutex detection methods,

h

2

, which can fail to recognize a mutex pair but never regards a reachable pair as mutex, and the sampling-based method MMM, which makes the opposite type of error. Both methods are very often perfect, i.e. they exactly identify which pairs are mutex and which are reachable. In the cases that they err that we examine in this paper,

h

2

’s errors cause search to be moderately slower (7% −24%) whereas MMM’s errors have very little effect on search speed or suboptimality, even when its sample size is quite small.

Mehdi Sadeqi, Robert C. Holte, Sandra Zilles
Ensemble of Multiple Kernel SVM Classifiers

Multiple kernel learning (MKL) allows the practitioner to optimize over linear combinations of kernels and shows good performance in many applications. However, many MKL algorithms require very high computational costs in real world applications. In this study, we present a framework which uses multiple kernel SVM classifiers as the base learners for stacked generalization, a general method of using a high-level model to combine lower-level models, to achieve greater computational efficiency. The experimental results show that our MKL-based stacked generalization algorithm combines advantages from both MKL and stacked generalization. Compared to other general ensemble methods tested in this paper, this method achieves greater performance on predictive accuracy.

Xiaoguang Wang, Xuan Liu, Nathalie Japkowicz, Stan Matwin
Combining Textual Pre-game Reports and Statistical Data for Predicting Success in the National Hockey League

In this paper, we create meta-classifiers to forecast success in the National Hockey League. We combine three classifiers that use various types of information. The first one uses as features numerical data and statistics collected during previous games. The last two classifiers use pre-game textual reports: one classifier uses words as features (unigrams, bigrams and trigrams) in order to detect the main ideas expressed in the texts and the second one uses features based on counts of positive and negative words in order to detect the opinions of the pre-game report writers. Our results show that meta classifiers that use the two data sources combined in various ways obtain better prediction accuracies than classifiers that use only numerical data or only textual data.

Josh Weissbock, Diana Inkpen

Short Papers

Using Ensemble of Bayesian Classifying Algorithms for Medical Systematic Reviews

Systematic reviews are considered fundamental tools for Evidence-Based Medicine. Such reviews require frequent and time- consuming updating. This study aims to compare the performance of combining relatively simple Bayesian classifiers using a fixed rule, to the relatively complex linear Support Vector Machine for medical systematic reviews. A collection of four systematic drug reviews is used to compare the performance of the classifiers in this study. Cross-validation experiments were performed to evaluate performance. We found that combining Discriminative Multinomial Naïve Bayes and Complement Naïve Bayes performs equally well or better than SVM while being about 25% faster than SVM in training time. The results support the usefulness of using an ensemble of Bayesian classifiers for machine learning-based automation of systematic reviews of medical topics, especially when datasets have a large number of abstracts. Further work is needed to integrate the powerful features of such Bayesian classifiers together.

Abdullah Aref, Thomas Tran
Complete Axiomatization and Complexity of Coalition Logic of Temporal Knowledge for Multi-agent Systems

Coalition Logic (CL) is one of the most influential logical formalisms for strategic abilities of multi-agent systems. However CL can not formalize the evolvement of rational mental attitudes of the agents such as knowledge. In this paper, we introduce Coalition Logic of Temporal Knowledge (CLTK), by incorporating a temporal logic of knowledge (Halpern and Vardi’s logic of

CKL

n

) into CL to equip CL with the power to formalize how agents’ knowledge (individual or group knowledge) evolves over the time by the coalitional forces and the temporal properties of strategic abilities as well. Furthermore, we provide a complete axiomatization of CLTK, along with the complexity of the satisfiability problem, which is shown to be EXPTIME-complete.

Qingliang Chen, Kaile Su, Yong Hu, Guiwu Hu
Experts and Machines against Bullies: A Hybrid Approach to Detect Cyberbullies

Cyberbullying is becoming a major concern in online environments with troubling consequences. However, most of the technical studies have focused on the detection of cyberbullying through identifying harassing comments rather than preventing the incidents by detecting the bullies. In this work we study the automatic detection of bully users on YouTube. We compare three types of automatic detection: an expert system, supervised machine learning models, and a hybrid type combining the two. All these systems assign a score indicating the level of “bulliness” of online bullies. We demonstrate that the expert system outperforms the machine learning models. The hybrid classifier shows an even better performance.

Maral Dadvar, Dolf Trieschnigg, Franciska de Jong
Towards a Tunable Framework for Recommendation Systems Based on Pairwise Preference Mining Algorithms

In this article, we present

PrefRec

, a general framework for developing RS using

Preference Mining

and

Preference Aggregation

techniques. We focus on Pairwise Preference Mining techniques allowing to predict which, between two objects, is the preferred one.A preliminary empirical study for analyzing the influence of the different factors involved in each of the five modules of

PrefRec

is presented.

Sandra de Amo, Cleiane G. Oliveira
Belief Change and Non-deterministic Actions

Belief change refers to the process in which an agent incorporates new information together with some pre-existing set of beliefs. We are interested in the situation where an agent must incorporate new information after the execution of actions with non-deterministic effects. In this case, the observation plays two distinct roles. First, it provides information about the current state of the world. Second, it provides information about the outcomes of any actions that have previously occurred. While the literature on belief change has extensively explored the former, we suggest that existing approaches to belief change have not explicitly considered how an agent uses observed information to determine the effects of non-deterministic actions. In this paper, we propose an approach in which action effects simply progress the agent’s underlying plausibility ordering over possible states. In the case of non-deterministic actions, new possible world trajectories are created and then subsequently dismissed as dictated by observations.

Aaron Hunter
Robust Features for Detecting Evasive Spammers in Twitter

Researchers have designed features of Twitter accounts that help machine learning algorithms to detect spammers. Spammers try to evade detection by manipulating such features. This has led to the design of robust features, i.e., features that are hard to manipulate. In this paper, we propose and evaluate five new robust features.

Muhammad Rezaul Karim, Sandra Zilles
Gene Reduction for Cancer Classification Using Cascaded Neural Network with Gene Masking

This paper presents an approach to cancer classification from gene expression profiling using cascaded neural network classifier. The method used aims to reduce the genes required to successfully classify the small round blue cell tumours of childhood (SRBCT) into four categories. The system designed to do this consists of a feedforward neural network and is trained with genetic algorithm. A concept of ‘gene masking’ is introduced to the system which significantly reduces the number of genes required for producing very high accuracy classification.

Raneel Kumar, Krishnil Chand, Sunil Pranit Lal
Polynomial Multivariate Approximation with Genetic Algorithms

We discuss an algorithm which allows us to find the algebraic expression of a dependent variable as a function of an arbitrary number of independent ones where data is arbitrary, i.e. it may have arisen from experimental data. The possibility of such approximation is proved starting from the Universal Approximation Theorem (UAT). As opposed to the neural network (NN) approach to which it is frequently associated, the relationship between the independent variables is explicit, thus resolving the “black box” characteristics of NNs. It implies the use of a nonlinear function (called the activation function) such as the logistic 1/(1+e

− 

x

). Thus, any function is expressible as a combination of a set of logistics. We show that a close polynomial approximation of logistic is possible by using only a constant and monomials of odd degree. Hence, an upper bound (D) on the degree of the polynomial may be found. Furthermore, we may calculate the form of the model resulting from D. We discuss how to determine the best such set by using a genetic algorithm leading to the best L∞- L2 approximation. It allows us to find the best approximation polynomial given a selected fixed number of coefficients. It then finds the best combination of coefficients and their values. We present some experimental results.

Angel Kuri-Morales, Alejandro Cartas-Ayala
Improving Word Embeddings via Combining with Complementary Languages

Word embeddings have recently been demonstrated outstanding results across various NLP tasks. However, most existing word embeddings learning methods employ mono-lingual corpus without exploiting the linguistic relationship among languages. In this paper, we introduce a novel CCL (Combination with Complementary Languages) method to improve word embeddings. Under this method, one word embeddings are replaced by its center word embeddings, which is obtained by combining with the corresponding word embeddings in other different languages. We apply our method to several baseline models and evaluate the quality of word embeddings on word similarity task across two benchmark datasets. Despite its simplicity, the results show that our method is surprisingly effective in capturing semantic information, and outperforms baselines by a large margin, at most 20 Spearman rank correlation (

ρ

×100).

Changliang Li, Bo Xu, Gaowei Wu, Tao Zhuang, Xiuying Wang, Wendong Ge
A Clustering Density-Based Sample Reduction Method

In this paper, we propose a new cluster-based sample reduction method which is unsupervised, geometric, and density-based. The original data is initially divided into clusters, and each cluster is divided into “portions” defined as the areas between two concentric circles. Then, using the proposed geometric-based formulas, the membership value of each sample belonging to a specific portion is calculated. Samples are then selected from the original data according to the corresponding calculated membership value. We conduct various experiments on the NSL-KDD and KDDCup99 datasets.

Mahdi Mohammadi, Bijan Raahemi, Ahmad Akbari
The Use of NLP Techniques in Static Code Analysis to Detect Weaknesses and Vulnerabilities

We employ classical NLP techniques (

n

-grams and various smoothing algorithms) combined with machine learning for non-NLP applications of detection, classification, and reporting of weaknesses related to vulnerabilities or bad coding practices found in artificial constrained languages, such as programming languages and their compiled counterparts. We compare and contrast the NLP approach to the signal processing approach in our results summary along with concrete promising results for specific test cases of open-source software written in C, C++, and JAVA. We use the open-source MARF’s NLP framework and its MARFCAT application for the task, where the latter originally was designed for the Static Analysis Tool Exposition (SATE) workshop

Serguei A. Mokhov, Joey Paquet, Mourad Debbabi
Weka-SAT: A Hierarchical Context-Based Inference Engine to Enrich Trajectories with Semantics

A major challenge in trajectory data analysis is the definition of approaches to enrich it semantically. In this paper, we consider machine learning and context information to enrich trajectory data in three steps: (1) the definition of a context model for trajectory domain; (2) the generation of rules based on that context model; (3) the implementation of a classification algorithm that processes these rules and adds semantics to trajectories. This approach is hierarchical and combines clustering and classification tasks to identify important parts of trajectories and to annotate them with semantics. These ideas were integrated into Weka toolkit and experimented using fishing vessel’s trajectories.

Bruno Moreno, Amílcar S. Júnior, Valéria Times, Patrícia Tedesco, Stan Matwin
Inferring Road Maps from Sparsely-Sampled GPS Traces

In this paper, we proposed a new segmentation-and-grouping framework for road map inference from sparsely-sampled GPS traces. First, we extended DBSCAN with the orientation constraint to partition the whole point set of traces to clusters representing road segments. Second, we proposed an adaptive

k

-means algorithm that the

k

value is determined by an angle threshold to reconstruct nearly straight line segments. Third, the line segments are grouped according to the ‘Good Continuity’ principle of Gestalt Law to form a ‘Stroke’ for recovering the road map. Experiment results show that our algorithm is robust to noise and sampling rate. In comparison with previous work, our method has advantages to infer the road maps from sparsely-sampled GPS traces.

Jia Qiu, Ruisheng Wang, Xin Wang
Heterogeneous Multi-Population Cultural Algorithm with a Dynamic Dimension Decomposition Strategy

Heterogeneous Multi-Population Cultural Algorithm (HMP-CA) is one of the most recent architecture proposed to implement Multi-Population Cultural Algorithms which incorporates a number of heterogeneous local Cultural Algorithms (CAs) communicating with each other through a shared belief space. The heterogeneous local CAs are designed to optimize different subsets of the dimensions of a given problem. In this article, two dynamic dimension decomposition techniques are proposed including the top-down and bottom-up approaches. These dynamic approaches are evaluated using a number of well-known benchmark numerical optimization functions and compared with the most effective and efficient static dimension decomposition methods. The comparison results reveals that the proposed dynamic approaches are fully effective and outperforms the static approaches in terms of efficiency.

Mohammad R. Raeesi N., Ziad Kobti
Toward a Computational Model for Collective Emotion Regulation Based on Emotion Contagion Phenomenon

This paper proposes a novel computational model for emotion regulation process which integrates the traditional appraisal approach with the dynamics of emotion contagion. The proposed model uses a fuzzy appraisal approach to analyze the influence of applying different regulation strategies as directed pro-regulation interventions to the system. Furthermore, the dynamics of changes in the population of emotional and neutral agents were modeled. The proposed model provides an effective framework to monitor and intervene as affect regulator in catastrophic situations such as natural disasters and epidemic diseases.

Ahmad Soleimani, Ziad Kobti
Effects of Frequency-Based Inter-frame Dependencies on Automatic Speech Recognition

The hidden Markov model (HMM) is a state-of-the-art model for automatic speech recognition. However, even though it already showed good results on past experiments, it is known that the state conditional independence that arises from HMM does not hold for speech recognition. One way to partly alleviate this problem is by concatenating each observation with their adjacent neighbors. In this article, we look at a novel way to perform this concatenation by taking into account the frequency of the features. This approach was evaluated on spoken connected digits data and the results show an absolute increase in classification of 4.63% on average for the best model.

Ludovic Trottier, Brahim Chaib-draa, Philippe Giguère
Social Media Corporate User Identification Using Text Classification

This paper proposes a text classification method for identifying corporate social media users. With the explosion of social media content, it is imperative to have user identification tools to classify personal accounts from corporate ones. In this paper, we use text data from Twitter to demonstrate an efficient corporate user identification method. This method uses text classification with simple but robust processing. Our experiment results show that our method is lightweight, efficient and accurate.

Zhishen Yang, Jacek Wołkowicz, Vlado Kešelj
Convex Cardinality Restricted Boltzmann Machine and Its Application to Pattern Recognition

The Restricted Boltzmann machine is a graphical model which has been very successful in machine learning and various applications. Recently lots of attention has been devoted to sparse techniques combining a cardinality potential function and an energy function. In this paper we use a convex cardinality potential function for increasing competition between hidden units to be sparser. Convex potential functions relax conditions on hidden units, encouraging sparsity among them. In this paper we show that combination of convex potential function with cardinality potential produces better results in classification.

Mina Yousefi, Adam Krzyżak, Ching Y. Suen

Contributions from Graduate Student Symposium

Task Oriented Privacy (TOP) Technologies

One major shortcoming with most of the exiting privacy preserving techniques is that, they do not make any assumption about the ultimate usage of the data. Therefore, they follow a ‘one-size-fits-all’ strategy which usually results in an inefficient solution and consequently leads to over-anonymization and information loss. We propose a Task Oriented Privacy (TOP) model and its corresponding software system which incorporates the ultimate usage of the data into the privacy preserving data mining and data publishing process. Our model allows the data recipient to perform privacy preserving data mining including data pre-processing using metadata. It also provides an intelligent privacy preserving data publishing technique guided by feature selection and personalized privacy preferences.

Yasser Jafer
Graph-Based Domain-Specific Semantic Relatedness from Wikipedia

Human made ontologies and lexicons are promising resources for many text mining tasks in domain specific applications, but they do not exist for most domains. We study the suitability of Wikipedia as an alternative resource for ontologies regarding the Semantic Relatedness problem.

We focus on the biomedical domain because (1) high quality manually curated ontologies are available and (2) successful graph based methods have been proposed for semantic relatedness in this domain.

Because Wikipedia is not hierarchical and links do not convey defined semantic relationships, the same methods used on lexical resources (such as WordNet) cannot be applied here straightforwardly.

Our contributions are (1) Demonstrating that Wikipedia based methods outperform state of the art ontology based methods on most of the existing ontologies in the biomedical domain (2) Adapting and evaluating the effectiveness of a group of bibliometric methods of various degrees of sophistication on Wikipedia for the first time (3) Proposing a new graph-based method that is outperforming existing methods by considering some specific features of Wikipedia structure.

Armin Sajadi
Semantic Management of Scholarly Literature: A Wiki-Based Approach

The abundance of available literature in online repositories poses more challenges rather than expediting the task of retrieving content pertaining to a knowledge worker’s information need. The rapid growth of the number of scientific publications has encouraged researchers from various domains to look for automatic approaches that can extract knowledge from the vast amount of available literature. Recently, a number of desktop and web-based applications have been developed to aid researchers in retrieving documents or enhancing them with semantic annotations [1,2]; yet, an integrated, collaborative environment that can encompass various activities of a researcher from assessing the writing quality of a paper to finding complementary work of a subject is not readily available. The hypothesis behind the proposed research work is that knowledge-intensive literature analysis tasks can be improved with semantic technologies. In this paper, we present the

Zeeva

system, as an empirical evaluation platform with integrated intelligent

assistants

that collaboratively work with humans on textual documents and use various techniques from the Natural Language Processing (NLP) and Semantic Web domains to manage and analyze scholarly publications.

Bahar Sateli
Backmatter
Metadaten
Titel
Advances in Artificial Intelligence
herausgegeben von
Marina Sokolova
Peter van Beek
Copyright-Jahr
2014
Verlag
Springer International Publishing
Electronic ISBN
978-3-319-06483-3
Print ISBN
978-3-319-06482-6
DOI
https://doi.org/10.1007/978-3-319-06483-3