1 Introduction
2 Related work
3 Method
3.1 Structured feature space decomposition
3.1.1 Binning values of a feature
3.1.2 Selecting additional features
3.1.3 Hyperparameters
3.2 Applications of the learned model
3.2.1 Feature selection and correlations
3.2.2 Prediction
3.2.3 Analysis and model interpretation
3.3 Comparison to state-of-the-art
4 Results
Prediction task | Dataset | # of samples | # of features |
---|---|---|---|
Classification | Breast Cancer (Original) [28] | 683 | 9 |
Spambase | 4601 | 57 | |
SPECTF [29] | 267 | 44 | |
Parkinsons [30] | 195 | 22 | |
Stack Exchange | 1,026,225 | 12 | |
Khan | 680,551 | 17 | |
Digg | 1,000,000 | 19 | |
Twitter | 5,000,000 | 19 | |
Duolingo | 767,718 | 18 | |
Regression | App Energy [31] | 19,735 | 27 |
Building (Sales) [32] | 372 | 103 | |
Building (Costs) [32] | 372 | 103 | |
Pole Telecommunication [33] | 15,000 | 48 | |
Breast Cancer (Prognostic) [28] | 194 | 32 | |
Boson Housing [34] | 506 | 13 | |
Triazines [35] | 186 | 60 | |
Parkinsons (Motor) [36] | 5875 | 16 | |
Parkinsons (Total) [36] | 5875 | 16 |
-
Stack Exchange. The Q&A platform Stack Exchange enables users to ask and answer questions. Askers can also accept one of the answers as the best answer. This enables us to measure answerer performance by whether their answer was accepted as the best answer or not. The data we analyze includes a random sample of all answers posted on Stack Exchange from 08/2009 until 09/2014 that preserves the class distribution. Each record corresponds to an answer and contains a binary outcome variable \(Y\in \{0,1\}\) (one indicates the answer was accepted, and zero otherwise), along with 14 features. These features include answer-based features, such as the length of the answer, measured in the number of words, lines of code and hyperlinks to Web content the answer contains, the number of other answers the question already has, the answer’s readability score, a numeric index giving the level of education needed to easily comprehend the answer. Other features include the answerer’s reputation, how long the answerer has been registered (signup duration in months) and a percentile rank (signup percentile), the number of answers they have previously written, time since the previous answer, the number of answers written by the answerer in his or her current session, and answer’s position within the session, i.e., whether it was the first, second, third, etc. answer the user wrote during the same session.
-
Khan Academy. The online educational platform Khan Academy enables users learn a subject then practice what they learned through a sequence of questions on the subject. We study performance during the practice stage by looking at whether users answered the questions correctly on their first attempt (\(Y=1\)) or not (\(Y=0\)). We study an anonymized sample of questions answered by adult Khan Academy users over a period from 07/2012 to 02/2014. For each question a user answers we have 19 features: as with Stack Exchange, these include answer-based, user-based, and other temporal features. The features include the amount of time it takes the user to answer the question, (solve_time), the number of attempts the user made to answer the question, time since the user’s previous answer (time_since_prev_ans), the number of questions the user answered during the current session, (session_length), and the answer’s position within the session. Additional features include temporal attributes such as the hour of the day, day of the week, month, etc. that the question was answered; user-based features, such as the month user signed up for Khan Academy, the number of first_five questions user answered correctly without hints, time between user’s first and last answer, (signup_duration), the numbers of all questions user ever attempted to answer, and the number of all attempts made on all questions, and other features, such as how long this user has currently been studying.
-
Duolingo. The online language learning platform Duolingo is accessed through an app on a mobile device. Users are encouraged to use the app in short bursts during breaks and commutes. The data4 was made available as part of a previous study [2]. The data contains a 2-week sample (02/28/2013 through 03/12/2013) of ongoing activity of users learning a language. All users in this data started lessons before the beginning of the data collection period. We focus on 45K users who completed at least five lessons. The median number of lessons was 7, although some had as many as 639 lessons. Performance on a lesson is defined as \(Y=1\) if the user got all the words in the lesson correct; otherwise, it is \(Y=0\). Features describing the user include how many lessons and sessions the user completed, how many perfect lessons the user had, the month and day of the lesson, etc.
-
Digg. The social news platform Digg allows users to post news stories, which their followers can like or “digg.” When a user diggs a story, that story is broadcast to his or her followers, a mechanism that allows for the diffusion of contents through the Digg social network. A further characteristic of Digg is its front page—stories that are popular are promoted to the front page and thus become visible to every Digg user. We study a dataset that tracks the diffusion of 3,500 popular Digg stories from their submissions by a single user to their eventual promotion to and residence on the Digg front page. We study information diffusion on Digg by examining whether or not (\(Y \in \{0,1\}\)) users “digg” (i.e., adopt) a story following exposure of the story from their friends, and thus share that information with their followers. The features associated with adoption include user-based features, such as indegree and outdegree (number of followers and followees of the users), node activity (how often the user posts), information received (the rate at which the user receives information from all followees); dynamics-related features such as the number of times the user was exposed to the story, and story-related features, such as its global popularity in the previous hour, and diurnal-features, including the hour of the day and day of the week. Through this data, we can study the factors that are important in explaining the spread of information in this social system.
-
Twitter. On the online social network Twitter, users can post information, which is then broadcast to their followers, i.e., the other Twitter users that follow that user. This dataset tracks the spread of 65,000 unique URLs through the Twitter social network during one month in 2010. Similarly to Digg, we can study social influence and information diffusion by examining whether (\(Y=1\)) or not (\(Y=0\)) a user posts a URL after being exposed to it when one of his or her friends posts. The features associated with each exposure event are the same as those for Digg.
4.1 Tuning hyperparameters
4.2 Prediction performance
4.3 Analyzing human behavior with S3D
4.3.1 Feature selection and correlations
Stack exchange | Digg | ||
---|---|---|---|
feature
|
weight
|
feature
|
weight
|
Signup Percentile | 0.148 | Activity | 0.258 |
Signup Duration | 0.143 | Time Since Last Tweet | 0.152 |
Reputation | 0.130 | Outdegree | 0.068 |
# Ans. Before | 0.125 | Info Received | 0.067 |
Time Since Prev Ans. | 0.117 | Indegree | 0.058 |
Words | 0.114 | Meme Pop. (Current) | 0.035 |
Readability | 0.101 | Meme Age | 0.033 |
Code Lines | 0.066 | Neighb Indegree | 0.032 |
Session Len. | 0.023 | Neighb Activity | 0.032 |
URLs | 0.022 | Meme Pop. (Recent) | 0.032 |
Ans. Position | 0.012 | Neighb Info Received | 0.031 |
Images | 0.000 | Neighb Outdegree | 0.030 |
Order | 0.030 | ||
Inv. Exposure Rate | 0.029 | ||
Time Last/Second to Last Exposures | 0.028 | ||
Time Last/First Exposures | 0.026 | ||
# Exposures | 0.024 | ||
Hour | 0.023 | ||
Day | 0.013 |