1 Introduction
-
Usage: What kind of usage behavior can we see in Koders?
-
Search Topics: What are the users searching for?
-
Query Forms: How are users expressing their information need in their queries?
2 Usage Log Data
3 Analysis of Usage Data
-
Routine usage: First, we look at three variables to understand whether users are searching in Koders routinely and actively. We look at number of days that users are active in Koders, number of search activities, and number of download activities among the users.
-
Analysis of sessions: Second, we do an analysis on sessions of activities in the usage log. A session is considered to be a series of queries by a single user in a short interval that represents a single information need (Silverstein et al. 1999). We look at three variables in sessions: duration, activities, and page views. Duration is the length of session in minutes, activities are either search or download activities, and page views are count of consecutive repeating queries that are recorded in the log when a user navigates through multiple pages of search results for the same query.
-
Analysis of queries: Third, we do an analysis of the queries in the log to understand how users are expressing their queries. We look at query length, common usage of terms in queries among users, types of queries user give, and the kind of query operators and reformulations in the queries.
-
Comparing with Web search: Finally, we compare some of the results we obtained with existing results from analysis of logs in Web search.
3.1 Routine Usage
-
Users engage in very few activities: A large percentage of the users had only few search activities. More than 85% of users had just three or less search activities; about 67% of users had only one search activity. More than half of the users do not download anything after search, and those who download have only few downloads. About 64% of the users who searched had no download activity, and 91% of users had three or less downloads.
-
Most of the users did not used Koders again after using it for a day: About 90% of the users were active only for one day. Among others, 98% of users were active for less than or equal to three days. Only 0.14% of users were active for more than 10 days.
#Act | Days active | Download activities | Search activities | |||
---|---|---|---|---|---|---|
# Users | % | # Users | % | # Users | % | |
0 | NA | NA | 1,212,666 | 64.35 | NA | NA |
1 | 1,685,551 | 89.45 | 289,988 | 15.39 | 1,276,549 | 67.74 |
2 | 125,331 | 6.65 | 146,623 | 7.78 | 225,018 | 11.94 |
3 | 37,026 | 1.96 | 74,482 | 3.95 | 114,280 | 6.06 |
> 3 | 36,418 | 1.93 | 160,567 | 8.52 | 268,479 | 14.24 |
< = 3 | 1,847,908 | 98.06 | 1,723,759 | 91.48 | 1,615,847 | 85.75 |
3 < #Act < = 10 | 36,418 | 1.8 | 121,875 | 6.47 | 206,153 | 10.95 |
> 10 | 2,648 | 0.14 | 38,692 | 2.05 | 62,326 | 3.3 |
3.2 Analysis of Sessions
-
Sessions are short: About 57% of sessions had only one activity. About 84% of sessions had a duration of less than or equal to three minutes. Only 3.6% of sessions had a duration of more than 10 min.
-
More than half of the sessions had no downloads: About 57% of sessions had no downloads in them. Sessions with lot of downloads are very rare; less than 1% had more than ten downloads.
-
There are few sessions with no search activities: Table 2 shows that about 14% of sessions had no search activities. These sessions can be described as series of isolated download activities made by an user. Sessions have only few search activities in them. About 90% of sessions have less than or equal to three search activities.
Count (c) | % of sessions | |||
---|---|---|---|---|
c = duration |
c = activities |
c = searches |
c = downloads | |
0 | 57.44 | 0.00 | 14.02 | 57.13 |
1 | 0.16 | 55.49 | 57.99 | 27.46 |
2 | 0.09 | 17.23 | 12.11 | 8.05 |
3 | 0.07 | 9.34 | 5.92 | 3.05 |
> 3 | 16.19 | 17.95 | 9.97 | 4.30 |
< = 3 | 83.81 | 82.05 | 90.03 | 95.70 |
3 < c < = 10 | 12.59 | 15.05 | 8.81 | 3.48 |
> 10 | 3.60 | 2.90 | 1.16 | 0.82 |
#Act | #Act = search | #Act = download | |
---|---|---|---|
% of Ses
WOD
| % of Ses
WSD
| % of Ses
WOS
| |
1 | 76.95 | 48.62 | 82.23 |
2 | 10.90 | 20.37 | 10.24 |
3 | 5.01 | 10.60 | 2.88 |
> 3 | 7.14 | 20.40 | 4.65 |
< = 3 | 92.86 | 79.60 | 95.35 |
3 < #Act < = 10 | 6.47 | 17.72 | 3.14 |
> 10 | 0.67 | 2.68 | 1.51 |
Previous activity | # downloads | % of downloads in session |
---|---|---|
Download | 1,388,810 | 43.55 |
Search | 938,440 | 29.43 |
None | 861,768 | 27.02 |
# Screen Views (SV) | % Search
ALL
| % Search
WSD
| % Search
WOD
| % Download
F
|
---|---|---|---|---|
1 | 86.35 | 85.17 | 87.47 | 85.68 |
2 | 9.53 | 10.50 | 8.62 | 10.16 |
3 | 2.30 | 2.47 | 2.13 | 2.38 |
> 3 | 1.82 | 1.86 | 1.78 | 1.78 |
< = 3 | 98.18 | 98.14 | 98.22 | 98.22 |
3 < SV < = 10 | 1.63 | 1.72 | 1.54 | 1.65 |
> 10 | 0.19 | 0.14 | 0.24 | 0.13 |
3.3 Analysis of Queries
-
Queries are very short: Table 6 shows that about 79% of the users had only one term in their query; 97% of the users had three or less terms. Only 0.05% of the users had more than ten terms in their queries. This statistic is similar when we look at number of terms across queries. More than 79% of queries had only one term in them, and more than 97% of queries had less than or equal to three terms.
-
Terms in queries are quite diverse: Large percentage of the terms were unique among users. Table 7 shows that about 72% of the terms had only one user using them in queries, 89% the of terms had at most three users in common, and only 3% of all the terms were common among more than ten users. The top five of the most common terms were (showing number of users using them inside brackets): md5 (46,433), sort (29,728), file (19,219), code (15,532), and java (15,092). Examples of (rare) terms that had only one user each are: bjc_compress, stream_update, partitioning.h, “evaluate_nbr_bits”, and ktportlet. Koders maintains a list of most popular queries in its Web site, and the top terms listed above can be found in that list. It is possible that users tend to look at these popular examples and try those queries themselves, thus contributing to the popularity of already popular examples. All of the terms that were unique seem to be names of variables and files.
# of max. terms in query | # of users | % of users | # of queries | % of queries |
---|---|---|---|---|
1 | 1,488,364 | 78.99 | 4,147,683 | 79.64 |
2 | 254,823 | 13.52 | 737,565 | 14.16 |
3 | 85,032 | 4.51 | 205,441 | 3.94 |
> 3 | 56,107 | 2.98 | 117,069 | 2.24 |
< = 3 | 1,828,219 | 97.02 | 5,090,689 | 97.75 |
3 < t < = 10 | 55,211 | 2.93 | 115,652 | 2.22 |
> 10 | 896 | 0.05 | 1,417 | 0.03 |
# of common users (u) | # of terms | % |
---|---|---|
1 | 659,401 | 72.19 |
2 | 107,881 | 11.81 |
3 | 48,029 | 5.25 |
> 3 | 98,014 | 10.73 |
< = 3 | 815,311 | 89.27 |
3 < u < = 10 | 67,001 | 7.33 |
> 10 | 31,013 | 3.40 |
Operator | Description | # Queries | % Queries |
---|---|---|---|
No operator | 4,853,829 | 93.20 | |
“ | Use of quotes before/after a query term | 178,014 | 3.42 |
cdef: | Finding class definitions | 133,832 | 2.57 |
mdef: | Finding method definitions | 89,107 | 1.71 |
idef: | Finding interface definitions | 52,705 | 1.01 |
mcall: | Used but not defined by Koders | 11,358 | 0.22 |
+ ... | Used but not defined by Koders | 6,652 | 0.13 |
...* | * after a term, stemming operator | 5,189 | 0.1 |
− ... | − before a term, exclusion operator | 4,761 | 0.09 |
-
A term is a natural term if two conditions are met: it only contains the alphabets from the English language, and if the term is found in a dictionary of English words. We prepared this dictionary using an exhaustive list of words found in the automatically generated inflection database available at Web site for AGID word list (2010). The dictionary contained 252,379 unique English words.
-
A term is a code term if it contains alphabets other than those of English language (such as numbers and symbols), or if the term is not found in the dictionary mentioned above.
-
Code queries are the mostly used types of queries. Natural queries are less used than code queries. There are few queries that are Hybrid.
-
Among the three types of queries, Code queries lead to the most of the downloads. About 21% of Code queries lead to a download. Compared to this, only about 12% Natural queries lead to a download. Hybrid queries are better than natural queries in terms of being followed by a download in a session.
Query type | # Queries | % Query | # Downloads
F
| % Q2D |
---|---|---|---|---|
Code (C) | 2,982,171 | 57.26 | 652,856 | 20.90 |
Natural (N) | 1,756,080 | 33.72 | 215,871 | 12.29 |
Hybrid (H) | 469,507 | 9.01 | 69,713 | 14.85 |
QR | Description | % Q
SES
| % Q
MOD
|
---|---|---|---|
T | Totally changing the query | 23.35 | 76.50 |
A | Adding terms | 2.27 | 7.45 |
D | Deleting terms | 1.61 | 5.28 |
O | Modifying operators only | 1.31 | 4.31 |
M | Modifications other than those mentioned | 1.97 | 6.46 |
3.4 Comparison with Web search
4 Topic Modeling
4.1 Results—Latent Topics
Topic code | Description | Words |
---|---|---|
Audio
| Working with audio and sound | Compare, control, encode, audio, decode |
DataStr
| Data structures | List, object, arraylist, map, vector |
Network
| Networking, FTP | Client, server, ftp, socket, iso |
Files
| Working with files | File, read, files, create, write |
GUI
| Swing and AWT GUI | Swing, jtable, applet, awt, window |
Data structure topics | |
AVL Tree | Tree, avl, minimum, spanning, avltree |
B-Tree | Tree, b, btree, trie, suffix |
Queue | Queue, priority, fifo, circular, priorityqueue |
Lists | List, linked, linkedlist, sorted, lists |
Heap | Max, heap, min, unix, chromaticity |
Graph | Graph, vertex, dfs, edge, salvo |
4.2 Topic Categories
-
Three of the 50 topics contained words (such as “How”, “Source”, “Code”) that are often used in writing a verbose query. We looked at several queries that belonged to these three topics and found that they start with phrases such as; “How to use ..”, “Source code for ..” etc. Based on this information we interpreted these topics to be describing forms of queries where natural language expressions were used. LDA was able to detect these topics without any preprocessing. These topics appear with a prefix “NL.” in their names in Table 19.
-
Others topics related to the form of queries captured the use of query operators, “mdef:” and “cdef:”.
-
We also found topics that seemed to be capturing the common terms used in the FQNs (fully qualified names) of Java entities. We include these topics under the category “Form Centric” with the interpretation that using FQNs is also a common technique to search for relevant source code.
-
One of the topics (Topic ‘Jkw’ in Table 19) captured the use of the Java language keywords to express structure in the query. For example, a query such as ‘extends iactionlistener’ uses the keyword ‘extends’ where the user is possibly trying to find interfaces that extend the interface IActionListener.
4.3 Users and Topics
4.4 Search, Downloads and Topics
Top 10 | Low 10 | ||||
---|---|---|---|---|---|
Users | Search | Download | Users | Search | Download |
JAVA | JAVA | JAVA | BIRT | U2 | NL.HUD |
DEF | DEF | DEF | U2 | U1 | fqn.j3DSfCon |
apache | Eclipse | Eclipse | gameMob | fqn.j3DSfCon | U2 |
files | GUI | GUI | searchEng | Netbeans | U1 |
String | Network | jfreeC | U1 | jboss | NL.apps |
DataStr | Jkw | Jkw | audio | JMS | searchEng |
GUI | files | hibernate | lucene | searchEng | Netbeans |
Network | imaging | Imaging | ML | BIRT | jboss |
DateTime | hibernate | secAuth | M.GwtSecAuth | audio | calSched |
NL.HUD | sort | files | secAuth | junit | NL.SC |
4.5 Prevalence in Topic Categories
Category | # of topics | Users | Searches followed | Downloads |
---|---|---|---|---|
Applications | 4 | 51,723 | 74,828 | 20,109 |
Programming tasks | 15 | 216,989 | 315,322 | 86,035 |
Form centric | 6 | 93,695 | 142,321 | 35,011 |
Java/JDK libraries | 5 | 80,406 | 133,898 | 36,744 |
Frameworks | 15 | 205,630 | 299,281 | 89,632 |
Unknown | 5 | 65,849 | 89,455 | 26,439 |
5 Query Forms
Activities | Topic assigned | |
---|---|---|
S
1
| “Date range” | Date time |
S
2
| How validate the number | Natural language |
Days data range | Query | |
S
3
| “Date range validation” | Date time |
S
4
| “Date range” | Date time |
S
5
| Date range | Date time |
→ S
6
| Number days date range | Date time |
D
1
| Downloaded <file_id> |
5.1 Lexical Structure
Form | Examples (each query separated by a comma) |
---|---|
Acronym | emf, dao, crud |
Code | Catch(sqlexception, substring(, byte[] |
Name | Filewriter, tchat, javax.media.datasink |
NL | Storing vector file format, example programs for datagram, read value from xml file |
Term | Drag drop, string date, file reading |
5.2 Result Types
Form | Examples (each query separated by a comma) |
---|---|
Entity | Hashtable, filewriter, cdef:jpegdecoder |
Feature | Parse the url data, lucene search, pad |
Line | Catch(sqlexception, byte[], //special values for whereclauses |
System | Media player, example programs for datagram, apache torque |
Form | Intent | |||||
---|---|---|---|---|---|---|
Entity | Feature | Line | System |
S
| % | |
Acronym | 2 | 2 | 0 | 8 | 12 | 8.00 |
Code | 0 | 1 | 4 | 0 | 5 | 3.33 |
Name | 55 | 6 | 0 | 0 | 61 | 40.67 |
NL | 2 | 14 | 1 | 7 | 24 | 16.00 |
Term | 13 | 25 | 0 | 10 | 48 | 32.00 |
S
| 72 | 48 | 5 | 25 | 150 | |
% | 48 | 32.00 | 3.33 | 16.67 |
5.3 Form and Relevance
-
Relevant Results: A search result for a query in a search session is relevant (and possibly usable) if a download activity follows the query in the session. We associate download with relevancy assuming that users download code only if they think it to be usable. This is quite similar to the assumption in general purpose search that click-through is a significant indicator of relevance (Joachims 2002; Joachims et al. 2007).
-
Efficient Query: A query is efficient if it produces a download as the next immediate activity in a session.
-
Effective Query: A query is effective if it produces a relevant download after the query in a session. We inspect the downloaded code to see if the download was indeed relevant to the query in the session. An efficient query might not be effective if it results in an immediate download that is not relevant to the query.
Users mostly are looking for entities defined in programs, and features that implement some behavior. They mostly issue queries that look like names of the entities as they are defined in the code. In absence of the knowledge about the defined names, they issue short queries that consist of natural language terms. Users get to relevant results with much less effort when they use queries that include the names of code entities. While users seem to look into the results they get with queries having natural language terms, these queries mostly fail to yield relevant results compared to other forms of queries.