research-article

Guilt by association: large scale malware detection by mining file-relation graphs

Authors:
Acar Tamersoy

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

,
Kevin Roundy

Symantec Research Labs, Culver City, CA, USA

Symantec Research Labs, Culver City, CA, USA
View Profile

,
Duen Horng Chau

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2014Pages 1524–1533https://doi.org/10.1145/2623330.2623342

Published:24 August 2014Publication History

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1524–1533

ABSTRACT

The increasing sophistication of malicious software calls for new defensive techniques that are harder to evade, and are capable of protecting users against novel threats. We present AESOP, a scalable algorithm that identifies malicious executable files by applying Aesop's moral that "a man is known by the company he keeps." We use a large dataset voluntarily contributed by the members of Norton Community Watch, consisting of partial lists of the files that exist on their machines, to identify close relationships between files that often appear together on machines. AESOP leverages locality-sensitive hashing to measure the strength of these inter-file relationships to construct a graph, on which it performs large scale inference by propagating information from the labeled files (as benign or malicious) to the preponderance of unlabeled files. AESOP attained early labeling of 99% of benign files and 79% of malicious files, over a week before they are labeled by the state-of-the-art techniques, with a 0.9961 true positive rate at flagging malware, at 0.0001 false positive rate.

Supplemental Material

p1524-sidebyside.mp4

mp4

251 MB

Download

References

D. S. Anderson, C. Fleizach, S. Savage, and G. M. Voelker. Spamscatter: Characterizing internet scam hosting infrastructure. In Proceedings of the USENIX Security Symposium, 2007. Google ScholarDigital Library
M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster. Building a dynamic reputation system for dns. In Proceedings of the USENIX Security Symposium, 2010. Google ScholarDigital Library
L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi. Exposure: Finding malicious domains using passive dns analysis. In Proceedings of the Annual Network and Distributed System Security Symposium, 2011.Google Scholar
Bleeping Computer. Cryptolocker ransomware information guide and faq. October 2013. http://www.bleepingcomputer.com/virus-removal/cryptolocker-ransomware-information.Google Scholar
A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, 1997. Google ScholarDigital Library
M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the ACM Symposium on the Theory of Computing, 2002. Google ScholarDigital Library
D. H. Chau, C. Nachenberg, J. Wilhelm, A. Wright, and C. Faloutsos. Polonium: Tera-scale graph mining and inference for malware detection. In Proceedings of the SIAM International Conference on Data Mining, 2011.Google ScholarCross Ref
O. Chum, J. Philbin, and A. Zisserman. Near duplicate image detection: min-hash and tf-idf weighting. In British Machine Vision Conference, 2008.Google ScholarCross Ref
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. In Proceedings of the IEEE International Conference on Data Engineering, 2000. Google ScholarDigital Library
A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of the International conference on World Wide Web, 2007. Google ScholarDigital Library
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the ACM Symposium on Computational Geometry, 2004. Google ScholarDigital Library
T. Dumitras and D. Shou. Toward a standard benchmark for computer security research: The worldwide intelligence network environment (wine). In Proceedings of the European Conference on Computer Systems BADGERS Workshop, 2011. Google ScholarDigital Library
P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief propagation for early vision. International Journal of Computer Vision, 70(1):41--54, 2006. Google ScholarDigital Library
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the International Conference on Very Large Data Bases, 1999. Google ScholarDigital Library
X. Hu, S. Bhatkar, K. Griffin, and K. G. Shin. Mutantx-s: Scalable malware clustering based on static features. In Proceedings of USENIX Annual Technical Conference, 2013. Google ScholarDigital Library
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the ACM Symposium on the Theory of Computing, 1998. Google ScholarDigital Library
U. Kang, D. H. Chau, and C. Faloutsos. Mining large graphs: Algorithms, inference, and discoveries. In Proceedings of the IEEE International Conference on Data Engineering, 2011. Google ScholarDigital Library
N. Karampatziakis, J. W. Stokes, A. Thomas, and M. Marinescu. Using file relationships in malware classification. In Proceedings of the Conference on Detection of Intrusions and Malware and Vulnerability Assessment, 2012. Google ScholarDigital Library
H. Kim and H. Park. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23(12):1495--1502, 2007. Google ScholarDigital Library
M. McGlohon, S. Bay, M. G. Anderle, D. M. Steier, and C. Faloutsos. Snare: A link analytic system for graph labeling and risk detection. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, 2009. Google ScholarDigital Library
E. E. Papalexakis, T. Dumitras, D. H. P. Chau, B. A. Prakash, and C. Faloutsos. Spatio-temporal mining of software adoption & penetration. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2013. Google ScholarDigital Library
A. Rajaraman and J. D. Ullman. Mining of Massive Datasets. Cambridge University Press, 2012. Google ScholarDigital Library
Symantec. Internet security threat report. 18, 2013.Google Scholar
Y. Ye, T. Li, S. Zhu, W. Zhuang, E. Tas, U. Gupta, and M. Abdulhayoglu. Combining file content and file relations for cloud based malware detection. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, 2011. Google ScholarDigital Library
J. Yedidia, W. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations, pages 239--270. Morgan Kaufmann Publishers Inc., 2003. Google ScholarDigital Library

Index Terms

Guilt by association: large scale malware detection by mining file-relation graphs
1. Information systems
  1. Information systems applications
    1. Data mining
2. Security and privacy
  1. Systems security
    1. Operating systems security

Recommendations

Real-Time Detection of Malware Downloads via Large-Scale URL->File->Machine Graph Mining
ASIA CCS '16: Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security

In this paper we propose Mastino, a novel defense system to detect malware download events. A download event is a 3-tuple that identifies the action of downloading a file from a URL that was triggered by a client (machine). Mastino utilizes global ...
Read More
Malware detection using adaptive data compression
AISec '08: Proceedings of the 1st ACM workshop on Workshop on AISec

A popular approach in current commercial anti-malware software detects malicious programs by searching in the code of programs for scan strings that are byte sequences indicative of malicious code. The scan strings, also known as the signatures of ...
Read More
Malware Detection Method Focusing on Anti-debugging Functions
CANDAR '14: Proceedings of the 2014 Second International Symposium on Computing and Networking

Malware has received much attention in recent years. Antivirus software is widely used as a countermeasure against malware. However, some kinds of malware can evade detection by antivirus software, hence, a new detection method is required. In this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2014
2028 pages
ISBN:9781450329569
DOI:10.1145/2623330
General Chairs:
Sofus Macskassy
Facebook
,
Claudia Perlich
Dstillery
,
Program Chairs:
Jure Leskovec
Stanford University
,
Wei Wang
UCLA
,
Rayid Ghani
University of Chicago
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
belief propagation
file graph
graph mining
locality sensitive hashing
malware detection
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '14 Paper Acceptance Rate151of1,036submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 108
  Total Citations
  View Citations
- 875
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Guilt by association: large scale malware detection by mining file-relation graphs

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Real-Time Detection of Malware Downloads via Large-Scale URL->File->Machine Graph Mining

Malware detection using adaptive data compression

Malware Detection Method Focusing on Anti-debugging Functions