ABSTRACT
While the amount of data we can process and store grows, our ability to find data remains dependent upon our own memories more often than not. Manual metadata management is common among scientific users, consuming their time while not making use of the computing resources at hand. Our system design proposes to empower users with more powerful data finding tools, such as unified search spaces, provenance, and ranked file system search. By returning the responsibility of file management to the file system, we enable scientists to focus on their science without the need for a customized file organization scheme for their work.
- D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs row-stores: How different are they really? June 2008.Google Scholar
- I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludäscher, and S. Mock. Kepler: An extensible system for design and execution of scientific workflows. 2004.Google Scholar
- S. Ames, N. Bobb, S. A. Brandt, A. Hiatt, C. Maltzahn, E. L. Miller, A. Neeman, and D. Tuteja. Richer file system metadata using links and attributes. In Proceedings of MSST 2005, Apr. 2005. Google ScholarDigital Library
- D. Bhagwat and N. Polyzotis. Searching a file system using inferred semantic links. In Proceedings of HYPERTEXT'05, 2005. Google ScholarDigital Library
- P. J. Braam. The Lustre storage architecture. http://www.lustre.org/documentation.html, 2004.Google Scholar
- S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. Silva, and H. T. Vo. Vistrails: Visualization meets data management. June 2006.Google Scholar
- S. B. Davidson and J. Freire. Provenance and scientific workflows: Challenges and opportunites. June 2008.Google Scholar
- E. Deelman, G. Singh, M. P. Atkinson, A. Chervenak, N. P. C. Hong, C. Kesselman, S. Patil, L. Pearlman, and M.-H. Su. Grid-based metadata services. International Conference on Scientific and Statistical Database Management, 2004. Google ScholarDigital Library
- S. Dumais, E. Cutrell, J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins. Stuff I've seen: a system for personal information retrieval and re-use. In Proceedings of ACM SIGIR'03, 2003. Google ScholarDigital Library
- S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stonebraker. OLTP through the looking glass, and what we found there. June 2008.Google Scholar
- S. Harizopoulos, V. Liang, D. J. Abadi, and S. Madden. Performance tradeoffs in read-optimized databases. 2006.Google Scholar
- A. L. Holloway and D. J. Dewitt. Read-optimized databases, in depth. Proceedings of VLDB'08, 1, August 2008. Google ScholarDigital Library
- D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. Pocock, P. Li, and T. Oinn. Taverna: a tool for building and running workflows of services. Nucleic Acids Research, 34(Web Server issue):729--732, July 2006.Google Scholar
- S. N. Jones, C. R. Strong, D. D. E. Long, and E. L. Miller. Tracking emigrant data via transient provenance. In Proceedings of USENIX TaPP'11, June 2011.Google Scholar
- M. Meseke. Using xml and xquery for data management in hpss. In Proceedings of MSST 2011, 2011. Google ScholarDigital Library
- K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. Provenance-aware storage systems. In Proceedings of USENIX ATC'06, 2006. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999--66, Stanford InfoLab, November 1999.Google Scholar
- S. Ram and J. Liu. Understanding the semantics of data provenance to support active conceptual modeling. In Proceedings of Active Conceptual Modeling of Learning'06, 2006.Google Scholar
- R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon. Design and implementation of the Sun network file system. In Proceedings of USENIX ATC'85, 1985.Google Scholar
- F. Schmuck and R. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of FAST'02, 2002. Google ScholarDigital Library
- S. Shah, C. A. N. Soules, G. R. Ganger, and B. D. Noble. Using provenance to aid in personal file search. In USENIX ATC'07, 2007. Google ScholarDigital Library
- Y. L. Simmhan, B. Plale, and D. Gannon. A framework for collecting provenance in data-centric scientific workflows. IEEE International Conference on Web Services, 2006. Google ScholarDigital Library
- C. A. N. Soules and G. R. Ganger. Connections: using context to enhance file search. In Proceedings of SOSP'05, 2005. Google ScholarDigital Library
- M. Stonebraker, C. Bear, U. Çetintemel, M. Cherniack, T. Ge, N. Hachem, S. Harizopoulos, J. Lifter, J. Rogers, and S. Zdonik. One size fits all? part 2: Benchmarking results. January 2007.Google Scholar
- C. Strong, S. Jones, A. Parker-Wood, A. Holloway, and D. D. E. Long. Los Alamos National Laboratory Interviews. Technical Report UCSC-SSRC-11-06, University of California, Santa Cruz, Sept. 2011.Google Scholar
- R. W. Watson and R. A. Coyne. The parallel I/O architecture of the High Performance Storage System (HPSS). In Proceedings of MSS'95, 1995. Google ScholarDigital Library
- S. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proceedings of OSDI'06, Nov. 2006. Google ScholarDigital Library
- B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou. Scalable performance of the Panasas parallel file system. In Proceedings of FAST'08, 2008. Google ScholarDigital Library
Index Terms
- Easing the burdens of HPC file management
Recommendations
Knowledge File System -- A Principled Approach to Personal Information Management
ICDMW '10: Proceedings of the 2010 IEEE International Conference on Data Mining WorkshopsThe Knowledge File System (KFS) is a smart virtual file system that sits between the operating system and the file system. Its primary functionality is to automatically organize files in a transparent and seamless manner so as to facilitate easy ...
The partitioned exponential file for database storage management
The rate of increase in hard disk storage capacity continues to outpace the rate of decrease in hard disk seek time. This trend implies that the value of a seek is increasing exponentially relative to the value of storage.
With this trend in mind, we ...
Provenance-Based Searching and Ranking for Scientific Workflows
IPAW 2014: Revised Selected Papers of the 5th International Provenance and Annotation Workshop on Provenance and Annotation of Data and Processes - Volume 8628We present PBase, a scientific workflow provenance repository that supports declarative graph queries and keyword-based graph searching, complemented with ranking capabilities taking into consideration authority and quality of service criteria. Given ...
Comments