Keywords
DNA-storage, digital information storage in DNA, synthetic biology, glossary, communication, controlled vocabulary, short plain-language summaries, interdisciplinary collaboration
This article is included in the EMBL-EBI collection.
DNA-storage, digital information storage in DNA, synthetic biology, glossary, communication, controlled vocabulary, short plain-language summaries, interdisciplinary collaboration
As we tackle increasingly complex issues throughout science, a breadth of knowledge is often necessary to devise novel solutions — something frequently achieved through interdisciplinary collaborations. The inherent diversity within interdisciplinary teams stimulates knowledge exchange, creativity or even a change in perspective; however, it can be very challenging. We work within an emerging field in synthetic biology, repurposing DNA as a storage medium for digital information. Advancing from early proof-of-principle studies in the high-throughput era1,2 (see references therein for historical perspective) towards a more reliable, refined and functional large-scale DNA storage system3,4 raises unique challenges that can only be resolved through a broad collaborative effort between biochemical and DNA sequencing specialists, computer and molecular scientists, information theorists and others. This body of research has gained considerable interest both within the research community and with the public, and this has further emphasised the need to address our communication and the presentation of our work.
Intersection between these fields is clearly beneficial. Information theory has already underpinned many advances in life sciences, from adapting Levenshtein coding to create error-correcting molecular barcodes used in multiplexed DNA sequencing5 to Burrows-Wheeler transformation of reference genomes implemented in several short read aligners6–8. A molecular biologist may see the process of storing information in DNA as a very physical process, progressing from DNA synthesis (writing) to amplification (copying) to sequencing (reading). To an information theorist, this is a noisy channel: a series of transformations through which information is transmitted and the outputs observed. Differences in the way experts in these different fields describe their data and results can hinder collaboration and restrict impact. As a result, publications have the potential to be an ineffective hybrid of accepted nomenclature and data presentation within the intersecting fields with few readers, both in the team and outside, able to fully understand the publication as a whole.
Unsurprisingly, common nomenclature between the intersecting disciplines has disparate meanings. Use of the word ‘qubit’ can lead you to believe that some DNA needs quantifying9 or you may be discussing quantum information or quantum field theory10. This complicates communication; misunderstandings have the potential to pass unnoticed, only becoming apparent downstream. Examples of such misunderstandings are the use of the words errors, erasures, and substitutions when retrieving data through DNA sequencing. To an information theorist, an ‘error’ refers to a falsely read symbol, for example when an A in the DNA sequence is falsely read as a C, distinct from an insertion or deletion. An ‘erasure’ would be a read that was possibly so uncertain that it is neither called as an A, C, G or T, but distinct from a ‘deletion’ in that the read is not simply missed but we are made aware that there is a missing symbol at this position in the DNA string. An ‘insertion’ is a symbol read, when no symbol should exist. To a molecular biologist and DNA sequencing expert, all of these would be described as read ‘errors’. To them, errors in the information theoretic sense would be called substitutions.
DNA-storage has become a popular research field, with a number of interdisciplinary teams forming and collaborating in an attempt to make viable information storage systems that capitalise on DNA’s numerous advantages11. To alleviate confusion and improve daily communication within and between these groups we propose, and have begun to implement, two measures: a glossary and a controlled vocabulary.
We have created a glossary defining basic terms in molecular biology, information theory and computer science etc. that are relevant to DNA-storage, for those unfamiliar with one or more of these disciplines. This proved to be a useful aid in early discussions within our team and helped to identify areas of nomenclature ambiguity which if not addressed may have complicated communication downstream. We have already experienced the advantages of sharing this within our team and with collaborators to facilitate exchange of ideas with them.
Our glossary is held on a cloud storage system, and can be found at https://goo.gl/x6B73Q or https://rebrand.ly/dna-storage-glossary. To allow an open and inclusive discussion of how we might improve communication within this emerging community, we encourage others to critique and contribute to the glossary. The document permits “Suggestions” (proposed edits) and “Comments” to be added, and we will review these regularly and update the document as a resource for our research community.
Leading on from this, we are developing an evolving controlled vocabulary allowing team members to communicate precisely. This has been particularly beneficial during technical discussions — for instance, to us data packet refers to part of a DNA sequence that decodes to digital information, and excludes parts that are designed to facilitate DNA sequencing or indexing.
Use of a controlled vocabulary is something that the community may wish to agree upon. For example, one question we pose is — what should we name these DNA sequences that encode digital information? Following the practice of genome scientists, we initially called collections of such DNA sequences libraries. However, working with such samples caused confusion with our colleagues in a molecular biology laboratory: in a Next Generation Sequencing context, the term library is commonly used to describe DNA fragments that have been prepared for DNA sequencing. We now propose to refer to DNA sequences that store digital information as inDNA (for ‘information-carrying DNA’). To refer to inDNA prepared for DNA sequencing, we can now unambiguously talk about a library of inDNA.
We would like to invite others to contribute to the development of a controlled vocabulary so that we might be able to communicate more precisely. We have included a few entries within our glossary document.
We now pose another question — how might we improve data description and presentation to increase accessibility and facilitate peer review and reproducibility? Peer review is crucial within the scientific community, but this quality improvement process may not be fully realised in interdisciplinary publications. We have experienced difficulties with peer review of publications related to DNA-storage applications, as authors of work under review, as reviewers ourselves, in our assessment of others’ reviews, and in dealings with journal editors. Often the expertise is not available, or reviewers may only evaluate limited aspects of the paper. The body of work may not be effectively reviewed as a whole, leaving authors without vital feedback and potentially leading to publication of flawed work.
The concept of standardising presentation of data and methods is not a novel idea in the life sciences, with ‘minimum information’ standards ensuring that publications contain the information necessary to interpret the experimental data. These are typically technique- or study-specific, e.g. MIAME (microarray experiments)12, MIQE (quantitative polymerase chain reaction)13 and MIFlowCyt (flow cytometry)14. Such an approach may not be appropriate to publications relating to DNA-storage applications for some time, as these typically encompass a number of disciplines, each with its own established data description standards and many of which use rapidly changing technologies. It is not appropriate or practical to standardise such a diverse range of technologies and disciplines. Rather we should respect the accepted discipline norms, blending these together to permit DNA-storage standards to evolve.
Even publications that sit predominantly within a single discipline may be of interest to those unfamiliar with that discipline and benefit from the inclusion of a whole-paper plain-language summary. As standard with plain-language summaries this should simply report the basic rational, methodology and main findings. Box 1 is a whole-publication plain-language summary of 2 that we have written as an example.
With the amount of digital information that needs to be stored growing exponentially there is a need to develop new ways of storing information. High information capacity, longevity and constant improvements in technologies that allow writing, copying and reading make DNA an attractive medium for storing digital information. Here we present a scalable reliable method for storing digital information in DNA.
The original bytes of several computer files in various formats were encoded into DNA as follows. A Huffman code was used to compress each byte, depending upon occurrence frequency, into a block of 5–6 trits, which are the characters 0, 1 or 2 (just as bits are 0 or 1). A reference table of these blocks and corresponding nucleotide sequences was created, with each block having four possible nucleotide combination representations. Nucleotide combinations were selected depending also upon the previous block, in a manner that prevented the occurrence of any repeating nucleotides (e.g. AA), as these are known to cause downstream copying and reading problems. Following encoding the digital information was represented as 153,335 DNA sequences of length 117 nucleotides, each containing an index and a simple error checkpoint in addition to encoding part of the original digital information. These DNA sequences were printed as a pool of DNA, containing ~1.2 × 107 copies of each sequence, which was copied via PCR and prepared for reading via DNA sequencing before being decoded (encoding strategy reversed).
Data totalling 739 kilobytes was successfully encoded into DNA, printed, copied, read and decoded with 100% accuracy. A storage density of ~2.2PB g−1 DNA was achieved.
It may also be useful to provide a plain-language summary of a specific technical aspect of a publication. For example, a molecular scientist may not understand the details of a complex mathematical algorithm (and nor should the description be altered specifically to allow them to), but an appreciation of how the output impacts aspects of the project relevant to them may be sufficient. We illustrate this using a paragraph from 4 (from p.5, Methods — Address Design and Encoding). This was read and discussed by the first two co-authors of the present paper, EEH and JS. Figure 1 highlights terms that either EEH, a molecular biologist (purple shading), or JS, an information theorist (yellow shading), found difficult to understand. Joining forces and explaining all terms to each other, they were able to understand the paragraph in depth.
As the interdisciplinary field of DNA-storage evolves towards maturity, there will be an increasing requirement for researchers from different backgrounds to understand publications without having access to colleagues from unfamiliar subject areas. This can be achieved in part by including brief summaries, which may make use of our glossary document, in specialised sections of a publications such that they become accessible for researchers from all disciplines.
We promote the value of interdisciplinary, collaborative science to solve complex problems, including in our field of digital information storage in DNA which combines molecular biology, information theory and computer science. We note the problems that this approach can generate in communication within and between research teams, and propose to reduce these in the DNA-storage area by initiating a glossary and controlled vocabulary. These have been made available to the research community for reference and critique, and we invite contributions to extend their scope.
EEH and JS are supported by the UK’s Biotechnology and Biological Sciences Research Council (BBSRC grants BB/L023741/1 and BB/L021994/1). NG is supported by the European Molecular Biology Laboratory.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
We would like to thank all participants at the IARPA meeting in Washington D.C. on 27–28 April 2016 (https://www.src.org/calendar/e006043/) for an interdisciplinary discussion, during which the need for a unified vocabulary to foster understanding within this new field was in evidence. We thank in particular Luis Ceze who chaired this discussion. This provided additional motivation for continuing and extending the glossary we had already put together, as reported during the meeting, and for writing this paper.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Not applicable
Are all the source data underlying the results available to ensure full reproducibility?
No source data required
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Nucleic acid synthesis and measurement technologies, technology development and business strategy.
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Not applicable
Are all the source data underlying the results available to ensure full reproducibility?
No source data required
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 10 Jan 18 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
The approach to limit a vocabulary to allow specific types of communication across epistemic and cultural boundaries would ... Continue reading Facilitating productive exchanges is an important requirement for any interdisciplinary endeavour.
The approach to limit a vocabulary to allow specific types of communication across epistemic and cultural boundaries would benefit from a quick visit to some older literature recommending, describing and theorising similar processes.
The controlled vocabulary, proposed in this paper, strongly resembles discussion is social theory and social studies of science. For instance, in 1997, Galison proposed the notion of *Trading zones* and described how they work [1]. They are areas where technical or scientific practices can become collective by allowing practitioners to use so-called pidgins. A pidgin is a simplified language, one that can be use by a diverse array of practitioners and which does not require full assimilation into a knowledge culture.
Trading zones host objects or elements that matter to many (disciplines). These elements may not be seen, described, conceptualised or understood in the same way. They can be described as boundary objects [2] occupying unique spaces on the boundary between disciplines allowing some form of communication to exist through them.
[1] Galison, Peter (1997) Image and Logic: A Material Culture of Microphysics. Chicago: University of Chicago Press.
[2] Star, Susan Leigh and James R. Griesemer (1989) “Institutional Ecology, ‘Translations’ and Boundary Objects: Amateurs and Professionals in Berkeley’s Museum of Vertebrate Zoology, 1907–39.” Social Studies of Science 19: 387–420.
The approach to limit a vocabulary to allow specific types of communication across epistemic and cultural boundaries would benefit from a quick visit to some older literature recommending, describing and theorising similar processes.
The controlled vocabulary, proposed in this paper, strongly resembles discussion is social theory and social studies of science. For instance, in 1997, Galison proposed the notion of *Trading zones* and described how they work [1]. They are areas where technical or scientific practices can become collective by allowing practitioners to use so-called pidgins. A pidgin is a simplified language, one that can be use by a diverse array of practitioners and which does not require full assimilation into a knowledge culture.
Trading zones host objects or elements that matter to many (disciplines). These elements may not be seen, described, conceptualised or understood in the same way. They can be described as boundary objects [2] occupying unique spaces on the boundary between disciplines allowing some form of communication to exist through them.
[1] Galison, Peter (1997) Image and Logic: A Material Culture of Microphysics. Chicago: University of Chicago Press.
[2] Star, Susan Leigh and James R. Griesemer (1989) “Institutional Ecology, ‘Translations’ and Boundary Objects: Amateurs and Professionals in Berkeley’s Museum of Vertebrate Zoology, 1907–39.” Social Studies of Science 19: 387–420.