Elsevier

Future Generation Computer Systems

Volume 98, September 2019, Pages 233-237
Future Generation Computer Systems

Automated detection of cancerous genomic sequences using genomic signal processing and machine learning

https://doi.org/10.1016/j.future.2018.12.041Get rights and content

Highlights

  • Automatic Identification of cancer gene mutation is proposed.

  • DWT based genomic signal processing technique is presented.

  • 100 percent classification accuracy was obtained.

Abstract

Missense mutations are the primary cause of cancer. Identification of mutation in gene sequences is the preliminary step in diagnosis of cancer. In order to identify mutation we need to differentiate between cancerous and non-cancerous gene sequences. Identification of mutation by sequence comparison method can only be possible if the existing variant repeats. If there are no homologous variants present, using a sequence identification method, it is difficult to distinguish cancerous and non-cancerous sequences. Here we have used DWT based Genomic Signal Processing techniques to identify a pattern in the characteristics of the sequences, which in turn can be used with machine learning algorithm to differentiate between cancerous and non-cancerous sequences. The cancerous and non-cancerous gene sequences for lung cancer, breast cancer and ovarian cancer are obtained from NCBI. After performing numerical mapping for the sequences, four level DWT is applied using Haar wavelet and statistical features like mean, median, standard deviation, inter quartile range, skewness and kurtosis are obtained from the wavelet domain. These statistical values when applied to machine learning algorithms resulted in the accuracy of 100% on classification of cancerous and non-cancerous sequences with Support Vector Machine.

Introduction

Genomic Signal Processing or GSP includes the analysis and processing of genomic signals which are measurable events originating from the genomic sequence to obtain biological knowledge, and then translate that information into systems-based applications to diagnose genetic diseases and treat them [1]. GSP arises from the branch of Electronics Engineering called DSP or Digital Signal Processing, which uses mathematics-based transform techniques like Fast Fourier Transforms (FFTs), or Discrete Wavelet Transforms (DWTs) to analyze the genomic signals.

A Discrete Wavelet Transform (DWT) is a wavelet transform where the wavelets are discretely sampled. It is mainly used for signal coding, i.e. to represent a discrete signal. The DWT of a signal x is calculated by passing it through a series of filters: first the samples are passed through a low-pass filter and then decomposed using a high-pass filter. The decomposition halves the time resolution since only half the number of samples characterizes the entire signal and hence half the frequency band. The frequency band occupies half of previous frequency band and hence the resolution of frequency is increased.

Bioinformatics is a interdisciplinary field where techniques from fields such as computing, statistics and biology have paved way for solving a biological problem. Likewise here we have used Genomic Signal Processing technique which is Discrete Wavelet Transform (DWT) to identify the difference between cancerous and non-cancerous gene sequences using a sequence based approach.

Numerous methods for gene prediction are available M Stanke et al. work for gene prediction using Hidden Markov Model [2] and G Dodin et al. for gene pattern prediction using Digital Signal Processing methods [3]. Many computational methods have been developed to find the cancer causing gene sequences using sequence based method which includes Barman et al. work on prediction of cancer cell using Digital Signal Processing [4]. The use of the concept of Genomic Signal Processing (GSP) in bioinformatics pioneered in P.P. Vaidyanathan et al. work on use of signal-processing concepts in genomics and proteomics [5]. The Genomic Signal Processing techniques are applied and compared with traditional machine learning technique such as Hidden Markov Model in Marhon, A et al. study which stated that DSP based methods have high accuracy in gene finding when compared to other methods [6].

Section snippets

Materials and methods

We have tried to automate the gene identification for genes associated with cancer using genomic signal processing and to differentiate between different types of cancer.

Here we have processed gene sequences for lung, breast and ovarian cancer. The cancerous and non cancerous gene sequences are obtained for the above mentioned cancer types and converted into indicator sequences (complex representation of sequence that GSP techniques can recognize) and processed using GSP techniques like DWT and

Results and discussions

The classification of sequences for lung cancer, breast cancer and ovarian cancer as cancerous and non cancerous gene sequences which uses the statistical parameters obtained by applying Discrete Wavelet Transform (DWT). Table 1 depicts the statistical values extracted from cancerous and non-cancerous sequences for lung cancer.

Table 2, Table 3 depicts the statistical values extracted from cancerous and non-cancerous sequences for breast cancer and ovarian cancer respectively. The statistical

Conclusion

Genomic Signal processing methods detects the difference between cancerous and non cancerous gene sequences for lung, breast and ovarian cancer efficiently. Classification yielded a model with good accuracy but optimal model can be obtained only when above procedure is applied for all types of cancer.

Liu Dongwei received his M.S. degree in Material from Shanghai Institute of Technology, China. His research interest is mainly in the area of Polyurethane elastomer.

References (13)

  • DodinG.

    Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences

    J. Theoret. Biol.

    (2000)
  • TiwariA.

    Genomic signal processing (GSP)

    Bioinform. Trends

    (2006)
  • StankeM.

    Gene prediction with a hidden Markov model and a new intron submodel

    Bioinformatics

    (2003)
  • S. Barman, Prediction of cancer cell using digital signal processing, IJE (ISSN:...
  • P.P. Vaidyanathan, The role of signal-processing concepts in genomics and proteomics,...
  • A. Marhon, A brief comparison of DSP and HMM methods for gene finding,...
There are more references available in the full text version of this article.

Cited by (9)

View all citing articles on Scopus

Liu Dongwei received his M.S. degree in Material from Shanghai Institute of Technology, China. His research interest is mainly in the area of Polyurethane elastomer.

Jia Runping received her Ph.D. degree in Material from TONGJI University in Shanghai, China. She is currently a professor in Shanghai Institute of Technology. Her research interest is mainly in the area of Polyurethane. She has published several research papers in scholarly journals in the above research areas and has participated in several conferences.

Wang Caifeng received her M.S. degree in Material from Shanghai Institute of Technology, China. Her research interest is mainly in the area of Polyurethane elastomer.

Arunkumar N has completed in his B.E., M.E. and Ph.D. in Electronics and Communication Engineering with specialization in Biomedical Engineering. He has a strong academic teaching and research experience of more than 10 years in SASTRA University, India. He is appreciated for his innovative research oriented teaching related practical life experiences to the principles of engineering. He is active in research and has been giving directions to active researchers across the globe. He has published more than 60 papers in peer reviewed academic journals with high impact factors. His main areas include machine learning, artificial intelligence and IoT. He is in the editorial board of few journals in his area of research.

K. Narasimhan received the M.Sc. degree with Electronics Specialization from Bharathidasan University, M.Tech. in Non destructive Testing from Regional Engineering College, Trichy and the Ph.D. degree from SASTRA University in the field of medical image processing. His research interests include Digital Image Processing, Medical Image analysis, Pattern Recognition, Digital Signal processing. He has published more than 50 papers in reputed international journals and conferences. He is currently working as Senior Assistant Professor in the Department of ECE, School of EEE, SASTRA Deemed University, Thanjavur. He is a Life Member of the Indian Society of Systems for Science and Engineering (ISSE).

M. Udayakumar graduated with an M.Tech. degree from the Department of Bioinformatics, SASTRA Deemed University, Thanjavur, India. He is an Assistant Professor III in the School of Chemical & Biotechnology, SASTRA Deemed University. His research work is mainly on designing tools, webserver application and database development for bioinformatics applications. His ongoing research is on structural analysis and crystallography studies on small molecules. He is a Life Member of the Indian Society of Systems for Science and Engineering (ISSE).

V. Elamaran received the B.E. degree in Electronics and Communication Engineering from Madurai Kamaraj University, and the M.E. degree in Systems Engineering and Operations Research from Anna University, India. Currently he is pursuing Ph.D. in the area of Low Power VLSI Design from SASTRA Deemed University, Thanjavur, India. His main research interests are signal, image and video processing, digital VLSI design circuits, design for testability, and FPGA based systems. He has published more than 80 research papers in reputed international journals and conferences. He is currently working as Assistant Professor in the Department of ECE, School of EEE, SASTRA Deemed University, Thanjavur. He is a Life Member of the Indian Society for Technical Education (ISTE) and the Indian Society of Systems for Science and Engineering (ISSE).

View full text