Big data facilitates artificial intelligence (AI) and AI necessitates big data. True for any field of human endeavor – cancer care being no exception.
Most often, our focus is, for valid reasons, on the ‘magical’ capability of AI – big data is taken for granted.
So, here I intend to throw some brighter light on big data with a preliminary answer to the following question: Where does big data, as required for AI-powered cancer diagnosis and drug discovery, come from?
Data type required in cancer care
Basic data types utilized in cancer diagnosis include molecular omics data, perturbation phenotypic data, molecular interaction data, and imaging data (Nature Reviews Cancer, 2022, 22:625). Here, I intend to focus on omics data – particularly that concerned with genomics and transcriptomics.
Omics data relates to genes and gene expression. In the cell of a living organism, DNA is the genetic material. DNA is made from four different kinds of deoxyribonucleotides. A nucleotide consists of three components: a pentose sugar (deoxyribose), a nitrogenous base and a phosphate group. The base can be a pyrimidine – cytosine (C) or (T), or a purine – adenine (A) or guanine (G).
Certain segments of the DNA code for functional proteins, while several others are responsible for the regulation of gene expression.
Some DNA mutations promote cancer
All forms of cancer are correlated with DNA mutations (changes in the sequences). Mutations in specific segments of the DNA give tumors a selective growth advantage and promote cancer. Such mutations can be in the coding sequence leading to the production of a protein with altered function. Alternatively, or additionally, changes in the gene expression can result from mutations in the regulatory sequence.
Evidently, correlation of mutations with cancer requires DNA sequence information. In 1977, two sequencing techniques were introduced – Sanger’s chain termination and Maxam and Gilbert’s chemical cleavage.
Early Sanger sequencing projects were limited to single genes or small segments of the genomic DNA. Therefore, the clinical application of genetics to cancer research and diagnosis relied on targeted sequencing approaches to identify specific relevant biomarkers. For example, it was found that inherited mutations in the BRCA1 gene are associated with a high risk of breast and ovarian cancers in certain families (The New England Journal of Medicine, 1996, 334:137).
Genetics, which studies individual genes and their roles in inheritance, had its limitations. The study of individual genes, one or only a few at a time, gives partial information on most of the metabolic processes and interaction networks within and across the cells of an organism. So, it became both necessary and important to focus on the entire genome, which is the complete set of DNA (including genes) of an organism. The science of genomics emerged towards the end of last century.
Emergence of genomics
In true sense, genome sequencing had begun in 1977 when the 5380 nucleotide-long, single-stranded genome of bacteriophage (phi)X174 was completely sequenced by a so-called ‘plus and minus’ method. With technological breakthroughs, Sanger sequencing was transformed from a manual to a semi-automated to an automated procedure and the term ‘high throughput’ was attributed to it. Using the automated Sanger sequencing procedure, the first genome of a free-living organism, the bacterium Haemophilus influenzae (~ 1.8 x 10 6 bp), was completely sequenced.
The greatest achievement in genome sequencing was yet to come. The Human Genome Project (HGP), launched in October 1990, was completed in April 2003 and the first sequence of the entire human genome was published. Although a remarkable feat beyond any doubt, the time and cost involved in the HGP was formidable nonetheless.
Encouragingly, since the completion of the HGP, amazing progress has been made in genome sequencing technologies. Between 2004 and 2006, second generation sequencing, more popularly known as “next-generation sequencing (NGS)”, technologies were introduced. Based on nanotechnology principles and innovations that facilitated massive parallel sequencing of single DNA molecules, NGS led to a dramatic increase in sequence data generation and accumulation
Next generation sequencing increases speed and reduces cost
Second generation sequencing technologies, provided by Illumina and Ion Torrent platforms, could read short DNA fragments. Evidently, these technologies had to deal with the challenges of reassembly of long DNA stretches. Third generation sequencing technologies, provided by Pacific Biosciences and Oxford Nanopore Technologies since 2011, can achieve read lengths of about 10 kilo base pairs.
NGS has led to an unimaginable increase in sequencing throughput. Whereas sequencing of the first human genome using traditional methods took more than 10 years, soon next generation technologies reduced the time to two months involving only one-hundredth of the cost. Now, sequencing of a full diploid human genome is a matter of just a few days.
Sequencing data explodes
As sequencing data began to pour out at an exponentially increasing pace, databases were established to archive them and make available for research and applications. For more than 30 years, the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) has been the principal infrastructure involved in collecting and providing nucleotide sequence data. INSDC comprises three partner organizations: (1) the DNA Data Bank of Japan (DDBJ; http://www.ddbj.nig.ac.jp/) at the National Institute of Genetics in Mishima, Japan, (2) the European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena/) at the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI), and (3) GenBank (https://www.ncbi.nlm.nih.gov/genbank/) at National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health in Bethesda, Maryland, USA.
The most well-known among the three is GenBank. Since its inception in 1982, it has grown at an exponential rate, doubling its in size every 18 months. According to latest update, GenBank, a comprehensive, public database, contains 25 trillion base pairs from over 3.7 billion nucleotide sequences from 557,000 formally described species.
Genomics embraces cancer care
Major efforts in using genome sequencing to investigate a variety of adult and pediatric cancers began in 2005. In 2008, only six months after publication of the first human genome sequence by next-generation technologies, the first application of whole genome sequencing (WGS) to a cancer sample was reported.
Over the past decade and a half, NGS has become ever more effective, in both speed and precision, at identifying not only the mutations in DNA, but also the changes in gene expression and post-translational modifications. The cancer hallmarks, reflected in the rapidly accumulating genomic, transcriptomic, proteomic and epigenomic data, have significantly facilitated identification and understanding of the disease. It is hard to overlook that big data has overwhelmed cancer care in a big way.
Big data sources
A remarkable example of big data is that provided by The Cancer Genome Atlas (TCGA; https://www.cancer.gov/ccg/research/genome-sequencing/tcga). It is a landmark cancer genomics program undertaken jointly by the National Cancer Institutes (NCI) and the National Human Genome Research Institute (NHGRI), both located on the National Institutes of Health (NIH) campus in Bethesda, MD, USA.
The collaborative effort that began in 2006, has molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. Ever since, TCGA has generated over 2.5 petabytes (peta denotes an order of 1015) of genomic, epigenomic, transcriptomic, and proteomic data.
Another source of big data is the Catalogue Of Somatic Mutations In Cancer (COSMIC; https://cancer.sanger.ac.uk/cosmic) – a comprehensive resource for investigating the effect of somatic mutations in human cancer. Somatic mutations occur after conception (the first cell division) in any of the cells of the body except the germ cells (sperm and egg). Most of such mutations are thought to be silent; nonetheless, a small minority can cause neoplastic as well as non-neoplastic diseases.
In 2004, COSMIC started with an initial survey of only four genes – today, it encompasses every human gene. The dataset in v98, released in May 2023, describes more than six million coding mutations across 1,520,321 samples.
International collaboration
Inspired by the participation and contribution of several countries around the globe, the International Cancer Genome Consortium (ICGC) (https://www.icgc-argo.org/page/65/icgc-initiatives-) was launched in 2008. ICGC coordinates large-scale genome studies in tumors from 50 cancer types and subtypes of serious concern in different countries.
Cancer genomics in the cloud
Incidentally, the ICGC genomic database is so huge that many researchers, not having sufficient computing power, find it difficult to download and analyze the data. To address the issue, the organization has joined forces with commercial as well as academic cloud computing partners such as Amazon Web Service (AWS) and Cancer Genome Collaboratory (CGC) (https://dcc.icgc.org/icgc-in-the-cloud).
AWS offers commercially a wide range of global cloud-based products including computation, storage, databases, networking, etc. The ICGC datasets are currently hosted in Northern Virginia, USA. On the other hand, the CGG is an academic cloud resource built by the Ontario Institute for Cancer Research in Canada. Encouragingly, scientists can now access and analyze ICGC datasets through these cloud computing platforms.
Artificial intelligence takes over
Over more than a decade, AI has substantially contributed towards the resolution of multiple medical problems, including cancer. AI essentially refers to the simulation of human intelligence by a system or machine. It encompasses varied areas of investigations and applications – including machine learning (ML). In other words, ML is a subset of AI.
Drawing inspirations from (human) brain science, a powerful machine learning algorithm was developed – artificial neural networks (ANN). Neural networks make up the backbone of deep learning (DL) algorithms – the word “deep” refers to the number of node layers in a deep neural network; DL is a subgroup of ML.
It has become somewhat obvious that DL has the potential to radically transform cancer care and give it a shape of precision oncology (cancer treatment strategies based on the distinct molecular characteristics of a tumor). However, a challenge in applying DL to oncology is the requirement of enormous amount of training data which is not only robust but well-phenotyped as well (phenotyping means clinical characterization of traits that signify health or disease). Such data has been provided by TCGA, ICGC and COSMIC, among others.
In lieu of a conclusion
To conclude, here are a couple of examples of TCGA application:
- In the April 26, 2019 issue of JAMA Network Open, authors have reported a machine learning method Supervised Cancer Origin Prediction Using Expression (SCOPE), trained on whole transcriptome TCGA data, to predict the origins of rare cancer types including treatment-resistant metastatic cancers (JAMA Network Open. 2019;2(4):e192597.doi:10.1001/jamanetworkopen.2019.2597).
- Researcher at University of Cambridge and STORM Therapeutics, UK have used TCGA data to identify optimal therapeutic targets in oncology. They have analyzed the performance of five different machine learning classifiers: random forests (RF), artificial neural networks (ANN), support vector machines (SVM), logistic regression (LR), and gradient boosting machines (GBM). Gene expression and mutation data from TCGA were used to train the models (Nature Scientific Reports, 2020, 10:10787).
Some experts in the field feel that “big data” in cancer is not big enough; nonetheless, everyone agrees that it getting bigger day by day.