Page 58 - Biennial Report 2018-20 Jun 2021
P. 58
DECIPHERING THE HUMAN TISSUE SPECIFIC DYNAMICS OF PROTEOFORMS
When the human genome was completely sequenced the total number of genes was found to
be only 30,000, baffling scientists about how the complexity of the human body is achieved
without increasing the number of genes significantly. Since then, deeper functional genomics
studies have shown that the RNA and protein products of these 30,000 genes are processed and
modified to generate a large diversity of isoforms. A leading consortium of proteomics scientists
have now defined ‘proteoform’ as all of the different molecular forms in which the protein
product of a single gene can be found, including changes due to genetic variations, alternatively
spliced RNA transcripts and post-translational modifications. As mass spectrometry based
protein annotation becomes the mainstay of proteomics, it has become necessary to build
bioinformatics tools that allow proper identification and annotation of proteoforms. Debasis
Dash is working towards developing a human tissue specific proteoform database by reanalyzing
public domain data and a pipeline to integrate multi-omics data with minimum false discovery
rate.
A database ProMetaDB was developed to overcome the data acquisition problem which contains
additionally curated meta information (phenotypes, conditions /concentrations and their
associated raw / processed files) for all existing proteomics studies. Datasets were retrieved from
PRIDE, MassIVE and PeptideAtlas, jPOST and iProx through lab’s custom R packages in which 195
MassIVE projects out of 2000 projects were re-annotated. In identification of human tissue
specific proteoforms, the project was extended to 10 different MS datasets using the proteoform
identification pipeline and it successfully identified 340 brain proteoforms. The identified
proteoforms were made publicly available at HuBSProt database. In the proteoform
identification pipeline, the major challenge was protein inference as proteoforms are much
similar to parent protein leading to a lot of shared peptides between them. Earlier, the approach
of cataloguing only those proteoforms that have at least one unique peptide along with other
shared peptides was beng used. To overcome this problem, the Protein Assembler tool was
integrated in the pipeline which uses parsimony method to collate peptides into proteins. Using
the ProMetaDB database for data acquisition, brain proteoforms search was extended for 25
different brain MS datasets corresponding to 17 different brain regions, cell types or subcellular
fractions - cerebrum, substantia nigra, pituitary, olfactory bulb, temporal lobe, corpus callosum,
putamen, caudate nucleus, dendritic spine, hippocampus, nucleus basalis of Meynert,
oligodendrocytes, frontal cortex, fetal brain and any region of brain. Search for different tissues
like pancreas and prostate was also started. The above-mentioned datasets were analyzed by
first using UniProtKB/Swiss-Prot as a search database for protein identification using
EuGenoSuite proteogenomic tool. The UniProtKB/Swiss-Prot database (version Dec 2018)
contains 42,423 proteins with their canonical isoforms. These search results also contain splice
variant isoforms as reported in UniProt. These identified proteins were passed to stage 2, in
which a customized search database was created containing possible variations that was further
used to identify proteoforms of the proteins which could not be identified in the first stage. This
customized database contains (I) 42,390 sequences containing variants peptides derived from
neXtProt database, (II) 502,629 proteins from GENCODE transcriptome database, and (III) 42,423
proteins from SwissProt database. In order to reduce search space and computational time, the
customized database based on proteins identified in the first stage were subset and then queried
57