Page 58 - Biennial Report 2018-20 Jun 2021
P. 58

DECIPHERING THE HUMAN TISSUE SPECIFIC DYNAMICS OF PROTEOFORMS


                  When the human genome was completely sequenced the total number of genes was found to
                  be only 30,000, baffling scientists about how the complexity of the human body is achieved
                  without increasing the number of genes significantly. Since then, deeper functional genomics
                  studies have shown that the RNA and protein products of these 30,000 genes are processed and
                  modified to generate a large diversity of isoforms. A leading consortium of proteomics scientists
                  have now defined ‘proteoform’ as all of the different molecular forms in which the protein
                  product of a single gene can be found, including changes due to genetic variations, alternatively
                  spliced RNA transcripts  and post-translational  modifications. As mass  spectrometry based
                  protein annotation becomes the  mainstay of proteomics, it has become necessary  to build
                  bioinformatics tools that allow proper identification and annotation of proteoforms. Debasis
                  Dash is working towards developing a human tissue specific proteoform database by reanalyzing
                  public domain data and a pipeline to integrate multi-omics data with minimum false discovery
                  rate.

                  A database ProMetaDB was developed to overcome the data acquisition problem which contains
                  additionally  curated  meta information (phenotypes, conditions  /concentrations and their
                  associated raw / processed files) for all existing proteomics studies. Datasets were retrieved from
                  PRIDE, MassIVE and PeptideAtlas, jPOST and iProx through lab’s custom R packages in which 195
                  MassIVE projects  out  of 2000 projects  were re-annotated. In identification  of human tissue
                  specific proteoforms, the project was extended to 10 different MS datasets using the proteoform
                  identification pipeline and it  successfully identified 340 brain proteoforms. The identified
                  proteoforms  were  made  publicly  available  at  HuBSProt database.  In  the  proteoform
                  identification pipeline, the major challenge  was protein inference as proteoforms are much
                  similar to parent protein leading to a lot of shared peptides between them. Earlier, the approach
                  of cataloguing only those proteoforms that have at least one unique peptide along with other
                  shared peptides was beng used. To overcome this problem, the Protein Assembler tool was
                  integrated in the pipeline which uses parsimony method to collate peptides into proteins. Using
                  the ProMetaDB database for data acquisition, brain proteoforms search was extended for 25
                  different brain MS datasets corresponding to 17 different brain regions, cell types or subcellular
                  fractions - cerebrum, substantia nigra, pituitary, olfactory bulb, temporal lobe, corpus callosum,
                  putamen, caudate nucleus, dendritic spine, hippocampus, nucleus basalis  of Meynert,
                  oligodendrocytes, frontal cortex, fetal brain and any region of brain. Search for different tissues
                  like pancreas and prostate was also started. The above-mentioned datasets were analyzed by
                  first using UniProtKB/Swiss-Prot as a search database for protein identification using
                  EuGenoSuite proteogenomic tool. The UniProtKB/Swiss-Prot  database (version Dec 2018)
                  contains 42,423 proteins with their canonical isoforms. These search results also contain splice
                  variant isoforms as reported in UniProt. These identified proteins were passed to stage 2, in
                  which a customized search database was created containing possible variations that was further
                  used to identify proteoforms of the proteins which could not be identified in the first stage. This
                  customized database contains (I) 42,390 sequences containing variants peptides derived from
                  neXtProt database, (II) 502,629 proteins from GENCODE transcriptome database, and (III) 42,423
                  proteins from SwissProt database. In order to reduce search space and computational time, the
                  customized database based on proteins identified in the first stage were subset and then queried


                                                           57
   53   54   55   56   57   58   59   60   61   62   63