Page 59 - Biennial Report 2018-20 Jun 2021

P. 59

against this subset in the second stage. This strategy also helped in controlling false positives.
The protocol was repeated again using the same parameters as in stage 1. Finally, different
proteoforms associated with the proteins were identified. Further, these proteoforms were
filtered for brain specificity based on the gene enrichment information from Human Protein
Atlas. A new template was designed to accommodate various new features of proteoforms
which will be incorporated in HuBSProt database.

A GENOMICS AND PROTEOMICS APPROACH TO UNDERSTANDING THE PITCHER
PLANT

Nepenthes khasiana the only pitcher plant found in India is endemic to West and South Garo
Hills, West and East Khasi Hills and Jaintia Hills of Meghalaya. This plant has a unique combination
of biochemistry, morphology and physiology to enable prey capture and nutrient assimilation
from digested prey. Under the special twinning programme of DBT that facilitates interaction
with scientists of the Northeast region of India, it was decided to sequence the genome, compare
the proteome and inquiline diversity of open and unopened pitchers, owing to its many
medicinal values and importance in the evolution of genus Nepenthes in Asia. The chloroplast
(cp) and mitochondrial (mt) genomes were pulled from whole genome data of N. khasiana.
Reads from a shotgun library with insert size 450 bp were used for assembling the organelle
genomes. The cp & mt genome was assembled using NOVOplasty. The cp genome was annotated
using DOGMA, CpGAVAS and BLAST (blastn, blastp and tblastx). The mt genome was annotated
using MITOFY and BLAST. Both the annotated genomes were submitted to GenBank with
accession numbers MK330891 and MH923233 for the mitochondrial and chloroplast genome,
respectively. The length of the assembled cp genome is 156914 bp, having a quadripartite
structure with a pair of inverted repeats of 25193 bp, a large single copy of 87237 bp and a small
single copy region of 19291 bp. A total of 87 protein coding genes, 37 tRNAs and 8 rRNAs were
annotated in the assembled cp genome. The length of the assembled mt genome is 900031 bp.
A total of 50 protein coding genes, 27 tRNAs and 7 rRNAs were annotated in the assembled
genome.
The chloroplast (cp) genome was assembled using adapter trimmed shotgun reads in
NOVOplasty. For de novo whole genome assembly, adapter and quality trimmed shotgun and
mate-pair reads were assembled using AllPathsLG. GapCloser and RepeatMasker were used for
closing gaps and masking repeats. Repeat masked draft genome was used for all further analysis.
Gene prediction was done with AUGUSTUS using Arabidopsis as the training dataset. SSRs were
identified with MISA. 88.6% of the coding genome after final assembly was found to be complete
based on core orthologs (plants ortholog dataset). A total of 7,214 scaffolds were assembled
with scaffold N50 1,163,181 bp (~1Mb) and average scaffold size 120 Kb. The genome size was
computed as 749,857,876 bp(~750 Mb) based on k-mer distribution. The draft genome was
found to be richer in trinucleotide repeats as compared to the mono- or di-nucleotide repeats.
Assembled cp genome was 156,914 bp long with a quadripartite structure (a pair of inverted
repeats, a large single copy and a small single copy region including 87 PCGS, 37 tRNAs and 8
rRNAs). N. khasiana whole genome data accession in SRA is SRP149035; Cp genome accession in
GenBank is MH923233. The high-quality reads from six paired-end libraries and three mate-pair

54 55 56 57 58 59 60 61 62 63 64