Collecting Proteins and Genes data from Uniprot Database

What is Uniprot Database?

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.source: Wikipedia

Why is this protein list important to current COVID-19 Dataset?

Inorder to gather the information about the biomolecular mechanism from the scientific literature (COVID-19 Dataset), one need to have the list of associated Proteins, Genes, Pathways, Drugs etc. This notebook presents the steps to gather Corona Virus associated proteins, Gene names and associated Pathways from Uniprot database. These lits could be useful to look at the textual documents for further NLP processing and to present the entity relationship.

1. Getting Data

Step -I

Gp to Uniprot Database (https://www.uniprot.org/) and select UniprotKB in search bar. Then inter corona virus into the search bar.

Step -II:

After you hit search operation, you will get a table like disply of the result. It is multi page table.

Step-III:

Look at the right most task bar of this table. You can see pen like icon through which you get next window. You can make a selection of the information you want to gather (e.g., Name, Gene, Pathways).

Step - IV

Once you are done with selection of information, you can go back to previous table and hit download button. You can select the format of the data. Excel file download is one option.

2. Data Wrangling

What After getting Protein Data?

Lets play around with this data

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
file_path = "../input/corona-virus-proteins-from-uniprot-database/corona.csv"
df = pd.read_csv(file_path)
df.head(5)
Entry Entry name Status Protein names Gene names Organism Virus hosts Pathway
0 A0A3R5SMJ6 A0A3R5SMJ6_WNV unreviewed Genome polyprotein NaN West Nile virus (WNV) Aedes [TaxID: 7158]; Amblyomma variegatum (Tro... NaN
1 M1UFP6 M1UFP6_9FLAV unreviewed Genome polyprotein NaN Bovine viral diarrhea virus 1b NaN NaN
2 P11223 SPIKE_IBVB reviewed Spike glycoprotein (S glycoprotein) (E2) (Pepl... S 2 Avian infectious bronchitis virus (strain Beau... Gallus gallus (Chicken) [TaxID: 9031] NaN
3 P11224 SPIKE_CVMA5 reviewed Spike glycoprotein (S glycoprotein) (E2) (Pepl... S 3 Murine coronavirus (strain A59) (MHV-A59) (Mur... Mus musculus (Mouse) [TaxID: 10090] NaN
4 P0C6X9 R1AB_CVMA5 reviewed Replicase polyprotein 1ab (pp1ab) (ORF1ab poly... rep 1a-1b Murine coronavirus (strain A59) (MHV-A59) (Mur... Mus musculus (Mouse) [TaxID: 10090] NaN

There are total 21,876 proteins from different sources

df.shape
(21876, 8)

Q: What are the different Organisms? Can you find the top 50 Organisms?

df_organism = pd.DataFrame(df.groupby("Organism").count()['Entry'])
df_organism = df_organism.sort_values(by = "Entry", ascending = False)
df_organism[0:20].plot.barh(figsize = [15,10], fontsize =20)
plt.gca().invert_yaxis()

png

df_organism[0:20]
Entry
Organism
Infectious bronchitis virus 8184
Porcine epidemic diarrhea virus 4657
Middle East respiratory syndrome-related coronavirus 1139
Feline coronavirus 983
Human coronavirus OC43 (HCoV-OC43) 630
Avian coronavirus 497
Canine coronavirus 463
Human coronavirus NL63 (HCoV-NL63) 333
Transmissible gastroenteritis virus 258
Porcine deltacoronavirus 235
Bovine coronavirus 225
Human coronavirus 229E (HCoV-229E) 210
Murine hepatitis virus 111
Human coronavirus HKU1 (HCoV-HKU1) 110
Porcine hemagglutinating encephalomyelitis virus 104
Porcine respiratory coronavirus 98
Alphacoronavirus sp. 90
Human immunodeficiency virus 1 81
Bat SARS-like coronavirus 66
Hepatitis B virus (HBV) 66

Q: What are the different Viral hosts? Can you find top Virus hosts?

df['Virus hosts'] = df['Virus hosts'].apply(lambda x: str(x)[0:50] )
df_host = pd.DataFrame(df.groupby("Virus hosts").count()['Entry'])
df_host = df_host.sort_values(by = "Entry", ascending = False)
df_host[1:20].plot.barh(figsize = [15,10], fontsize =20)
plt.gca().invert_yaxis()

png

df_host[1:20]
Entry
Virus hosts
Homo sapiens (Human) [TaxID: 9606] 1385
Bos taurus (Bovine) [TaxID: 9913] 78
Homo sapiens (Human) [TaxID: 9606]; Pan troglodyte 66
Gallus gallus (Chicken) [TaxID: 9031] 64
Sus scrofa (Pig) [TaxID: 9823] 57
Mus musculus (Mouse) [TaxID: 10090] 54
Meleagris gallopavo (Wild turkey) [TaxID: 9103] 51
Homo sapiens (Human) [TaxID: 9606]; Paguma larvata 50
Alliaria petiolata (Garlic mustard) (Arabis petiol 33
Pipistrellus abramus (Japanese pipistrelle) (Pipis 32
Tylonycteris pachypus (Lesser bamboo bat) [TaxID: 19
Canis lupus familiaris (Dog) (Canis familiaris) [T 16
Rattus norvegicus (Rat) [TaxID: 10116] 11
Rousettus leschenaultii (Leschenault's rousette) [ 11
Equus caballus (Horse) [TaxID: 9796] 9
Felidae (cat family) [TaxID: 9681] 9
Impatiens [TaxID: 35939] 9
Scotophilus kuhlii (Lesser asiatic yellow bat) [Ta 7
Rhinolophus sinicus (Chinese rufous horseshoe bat) 7

3. Cleaning Protein Names, Synonyms and abbreviations

def filter(line):
    proteins = set()
    line = str(line)
    line = line.lower()

    '''for lines without () or [] terms'''
    if "(" not in line or "[" not in line:
        proteins.add(line.strip().replace(' ', '_'))


    '''for line including () terms'''    
    if '(' in line:
        start = 0
        open_in = line.find('(')
        tmp = line[start:open_in].strip().replace(' ', '_')
        proteins.add(tmp)
        while open_in >=0:
            start = open_in+1
            end = line.find(')', start)
            proteins.add(line[start:end].strip().replace(' ', '_'))
            open_in = line.find('(', end)

    '''for lines including [] trems'''
    if '[' in line:
        raw = line[line.find('['):line.find(']')]
        #print("THIS IS RAW:", raw[15:-1])
        raw = raw[15:-1]
        lraw = raw.split("; ")
        for item in lraw:
            #print(item)
            if '(' in item:
                start = 0
                open_in = item.find('(')
                tmp = item[start:open_in].strip().replace(' ', '_')
                proteins.add(tmp)
            else:
                proteins.add(item.strip().replace(' ', '_'))
    return proteins
allProteins = []
i = 0
for u,p in zip(df['Entry'],df['Protein names']):
    print(u,"|",p)
    print("------------")
    print(u,"|",filter(p))
    print("===================================================")
    i += 1
    if i>4:
        break
A0A3R5SMJ6 | Genome polyprotein
------------
A0A3R5SMJ6 | {'genome_polyprotein'}
===================================================
M1UFP6 | Genome polyprotein
------------
M1UFP6 | {'genome_polyprotein'}
===================================================
P11223 | Spike glycoprotein (S glycoprotein) (E2) 
(Peplomer protein) [Cleaved into: Spike protein S1; 
Spike protein S2; Spike protein S2']
------------
P11223 | {'peplomer_protein', 'spike_protein_s1',
'e2', 'spike_protein_s2', 'spike_glycoprotein',
's_glycoprotein'}
===================================================
P11224 | Spike glycoprotein (S glycoprotein) (E2) 
(Peplomer protein) [Cleaved into: Spike protein S1;
Spike protein S2; Spike protein S2']
------------
P11224 | {'peplomer_protein', 'spike_protein_s1',
'e2', 'spike_protein_s2', 'spike_glycoprotein',
's_glycoprotein'}
===================================================
allProteins = []
for u,p in zip(df['Entry'],df['Protein names']):
    allProteins.append({"id":u, "names":list(filter(p))})
allProteins[0:5]
[{'id': 'A0A3R5SMJ6', 'names': ['genome_polyprotein']},
 {'id': 'M1UFP6', 'names': ['genome_polyprotein']},
 {'id': 'P11223',
  'names': ['peplomer_protein',
   'spike_protein_s1',
   'e2',
   'spike_protein_s2',
   'spike_glycoprotein',
   's_glycoprotein']},
 {'id': 'P11224',
  'names': ['peplomer_protein',
   'spike_protein_s1',
   'e2',
   'spike_protein_s2',
   'spike_glycoprotein',
   's_glycoprotein']},
 {'id': 'P0C6X9',
  'names': ['m-pro',
   'nsp16',
   'nsp7',
   'exon',
   'nsp14',
   'nsp10',
   'guanine-n7_methyltransferase',
   'ec_3.4.22.-',
   'nendou',
   'non-structural_protein_3',
   'non-structural_protein_2',
   'ec_2.1.1.-',
   "2'-o-methyltransferase",
   'hel',
   'pol',
   'ec_3.6.4.13',
   'growth_factor-like_peptide',
   'non-structural_protein_7',
   'gfl',
   'ec_3.4.22.69',
   'p22',
   'p27',
   'non-structural_protein_9',
   'orf1ab_polyprotein',
   'nsp4',
   'uridylate-specific_endoribonuclease',
   '3cl-pro',
   'pp1ab',
   'p65',
   'host_translation_inhibitor_nsp1',
   'p15',
   'nsp1',
   'ec_3.6.4.12',
   'nsp2',
   'ec_2.7.7.48',
   'p67',
   'nsp12',
   'peptide_hd2',
   'nsp5',
   'p210',
   'rdrp',
   'nsp9',
   'p100',
   'nsp3',
   'nsp8',
   'non-structural_protein_6',
   'rna-directed_rna_polymerase',
   'p35',
   '3c-like_proteinase',
   'papain-like_proteinase',
   'ec_3.1.13.-',
   'non-structural_protein_4',
   'pl-pro',
   '3clp',
   'p12',
   'non-structural_protein_10',
   'p28',
   'non-structural_protein_8',
   'helicase',
   'nsp6',
   'nsp13',
   'nsp15',
   'p44',
   'ec_3.1.-.-',
   'replicase_polyprotein_1ab',
   'ec_3.4.19.12',
   'p10']}]
import json
with open("virus-proteins.json", 'w') as fn:
    json.dump(allProteins,fn)