Home Explore Bioinformatics Practical Manual

Bioinformatics Practical Manual

Published by philip, 2021-09-19 19:39:50

Description: A step-by-step guide to using basic to advanced Bioinformatics tools.

Keywords: Bioinformatics,Genomics,Proteomics

Read the Text Version

Pages:

1 - 50

PRACTICAL MANUAL GENOMICS PROTEOMICS & BIOINFORMATICS Prepared by Philip Litto Thomas Department of Zoology

Genomics, Proteomics & Bioinformatics: A Practical Manual Contents 1. Biological database search & data retrieval ..............................................2 1.1. Downloading nucleotide sequence from NCBI database ......................2 1.2. Downloading protein sequence from NCBI database ...........................7 1.3. Downloading protein sequence from ExPASy & SWISS PROT ........10 1.4. Downloading protein structure file from PDB ....................................13 2. Primer designing: Primer-BLAST...........................................................14 3. Sequence alignment: BLAST-X ...............................................................17 4. Multiple Sequence Alignment using T-Coffee.........................................21 5. Gene prediction using GENSCAN ...........................................................24 6. Promoter prediction using Promoter 2.0 Prediction Server...................27 7. Identify conserved domains of proteins using CD-Search......................31 8. Gene/Protein function analysis using PANTHER...................................34 9. Protein-Protein interaction studies using STRING ................................39 10. Protein structure analysis using RasMol ................................................46 1 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 1. Database search & data retrieval 1.1 Downloading nucleotide sequence from NCBI database NCBI (National Centre for Biotechnology Information), maintains the GenBank nucleotide sequence database which is an annotated collection of all publicly available DNA sequences. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis. Nucleotide sequences from GenBank database can be accessed and downloaded from the web site http://www.ncbi.nlm.nih.gov/nucleotide. The downloaded sequence file is called GenBank flat file. Some of the important fields in GenBank flat file are: LOCUS: This field contains a number of different data elements, including locus name, sequence length, molecule type, GenBank division, and modification date. DEFINITION: Brief description of sequence; includes information such as source organism, gene name/protein name, or some description of the sequence's function ACCESSION: The unique identifier for a sequence record. An accession number applies to the complete record and is usually a combination of a letter(s) and numbers, such as two letters followed by an underscore bar and six or more digits (e.g., ,NM_203377 or NM_001164047) FEATURES: Information about genes and gene products including exons and CDS. ORIGIN: The sequence data begin on the line immediately below ORIGIN. 2 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Exercise: Download the nucleotide sequence of for Homo sapiens myoglobin (MB), transcript variant 2, mRNA (NM_203377) and view the sequence in FASTA format. Procedure: Step 1: Access NCBI URL and select nucleotide from the drop down menu (http://www.ncbi.nlm.nih.gov) and search for NM_203377 Step 2: The following details about the sequence are retrieved and noted down: Sequence length: 1170 bp Molecule type: mRNA GenBank division: PRIMATES Modification date: 14-SEP-2013 Version: NM_203377.1 GI: 44955884 No. of Exons: 4 CDS (Coding sequence): 173 to 637 Poly A signal: 1147 to 1152 Step 3: Click on the FASTA link on the page to obtain the sequence in FASTA format 3 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual GenBank flat file for NM_203377 4 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Nucleotide sequence in GenBank format 5 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Nucleotide sequence in FASTA format 6 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 1.2 Downloading Protein sequence from NCBI database The Protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank. Protein sequences from NCBI database can be accessed and downloaded from the web site http://www.ncbi.nlm.nih.gov/protein. Similar to nucleotide sequence files, the Protein sequence files have various fields like LOCUS: This field contains a number of different data elements, including locus name, sequence length, molecule type, GenBank division, and modification date. DEFINITION: Brief description of sequence; includes information such as source organism, protein name, or some description of the sequence's function ACCESSION: The unique identifier for a sequence record. An accession number applies to the complete record and is usually a combination of a letter(s) and numbers. Protein IDs consist of three letters followed by five digits (eg, AAA59172) FEATURES: Information about Protein, signal peptide and CDS. ORIGIN: The sequence data begin on the line immediately below ORIGIN. Exercise: Download the amino acid sequence of Homo sapiens Insulin (AAA59172). Procedure: Step1: Searched for AAA59172 in NCBI protein database (http://www.ncbi.nlm.nih.gov/protein) Step 2: The following details regarding the sequence were retrieved and noted: Sequence length: 110 aa GenBank division: PRIMATES Modification date: 12-FEB-2001 Version: AAA59172.1 7 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual GI: 386828 Features- Signal Peptide: 1 to 24 Mature peptide: 25 to 110 First aminoacid in mature peptide: f (Phenylalanine) Protein sequence file definitions 8 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Protein sequence file features table 9 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 1.3 Downloading protein sequence from ExPASy & SWISS- PROT The ExPASy (Expert Protein Analysis System) is a portal (http://www.expasy.org) provided by the Swiss Institute of Bioinformatics (SIB). It provides a single point entry to a variety of databases and analytical tools for proteomics, genomics, phylogeny, systems biology etc. The ExPASy server provides analysis tools for specific tasks relevant to proteomics, similarity searches, pattern and profile searches, post-translational modification prediction, topology prediction, primary, secondary and tertiary structure analysis and sequence alignment. The Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) joined to form the UniProt consortium. UniProt provides a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. Uniport database can be accessed via the Expasy server. The centerpiece of the UniProt databases is the UniProt knowledge base (UniProtKB), which comprises 2 sections: manually annotated UniProtKB/Swiss-Prot and automatically annotated UniProtKB/TrEMBL. Taken together, these 2 sections give access to all publicly available protein sequences. The SWISS-PROT knowledgebase (http://www.expasy.org/sprot/) is a curated protein sequence database, which strives to provide high quality annotations (such as the description of the function of a protein, its domain structure, post- translational modifications and variants), a minimal level of redundancy and a high level of integration with other databases. SWISS-PROT is supplemented by TrEMBL, which contains computer-annotated entries for all sequences not yet integrated in SWISS-PROT. In order to have minimal redundancy and to improve sequence reliability, all protein sequences encoded by a same gene are merged into a single UniProtKB/Swiss-Prot entry. Differences found between various 10 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual sequencing reports are analyzed and fully described in the feature table (alternative splicing events, polymorphisms or conflicts for example). Once in UniProtKB/Swiss-Prot, a protein entry is removed from UniProtKB/TrEMBL. Exercise: Download the amino acid sequence of Myoglobin of Homo sapiens from SWISS-PROT database using the Expasy server. Display the sequence in FASTA format. Procedure: Step 1: Open the ExPASy portal (http://www.expasy.org) Step 2: Select UniPortKB from the dropdown menu and search for Myoglobin + Homo sapiens The result page will display few sequences. Step 3: In the result page click on the Filter by “Reviewed” link on left pane. (This restricts the results to those from SWISS-PROT) 11 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Step 4: Next, click on the entry to open the file. Step 5: Scroll down the page to view the details. Click on the option FASTA under the subheading ‘Sequences’. A new page opens with the myoglobin sequence in FASTA format. 12 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 1.4 Downloading protein structure file from PDB The Protein Data Bank (PDB) is a worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. The 3D structures are obtained by X-ray crystallography, NMR spectroscopy, and of recent by cryo-electron microscopy. Researchers around the world deposit their structural data in PDB. These structural files are freely accessible on the Internet viawww.pdb.org. The structure files may be viewed using one of several free and open source computer programs, including Jmol, Pymol, VMD, and Rasmol. Steps to download a PDB file:  Go to www.pdb.org  If the PDB id of the molecule of interest is known, give it in the search box. Else, give a general search term.eg., Reverse transciptase  The search will return many results. Clicking on the best suited will show its details.  Click on the Download files link on RH side  From the options, select PDB format  It will prompt to Open/Save the file Save in a proper destination. 13 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 2. Primer designing using Primer-BLAST Primer design is a very important step while setting up a PCR assay. If the primers anneal poorly or to more than one sequence, this can significantly impact the quality and reliability of results. Primer-BLAST is a PCR primer design and specificity checking tool from NCBI. It picks primers using the Primer3 algorithm and then uses BLAST to screen for primers specific to the input template. Similar to other BLAST searches, user can limit a Primer-BLAST search to specific taxa. The result presents candidate primers along with their alignment to targets. Primer-BLAST is a web only application accessible at www.ncbi.nlm.nih.gov/tools/primer-blast/. For running the Primer BLAST program, either RefSeq ID or the nucleotide sequence of the gene to be amplified should be used. If the sequence ID is not known, go to Gene database and search for the gene of interest. Find the Reference Sequence (RefSeq) of gene of interest (e.g. “NM_203483”) Also, the user can provide the Primer sequences and the template sequences to check the specificity of the primer set. 14 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Primer BLAST page consists of four sections 1. PCR Template Enter the target sequence in FASTA format or an accession number of an NCBI nucleotide sequence in the PCR Template section of the form. If the NCBI mRNA reference sequence accession number is used, the tool will automatically design primers that are specific to that splice variant. 2. Primer Parameters PCR product size: Product size selection depends on the type of PCR one intends to do. For efficient amplification in real-time RT-PCR, primers should be designed so that the size of the size is <200 bp. Number of primers to return: Up to the user, enter at least 10. Melting temperature: as a rule, aim a minimum of 57°C and a maximum of 63°C; the ideal melting temperature is 60°C (with a maximum difference of 3°C in the Tm’s of the two primers). 15 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 3. Exon/intron selection To avoid amplification of contaminating genomic DNA, design primers or probes so that one half hybridizes to the 3′ end of one exon and the other half to the 5′ end of the adjacent exon. To do this, simply select “Primer must span an exon-exon junction”. Other settings need not be changed. 4. Primer pair specificity checking parameters In the Primer Pair Specificity Checking Parameters section, select the appropriate source Organism and the smallest Database that is likely to contain the target sequence. These settings give the most precise results. For broadest coverage, choose the nr database and do not specify an organism. Click the \"Get Primers\" button to submit the search and retrieve specific primer pairs. Checking the output screen The primers should end with a C or G residue, because T and A residues can bind more easily to DNA in a non-specific way. Optimal primers also have a GC content of around 50-60% to ensure maximum product stability. In self complementarity, the primers pair with the lower score is to be selected to decrease the possibility of primer-dimer formation. 16 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 3. Sequence alignment using BLAST-X BLAST (Basic Local Alignment Search Tool) is an algorithm to search large databases for homologs in a reasonably short amount of time. BLAST finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. BLAST is used to generate alignments between a nucleotide or protein sequence, referred to as a “query” and nucleotide or protein sequences within a database, referred to as “subject” sequences. A BLAST alignment consists of a pair of sequences, in which every letter in one sequence is paired with, or “aligned to,” exactly one letter or a gap in the other. The alignment score is computed by assigning a value to each aligned pair of letters and then summing these values over the length of the alignment. BLAST is accessible at the URL: http://blast.ncbi.nlm.nih.gov/ The BLAST suites of programs available are: BLAST-N, BLAST-P, BLAST-X, TBLAST-N, TBLAST-X Each program has the same look-and-feel as the others, although some programs have more options than others. if you have a coding region of a DNA sequence, it is advisable to first translate it into protein before carrying out a search because a DNA sequence will contain far less information with which homologies can be detected after a very short space of evolutionary time than will the same encoded protein sequence due to the redundancy of codon to amino acid translation in the genetic code. 17 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual BLAST-X searches protein databases using a translated nucleotide query. It is used for identifying potential protein products encoded by a nucleotide query. BLAST-X Input BLASTX web page has the following forms: 1. Enter Query Sequence provides a place to input or upload query sequence, and optionally select a query subrange. In the “Enter Query Sequence” section, a “Genetic code” field is present under the “Choose File” button to specify the codon table to be used in the translation of the input nucleotide query. Choose a code appropriate for the source of the query sequence. 2. Choose Search Set is where one selects a database and optionally limits the search by an organism or Entrez query. The default database Non redundant protein sequences\"(nr). An option to limit the search to any organism or to exclude some is also provided. 18 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 3. ‘Algorithm parameters’ is a link to a page section that lets one, change the parameters of the selected BLAST algorithm. BLAST-X OUTPUT Graphical representation of the BLAST-X output Table showing the scores and coverage of alignment 19 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Actual alignment between the query and the subject Exercise: Download the nucleotide sequnce of the Homo sapiens hemoglobin, Delta(NM_000519.3) and do a BLAST-X against nr database. Interpret your results. Procedure: Step 1: Search for NM_000519.3 in NCBI Nucleotide database. Step 2: Retrieve the file and click BLAST link on the NCBI page. Step 3: In the BLAST page NM_000519.3 appears in the query box Step 4: Select the ‘blastx’ tab on the top of the seach box and keeping the defaults, click the BLAST button. 20 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 4. Multiple Sequence Alignment using T-Coffee T-Coffee is a versatile Multiple Sequence Alignment method suitable for aligning most types of biological sequences. T-Coffee Server is hosted by the Centre for Genomic Regulation (CRG) of Barcelona.T-Coffee tool is hosted on the web at http://www.tcoffee.org/ 1. Choose your T-Coffee flavor T-Coffee can align protein sequences as well as DNA/RNA sequences, also different modes are available. In the main page choose the T-Coffee mode according your requirements. 21 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 2. Enter your sequences Fill up the form with the required data. In the most basic case you have just to enter your sequences. When you have entered all data, just click Submit button to process your alignment. 3. Get your alignment Depending how complex is your input sequences the alignment can take few seconds or some minutes to be fulfilled. You don't need to keep your browser open to wait for the result. You can close it or navigate away to another page. 22 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual If you have entered your email address in your request you will be notified in your inbox when the alignment is complete. In any case you can check your alignment request status through the History page available on the top navigation bar. When the alignment is complete it will be displayed in a page like the following. The alignment result will be available on the T-Coffee server for seven days. 23 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 5. Gene prediction using GENSCAN GENSCAN is a general-purpose gene identification program which analyzes genomic DNA sequences from a variety of organisms including human, other vertebrates, invertebrates and plants. GENSCAN identifies complete exon/intron structures of genes in genomic DNA. Novel features of the program include the capacity to predict multiple genes in a sequence, to deal with partial as well as complete genes, and to predict consistent sets of genes occurring on either or both DNA strands. This set of exons/genes is then printed to an output file (the text output) together with the corresponding predicted peptide sequences. A graphical (PostScript) output may also be created which displays the location and DNA strand of each predicted exon. Unlike the majority of other currently available gene prediction programs, the model treats the most general case in which the sequence may contain no genes, one gene, or multiple genes on either or both DNA strands and partial genes as well as complete genes are considered. The most important restrictions are that only protein coding genes are considered (and not tRNA or rRNA genes, for example), and that transcription units are assumed to be non-overlapping. Procedure Step 1: Open the GENSCAN server at MIT (http://genes.mit.edu/GENSCAN.html) Paste the genome sequence to be analysed into the large white box. Step2: Select the required options from the pull down menu. Lower values of suboptimal exon cut-off will return irrelevant results. Step 3: Run GENSCAN Exercise: Using GENSCAN, identify predicted genes within a genome sequence. 24 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Explanation: Column Description ------ ------------------------------------------------------------- Gn.Ex. gene number, exon number (for reference) Type Init = Initial exon Intr = Internal exon Term = Terminal exon Sngl = Single-exon gene Prom = Promoter PlyA = poly-A signal S DNA strand (+ = input strand; - = opposite strand) Begin beginning of exon or signal (numbered on input strand) End end point of exon or signal (numbered on input strand) Len length of exon or signal (bp) Fr \"absolute reading frame\" relative to start of sequence. For example, if nucleotides 1,2,3 of the sequence are read as a codon, that's called reading frame 0. If 2,3,4 are read as a codon, that's reading frame 1. If 3,4,5 are read as a codon, that's reading frame 2, and so on. This information, together with the starting and ending positions of the exon, is sufficient to give the amino acid sequence encoded by the exon. Another use of the reading frame is that if you see two adjacent predicted exons 25 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual separated by a relatively short intron which share the same reading frame,it may be worth looking at the possibility that the intervening intron is not correct, i.e. that the two exons plus the intervening intron might form one long exon (assuming there are no inframe stops in the intron, of course). Ph \"net phase\" of exon (exon length modulo 3) For example, an exon of length 15 bp has net phase 0 since 15 is divisible by 3, an exon of length 16 bp has net phase 1 because 16 divided by 3 leaves a remainder of 1, an exon of length 17 bp has net phase 2, and an exon of length 18 bp has net phase 0 again. The point of this is that exons whose net phaseis 0 can be omitted from the gene without disrupting the reading frame: such exons are candidates for being either 1) incorrect, or 2) alternatively spliced. I/Ac initiation signal or acceptor splice site score (x 10) (If below zero, probably not a real acceptor site.) Do/T donor splice site or termination signal score (x 10) (If below zero, probably not a real donor site.) CodRg coding region score (x 10) Low coding region scores may indicate potentially incorrect predictions or genes with unusual amino acid and/or codon usage patterns. P probability of exon (sum over all parses containing exon) This quantity is the probability that the predicted exon is correct. Tscr exon score (depends on length, I/Ac, Do/T and CodRg scores) An overall measure of exon quality based on local sequence properties 26 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 6. Promoter prediction using Promoter 2.0 Prediction Server Promoter 2.0 predicts transcription start sites of vertebrate RNA Polymerase II promoters in DNA sequences. It has been developed as an evolution of simulated transcription factors that interact with sequences in promoter regions. It builds on principles that are common to neural networks and genetic algorithms. INPUT 1. Specify the input sequences The sequences intended for processing can be input in the following two ways:  Paste a single sequence (just the nucleotides) or a number of sequences in FASTA format into the upper window of the main server page.  Select a FASTA file on your local disk, either by typing the file name into the lower window or by browsing the disk. Both ways can be employed at the same time: all the specified sequence will be processed. The allowed input alphabet is A, C, G, T and X (unknown); all the other symbols will be converted to X before processing. 27 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 2. Select the output format Click on the \"Full output\" button if you want the input sequences to be included in the server output. The default output format shows the predictions only. 3. Submit the job Click on the \"Submit\" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in your browser window. At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them. OUTPUT EXAMPLE OUTPUT Promoter 2.0 Prediction Results INPUT SEQUENCE: >gi_209811_gb_J01917_ADRCG Adenovirus type 2, complete genome. CATCATCATAATATACCTTATTTTGGATTGAAGCCAATATGATAATGAGGGGGTGGA GTTTGTGACGTGGCGCGGGGCGTGGGAACGGGGCGGGTGACGTAGTAGTGTGGCGG AAGTGTGATGTTGCAAGTGTGGCGGAACACATGTAAGCGCCGGATGTGGTAAAAGT GACGTTTTTGGTGTGCGCCGGTGTATACGGGAAGTGACAATTTTCGCGCGGTTTTAG GCGGATGTTGTAGTAAATTTGGGCGTAACCAAGTAATGTTTGGCCATTTTCGCGGGA AAACTGAATAAGAGGAAGTGAAATCTGAATAATTCTGTGTTACTCATAGCGCGTAAT ATTTGTCTAGGGCCGCGGGGACTTTGACCGTTTACGTGGAGACTCGCCCAGGTGTTT TTCTCAGGTGTTTTCCGCGTTCCGGGTCAAAGTTGGCGTTTTATTATTATAGTCAGCT GACGCGCAGTGTATTTATACCCGGTGAGTTCCTCAAGAGGCCACTCTTGAGTGCCAG CGAGTAGAGTTTTCTCCTCCGAGCCGCTCCGACACCGGGACTGAAAATGAGACATAT 28 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual TATCTGCCACGGAGGTGTTATTACCGAAGAAATGGCCGCCAGTCTTTTGGACCAGCT GATCGAAGAGGTACTGGCTGATAATCTTCCACCTCCTAGCCATTTTGAACCACCTAC CCTTCACGAACTGTATGATTTAGACGTGACGGCCCCCGAAGATCCCAACGAGGAGG CGGTTTCGCAGATTTTTCCCGAGTCTGTAATGTTGGCGGTGCAGGAAGGGATTGACT TATTCACTTTTCCGCCGGCGCCCGGTTCTCCGGAGCCGCCTCACCTTTCCCGGCAGCC CGAGCAGCCGGAGCAGAGAGCCTTGGGTCCGGTTTCTATGCCAAACCTTGTGCCGG AGGTGATCGATCTTACCTGCCACGAGGCTGGCTTTCCACCCAGTGACGACGAGGATG AAGAGGGTGAGGAGTTTGTGTTAGATTATGTGGAGCACCCCGGGCACGGTTGCAGG TCTTGTCATTATCACCGGAGGAATACGGGGGACCCAGATATTATGTGTTCGCTTTGC TATATGAGGACCTGTGGCATGTTTGTCTACAGTAAGTGAAAATTATGGGCAGTCGGT GATAGAGTGGTGGGTTTGGTGTGGTAATTTTTTTTTAATTTTTACAGTTTTGTGGTTT AAAGA PREDICTED TRANSCRIPTION START SITES: gi_209811_gb_J01917_ADRCG Adenovirus type 2, complete genome., 1200 nucleotides Position Score Likelihood 600 1.063 Highly likely prediction OUTPUT DESCRIPTION For each input sequence the name and length are first printed, followed by a table in the form: Position Score Likelihood 'Position' is a position in the sequence where Promoter has been identified. 'Score' is the prediction score for a transcription start site occurring within 100 base pairs upstream from that position. 29 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 'Likelihood' is a descriptive label associated with that score. The scores are always positive numbers; they are labelled as follows: Below 0.5 Ignored 0.5 - 0.8 Marginal prediction 0.8 - 1.0 Medium likely prediction above 1.0 Highly likely prediction The positions scoring 0.5-0.8 (Marginal predictions) contain about 65% true transcription start sites within 100 base pairs upstream. The positions scoring 0.8- 1.0 (Medium likely predictions) are about 80% true. Finally, the positions scoring above 1.0 (Highly likely predictions) are about 95% true. On average, the software picks up about 80% of all Pol II promoters. These numbers are rough estimates based on a limited test set. The input sequence will be included in the output, preceding the predictions if \"Full output\" has been selected. 30 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 7. Identify conserved domains of proteins using CD-Search The CD-Search service is a web-based tool for the detection of conserved domains in protein sequences. It can therefore help to elucidate the protein's function. The CD-Search service uses RPS-BLAST(Reverse Position-Specific BLAST) to compare a query protein sequence against conserved domain models that have been collected from a number of source databases, and presents results as a concise display (default), standard display, or full display. If CD-Search finds a specific hit, there is a high confidence in the association between the protein query sequence and a conserved domain. The other types of hits that can be found also shed light on the putative function of the query protein. If conserved features are found, they are designated by small triangles in the search results graphical summary, indicating the specific amino acids likely involved in functions such as catalysis or binding. User can submit a protein or nucleotide query sequence to CD-Search, either as a sequence identifier (i.e., as an accession or GI number that is valid in the NCBI Entrez system), or as FASTA-formatted or bare sequence data. Hitting the submit button will start CD-search with default settings for search sensitivity and display options. 31 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Options available in CD search are the following: Database Selection: Here, select the database in which the search has to be conducted. CDD is the default database for searches. This is a superset including all databases listed in the dropdown menu like Pfam, SMART, COG, PRK, and TIGRFAM except for KOG which is the eukaryotic counterpart to the COG database. Expect Value (E-value): is a parameter that describes the number of hits one can \"expect\" to see by chance when searching a database of a particular size. False positive results should be very rare with the default setting of 0.01 (use a more conservative, i.e. lower, setting for more reliable results). Results with E-values in the range of 1 and above should be considered putative false positives. Low Complexity Filter: filters query sequences for compositionally biased regions. These regions are flagged as such and largely ignored during the search phase if filtering is turned ON (the default setting is OFF). Force Live Search: Use this option if user query is a GI or accession number of a protein sequence already in the Entrez Protein database and user prefer to see live rather than precalculated CD-Search results. Rescue Borderline Hits: This option allows user to see hits that have an E-value above the RPS-BLAST reporting threshold (anywhere between 0.01 and 1.0), and that are consistent with known domain architectures. A rescued hit is displayed with a dashed border, and its e-value is displayed in red. Suppress Weak Overlapping Hits: This option suppresses hits that have an e- value close to the RPS-BLAST reporting threshold (in between 0.01 and 0.001) but overlap with stronger hits. Maximum number of hits: limits the size of the hit list produced by CD-Search. Typically, for average sized proteins, the number of expected domain-hits is small and the default setting of 500 should be more than sufficient. 32 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Result Mode: It allows user to select the level of detail displayed in the search results: Concise mode shows only the best scoring domain model, as available for each region on the query sequence. Standard mode shows the best scoring domain model from each source database, for each region on the query sequence. Full mode shows all hits for each region on the query sequence. The example below is the search result for protein GI 157830769 (Cyclodextrin Glucanotransferase) ran using the default database(CDD) and the result mode as Concise. Hit types in the concise display includes specific hits, the superfamily to which the highest-ranking hit belongs, and multi-domain models. 33 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 8. Gene/Protein function analysis using PANTHER The PANTHER (Protein ANalysis THrough Evolutionary Relationships) classification system was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis. In PANTHER, Proteins have been classified according to:  Family and subfamily: families are groups of evolutionarily related proteins; subfamilies are related proteins that also have the same function  Molecular function: the function of the protein by itself or with directly interacting proteins at a biochemical level, e.g. a protein kinase  Biological process: the function of the protein in the context of a larger network of proteins that interact to accomplish a process at the level of the cell or organism, e.g. mitosis.  Pathway: similar to biological process, but a pathway also explicitly specifies the relationships between the interacting molecules. The PANTHER home page can be accessed at http://www.pantherdb.org/ PANTHER home page provides direct links to the most commonly used PANTHER tools and features. By clicking the folder tabs at the top, users can switch between different tools. The first tab provides a set of tools to analyze a list of genes or proteins. This is also the home page when you come to PANTHER. On this page, you can upload a list of genes/proteins, and get functional classification. The results can be displayed as either a gene list page or a pie chart. 34 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual PANTHER Home Page Functional Analysis Step1. Enter IDs of proteins/genes to be analyzed: You can input the gene or protein identifiers by typing into the Enter IDs box. You can also copy/paste multiple IDs to the box separated by space or comma. Step 2. Select organism: Because some identifiers, such as gene symbols, are not organism specific, it is important to select the organism that your IDs are from. However, you can select multiple organisms, especially when you run the functional classification. 35 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Step 3.Select Analysis Choose one of the following tools to analyze your list. i) Functional classification viewed in gene list: The results are displayed on a gene list page with terms from Gene Ontology, PANTHER Protein Class, and PANTHER Pathway. ii) Function classification viewed in pie chart: This tool is the same as above, but the results are displayed in a pie chart. Reach result page by clicking the Submit button. Depending on the analysis method you selected you get either of the following results page. Results Page Gene list page The gene list page contains the following information: Gene ID - This is the identifier for genes in the PANTHER library. Gene Name/Gene Symbol - The Entrez gene definition and gene symbol. 36 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual PANTHER Family/Subfamily - The family or subfamily name of the PANTHER model where the sequence is in. PANTHER Protein Class - This is a PANTHER Index terms describing protein classes. Species - The organism of the gene. The page offers a Convert List to option where you can convert the list to display Pathways in which the given proteins/genes are involved. Graph and diagram page 37 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Pie charts A pie chart can be generated from a gene list. If it is a whole genome pie chart, you can choose the ontology you want to display from the Select ontology drop-down menu. The page also provides links to allow you to convert the pie chart to bar chart 38 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 9. Protein-Protein interaction studies using STRING Protein–protein interaction networks are an important ingredient for the system- level understanding of cellular processes. Such networks can be used for filtering and assessing functional genomics data and for providing a platform for annotating structural, functional and evolutionary properties of proteins. STRING is a web based tool that helps in identifying known interactions between a set of proteins. STRING Home page One can search the STRING site using single protein name, multiple names or by amino acid sequence. Commonly, protein name or identifier is used. Select Multiple proteins from the menu on the left side of the page and paste protein identifiers or names into the search box. The organism can be selected by clicking on the arrow or directly typing the name inside the relative input field. Click search 39 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual button and you will be taken to an intermediate page that shows the proteins that have been identified by the STRING program to match with your input. The intermediate page shows the following message: The following proteins in Mus musculus appear to match your input. Please review the list, then click 'Continue' to proceed. Clicking the continue button takes you to the results page which shows the network of interaction between the proteins that you submitted. The network has six tabs on the bottom. They are: 1. Viewers 2. Legend 3. Settings 4. Analysis 5. Exports 6. Clusters 40 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual STRING RESULT PAGE The network view summarizes the network of predicted associations for a particular group of proteins. The network nodes are proteins. The edges represent the predicted functional associations. The edges are draw according to the view settings.  Red line - indicates the presence of fusion evidence  Green line - neighborhood evidence  Blue line - cooccurrence evidence  Purple line - experimental evidence 41 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual  Yellow line - textmining evidence  Light blue line - database evidence  Black line - coexpression evidence. Clicking on a node gives several details about the protein. Clicking on an edge displays a detailed evidence breakdown. 42 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Navigation Buttons The navigation takes you to different aspects of the data, allowing you change parameters and to see the different types of evidence that supports the predicted associations. Viewers: This tab shows various avenues of evidences for the protein-protein interaction maps generated. Legend: In the Legend section a list of your input(s) is shown. Predicted associations are shown immediately in a list below your input, sorted by score. 43 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Settings: In the settings you can change the parameters that influence the output. Note that parameters are only changed when you press the 'Update Settings' button. Analysis: The analysis section gives some brief statistics of the inferred network, such as the number of nodes and edges Exports: this section you can export your current network to different formats. The most popular ones are the following two: 44 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual •bitmap image- image of the network in the PNG (portable network graphic) file format. •high-resolution bitmap - image in PNG format, at resolution 400 dpi 45 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual 10. Protein structure analysis using RasMol RasMol is a molecular graphics program intended for the visualisation of proteins, nucleic acids and small molecules originally developed by Roger Sayle. The program reads in a molecule coordinate file and interactively displays the molecule on the screen in a variety of colour schemes and molecule representations. Currently available representations include wireframes, sticks, spacefilling (CPK) spheres, ball and stick, solid and ribbons. RasMol supports Protein Data Bank (PDB) file formats along with few other formats. RasMol has two windows: Main graphics window with a black background and a command line window with a white background. One can write commands in the command line or use the dropdown menu on top of Graphics window. Loading a PDB file Click: File  Open  1MBO (PDB id for Oxymyoglobin) Controlling the molecules using Mouse Action Mouse Button and Key Rotate X-Y Left Mouse Button Move Molecule Right mouse Button Zoom In/Out Left Mouse Button & Shift Key Rotate Z Right Mouse Button & Shift Key Changing background background blue This turns the default black background of graphics window into blue. Other colours like green, red, white, yellow etc can be used. 46 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Selecting regions of interest - Select command Once a portion of the molecule is selected, all subsequent commands will apply only to that selected portion. select helices Alpha helices are selected color green Only the helices turn green. restrict selected All other regions except the selected disappear. Selecting residues To select a residue give the three letter code; for Histidine, select his Viewing H bonds hbonds on Increase size of H bonds using hbonds 20 Distance/ angle - Set Picking command set picking angle Upon clicking on 3 atoms, command window shows angle between the atoms set picking distance Clicking on 2 atoms shows the distance between them set picking label Click on any atom to label it 47 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics, Proteomics & Bioinformatics: A Practical Manual Saving an image From the Export tab, select the file type to save the current image RasMol output 48 Department of Zoology, St. Berchmans College (Autonomous), Changanassery

Genomics Proteomics & Bioinformatics: A Practical Manual

philip

Bioinformatics Practical Manual

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

Bioinformatics Practical Manual

Description: A step-by-step guide to using basic to advanced Bioinformatics tools.

Keywords: Bioinformatics,Genomics,Proteomics

Read the Text Version

philip

TOP SEARCH

RELATED PUBLICATIONS