Overview:In this exercise, you will install and use ncbi blast+ to search all the proteins encoded in the genome one E. coli strain against all the proteins encoded in a second E. coli strain. Then, you will use Excel to find reciprocal best hits. You will also use DAVID to look for functional enrichment in the set of genes found only in one of the two genomes.
EcoTypeCDS.faa is a fasta file of all 4871 predicted protein coding genes from strain ATCC11775
MG1655CDS.faa is a fasta file of all 4139 predicted protein coding genes from strain MG1655
Part I – Command line BLAST Download and Install ncbi blast+ from the ftp site: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Launch your command line interface and navigate to the directory containing the lab exercise files.
Make blast databases:
Now, you will use Excel to find Reciprocal Best Hits.
Import one output file into ‘sheet1’ using the Import External Data function. Import the other output file into ‘sheet2’.
In order to create a unique identifier for each hit that will match between the two “sheets”, add an extra column that concatenates the query and subject name into a single field in the same order for both “sheets”. This is the formula I typed in the first blank cell, then I used the fill down command:
=B1&"-"&A1 (for sheet1)
=A1&”-“&B1 (for sheet2)
In the next blank column of sheet1, use the match function to check whether your concatenated identifier is also found in sheet2:
Note, the “ISNA( )” function is negating the “MATCH” function, so the answer “FALSE” means there is a match in sheet2. To figure out why I recommend this, use the next blank column to run it without the ISNA ( ):
Without adding any additional criteria, how many RBH orthologs did you find? 3641 How many proteins are found in each E. coli genome that have no orthologs in the other? Is that number the same as the number of rows in your spreadsheet that don’t correspond to reciprocal best hits? Why or why not? 498 found only in MG1655; 1230 found only in EcoType using the total number of genes in the fasta files listed above. The spreadsheet only includes genes that had a match that met the E<0.001 criteria in our BLAST analysis. Now explore some other criteria you might consider to add specificity to your ortholog predictions. First, delete all the rows that don’t have reciprocal matches or copy the ones that do to a new sheet.
Plot a histogram based on percent identity and describe the results. How could you use this to for exclusion of some RBH from your final ortholog set? Based on the distribution you would likely want to exclude RBH that fell below a reasonable threshold since the vast majority of RBH predicted orthologs are nearly identical. Sort by alignment length. How many of the alignments less than 50 amino acids? Would it be wise to eliminate the rbh orthologs that are based short matches or is there other information you would like to know first? Rewrite the BLAST command that we used above to add this information to the output file. Hint, type “blastp –help” to get a complete list of command line options including additional fields we could have included in the output file.
I think its premature to throw them out until you know the length of the query and subject sequences too. Just add qlen and slen to the –outfmt “6….” Option. Can you think of any other filters you could apply that might improve the ortholog predictions? Based on calculation using the query and subject length above, I’d exclude orthologs predicted if the match encompassed less than 60% of both proteins. Do you think you would get the same or different answer using nucleotide sequences instead of proteins? Why or why not? Would you give the same answer if you were comparing E. coli to human? In this case, you should get a very similar (though likely not exactly the same) answer, but with the much greater evolutionary distance to human, the protein searches are likely to be much better.
Part II-Functional Enrichment
Your task is to identify what type of genes in E. coli MG1655 do not have orthologs in the E. coli Type strain. Based on the ortholog analysis above, I compiled a list of EntrezIDs for MG1655 proteins. The list of 478 proteins (we lost a few in the ID mapping) is found in file:
You will use the DAVID web server to search for categories that are functionally enriched in this set relative to the background of the MG1655 genome as a whole.
Go to the DAVID web site:
Click on the ‘Functional Annotation’ shortcut near the top on the left side of the window. Paste your list of Entrez GeneIDs into the ‘Upload Gene List’ box. Select ‘ENTREZ_GENE_ID’ from the list of possible identifier types. Use the radio button to select ‘Gene List’ rather than ‘Background’ and click the ‘Submit’ button. Results should appear on the right.
Q. You started with 478 GeneIDs. How many of those mapped to DAVID IDs? Why are these numbers different? 445 and most likely the DAVID db is missing those genes. Q. The DAVID system automatically chose a background gene set to compare to your list. Was it the right choice? Yes it chose E. coli Use the +/- list expander buttons to examine the default categories that DAVID will use for the analysis. Specifically, look at the ‘Gene Ontology’ list.
Q. What are the three gene ontology selections that are active? Why are there three? What fraction of the GeneIDs you started with are found in each category? There is one for each of the ontologies, cellular component, biological process and molecular function. They must be curated subsets of the total categories for each ontoloty Click the ‘Functional Annotation Clustering’ button found below the list of categories. This will launch an analysis of categories enriched in your gene list relative to the background set.
Q. How many enriched clusters did DAVID find? 56 Take a look at the terms that are part of the first enriched cluster and answer the following questions.
Q: What is the prevailing biological theme of this cluster? Phenylalanine metabolism
Q: Are all three GO ontologies represented in this cluster? Are multiple GO terms from one ontology included? No cellular component, multiple biological process and molecular function terms
Q: How many MG1655 genes are found in this cluster? How do you get a list of MG1655 genes in a cluster? 20 reported if you click on the red G
Q: Are the p-values of all the terms associated with this cluster equally significant? Are you convinced that the cluster as a whole is enriched in your list? No, Yes. Look at a few more enriched clusters.
Q: Does DAVID help you make sense of the difference between MG1655 and the Type strain? Why or why not? I think so, I’d have had a hard time drawing comparable conclusions just by staring at the list of 478 genes. Q: Describe at least three potentially interesting biological processes or functions enriched in the MG1655 genes that have no orthologs in the Type strain.
MG1655 is likely to degrade styrene, use the ADC pathway to produce 4-amino-butanoate, and is capable of producing either more or very different fimbrae than the Type Strain. There are many alternative insights you may have found. Q: If you ran a DAVID analysis of the Type strain genes without orthologs in MG1655, is it possible that you would see the same clusters enriched? Why or why not? Sure it is, for example, I noted the enrichment for fimbrae in MG1655. I would be there are totally different ( non-orthologous) genes involved with production of fimbrae in the Type strain.