Hi Elaine, thanks alot for such detailed suggestions! In my case the situation might be much simplier because I'd like to i) make conclusions about structure-function relationships in the set of olfactory receptors proteins which are -) all rhodopsin-like (A) GPCRs sharing 40-70% sequence identify -) have the same functional-relevant motifs -) interact only with one Golf type of transdusser -) !! but !! could interact with big range of odorants (ligands). Taking into account that TM bundle of those proteins has big conservation in comparison to its extracellular (expecially loop) part I should to focus on the extracellular (less conservative) part which are mainly involved in the ligand binding (! because I'd like to predict mutations which affect ligand binding first of all). BTW taking into account very hight level of the redundancy in GPCRs sequence (within the rhodopsin-like subfamily less that 40% of homology on sequence) but hight conservation of the structure of all there proteins (rmsd <4 A) might the conclusion about high tollerance to the mutations of these proteins towards the mutations (affecting function) be made (the phenomen which is called like buffering in evolutional biology)? Some teqnichal question: is it possible in Chimera (using mut align plugin for instance) to redunce number of sequences ussed in the conservation diagrams as well as in the in multiple alignments diagrams (for inctance to hide some sequences from the all set which will not been used)? E.g: Previously I've had to make new aligmnet.fasta files with different number of sequences and load it each time in the Chimera when I need to compare different number of sequences. But in fact I'd like to do it within one open Chimera session with the set of maximum number of sequences hidding the unused sequenses. Thanks alot for help, James 2014-09-02 0:07 GMT+04:00 Elaine Meng <meng@cgl.ucsf.edu>:
Hi James, That’s a really broad question… usually that’s why people are interested in showing conservation on some structure: to highlight the important residues, and in combination with structural analysis and/or mutagenesis, to suggest their roles in structure and function. I don’t have specific papers in mind, but there should be no shortage if you try a few keyword searches.
There is a bit more discussion in my page on “sources of sequence alignments” as to choosing the set of sequences, or source (web database or web server) of a set of sequences for calculating conservation. <http://www.cgl.ucsf.edu/home/meng/sources.html> (That page is also linked to the “mapping sequence conservation” tutorial.)
However, your question is really about making judgement calls as they pertain to your specific research project, which may be beyond the scope of this Chimera list, and further, I don’t claim to be an authority on the subject. Logic would dictate that if you are looking for residues important in function X, then you would want to include only sequences of proteins that perform function X, but if you are looking for patterns of conservation among related functions X,Y,Z in some set of homologous proteins, you would use a broader set. Maybe it is known what percent IDs give the proper boundaries in your situation, but more often it is not known. By boundaries I mean how much variability can be in your set without including some proteins that don’t have the function of interest.
The percent ID filtering you mention is a little bit different - that filtering would remove sequences from the set that are similar to another within the specified cutoff, but could be considered separately from the issue of how broadly variable the set is in the first place (for example, only class A GPCRs, only amine neurotransmitter-binding GPCRs, or only Gs-stimulating GPCRs). In other words, a data set could be big for at least two different reasons: (1) it’s broad, (2) it’s redundant. For redundancy-filtering, I’d personally only filter down to get an alignment that is not too big for Chimera, and err on the side of including lots of sequences as long as the overall range of the set is appropriate for the function of interest. There are sequence-weighting options in the AL2CO conservation scoring in Chimera that may help to mitigate any under- and over-representation of subsets of the sequences.
Disclaimer: this is just my opinion as someone who has dabbled in the area, and others with more experience may have divergent views or better ideas of how to attack the problem!
I hope this helps, Elaine ----- Elaine C. Meng, Ph.D. UCSF Computer Graphics Lab (Chimera team) and Babbitt Lab Department of Pharmaceutical Chemistry University of California, San Francisco
On Sep 1, 2014, at 2:55 AM, James Starlight <jmsstarlight@gmail.com> wrote:
Elaine,
I have not fully finished your tutorial but now I have one question: have you seen some intresting papers covering my problem: to make prediction of the functional properties of some amino acid (motifs) based on the analysis of the 3D structure of the protein under interest together with the analysis of sequences of closely related homologues?
In particular I'm interested of how much sequences should I include to the MSA and what threshold for the seq identity (agains my target protein) should be chosen. E.g In case where I deal with the set of G-protein coupled receptors (which has low sequence similarity but hight structure conservation): I've obtained 2 different pictures of the conservative a.a motifs in cases where i've used i) only several templates with low sequence (40%) identity VS ii) where I have used alot of sequenses with begger identity (up to 60%). In the latter cases I've obtained much bigger conservation in the motifs seen based on the analysis of SS( which is trivial!) where in the i) case- there were only several highly conservative motifs. Does it means that the analysis of BIG datasets with bigger sequence identity produce bigger unsertaintly in the final results because we can conclude about what conservative elements are *really* functional importnat?
James