Restricting "select by conservation" to a subset of sequences in a multiple alignment

Hi, I very much like the sequence tools within Chimera - I find the capacity to load in multiple alignments and select by conservation particularly handy. I have two suggestions that I would love to see in a future version of chimera (or maybe they are already in there and I don’t know about them!): 1. Selection of sub-groups of sequences within a multiple alignment for selecting residues by conservation. This would enable the easy selection/identification of residues that are specifically conserved within individual families with different functions or substrates - for example, a position that is an arginine, and always an arginine, in one subgroup of proteins that we know have a particular function, and always a glutamate in another group of proteins that we know possess a different but related function. It is possible to do this currently by loading in multiple different sequence alignments and saving and intersecting selections, but having a popup menu in the select/render by attribute tab where you could select or deselect sequences that contribute to the analysis of conservation would make it much easier. 2. Optional addition of sequence similarity, as well as identity, to the analysis of sequence conservation would make analysis of alignments including quite divergent sequences much more meaningful. Even better would be if the definition of similar groups for this purpose was user-adjustable and had allowance for overlapping groups of similar residues. Cheers, Oliver Clarke.

Hi Oliver, There are some capabilities that at least partially cover these requests, which I'll try to outline here… (1) Subset analysis. If you have a tree being shown with your alignment, you can click a node in the tree and choose to open that branch of the alignment in a separate window. Then you can analyze that subset of sequences using the separate Multalign Viewer window. We don't have anything to limit analysis to a subset of what's in a sequence window, though. Trees to go with alignments are available from several sources, such as PFAM. You open the alignment, then from the alignment window menu choose Tree… Load, then browse to the tree file. <http://www.rbvi.ucsf.edu/chimera/docs/ContributedSoftware/multalignviewer/framemav.html> <http://www.rbvi.ucsf.edu/chimera/docs/ContributedSoftware/multalignviewer/multalignviewer.html#trees> (2) Conservation calculation options. Use Multalign Viewer menu: Preferences… Headers, and set Conservation style to AL2CO. Then you have a number of sophisticated choices for calculating conservation: entropy measure, variability measure, and sum-of-pairs with many choices of which AA similarity matrix to use, with or without sequence weighting and smoothing over windows. I'm continually wishing more people would make use of these, and try to mention them in all of my tutorials! The AL2CO options come from the program of that name provided by the Grishin group. <http://www.rbvi.ucsf.edu/chimera/docs/ContributedSoftware/multalignviewer/multalignviewer.html#mavpref-analysis> Theoretically you could even make your own matrix and put it in the proper place for discovery by the code, although that would be a lot of work. Probably one of the existing matrices would be suitable, there are several BLOSUM, some PAM, and some structure-derived (SDM, HSDM). I added the latter fairly recently. You can also calculate your own custom values externally from Chimera and then load them in from a "header file," to be displayed above the alignment just like the built-in Conservation. See tutorial: Mapping Sequence Conservation on Structures with Chimera <http://www.rbvi.ucsf.edu/chimera/data/tutorials/systems/outline.html> I hope this helps, Elaine ---------- Elaine C. Meng, Ph.D. UCSF Computer Graphics Lab (Chimera team) and Babbitt Lab Department of Pharmaceutical Chemistry University of California, San Francisco On Feb 18, 2015, at 7:39 AM, Oliver Clarke <olibclarke@gmail.com> wrote:
Hi, I very much like the sequence tools within Chimera - I find the capacity to load in multiple alignments and select by conservation particularly handy.
I have two suggestions that I would love to see in a future version of chimera (or maybe they are already in there and I don’t know about them!):
1. Selection of sub-groups of sequences within a multiple alignment for selecting residues by conservation. This would enable the easy selection/identification of residues that are specifically conserved within individual families with different functions or substrates - for example, a position that is an arginine, and always an arginine, in one subgroup of proteins that we know have a particular function, and always a glutamate in another group of proteins that we know possess a different but related function. It is possible to do this currently by loading in multiple different sequence alignments and saving and intersecting selections, but having a popup menu in the select/render by attribute tab where you could select or deselect sequences that contribute to the analysis of conservation would make it much easier.
2. Optional addition of sequence similarity, as well as identity, to the analysis of sequence conservation would make analysis of alignments including quite divergent sequences much more meaningful. Even better would be if the definition of similar groups for this purpose was user-adjustable and had allowance for overlapping groups of similar residues.
Cheers, Oliver Clarke.

Hi Elaine, Many thanks for the detailed explanation - it certainly looks like there are some useful features here that I didn’t know about, and I will definitely take advantage of them in future. With regards to the conservation similarity features, I will read further, but I don’t quite know how to interpret the mavConservation attribute when conservation is calculated using the AL2CO method - it doesn’t seem to be a simple "percentage of sequences that are identical at this position” like it is otherwise - it does not range between 0 and 1, for example. What I would like, in addition to the present options, is have the option to take into account similarity, not just identity, when calculating mavConservation (or perhaps have a separate attribute, mavSimilar) - so that rather than mavConservation being the percentage of sequences that are identical at a given position, instead it would be the percentage of residues that are similar at a given position. Cheers, Oliver.
On Feb 18, 2015, at 12:11 PM, Elaine Meng <meng@cgl.ucsf.edu> wrote:
Hi Oliver, There are some capabilities that at least partially cover these requests, which I'll try to outline here…
(1) Subset analysis. If you have a tree being shown with your alignment, you can click a node in the tree and choose to open that branch of the alignment in a separate window. Then you can analyze that subset of sequences using the separate Multalign Viewer window. We don't have anything to limit analysis to a subset of what's in a sequence window, though. Trees to go with alignments are available from several sources, such as PFAM. You open the alignment, then from the alignment window menu choose Tree… Load, then browse to the tree file.
<http://www.rbvi.ucsf.edu/chimera/docs/ContributedSoftware/multalignviewer/framemav.html> <http://www.rbvi.ucsf.edu/chimera/docs/ContributedSoftware/multalignviewer/multalignviewer.html#trees>
(2) Conservation calculation options. Use Multalign Viewer menu: Preferences… Headers, and set Conservation style to AL2CO. Then you have a number of sophisticated choices for calculating conservation: entropy measure, variability measure, and sum-of-pairs with many choices of which AA similarity matrix to use, with or without sequence weighting and smoothing over windows. I'm continually wishing more people would make use of these, and try to mention them in all of my tutorials! The AL2CO options come from the program of that name provided by the Grishin group.
Theoretically you could even make your own matrix and put it in the proper place for discovery by the code, although that would be a lot of work. Probably one of the existing matrices would be suitable, there are several BLOSUM, some PAM, and some structure-derived (SDM, HSDM). I added the latter fairly recently.
You can also calculate your own custom values externally from Chimera and then load them in from a "header file," to be displayed above the alignment just like the built-in Conservation.
See tutorial: Mapping Sequence Conservation on Structures with Chimera <http://www.rbvi.ucsf.edu/chimera/data/tutorials/systems/outline.html>
I hope this helps, Elaine ---------- Elaine C. Meng, Ph.D. UCSF Computer Graphics Lab (Chimera team) and Babbitt Lab Department of Pharmaceutical Chemistry University of California, San Francisco
On Feb 18, 2015, at 7:39 AM, Oliver Clarke <olibclarke@gmail.com> wrote:
Hi, I very much like the sequence tools within Chimera - I find the capacity to load in multiple alignments and select by conservation particularly handy.
I have two suggestions that I would love to see in a future version of chimera (or maybe they are already in there and I don’t know about them!):
1. Selection of sub-groups of sequences within a multiple alignment for selecting residues by conservation. This would enable the easy selection/identification of residues that are specifically conserved within individual families with different functions or substrates - for example, a position that is an arginine, and always an arginine, in one subgroup of proteins that we know have a particular function, and always a glutamate in another group of proteins that we know possess a different but related function. It is possible to do this currently by loading in multiple different sequence alignments and saving and intersecting selections, but having a popup menu in the select/render by attribute tab where you could select or deselect sequences that contribute to the analysis of conservation would make it much easier.
2. Optional addition of sequence similarity, as well as identity, to the analysis of sequence conservation would make analysis of alignments including quite divergent sequences much more meaningful. Even better would be if the definition of similar groups for this purpose was user-adjustable and had allowance for overlapping groups of similar residues.
Cheers, Oliver Clarke.

Hi Oliver, The sum-of-pairs option does represent amino acid similarity, using the values in an amino acid similarity matrix of your choice. It's literally a sum of the values for the various combinations of pairs in the column. All AL2CO conservation values are then normalized so that positive always means more conserved and the value is in standard deviations from the mean (Z-score), regardless of which option is used. There is a full explanation in the Chimera docs and in the AL2CO paper. <http://www.rbvi.ucsf.edu/chimera/docs/ContributedSoftware/multalignviewer/multalignviewer.html#mavpref-analysis> AL2CO: calculation of positional conservation in a protein sequence alignment. Pei J, Grishin NV. Bioinformatics. 2001 Aug;17(8):700-12. <http://www.ncbi.nlm.nih.gov/pubmed/11524371> Best, Elaine ---------- Elaine C. Meng, Ph.D. UCSF Computer Graphics Lab (Chimera team) and Babbitt Lab Department of Pharmaceutical Chemistry University of California, San Francisco On Feb 18, 2015, at 9:43 AM, Oliver Clarke <olibclarke@gmail.com> wrote:
Hi Elaine,
Many thanks for the detailed explanation - it certainly looks like there are some useful features here that I didn’t know about, and I will definitely take advantage of them in future.
With regards to the conservation similarity features, I will read further, but I don’t quite know how to interpret the mavConservation attribute when conservation is calculated using the AL2CO method - it doesn’t seem to be a simple "percentage of sequences that are identical at this position” like it is otherwise - it does not range between 0 and 1, for example.
What I would like, in addition to the present options, is have the option to take into account similarity, not just identity, when calculating mavConservation (or perhaps have a separate attribute, mavSimilar) - so that rather than mavConservation being the percentage of sequences that are identical at a given position, instead it would be the percentage of residues that are similar at a given position.
Cheers, Oliver.

On Feb 18, 2015, at 7:39 AM, Oliver Clarke <olibclarke@gmail.com> wrote:
1. Selection of sub-groups of sequences within a multiple alignment for selecting residues by conservation. This would enable the easy selection/identification of residues that are specifically conserved within individual families with different functions or substrates - for example, a position that is an arginine, and always an arginine, in one subgroup of proteins that we know have a particular function, and always a glutamate in another group of proteins that we know possess a different but related function. It is possible to do this currently by loading in multiple different sequence alignments and saving and intersecting selections, but having a popup menu in the select/render by attribute tab where you could select or deselect sequences that contribute to the analysis of conservation would make it much easier.
Making it easier to add conservation/consensus lines that apply to only subsets of sequences would indeed be pretty useful. It's actually already on my "to do" list (#12388 (allow conservation to be shown for sequence subsets)), but many things on my to-do list haven't been getting the attention I'd like as we work hard to try to get "Chimera 2" at least minimally up and running. Sort of as a corollary, this will probably get implemented in Chimera 2 rather than Chimera 1, which as you can imagine means it will be awhile before it sees the light of day. So for now you are going to have to use the less nice workarounds you are currently using to get your work done. I'm sorry about that. --Eric Eric Pettersen UCSF Computer Graphics Lab http://www.cgl.ucsf.edu
participants (3)
-
Elaine Meng
-
Eric Pettersen
-
Oliver Clarke