[chimerax-users] Re: retrieve the sequence of an atom selection

9 Jan 2024

      Hi Eric,

thanks a lot for your help. That is exactly what I wanted!

By the way, the example DNA file I created in ChimeraX using build start nucleic type DNA form B. And, yes, it has these different residue names for the two strands in the pdb file.

SEQRES   1 A   17  DA  DG  DG  DG  DA  DG  DT  DA  DA  DT  DC  DC  DC
SEQRES   2 A   17  DC  DT  DT  DG
SEQRES   1 B   17  C   A   A   G   G   G   G   A   T   T   A   C   T
SEQRES   2 B   17  C   C   C   T
ATOM      1  C4' DA  A   1     212.165 192.028 187.380  1.00  0.00           C
ATOM      2  O5' DA  A   1     214.218 191.381 188.348  1.00  0.00           O
ATOM      3  O3' DA  A   1     210.239 190.586 187.380  1.00  0.00           O
...

Can you reproduce this?

Regards,

Maik

________________________________
From: Eric Pettersen <pett@cgl.ucsf.edu>
Sent: Monday, January 8, 2024 9:10:05 PM
To: Engeholm, Maik
Cc: chimerax-users@cgl.ucsf.edu
Subject: Re: [chimerax-users] retrieve the sequence of an atom selection

Hi Maik,
The residue names come from the input, so chain B in your input file must have those single-letter codes.  The PDB-standard DNA residue names are two letters starting with D, whereas RNA names are one letter, but your one-letter residues names clearly are not RNA since there is a ’T” in them.  Anyway, whatever created the input file is the culprit here.
For getting the sequences cleanly, I would use the Chain objects, which know both the residues and corresponding sequence.  Assuming your selections on the chains are contiguous, this should work:

selection = run(session, "select ...")
for chain in selection.chains:
indices = [i for i, r in enumerate(chain.residues) if r and r.atoms.selecteds.any()]
print(chain, "".join(chain.characters[min(indices):max(indices)+1]))

--Eric

Eric Pettersen
UCSF Computer Graphics Lab

On Jan 6, 2024, at 2:02 PM, Engeholm, Maik via ChimeraX-users <chimerax-users@cgl.ucsf.edu> wrote:

Hi there,

I'm looking for a neat way to retrieve the sequence of an atom selection (in this case part of a double-stranded DNA molecule) from within a python script.

I've been playing around with:

selection =    run    (session, 'select #1/A:2-4|#1/B:6-9')
for a in selection.atoms.unique_residues:
    print(a)

which gives me:

151 atoms, 169 bonds, 7 residues, 1 model selected
/A DG 2
/A DG 3
/A DG 4
/B G 6
/B G 7
/B A 8
/B T 9

This is a good start, but ideally I'd like to get back a contiguous sequence string for each chain (like (('A', 'GGG'), ('B', GGAT'))). Also, I do not really understand, why residue names are reported as 'DG' for strand A, but 'G' for strand B, which makes extracting a sequence string from the above output quite difficult.

I've also tried:

selection =    run    (session, 'select #1/A:2-4|#1/B:6-9')
for a in selection.atoms.by_chain:
    print(a[1])
    print(a[2].unique_residues.unique_sequences[0][1])

which gives me:

151 atoms, 169 bonds, 7 residues, 1 model selected
A
AGGGAGTAATCCCCTTG
B
CAAGGGGATTACTCCCT

This is obviously also not what I want since it returns the entire sequence of each chain.

I guess I could take the residue numbers from the first command and use those to slice the strings returned by the second, but that feels a little awkward, and so I was wondering if there was a better, more direct way to achieve this seemingly simple task.

Your help would be much appreciated.

Cheers,

Maik

_______________________________________________
ChimeraX-users mailing list -- chimerax-users@cgl.ucsf.edu<mailto:chimerax-users@cgl.ucsf.edu>
To unsubscribe send an email to chimerax-users-leave@cgl.ucsf.edu<mailto:chimerax-users-leave@cgl.ucsf.edu>
Archives: https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.cgl.ucsf.edu_mailm...