Re: [chimerax-users] label_seq_id in mmcif files for Chimera
On 7/7/2020 3:50 PM, Marcin Wojdyr wrote:
On Tue, 7 Jul 2020 at 22:38, Greg Couch gregc@cgl.ucsf.edu wrote:
As you know, the entity_poly_seq table says which polymeric residues are adjacent to each other (entity_poly_seq.num values are the atom_site.label_seq_id values).
Yes, it's an enhanced equivalent of the SEQRES record in the PDB format. Files from the wwPDB always have it, but files from other sources will often miss it. I think currently almost all the mmCIF files in circulation originate from the wwPDB. But this is slowly changing and people are starting to use mmCIF as a working format.
The sequence information is extremely useful for eliminating ambiguities in the atom_site data. And it is frequently known by the user, even when it is missing from the mmCIF file.
Without that information, ChimeraX still needs the atom_site.label_seq_id to imply the polymeric ordering.
Ok. It would be useful, though, if ChimeraX could infer the ordering and connectivity in the same way as it does for PDB files.
Thanks, Marcin
Adding a atom_site.label_seq_id isn't different from supplying a residue number in PDB file. When there are adjacent residues of the same type, does the PDB reader see a duplicate atom and generate a new residue? merge the residues? generate an error? I haven't tested the PDB reader, but a residue number helps it too.
Learning to use mmCIF effectively will take time. Programmers should use the PDB's mmCIF Dictionary Suite to validate the mmCIF their programs output -- it won't validate completely (because it's for mmCIF files served by the PDB), but it will show where the mmCIF output can be improved.
For more information about ChimeraX reading mmCIF, see https://www.rbvi.ucsf.edu/chimerax/docs/devel/bundles/mmcif/src/mmcif_guidel....
ChimeraX's mmCIF reader tries really hard to show you the data exactly as it is specified. So if there's badly specified data, you'll be able to see it. Over time, we will improve ChimeraX's ability to handle underspecified data as developers start generating mmCIF files. But we don't plan to modify the mmCIF reader if we can help it. Adding code, to infer information that should be explicitly given in the mmCIF file, slows down the mmCIF reader for everyone. This is especially true for large structures with over a million atoms.
-- Greg
Hi Greg,
Adding a atom_site.label_seq_id isn't different from supplying a residue number in PDB file. When there are adjacent residues of the same type, does the PDB reader see a duplicate atom and generate a new residue? merge the residues? generate an error? I haven't tested the PDB reader, but a residue number helps it too.
The sequence number/id from the PDB format tells which atoms are in the same residue, but it doesn't imply connectivity between residues, because the numbers don't need to be consecutive. In the mmCIF format it is stored as _atom_site.auth_seq_id (+pdbx_PDB_ins_code for the full id). So this makes conversion from PDB to mmCIF problematic. If SEQRES is present I do sequence alignment to determine label_seq_id. If SEQRES is missing I could ask the user to supply the full sequence, but then the user may think that (since this was not obligatory when working with PDB files) moving to mmCIF is a step backward. I could infer gaps and increase label_seq_id by 2 if there is a gap, but the resulting mmCIF file can be used for any purpose, not only by Chimera. The apparent gap may actually be caused by misplaced atoms and it could turn out that the gap in numbering is causing later different problems. It's not clear to me what's better here - leaving label_seq_id null or filling it with the best guesses (which sometimes will be wrong).
Marcin
Hi Marcin,
In the mmCIF format atom_site.label_seq_id is required as described in the mmCIF documentation
http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_atom_site.labe... http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_atom_site.label_seq_id.html
That documentation says it is used by the 60 mmCIF tables I copied below -- if you don't included it then none of those tables can refer to a specific residue. A minimal mmCIF file can of course not have any of those 60 tables so it could work in principle to omit label_seq_id or specify it as ".". But you are really asking for software not to work. It is like not including any atom names because you don't happen to need the atom names -- good luck getting software to work correctly when you omit basic information.
Tom
_atom_site_anisotrop.pdbx_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_atom_site_anisotrop.pdbx_label_seq_id.html _geom_angle.atom_site_label_seq_id_1 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_angle.atom_site_label_seq_id_1.html _geom_angle.atom_site_label_seq_id_2 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_angle.atom_site_label_seq_id_2.html _geom_angle.atom_site_label_seq_id_3 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_angle.atom_site_label_seq_id_3.html _geom_bond.atom_site_label_seq_id_1 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_bond.atom_site_label_seq_id_1.html _geom_bond.atom_site_label_seq_id_2 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_bond.atom_site_label_seq_id_2.html _geom_contact.atom_site_label_seq_id_1 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_contact.atom_site_label_seq_id_1.html _geom_contact.atom_site_label_seq_id_2 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_contact.atom_site_label_seq_id_2.html _geom_hbond.atom_site_label_seq_id_A http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_hbond.atom_site_label_seq_id_A.html _geom_hbond.atom_site_label_seq_id_D http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_hbond.atom_site_label_seq_id_D.html _geom_hbond.atom_site_label_seq_id_H http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_hbond.atom_site_label_seq_id_H.html _geom_torsion.atom_site_label_seq_id_1 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_torsion.atom_site_label_seq_id_1.html_geom_torsion.atom_site_label_seq_id_2 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_torsion.atom_site_label_seq_id_2.html _geom_torsion.atom_site_label_seq_id_3 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_torsion.atom_site_label_seq_id_3.html _geom_torsion.atom_site_label_seq_id_4 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_torsion.atom_site_label_seq_id_4.html _ndb_struct_na_base_pair.i_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_ndb_struct_na_base_pair.i_label_seq_id.html_ndb_struct_na_base_pair.j_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_ndb_struct_na_base_pair.j_label_seq_id.html _ndb_struct_na_base_pair_step.i_label_seq_id_1 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_ndb_struct_na_base_pair_step.i_label_seq_id_1.html _ndb_struct_na_base_pair_step.i_label_seq_id_2 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_ndb_struct_na_base_pair_step.i_label_seq_id_2.html _ndb_struct_na_base_pair_step.j_label_seq_id_1 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_ndb_struct_na_base_pair_step.j_label_seq_id_1.html _ndb_struct_na_base_pair_step.j_label_seq_id_2 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_ndb_struct_na_base_pair_step.j_label_seq_id_2.html _pdbx_atom_site_aniso_tls.label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_atom_site_aniso_tls.label_seq_id.html _pdbx_domain_range.beg_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_domain_range.beg_label_seq_id.html _pdbx_domain_range.end_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_domain_range.end_label_seq_id.html _pdbx_feature_monomer.label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_feature_monomer.label_seq_id.html _pdbx_refine_component.label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_refine_component.label_seq_id.html _pdbx_remediation_atom_site_mapping.label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_remediation_atom_site_mapping.label_seq_id.html _pdbx_sequence_range.beg_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_sequence_range.beg_label_seq_id.html _pdbx_sequence_range.end_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_sequence_range.end_label_seq_id.html _pdbx_struct_chem_comp_diagnostics.seq_num http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_chem_comp_diagnostics.seq_num.html _pdbx_struct_chem_comp_feature.seq_num http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_chem_comp_feature.seq_num.html _pdbx_struct_conn_angle.ptnr1_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_conn_angle.ptnr1_label_seq_id.html _pdbx_struct_conn_angle.ptnr2_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_conn_angle.ptnr2_label_seq_id.html _pdbx_struct_conn_angle.ptnr3_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_conn_angle.ptnr3_label_seq_id.html _pdbx_struct_group_component_range.end_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_group_component_range.end_label_seq_id.html _pdbx_struct_group_components.label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_group_components.label_seq_id.html _pdbx_struct_mod_residue.label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_mod_residue.label_seq_id.html _pdbx_struct_sheet_hbond.range_1_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_sheet_hbond.range_1_label_seq_id.html _pdbx_struct_sheet_hbond.range_2_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_sheet_hbond.range_2_label_seq_id.html _struct_conf.beg_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_conf.beg_label_seq_id.html _struct_conf.end_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_conf.end_label_seq_id.html _struct_conn.pdbx_ptnr3_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_conn.pdbx_ptnr3_label_seq_id.html _struct_conn.ptnr1_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_conn.ptnr1_label_seq_id.html _struct_conn.ptnr2_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_conn.ptnr2_label_seq_id.html _struct_mon_nucl.label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_mon_nucl.label_seq_id.html _struct_mon_prot.label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_mon_prot.label_seq_id.html _struct_mon_prot_cis.label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_mon_prot_cis.label_seq_id.html _struct_mon_prot_cis.pdbx_label_seq_id_2 http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_mon_prot_cis.pdbx_label_seq_id_2.html _struct_sheet_hbond.range_1_beg_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_sheet_hbond.range_1_beg_label_seq_id.html _struct_sheet_hbond.range_1_end_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_sheet_hbond.range_1_end_label_seq_id.html _struct_sheet_hbond.range_2_beg_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_sheet_hbond.range_2_beg_label_seq_id.html _struct_sheet_hbond.range_2_end_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_sheet_hbond.range_2_end_label_seq_id.html _struct_sheet_range.beg_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_sheet_range.beg_label_seq_id.html _struct_sheet_range.end_label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_sheet_range.end_label_seq_id.html _struct_site_gen.label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_site_gen.label_seq_id.html
On Jul 8, 2020, at 4:27 AM, Marcin Wojdyr wojdyr@gmail.com wrote:
Hi Greg,
Adding a atom_site.label_seq_id isn't different from supplying a residue number in PDB file. When there are adjacent residues of the same type, does the PDB reader see a duplicate atom and generate a new residue? merge the residues? generate an error? I haven't tested the PDB reader, but a residue number helps it too.
The sequence number/id from the PDB format tells which atoms are in the same residue, but it doesn't imply connectivity between residues, because the numbers don't need to be consecutive. In the mmCIF format it is stored as _atom_site.auth_seq_id (+pdbx_PDB_ins_code for the full id). So this makes conversion from PDB to mmCIF problematic. If SEQRES is present I do sequence alignment to determine label_seq_id. If SEQRES is missing I could ask the user to supply the full sequence, but then the user may think that (since this was not obligatory when working with PDB files) moving to mmCIF is a step backward. I could infer gaps and increase label_seq_id by 2 if there is a gap, but the resulting mmCIF file can be used for any purpose, not only by Chimera. The apparent gap may actually be caused by misplaced atoms and it could turn out that the gap in numbering is causing later different problems. It's not clear to me what's better here - leaving label_seq_id null or filling it with the best guesses (which sometimes will be wrong).
Marcin _______________________________________________ ChimeraX-users mailing list ChimeraX-users@cgl.ucsf.edu Manage subscription: https://www.rbvi.ucsf.edu/mailman/listinfo/chimerax-users
Hi Tom,
I don't want to continue it for too long, but I need to comment on this:
That documentation says it is used by the 60 mmCIF tables I copied below -- if you don't included it then none of those tables can refer to a specific residue.
They can - by using different identifiers, the ones inherited from the PDB format. Moreover, if the residue that you refer to is HOH you have to use different identifiers, because label_seq_id is null (.) for ligands and waters. It cannot be used to refer to a specific water molecule.
Marcin
On 7/8/2020 12:06 PM, Marcin Wojdyr wrote:
Hi Tom,
I don't want to continue it for too long, but I need to comment on this:
That documentation says it is used by the 60 mmCIF tables I copied below -- if you don't included it then none of those tables can refer to a specific residue.
They can - by using different identifiers, the ones inherited from the PDB format. Moreover, if the residue that you refer to is HOH you have to use different identifiers, because label_seq_id is null (.) for ligands and waters. It cannot be used to refer to a specific water molecule.
Marcin
Yes, that is a major failing of the mmCIF format. If the atom_site.label_seq_id were non-null for solvent, then all of the lines in the atom_site table would be uniquely keyed, with the label_* values, like a database table -- that would be fantastic for us computer science types. ChimeraX uses the atom_site.auth_seq_id to disambiguate the solvent. The PDBe is thinking of adding another column to the atom_site table for the same purpose.
It appears that it would be fairly easy for the PDB to fix this. The mmcif dictionary description for atom_site.label_seq_id would have to be updated to allow values for solvent. The current description is:
This data item is a pointer to _entity_poly_seq.num in the ENTITY_POLY_SEQ category.
That could be amended. As far as I can tell, CIF validators do not use the description when validating CIF files, so the current limitation is not a formal requirement, just a convention. The problem is all of the programs the PDB uses.
participants (3)
-
Greg Couch
-
Marcin Wojdyr
-
Tom Goddard