Re: [chimerax-users] label_seq_id in mmcif files for Chimera

On 7/7/2020 3:50 PM, Marcin Wojdyr wrote:
On Tue, 7 Jul 2020 at 22:38, Greg Couch <gregc@cgl.ucsf.edu> wrote:
As you know, the entity_poly_seq table says which polymeric residues are adjacent to each other (entity_poly_seq.num values are the atom_site.label_seq_id values). Yes, it's an enhanced equivalent of the SEQRES record in the PDB format. Files from the wwPDB always have it, but files from other sources will often miss it. I think currently almost all the mmCIF files in circulation originate from the wwPDB. But this is slowly changing and people are starting to use mmCIF as a working format. The sequence information is extremely useful for eliminating ambiguities in the atom_site data. And it is frequently known by the user, even when it is missing from the mmCIF file. Without that information, ChimeraX still needs the atom_site.label_seq_id to imply the polymeric ordering. Ok. It would be useful, though, if ChimeraX could infer the ordering and connectivity in the same way as it does for PDB files.
Thanks, Marcin
Adding a atom_site.label_seq_id isn't different from supplying a residue number in PDB file. When there are adjacent residues of the same type, does the PDB reader see a duplicate atom and generate a new residue? merge the residues? generate an error? I haven't tested the PDB reader, but a residue number helps it too. Learning to use mmCIF effectively will take time. Programmers should use the PDB's mmCIF Dictionary Suite to validate the mmCIF their programs output -- it won't validate completely (because it's for mmCIF files served by the PDB), but it will show where the mmCIF output can be improved. For more information about ChimeraX reading mmCIF, see https://www.rbvi.ucsf.edu/chimerax/docs/devel/bundles/mmcif/src/mmcif_guidel.... ChimeraX's mmCIF reader tries really hard to show you the data exactly as it is specified. So if there's badly specified data, you'll be able to see it. Over time, we will improve ChimeraX's ability to handle underspecified data as developers start generating mmCIF files. But we don't plan to modify the mmCIF reader if we can help it. Adding code, to infer information that should be explicitly given in the mmCIF file, slows down the mmCIF reader for everyone. This is especially true for large structures with over a million atoms. -- Greg

Hi Greg,
Adding a atom_site.label_seq_id isn't different from supplying a residue number in PDB file. When there are adjacent residues of the same type, does the PDB reader see a duplicate atom and generate a new residue? merge the residues? generate an error? I haven't tested the PDB reader, but a residue number helps it too.
The sequence number/id from the PDB format tells which atoms are in the same residue, but it doesn't imply connectivity between residues, because the numbers don't need to be consecutive. In the mmCIF format it is stored as _atom_site.auth_seq_id (+pdbx_PDB_ins_code for the full id). So this makes conversion from PDB to mmCIF problematic. If SEQRES is present I do sequence alignment to determine label_seq_id. If SEQRES is missing I could ask the user to supply the full sequence, but then the user may think that (since this was not obligatory when working with PDB files) moving to mmCIF is a step backward. I could infer gaps and increase label_seq_id by 2 if there is a gap, but the resulting mmCIF file can be used for any purpose, not only by Chimera. The apparent gap may actually be caused by misplaced atoms and it could turn out that the gap in numbering is causing later different problems. It's not clear to me what's better here - leaving label_seq_id null or filling it with the best guesses (which sometimes will be wrong). Marcin

Hi Marcin, In the mmCIF format atom_site.label_seq_id is required as described in the mmCIF documentation http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_atom_site.labe... <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_atom_site.labe...> That documentation says it is used by the 60 mmCIF tables I copied below -- if you don't included it then none of those tables can refer to a specific residue. A minimal mmCIF file can of course not have any of those 60 tables so it could work in principle to omit label_seq_id or specify it as ".". But you are really asking for software not to work. It is like not including any atom names because you don't happen to need the atom names -- good luck getting software to work correctly when you omit basic information. Tom _atom_site_anisotrop.pdbx_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_atom_site_anis...> _geom_angle.atom_site_label_seq_id_1 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_angle.ato...> _geom_angle.atom_site_label_seq_id_2 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_angle.ato...> _geom_angle.atom_site_label_seq_id_3 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_angle.ato...> _geom_bond.atom_site_label_seq_id_1 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_bond.atom...> _geom_bond.atom_site_label_seq_id_2 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_bond.atom...> _geom_contact.atom_site_label_seq_id_1 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_contact.a...> _geom_contact.atom_site_label_seq_id_2 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_contact.a...> _geom_hbond.atom_site_label_seq_id_A <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_hbond.ato...> _geom_hbond.atom_site_label_seq_id_D <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_hbond.ato...> _geom_hbond.atom_site_label_seq_id_H <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_hbond.ato...> _geom_torsion.atom_site_label_seq_id_1 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_torsion.atom_site_label_seq_id_1.html>_geom_torsion.atom_site_label_seq_id_2 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_torsion.a...> _geom_torsion.atom_site_label_seq_id_3 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_torsion.a...> _geom_torsion.atom_site_label_seq_id_4 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_geom_torsion.a...> _ndb_struct_na_base_pair.i_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_ndb_struct_na_base_pair.i_label_seq_id.html>_ndb_struct_na_base_pair.j_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_ndb_struct_na_...> _ndb_struct_na_base_pair_step.i_label_seq_id_1 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_ndb_struct_na_...> _ndb_struct_na_base_pair_step.i_label_seq_id_2 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_ndb_struct_na_...> _ndb_struct_na_base_pair_step.j_label_seq_id_1 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_ndb_struct_na_...> _ndb_struct_na_base_pair_step.j_label_seq_id_2 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_ndb_struct_na_...> _pdbx_atom_site_aniso_tls.label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_atom_site...> _pdbx_domain_range.beg_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_domain_ra...> _pdbx_domain_range.end_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_domain_ra...> _pdbx_feature_monomer.label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_feature_m...> _pdbx_refine_component.label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_refine_co...> _pdbx_remediation_atom_site_mapping.label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_remediati...> _pdbx_sequence_range.beg_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_sequence_...> _pdbx_sequence_range.end_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_sequence_...> _pdbx_struct_chem_comp_diagnostics.seq_num <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_ch...> _pdbx_struct_chem_comp_feature.seq_num <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_ch...> _pdbx_struct_conn_angle.ptnr1_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_co...> _pdbx_struct_conn_angle.ptnr2_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_co...> _pdbx_struct_conn_angle.ptnr3_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_co...> _pdbx_struct_group_component_range.end_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_gr...> _pdbx_struct_group_components.label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_gr...> _pdbx_struct_mod_residue.label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_mo...> _pdbx_struct_sheet_hbond.range_1_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_sh...> _pdbx_struct_sheet_hbond.range_2_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_pdbx_struct_sh...> _struct_conf.beg_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_conf.be...> _struct_conf.end_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_conf.en...> _struct_conn.pdbx_ptnr3_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_conn.pd...> _struct_conn.ptnr1_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_conn.pt...> _struct_conn.ptnr2_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_conn.pt...> _struct_mon_nucl.label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_mon_nuc...> _struct_mon_prot.label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_mon_pro...> _struct_mon_prot_cis.label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_mon_pro...> _struct_mon_prot_cis.pdbx_label_seq_id_2 <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_mon_pro...> _struct_sheet_hbond.range_1_beg_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_sheet_h...> _struct_sheet_hbond.range_1_end_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_sheet_h...> _struct_sheet_hbond.range_2_beg_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_sheet_h...> _struct_sheet_hbond.range_2_end_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_sheet_h...> _struct_sheet_range.beg_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_sheet_r...> _struct_sheet_range.end_label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_sheet_r...> _struct_site_gen.label_seq_id <http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_struct_site_ge...>
On Jul 8, 2020, at 4:27 AM, Marcin Wojdyr <wojdyr@gmail.com> wrote:
Hi Greg,
Adding a atom_site.label_seq_id isn't different from supplying a residue number in PDB file. When there are adjacent residues of the same type, does the PDB reader see a duplicate atom and generate a new residue? merge the residues? generate an error? I haven't tested the PDB reader, but a residue number helps it too.
The sequence number/id from the PDB format tells which atoms are in the same residue, but it doesn't imply connectivity between residues, because the numbers don't need to be consecutive. In the mmCIF format it is stored as _atom_site.auth_seq_id (+pdbx_PDB_ins_code for the full id). So this makes conversion from PDB to mmCIF problematic. If SEQRES is present I do sequence alignment to determine label_seq_id. If SEQRES is missing I could ask the user to supply the full sequence, but then the user may think that (since this was not obligatory when working with PDB files) moving to mmCIF is a step backward. I could infer gaps and increase label_seq_id by 2 if there is a gap, but the resulting mmCIF file can be used for any purpose, not only by Chimera. The apparent gap may actually be caused by misplaced atoms and it could turn out that the gap in numbering is causing later different problems. It's not clear to me what's better here - leaving label_seq_id null or filling it with the best guesses (which sometimes will be wrong).
Marcin _______________________________________________ ChimeraX-users mailing list ChimeraX-users@cgl.ucsf.edu Manage subscription: https://www.rbvi.ucsf.edu/mailman/listinfo/chimerax-users

Hi Tom, I don't want to continue it for too long, but I need to comment on this:
That documentation says it is used by the 60 mmCIF tables I copied below -- if you don't included it then none of those tables can refer to a specific residue.
They can - by using different identifiers, the ones inherited from the PDB format. Moreover, if the residue that you refer to is HOH you have to use different identifiers, because label_seq_id is null (.) for ligands and waters. It cannot be used to refer to a specific water molecule. Marcin

On 7/8/2020 12:06 PM, Marcin Wojdyr wrote:
Hi Tom,
I don't want to continue it for too long, but I need to comment on this:
That documentation says it is used by the 60 mmCIF tables I copied below -- if you don't included it then none of those tables can refer to a specific residue. They can - by using different identifiers, the ones inherited from the PDB format. Moreover, if the residue that you refer to is HOH you have to use different identifiers, because label_seq_id is null (.) for ligands and waters. It cannot be used to refer to a specific water molecule.
Marcin
Yes, that is a major failing of the mmCIF format. If the atom_site.label_seq_id were non-null for solvent, then all of the lines in the atom_site table would be uniquely keyed, with the label_* values, like a database table -- that would be fantastic for us computer science types. ChimeraX uses the atom_site.auth_seq_id to disambiguate the solvent. The PDBe is thinking of adding another column to the atom_site table for the same purpose. It appears that it would be fairly easy for the PDB to fix this. The mmcif dictionary description for atom_site.label_seq_id would have to be updated to allow values for solvent. The current description is:
This data item is a pointer to _entity_poly_seq.num in the ENTITY_POLY_SEQ category. That could be amended. As far as I can tell, CIF validators do not use the description when validating CIF files, so the current limitation is not a formal requirement, just a convention. The problem is all of the programs the PDB uses.
participants (3)
-
Greg Couch
-
Marcin Wojdyr
-
Tom Goddard