Describing mutations correctly and systematically is an arcane art, but an absolutely essential skill. Internationally agreed guidelines have been developed and refined since the initial proposals in 1993.
Mutations may be recorded in terms of alterations with respect to reference genomic DNA, cDNA and protein sequences:
By way of example, there is a relatively common single-base substitution mutation in COL1A1, that results in an amino acid substitution, that can be described as:
It is not sufficient to use a description only at the protein level because of the degeneracy of the genetic code. Indeed, mutations should never be described solely as protein alterations. Ideally, mutations should be described in terms of alterations to genomic DNA sequences, but it can be helpful to list all three formats, if they are known. The important point is that the reference sequences need to be clearly defined.
The triple-helical nature of the nature molecule has tended to dominate people's perceptions of the fibrillar collagens. In addition, the majority of type I and type III collagen mutations are single-amino-acid substitutions usually of the glycines found at every third position throughout the triple helix. The joint consequence of these issues has been the legacy amino acid numbering system for collagen chains that designates the first glycine of the triple-helical region as amino acid 1. This, of course, ignores that translation of the typical fibrillar collagen chain initiates with a signal peptide which is then followed by a pro-peptide and a telo-peptide before the start of the triple-helical region. Numbering from the start of the triple helix does not allow a way to describe mutations in, say, the N-pro-peptide.
This historic numbering system is not going to suddenly disappear however much we might want that. For the time being, it is probably good practice to also describe mutations using the legacy system.
The triple-helical region of the alpha1 chain encoded by COL1A1 is 1014 amino acids long. These amino acids correspond to systematically numbered amino acids 179 to 1192, consistent with the combined lengths of the signal, pro- and telo-petides being 178 for the collagen chain encoded by COL1A1.
Hence, the COL1A1 mutation described at the protein level as p.Gly257Arg would also be recorded as (Gly79Arg) with the brackets indicating the use of a legacy numbering system.
The triple-helical region of the alpha2 chain encoded by COL1A2 is 1014 amino acids long. These amino acids correspond to systematically numbered amino acids 91 to 1104, consistent with the combined lengths of the signal, pro- and telo-petides being 90 for the collagen chain encoded by COL1A2.
Hence, the COL1A2 mutation described at the protein level as p.Gly280Ser would also be recorded as (Gly190Ser) with the brackets indicating the use of a legacy numbering system.
The triple-helical region of the alpha1 chain encoded by COL3A1 is 1029 amino acids long. These amino acids correspond to systematically numbered amino acids 168 to 1196, consistent with the combined lengths of the signal, pro- and telo-petides being 167 for the collagen chain encoded by COL3A1.
Hence, the COL3A1 mutation described at the protein level as p.Gly489Glu would also be recorded as (Gly322Glu) with the brackets indicating the use of a legacy numbering system.
Once the exon/intron structure had been determined for COL1A1 (51 exons), COL1A2 (52 exons) and COL3A1 (51 exons), it became obvious that these genes were evolutionarily closely related. However, it was noted was that COL1A1 had a single exon of 108 bp that corresponded to exons 33 and 34 of COL1A2, each of which were 54 bp in length. This single exon in COL1A1, which encodes amino acids 746–781, is usually referred to as "exon 33/34". This designation is non-systematic but is standardly used for directly comparing the locations and effects of mutations in COL1A1 and COL1A2. Similarly, to maximally align the exons of COL3A1 with its type I counterparts, there is a single 114-bp exon designated "4/5" that corresponds to the individual exons 4 and 5 of COL1A1 and COL1A2.
There is an additional exon-numbering anomaly that might trip up the unwary . The first cDNA clones that were isolated for the individual chains of human collagen types I and III were part-length and corresponded to the 3′ end of the mRNA. These cDNAs were used as probes to screen genomic DNA libraries. Genomic DNA clones for the 3′ ends of the genes were sequenced and the locations of the exons were determined by alignment with the cDNA sequences. However, the total number of exons in the genes was not initially known and numbering of the exons was originally commenced from the 3′ end of the gene, incrementing towards the 5′ end. What we now refer to as exon 52 of COL1A2 was originally known as exon 1. This has not resulted in too much confusion, but early accounts (from the 1980s) of collagen gene mutations do refer to exon numbers that nowadays seem totally at odds with the location of the mutation relative to the amino acid sequence. For example, see figure 1B of Barsh et al. 1985.
The COL1A1 gene which encodes the alpha1 chain of type I collagen is located on chromosome 17. Full details of the gene can be found at Ensembl.
The reference genomic DNA has GenBank RefSeqGene accession number NG_007400
The reference cDNA has GenBank RefSeq accession number NM_000088
The reference protein has Genbank RefSeq accession number NP_000079
An annotated guide to the numbering of the cDNA and amino acid sequences and to the location of the exon boundaries is available for download.
The COL1A2 gene which encodes the alpha2 chain of type I collagen is located on chromosome 7. Full details of the gene can be found at Ensembl.
The reference genomic DNA has Genbank RefSeqGene accession number NG_007405
The reference cDNA has GenBank RefSeq accession number NM_000089
The reference protein has Genbank RefSeq accession number NP_000080
An annotated guide to the numbering of the cDNA and amino acid sequences and to the location of the exon boundaries is available for download.
The COL3A1 gene which encodes the alpha1 chain of type III collagen is located on chromosome 2. Full details of the gene can be found at Ensembl.
The reference genomic DNA has Genbank RefSeqGene accession number NG_007404
The reference cDNA has GenBank RefSeq accession number NM_000090
The reference protein has GenBank RefSeq accession number NP_000081
The type III collagen signal peptide has “historically” been 24 amino acids in length, but is probably actually 23 amino acids long. This shorter length is consistent with the feature table in NM_000090 and with independent analysis of the predicted cleavage point for signal peptides.
The location of the cleavage point for the C-propeptide is incorrect in the feature tables in NM_000090.3 and in NP_000081.1. The locations of these cleavage points have been corrected in the annotated guide.An annotated guide to the numbering of the cDNA and amino acid sequences and to the location of the exon boundaries is available for download.
The CRTAP gene which encodes cartilage associated protein is located on chromosome 3. Full details of the gene can be found at Ensembl.
The reference genomic DNA has Genbank RefSeqGene accession number NG_008122
The reference cDNA has GenBank RefSeq accession number NM_006371
The reference protein has GenBank RefSeq accession number NP_006362
An annotated guide to the numbering of the cDNA and amino acid sequences and to the location of the exon boundaries is in preparation.
The LEPRE1 gene which encodes leucine proline-enriched proteoglycan (leprecan) 1 is located on chromosome 1. Full details of the gene can be found at Ensembl.
The reference genomic DNA has Genbank RefSeqGene accession number NG_008123
The reference cDNA has GenBank RefSeq accession number NM_022356
The reference protein has GenBank RefSeq accession number NP_071751
An annotated guide to the numbering of the cDNA and amino acid sequences and to the location of the exon boundaries is in preparation.
Last edited: 19 December 2008
Raymond Dalgleish
The views expressed in this document are those of the document owner.