Systematic Descriptions of Variants

Introduction

Describing variants correctly and systematically is an arcane art, but an absolutely essential skill. Internationally agreed guidelines have been developed and refined since the initial proposals in 1993.

Varaints may be recorded in terms of alterations with respect to reference genomic DNA, cDNA and protein sequences:

By way of example, there is a relatively common single-base substitution mutation in COL1A1, that results in an amino acid substitution, that can be described as:

It is not sufficient to describe variants only at the protein level because of the degeneracy of the genetic code. Indeed, variants should never be described solely as protein alterations. Ideally, varaints should be described in terms of alterations to genomic DNA sequences, but it can be helpful to list all three formats, if they are known. The important point is that the reference sequences need to be clearly defined.

Essential Reading

Specific Issues with Collagen Mutations

Amino Acid Numbering

The triple-helical nature of the nature molecule has tended to dominate people's perceptions of the fibrillar collagens. In addition, the majority of type I and type III collagen variants are single-amino-acid substitutions usually of the glycines found at every third position throughout the triple helix. The joint consequence of these issues has been the legacy amino acid numbering system for collagen chains which designates the first glycine of the triple-helical region as amino acid 1. This, of course, ignores that translation of the typical fibrillar collagen chain initiates with a signal peptide which is then followed by a pro-peptide and a telo-peptide before the start of the triple-helical region. Numbering from the start of the triple helix does not allow a way to describe variants in, say, the N-pro-peptide.

This historic numbering system is not going to suddenly disappear however much we might want that. For the time being, it is probably good practice to also describe variants using the legacy system.

COL1A1

The triple-helical region of the alpha1 chain encoded by COL1A1 is 1014 amino acids long. These amino acids correspond to systematically numbered amino acids 179 to 1192, consistent with the combined lengths of the signal, pro- and telo-petides being 178 for the collagen chain encoded by COL1A1.

Hence, the COL1A1 mutation described at the protein level as p.(Gly257Arg) would also be recorded as Gly79Arg.

COL1A2

The triple-helical region of the alpha2 chain encoded by COL1A2 is 1014 amino acids long. These amino acids correspond to systematically numbered amino acids 91 to 1104, consistent with the combined lengths of the signal, pro- and telo-petides being 90 for the collagen chain encoded by COL1A2.

Hence, the COL1A2 mutation described at the protein level as p.(Gly280Ser) would also be recorded as Gly190Ser.

COL3A1

The triple-helical region of the alpha1 chain encoded by COL3A1 is 1029 amino acids long. These amino acids correspond to systematically numbered amino acids 168 to 1196, consistent with the combined lengths of the signal, pro- and telo-petides being 167 for the collagen chain encoded by COL3A1.

Hence, the COL3A1 mutation described at the protein level as p.(Gly489Glu) would also be recorded as Gly322Glu.

Exon Numbering

Once the exon/intron structure had been determined for COL1A1 (51 exons), COL1A2 (52 exons) and COL3A1 (51 exons), it became obvious that these genes were evolutionarily closely related. However, it was noted that COL1A1 had a single exon of 108 bp which corresponds to exons 33 and 34 of COL1A2, each of which are 54 bp in length. This single exon in COL1A1, which encodes amino acids 746–781, is usually referred to as "exon 33/34". This designation is non-systematic but is standardly used for directly comparing the locations and effects of mutations in COL1A1 and COL1A2. Similarly, to maximally align the exons of COL3A1 with its type I counterparts, there is a single 114-bp exon designated "4/5" which corresponds to the individual exons 4 and 5 of COL1A1 and COL1A2. The evolutionary basis of this numbering scheme is described by Välkkilä et al., 2001.

There is an additional exon-numbering anomaly which might trip up the unwary . The first cDNA clones that were isolated for the individual chains of human collagen types I and III were part-length and corresponded to the 3′ end of the mRNA. These cDNAs were used as probes to screen genomic DNA libraries. Genomic DNA clones for the 3′ends of the genes were sequenced and the locations of the exons were determined by alignment with the cDNA sequences. However, the total number of exons in the genes was not initially known and numbering of the exons originally commenced from the 3′end of the gene, incrementing towards the 5′end. What we now refer to as exon 52 of COL1A2 was originally known as exon 1. This has not resulted in too much confusion, but early accounts (from the 1980s) of collagen gene sequence variants do refer to exon numbers which nowadays seem totally at odds with the location of the variant relative to the reference amino acid sequence. For example, see figure 1B of Barsh et al. 1985.

COL1A1: download the guide

The COL1A1 gene which encodes the alpha1 chain of type I collagen is located on chromosome 17. Full details of the gene can be found at Ensembl.

The reference genomic DNA has GenBank RefSeqGene accession number NG_007400.1
The reference cDNA has GenBank RefSeq accession number NM_000088.3
The reference protein has Genbank RefSeq accession number NP_000079.2

An annotated guide to the numbering of the cDNA and amino acid sequences and to the location of the exon boundaries is available for download.

COL1A2: download the guide

The COL1A2 gene which encodes the alpha2 chain of type I collagen is located on chromosome 7. Full details of the gene can be found at Ensembl.

The reference genomic DNA has Genbank RefSeqGene accession number NG_007405.1
The reference cDNA has GenBank RefSeq accession number NM_000089.3
The reference protein has Genbank RefSeq accession number NP_000080.2

An annotated guide to the numbering of the cDNA and amino acid sequences and to the location of the exon boundaries is available for download.

COL3A1: download the guide

The COL3A1 gene which encodes the alpha1 chain of type III collagen is located on chromosome 2. Full details of the gene can be found at Ensembl.

The reference genomic DNA has Genbank RefSeqGene accession number NG_007404.1
The reference cDNA has GenBank RefSeq accession number NM_000090.3
The reference protein has GenBank RefSeq accession number NP_000081.1

The type III collagen signal peptide has “historically” been 24 amino acids in length, but is probably actually 23 amino acids long. This shorter length is consistent with the feature table in NM_000090 and with independent analysis of the predicted cleavage point for signal peptides.

The location of the cleavage point for the C-propeptide is incorrect in the feature tables in NM_000090.3 and in NP_000081.1. The locations of these cleavage points have been corrected in the annotated guide.

An annotated guide to the numbering of the cDNA and amino acid sequences and to the location of the exon boundaries is available for download.

CRTAP:

The CRTAP gene which encodes cartilage associated protein is located on chromosome 3. Full details of the gene can be found at Ensembl.

The reference genomic DNA has Genbank RefSeqGene accession number NG_008122.1
The reference cDNA has GenBank RefSeq accession number NM_006371.4
The reference protein has GenBank RefSeq accession number NP_006362.1

An annotated guide to the numbering of the cDNA and amino acid sequences and to the location of the exon boundaries is in preparation.

LEPRE1:

The LEPRE1 gene which encodes leucine proline-enriched proteoglycan (leprecan) 1 is located on chromosome 1. Full details of the gene can be found at Ensembl.

The reference genomic DNA has Genbank RefSeqGene accession number NG_008123.1
The reference cDNA has GenBank RefSeq accession number NM_022356.3
The reference protein has GenBank RefSeq accession number NP_071751.3

An annotated guide to the numbering of the cDNA and amino acid sequences and to the location of the exon boundaries is in preparation.


University Home Page Genetics Home Page Valid (X)HTML


Last edited: 12 August 2015
Raymond Dalgleish
The views expressed in this document are those of the document owner.