Introduction

Human mitochondrial DNA (mtDNA) has several characteristics that makes it an invaluable tool for population genetic studies, as high copy number, small size (≈16,500 bp) and higher mutation rate than nuclear DNA. Furthermore, mtDNA is maternally inherited without recombination, allowing for the reconstruction of the evolutionary history of populations (Ballard and Whitlock, 2004).

In 1981, the complete sequence of the human mtDNA was published for the first time (Anderson et al, 1981). Since that, populations from almost everywhere have been studied from the mtDNA point of view. The comparison of these large sets of mtDNA data have allowed to construct a robust phylogenetic tree (Torroni et al, 2006; van Oven and Kayser, 2009) and to estimate the global distribution and origin of each human mtDNA lineage (Cann et al, 1987; Ingman et al, 2000; Maca-Meyer et al, 2001; Richards et al, 2000).

MtDNA analysis has also become an useful tool in forensic genetics, as its mode of inheritance allows testing for a putative exclusion scenario in human identification. On the other hand, when only very limited or severely degraded DNA is present in a sample, mtDNA constitutes the last chance for successful DNA typing (Parson and Bandelt, 2007).

However, published data comparison is frequently complicate as mtDNA results could appear in two different formats: haplotype (detected mutations respect to a reference sequence) and nucleotide sequence data. Manual transformation between formats is time-consuming, complex and likely to introduce mistakes. Moreover, some data analyses, like haplogroup classification or matches between populations, require haplotype data. On the contrary, others, such as genetic diversity calculations, are designed for nucleotide sequences. In all cases, although some data analysis software allow you to use both formats, like Arlequin (Excoffier and Lischer, 2010), the transformation between them is usually needed as published mtDNA results could alternatively appear in both types.

HaploSeach software transforms haplotype and sequence data between them in a quick and easy way, allowing a fast and reliable data comparison. This program admits both partial and complete mtDNA sequences, and recognises substitution mutations (transitions and transversions), heteroplasmies and indels (insertions and deletions).

Although HaploSearch was designed to analyse mtDNA sequences, it is suitable for transforming haplotypes and sequences in any kind of DNA sources. The program only requires a reference sequence from which extract the information, as occurs with the revised Cambridge Reference sequence (CRS) for mtDNA (Andrews et al, 1999).

Data Format

Sequences

Sequences must be introduced into the commonly used FASTA format, following the IUPAC code (Cornish-Bowden, 1985). Using this format in HaploSearch allows you to obtain the complete and partial mtDNA data directly from the main databases (as GeneBank, mtDB...). To be correctly analysed, all sequences have to be equal in length, so they have to be previously aligned with the reference sequence, following the required guidelines (Bandelt and Parson, 2008). Each sequence must be written continuously, without spaces or new paragraphs. For this reason, after performing the alignment, it is encouraged to review the sequences as some aligning programs create new paragraphs into the sequence. Removing spaces or new paragraphs could be easily performed by using the "Replace" tool, which is available for the majority of text processors.

As an example:

>CRS
AAAACCCCCTCCCC-ATGCC
>SEC1
AAAACCCCCCCCCCCATGCC
>SEC2
AAACCCCCCTCCCC-ATGCC

Haplotypes

Mutations of haplotypes must be arranged from smaller to higher position and separated with spaces. If there are sequences that are exactly the same as the CRS reference, their haplotype would be CRS. This designation for non mutated sequences could be changed when other DNA types are studied.

Using HaploSearch software, mutations could be written using two formats: "Population Genetics Nomenclature" and "Forensic Genetics Nomenclature" (following the DNA Commission of the International Society for Forensic Genetics recommendations as detailed in Carracedo et al. (2000)).

Point mutations

Point mutations are caused when exchanging a single nucleotide for another (Freese, 1959a), in respect to the CRS (or other reference sequence). These changes are classified as transitions or transversions (Freese, 1959b).

Transition

Transition is a mutation changing a purine to another purine nucleotide (A↔G) or a pyrimidine to another pyrimidine nucleotide (C↔T). This is the most common mutation and, for the "Population Genetic Nomenclature", it is only designated by the nucleotide position:

0000000001111
1234567890123
CRS CGACCCCTGTATC
SEC1 CGACCCTTGTGTC

In this example, haplotype would be SEC1: 7 11, showing that SEC 1 has two transitions, in position 7 and 11, respectively.

However, for the "Forensic Genetic Nomenclature", the haplotype should be designated by the nucleotide position and the mutated base. In this case, it would be SEC1: 7T 11G.

Transversion

Transversion refers to the substitution of a purine for a pyrimidine or vice versa. For both haplotype formats, they are designated by the nucleotide position and the changed base:

0000000001111
1234567890123
CRS CGACCCCTGTATC
SEC1 CGCCCCCTTTATC

Thus, haplotype would be SEC1: 3C 9T, showing that SEC 1 has one transversion to cytosine in position 3 and one transversion to thymine in position 9.

Heteroplasmy

The presence of more than one mtDNA haplotype in a sample is referred to as heteroplasmy. This phenomenon could be due to differential segregation of pre-existing heteroplasmic variants, to accumulation of new somatic mutations or to a combination of both.

In this situations, it is necessary the use of a single symbol to designate a variety of possible nucleotides at a single position (Table 1).

Table 1 - The IUPAC nucleotide code (Cornish-Bowden, 1985)
IUPAC nucleotide code Base
A Adenine
C Cytosine
G Guanine
T Thymine
R A or G
Y C or T
S G or C
W A or T
K G or T
M A or C
B C or G or T
D A or G or T
H A or C or T
V A or C or G
N any base

For both haplotype formats, they are designated by the nucleotide position and the corresponding IUPAC nucleotide code:

0000000001111
1234567890123
CRS CGACCCCTGTATC
SEC1 CGACCCCTGTKTC

Thus, haplotype would be SEC1: 11K, showing that SEC1 has heteroplasmy in position 11, where nucleotides G and T are present.

Indels

The term indel includes insertions and deletions, as these two types of genetic mutation are often considered together because of the inability to distinguish between them when comparing two sequences. This problem does not exist when sequences are compared with a reference: insertions add one or more extra nucleotides into the DNA, in respect to the reference; and deletions remove one or more nucleotides from the DNA compared with the reference sequence. Due to indels, the sequences have to be aligned before using HaploSearch, in order to designate a correct haplotype. To perform the alignment it is recommended to use alignment programs as ClustalW.

Insertions

As insertions add one or more extra nucleotides, it is necessary to introduce gaps into the reference sequence to maintain the alignment.

0000000--001-111
1234567--890-123
CRS CGACCCC--TGT-ATC
SEC1 CGACCCCCCTGTCATC

To name the insertions in the "Population Genetic Nomenclature", you must indicate the base position in which the insertion has occurred and the bases that are inserted, preceded by letter "i". In the above example, haplotype would be "SEC1: 7iCC 10iC".

For the "Forensic Genetic Nomenclature", insertions are independently named by first noting the site immediately to the insertion followed by a decimal point and a '1' (for the first insertion), a '2' (if there is a second base inserted), and so on, and then by the nucleotide that is inserted. In the above example, haplotype would be SEC1: 7.1C 7.2C 10.1C (Carracedo et al, 2000).

Deletions

As deletions remove one or more nucleotides from the DNA, it is necessary to introduce gaps into the studied sequence to maintain the alignment.

0000000001111
1234567890123
CRS CGACCCCTGTATC
SEC1 CGACC--TGT-TC

To designate deletions in the "Population Genetic Nomenclature", you have to write the first base position of the gap and the bases that are deleted, preceded by letter "d". In the above example, haplotype would be SEC1: 6dCC 11dA.

For the "Forensic Genetic Nomenclature", deletions should be recorded by listing the missing sites followed by a 'del'. In the example, it would be SEC1: 6del 7del 11del.

Input Data Files

Input data files could be written on any text processor, as long as the file is saved as a txt file. However, if a text processor with autocorrection tools (such as Microsoft Word, OpenOffice Writer or Vim) is used, this function has to be disabled in order to avoid modifications that could affect the HaploSearch operation. Indels are prone to be affected by autocorrection tools, as consecutive hyphens are exchanged for only one. This could cause the lost of alignment and, sometimes, the use of characters that are not recognised by HaploSearch. Therefore, it is encouraged to disable the autocorrection tool or to use unformatted text processors.

Transforming sequences into haplotypes

The input file for transforming sequences into haplotypes has to be a txt file containing the aligned sequences in FASTA format. Moreover, we have to indicate what is the reference sequence and the position number for the first nucleotide in the sequence as follows:

  • The first line must indicate the nucleotide position for the first base of the reference sequence with the following format: START: ##. This position would be 1 for complete sequences or to begin with the corresponding number for partial sequences.
  • The second line must contain the reference sequence in fasta format. The reference sequence must be named ">reference_name" and would be the >CRS for mtDNA or any consensus sequence for other DNA types.
  • In the following lines, the studied sequences have to be introduced in FASTA format.

Example

START: 16180
>CRS
AAAACCCCCTCCCCATGCC
>SEC1
AAAACCCCCCCCCCATGCC
>SEC2
AAACCCCCCTCCCCATGCC

When sequences include indels, they have to be aligned for a correct HaploSearch analysis:

START: 16180
>CRS
AAAACCCCCTCCCC-ATGCC
>SEC1
AAAACCCCCCCCC--ATGCC
>SEC2
AAACCCCCCTCCCCCATGCC

Transforming haplotypes into sequences

The input file for transforming haplotypes into sequences is similar to the previous file, but using haplotype data, with whatever Population Genetic or Forensic Genetic nomenclature. As in the previous file, it has to be indicated what is the reference sequence and the position number for the first nucleotide in the reference sequence:

  • The first line must indicate the nucleotide position for the first base of the reference sequence with the following format: START: ##. This position would be 1 for complete sequences or to begin with the corresponding number for partial sequences.
  • The second line must contain the reference sequence in fasta format. The reference sequence must be named ">reference_name" and would be >CRS for mtDNA or any consense sequence for other DNA types.
  • In the following lines, the haplotypes should be written in a similar way to the fasta format. When sequences do not include mutation, their haplotype would be the reference name. For example, when a mtDNA sequence is identical to the CRS, its haplotype would be "CRS".

Example

START: 16180
>CRS
AAAACCCCCTCCCCATGCC
>SEC1
16189
>SEC2
16183C 16189 16193dC
>SEC3
CRS

Output Data Files

HaploSearch output data files have the same format as the input data file for the opposite transformation. This feature allows you to obtain the original data from the output file, checking if any mistakes were introduced during data manipulation and/or HaploSearch have worked properly.

If the input file is as follows:

START: 16180
>CRS
AAAACCCCCTCCCC--ATGCTTACAAGCAAGTACAGCAATCAACCCTCAA
>SEC1
AAACCCCTCCCCCCCCATGCTTACAAGCAAGTACAGCAATCAACCTTCAA
>SEC2
AAAACCCCCTCCCC--ATGCTTACAAGCAAGTACAGCAATCAACCCCCAA
>SEC3
AAAACCCCCCCC----ATGCTTACAAGCAAGTACAGCAATCAACCCTCAA

The output file for "Population Genetic Nomenclature" will be:

START: 16180
>CRS
AAAACCCCCTCCCCATGCTTACAAGCAAGTACAGCAATCAACCCTCAA
>SEC1
16183C 16187 16189 16193iCC 16223
>SEC2
16224
>SEC3
16189 16192dCC

Or this one, for "Forensic Genetic Nomenclature":

START: 16180
>CRS
AAAACCCCCTCCCCATGCTTACAAGCAAGTACAGCAATCAACCCTCAA
>SEC1
16183C 16187T 16189C 16193.1C 16193.2C 16223T
>SEC2
16224C
>SEC3
16189C 16192d 16193d

Now, if these output files are used as input file, we could obtain the original data source.

Important information about data format

Alignment

HaploSearch recognises the indels that are determined by the aligned input sequences. When sequences containing indels are aligned by alignment programs, the gaps are not always placed in the same position as in the commonly used nomenclature.

For instance, SEC1 has four inserted Cs between 301 to 320 mtDNA positions:

START:301
>rCRS
AACCCCCCCTCCCCCGC
>SEC1
AACCCCCCCCCCTCCCCCCGC

As there are several Cs in this position, the alignment could be shown in several ways, all of them being corrected. However, every one would originate different haplotypes:

rCRS: AA--CCCCCCCCT-CCCCCGC 302iCC 310iC / 302 .1C 302.2C 310.1C
SEC1: AACCCCCCCCCCTCCCCCCGC

rCRS: AACCCCCCCC--TCCCCC-GC 309iCC 315iC / 309.1C 309.2C 315.1C
SEC1: AACCCCCCCCCCTCCCCCCGC

rCRS: AAC--CCCCCCCTC-CCCCGC 303iCC 311iC / 303.1C 303.2C 311.1C
SEC1: AACCCCCCCCCCTCCCCCCGC

We do not know what mutational event caused these insertions, so all the different alignments are possible. However, certain indels are commonly named in a determined way. In the above example, the correct nomenclature would be 309iCC 315iC or 309.1C 309.2C 315.1C. This problem could be overcome by checking the alignment previous to the HaploSearch analysis (for instance, using a sequence editor as BioEdit) and placing the variable indels in the most used place. A later modification of the output file is also possible. For alignment guidelines see Bandelt and Parson (2008).

Partial sequences (Population Genetics nomenclature)

Sometimes, in population genetic studies, when only the hipervariable region I (HVRI) is analysed (positions between 16024 - 16365), the 16### notation could be omitted for clarity reasons. For example, haplotype SEQ1: 16069 16126 would be SEQ1: 069 126.

If you want to use this notation in HaploSearch, you should tick the corresponding option "I am only analysing HVRI" in the main form.

If you use the current notation:

START: 16090
>CRS
TATTTCGTACATTACTGCCAGCCACCATGA
>SEQ1
TATCTCGTACATTACTGCCAGACACCATGA

The output would be:

START: 16090
>CRS
TATTTCGTACATTACTGCCAGCCACCATGA
>SEQ1
16093 16111A

In other hand, if you eliminate the 16### and set the mentioned option, the output would be:

START: 90
>CRS
TATTTCGTACATTACTGCCAGCCACCATGA
>SEQ1
093 111A

This kind of notation is only possible for Populations Genetic Nomenclature.

Nomenclature of deletions (Forensic Genetics nomenclature)

As recommended by the EMPOP database, deletions are named as "del" in HaploSearch (see "Indels" section). However, Carracedo et al. (2000) recommends the use of "d" instead of "del". For this reason, we use "d" by default in the output file. Be aware that it is extremely important to use "D" for the heteroplasmy consisting of a mixture of A, G, and T (following IUPAC code) and "d" for deletions.