FAQ SeqPHASE

Frequently Asked Questions

Q1: What is SeqPHASE?

SeqPHASE is a web software that converts FASTA sequence alignments into the input file format requested for haplotyping using PHASE (http://stephenslab.uchicago.edu/software.html) and transforms PHASE output files back into FASTA. SeqPHASE is particularly geared towards molecular ecologists who use PHASE to find out the haplotypes of nuclear sequence markers from direct sequencing, but may be useful to others as well.

Q2: How does it work?

Symbols accepted in the alignments are A (adenine), T (thymine), C (cytosine), G (guanine), W (adenine or cytosine), Y (cytosine or thymine), K (guanine or thymine), M (adenine or cytosine), S (cytosine or guanine), R (adenine or guanine), N and ? (missing information), and - (indel). After the user hits the Submit button on the web form, SeqPHASE starts by verifying that all sequences have the same length, that they contain only authorized symbols and that sequences all have different names. It then removes constant positions, inventories variable positions for which more than two possible nucleotides are found and creates up to three files: one .inp necessary to run PHASE, one .known detailing which phases are known (if any) and one .const recording the constant positions that were removed from the alignment (if any). Since PHASE does not accept letters for multistate characters, nucleotides are written into the .inp file as numbers based on alphabetical order as a mnemonic: -1 or ? for missing information (depending on whether the position displays two or more than two different nucleotides), 0 for indel, 1 for A, 2 for C, 3 for G and 4 for T. Depending on the properties of the dataset submitted, a suitable command syntax for running PHASE is suggested on the output webpage; for more information and other possible options, please refer to the PHASE 2.1 documentation (http://stephenslab.uchicago.edu/instruct2.1.pdf)

Q3: Why does SeqPHASE take up to three alignments to generate PHASE input files?

Instead of generating PHASE input files from a single FASTA alignment, which would require users to specify manually what phases are already known (for instance from cloning) and what individuals need to be phased, SeqPHASE can take as input up to three separate FASTA files: one for homozygous individuals and heterozygotes to be phased (with one sequence per individual), one for "fake haplotypes" of individuals to be phased (see Q9) and a third one for heterozygotes whose phases are already known (with two sequences per individual). In the alignment of phased heterozygotes, the names of the two sequences of each individual should differ only by their last character (e.g., 'indiv3a' and 'indiv3b'). Heterozygous individuals whose two haplotypes differ only by one substitution or insertion/deletion can be indifferently entered in the first field (with one sequence per individual), in the second field (with two sequences per individual) or in the third field (with three sequences per individual) since haplotyping is trivial in such case: this will not affect the phasing of these inviduals nor the phasing of other individuals in the dataset.

Q4: What is the purpose of Step 2?

In the PHASE output files, nucleotides are represented by numbers and constant positions are omitted: thus, using these files to find out the actual haplotype sequences can be a tedious and error-prone operation. This is the reason why a second script was written, taking as input the .const file generated during Step 1 and the .out or .out_pairs PHASE output file, and returning a FASTA alignment of haplotype sequences (if the .const file box is left empty, a FASTA alignment containing only the variable positions is generated). If a .out file is inputted, a list of phased haplotypes is returned as FASTA with 1-letter IUPAC indetermination code letters (R, W, M, Y, S or K, see above) at positions where phase certainty is inferior to a certain threshold (90% using PHASE default running options; this probability threshold can be modified by running PHASE using the -p and -q options, see PHASE documentation). If a .out_pairs file is inputted, a list of all possible haplotype pairs for each individual is returned as FASTA with their respective probability indicated between parentheses. Since FASTA alignments normally cannot accommodate two sequences bearing exactly the same name, the two haplotypes of each newly phased individual receive this individual's name with "a" or "b" appended.

Q5: What is the new "Reduced output" option in Step 2 about?

When the "Reduced output" box is ticked, homozygous individuals are represented by only one sequence (e.g., "indiv1") in the output FASTA file (which is better for some applications, such as building phylogenetic trees); otherwise, homozygous individuals are represented by two identical sequences (e.g., "indiv1a" and "indiv1b"), like in the output of PHASE. Ticking the "Reduced output" box also changes slightly the way in which the posterior probabilities are shown in the FASTA headers of the output file (this is for compatibility with a downstream application that is not yet published).

Q6: Is there any constraints on sequence label names?

Any sequence label name is acceptable (as long as it conforms to the FASTA general format), except in the alignment of phased haplotypes: for SeqPHASE to find out which sequences describe the two haplotypes of a given individual, the label names for these two sequences should differ only by their the last character (ex: individual23a, individual23b).

Q7: Is it possible to specify loci positions?

SeqPHASE was created with in mind the phasing of sequences obtained from direct sequencing of nuclear markers. As a result, it assumes that all nucleotides in the input alignment are actually contiguous and considers the locus position for each variable site to be its actual position in the alignment. However, it is easy to specify different loci positions by manually editing the PHASE input file produced by SeqPHASE (the positions of the variable sites are listed in the third line of the input file following the letter P).

Q8: How can I code the genotypes of length-variant heterozygotes?

Length-variant heterozygotes (LVHs) are individuals whose haplotypes are of different lengths due to the presence of one or several indels. As a result, chromatograms look fine until the first indel (with sometimes a few double peaks if there are SNPs), then display numerous double peaks due to the superposition of non-homologous bases from the two sequences (see Flot et al. (2006) for an explanatory figure and Figure 2 in Fontaneto et al. (2015) for an example of such chromatogram pair). One may identify SNPs and the positions of the indels, then use SeqPHASE and PHASE to phase these polymorphisms; however, there are no IUPAC codes available to represent "A or indel", "C or indel", "G or indel" and "T or indel", which makes it impossible to represent the genotype of a length-variant heterozygote as a single string of IUPAC codes. To solve this problem, a new field was added to Step 1, in which data for individuals to be phased can be entered as pairs of fake haplotypes. Pairs of sequences entered in this field should be formatted like known phases (with names such as "indiv1a" and "indiv1b" for the two fake haplotypes of indiv1). However, since PHASE performs better when many known phases are available, the best strategy remains to phase all length-variant heterozygotes in the dataset prior to running PHASE, by using programs such as Champuru, TraceHaplotyper or Indelligent that analyze the patterns of double peaks in the chromatograms obtained from direct sequencing.

Q9: How does the "fake haplotypes" field work?

Genotypes to be phased can be entered either in the 1st field of SeqPHASE (with a single sequence per individual, e.g. ATSGTYAKR) or in the 2nd field of SeqPHASE (with two "fake haplotype" sequences per individual, e.g. ATCGTCATG and ATGGTCAGA that is equivalent to the example above). Fake haplotypes are then superposed by SeqPHASE into one genotype to be phased, so it makes no difference to enter ATCGTCATG and ATGGTCAGA, or ATCGTCATA and ATGGTCAGG, or ATGGTCATG and ATCGTCAGA, etc.

Q10: How should I cite this program?

Flot (2010) SeqPHASE: a web tool for interconverting PHASE input/output files and FASTA sequence alignments. Molecular Ecology Ressources 10 (1): 162-166