Coronavirus Dansk / English

Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX – University of Copenhagen

25 April 2014

Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX

In a recent publication in Nature Protocols, researchers from the Paleomix Group led by Dr. Ludovic Orlando describe an automated bioinformatics protocol for the processing of Next Generation Sequencing (NGS) data from modern and ancient organisms.

The pipeline offers a largely hands-off method for dealing with large amounts of shotgun sequencing data for resequencing projects, including sequence preprocessing, alignment, SNP calling, phylogenetic inference, and metagenomic analyses of the microbial contents of shotgun sequencing libraries. In a nutshell, the pipeline automatically performs a large number of analyses common to all studies aiming at full genome-resequencing.

An earlier version of the protocol was used in the sequencing of a 700.000 years old horse genome.

Next Generation Sequencing: Trillions of base-pairs

Until recently, whole-genome sequencing required a massive investment in time and money, and the collaboration of hundreds of researchers, and thousands of sequencing machines grouped within large sequencing centres. Thanks to the development of NGS platforms, it is now possible to sequence trillions of base-pairs in a matter of weeks, giving even smaller groups the opportunity to investigate complete genomes.

However, for a typical project involving multiple individuals with large genomes, the amount of data generated typically falls in the range of 10-100s of gigabytes, all of which has to be processed in a uniform manner to avoid introducing biases into downstream analyses. This involves a complex set of analytical steps to get from raw sequencing reads to usable alignments, which may be prone to introduce human error if carried out by hand.

Additionally, the generation of sequence alignments against known reference genomes is merely the first step in a typical project involving variant identification, the generation of consensus sequences for multiple alignments, and phylogenetic inference.

The PALEOMIX pipeline: A user-friendly package 

The PALEOMIX pipeline solves this problem by providing a user-friendly package for carrying out end-to-end analyses of NGS sequencing data from one or more individuals. Starting with demultiplexed sequencing reads, the pipeline automatically trims adapter sequences, (optionally) collapses overlapping paired reads into a single consensus sequence, carries out quality filtering, maps reads against one or more reference genomes, filters PCR duplicates in individual libraries, analyzes and corrects for the presence of post-mortem DNA damage in ancient samples, employs the GenomeAnalysisTK to perform local realignment around indels, and generates a wealth of project statistics.

From small regions to complete genomes

Subsequently, the PALEOMIX pipeline may use the alignments thus produced to carry out genotyping of regions of interest (ranging from small regions, to the exome, or complete genome), automatically filter low-quality SNPs, and generate consensus sequences from the filtered bases. The resulting sets of sequences may be aligned using the MAFFT multiple sequence aligner, and a maximum likelihood phylogenetic inference may be performed using ExaML.

Finally, the PALEOMIX pipeline includes a methodology for characterizing the metagenomic contents of the NGS libraries constructed for the samples being analysed. As shotgun sequencing of ancient samples often yields a large proportion of environmental microbial DNA, along with a minor of DNA belonging to the sample of interest, this opens for a wealth of potential information to be gleaned from data produced from ancient specimens.

Consequently, the PALEOMIX pipeline offers an handy suite of tools to researchers and significantly reduce the bioinformatical work-load required to implement comparative genomic analyses.

The source-code and documentation for the PALEOMIX pipeline is available at GitHub:

The PALEOMIX pipeline was published online in Nature Protocols:

Schubert M, Ermini L, Sarkissian CD, Jónsson H, Ginolhac A, Schaefer R, Martin MD, Fernández R, Kircher M, McCue M, Willerslev E, and Orlando L. Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX. Nat Protoc. 2014 May;9(5):1056-82. doi: 10.1038/nprot.2014.063. Epub 2014 Apr 10. PubMed PMID: 24722405.