nf-core/eager
A fully reproducible and state-of-the-art ancient DNA analysis pipeline
2.1.0
). The latest
stable release is
2.5.2
.
22.10.6
.
Learn more.
Introduction
nf-core/eager is a bioinformatics best-practice analysis pipeline for NGS sequencing based ancient DNA (aDNA) data analysis.
The pipeline uses Nextflow, a bioinformatics workflow tool. It pre-processes raw data from FASTQ inputs, or preprocessed BAM inputs. It can align reads and performs extensive general NGS and aDNA specific quality-control on the results. It comes with docker, singularity or conda containers making installation trivial and results highly reproducible.
Pipeline steps
Default Steps
By default the pipeline currently performs the following:
- Create reference genome indices for mapping (
bwa
,samtools
, andpicard
) - Sequencing quality control (
FastQC
) - Sequencing adapter removal and for paired end data merging (
AdapterRemoval
) - Read mapping to reference using (
bwa aln
,bwa mem
orCircularMapper
) - Post-mapping processing, statistics and conversion to bam (
samtools
) - Ancient DNA C-to-T damage pattern visualisation (
DamageProfiler
) - PCR duplicate removal (
DeDup
orMarkDuplicates
) - Post-mapping statistics and BAM quality control (
Qualimap
) - Library Complexity Estimation (
preseq
) - Overall pipeline statistics summaries (
MultiQC
)
Additional Steps
Additional functionality contained by the pipeline currently includes:
Preprocessing
- Illumina two-coloured sequencer poly-G tail removal (
fastp
) - Automatic conversion of unmapped reads to FASTQ (
samtools
) - Host DNA (mapped reads) stripping from input FASTQ files (for sensitive samples)
aDNA Damage manipulation
- Damage removal/clipping for UDG+/UDG-half treatment protocols (
BamUtil
) - Damaged reads extraction and assessment (
PMDTools
)
Genotyping
- Creation of VCF genotyping files (
GATK UnifiedGenotyper
,GATK HaplotypeCaller
andFreeBayes
) - Consensus sequence FASTA creation (
VCF2Genome
) - SNP Table generation (
MultiVCFAnalyzer
)
Biological Information
- Mitochondrial to Nuclear read ratio calculation (
MtNucRatioCalculator
) - Statistical sex determination of human individuals (
SexDetErrmine
)
Metagenomic Screening
- Taxonomic binner with alignment (
MALT
) - Taxonomic binner without alignment (
Kraken2
) - aDNA characteristic screening of taxonomically binned data from MALT (
MaltExtract
)
Quick Start
-
Install
nextflow
(>= v19.10.0) -
Install one of
docker
,singularity
orconda
-
Download the EAGER pipeline
-
Test the pipeline using the provided test data
-
Start running your own ancient DNA analysis!
-
Once your run has completed successfully, clean up the intermediate files.
NB. You can see an overview of the run in the MultiQC report located at ./results/MultiQC/multiqc_report.html
Modifications to the default pipeline are easily made using various options as described in the documentation.
Documentation
The nf-core/eager pipeline comes with documentation about the pipeline, found in the docs/
directory or on the main homepage of the nf-core project:
- Nextflow Installation
- Pipeline configuration
- Running the pipeline
- Output and how to interpret the results
- EAGER2 Code Contribution Guidelines
- nf-core/nextflow Troubleshooting
- EAGER Troubleshooting
Credits
This pipeline was mostly written by Alexander Peltzer (apeltzer), with contributions from Stephen Clayton, James A. Fellows Yates, Thiseas C. Lamnidis, Maxime Borry, Zandra Fagernäs, Aida Andrades Valtueña and Maxime Garcia. If you want to contribute, please open an issue (or even better, a pull request!) and ask to be added to the project - everyone is welcome to contribute here!
Authors (alphabetical)
- Aida Andrades Valtueña
- Alexander Peltzer
- James A. Fellows-Yates
- Judith Neukamm
- Maxime Borry
- Maxime Garcia
- Stephen Clayton
- Thiseas C. Lamnidis
- Zandra Fagernäs
Additional Contributors (alphabetical)
Those who have provided conceptual guidance, suggestions, bug reports etc.
- Arielle Munters
- Hester van Schalkwyk
- Irina Velsko
- Katerine Eaton
- Luc Venturini
- Marcel Keller
- Pierre Lindenbaum
- Pontus Skoglund
- Raphael Eisenhofer
- Torsten Günter
If you’ve contributed and you’re missing in here, please let us know and we will add you in of course!
Tool References
- EAGER v1, CircularMapper, DeDup* Peltzer, A., Jäger, G., Herbig, A., Seitz, A., Kniep, C., Krause, J., & Nieselt, K. (2016). EAGER: efficient ancient genome reconstruction. Genome Biology, 17(1), 1–14. https://doi.org/10.1186/s13059-016-0918-z Download: https://github.com/apeltzer/EAGER-GUI and https://github.com/apeltzer/EAGER-CLI
- FastQC download: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- AdapterRemoval v2 Schubert, M., Lindgreen, S., & Orlando, L. (2016). AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, 9, 88. https://doi.org/10.1186/s13104-016-1900-2 Download: https://github.com/MikkelSchubert/adapterremoval
- bwa Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics , 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324 Download: http://bio-bwa.sourceforge.net/bwa.shtml
- SAMtools Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics , 25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352 Download: http://www.htslib.org/
- DamageProfiler Judith Neukamm (Unpublished). Download: https://github.com/Integrative-Transcriptomics/DamageProfiler
- QualiMap Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics , 32(2), 292–294. https://doi.org/10.1093/bioinformatics/btv566 Download: http://qualimap.bioinfo.cipf.es/
- preseq Daley, T., & Smith, A. D. (2013). Predicting the molecular complexity of sequencing libraries. Nature Methods, 10(4), 325–327. https://doi.org/10.1038/nmeth.2375. Download: http://smithlabresearch.org/software/preseq/
- PMDTools Skoglund, P., Northoff, B. H., Shunkov, M. V., Derevianko, A. P., Pääbo, S., Krause, J., & Jakobsson, M. (2014). Separating endogenous ancient DNA from modern day contamination in a Siberian Neandertal. Proceedings of the National Academy of Sciences of the United States of America, 111(6), 2229–2234. https://doi.org/10.1073/pnas.1318934111 Download: https://github.com/pontussk/PMDtools
- MultiQC Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354 Download: https://multiqc.info/
- BamUtils Jun, G., Wing, M. K., Abecasis, G. R., & Kang, H. M. (2015). An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data. Genome Research, 25(6), 918–925. https://doi.org/10.1101/gr.176552.114 Download: https://genome.sph.umich.edu/wiki/BamUtil
- FastP Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics , 34(17), i884–i890. https://doi.org/10.1093/bioinformatics/bty560 Download: https://github.com/OpenGene/fastp
- GATK 3.5 DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., … Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–498. https://doi.org/10.1038/ng.806 Download
- GATK 4.X - no citation available yet
- VCF2Genome - Alexander Herbig and Alex Peltzer (unpublished). Download: https://github.com/apeltzer/VCF2Genome
- MultiVCFAnalyzer Bos, K.I. et al., 2014. Pre-Columbian mycobacterial genomes reveal seals as a source of New World human tuberculosis. Nature, 514(7523), pp.494–497. Available at: http://dx.doi.org/10.1038/nature13591. Download: https://github.com/alexherbig/MultiVCFAnalyzer
- MTNucRatioCalculator Alex Peltzter (Unpublished). Download: https://github.com/apeltzer/MTNucRatioCalculator
- Sex.DetERRmine.py Lamnidis, T.C. et al., 2018. Ancient Fennoscandian genomes reveal origin and spread of Siberian ancestry in Europe. Nature communications, 9(1), p.5018. Available at: http://dx.doi.org/10.1038/s41467-018-07483-5. Download: https://github.com/TCLamnidis/Sex.DetERRmine.git
- ANGSD Korneliussen, T.S., Albrechtsen, A. & Nielsen, R., 2014. ANGSD: Analysis of Next Generation Sequencing Data. BMC bioinformatics, 15, p.356. Available at: http://dx.doi.org/10.1186/s12859-014-0356-4. Download: https://github.com/ANGSD/angsd
- bedtools Quinlan, A.R. & Hall, I.M., 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics , 26(6), pp.841–842. Available at: http://dx.doi.org/10.1093/bioinformatics/btq033. Download: https://github.com/arq5x/bedtools2/releases
- MALT Download: https://software-ab.informatik.uni-tuebingen.de/download/malt/welcome.html
- Vågene, Å.J. et al., 2018. Salmonella enterica genomes from victims of a major sixteenth-century epidemic in Mexico. Nature ecology & evolution, 2(3), pp.520–528. Available at: http://dx.doi.org/10.1038/s41559-017-0446-6.
- Herbig, A. et al., 2016. MALT: Fast alignment and analysis of metagenomic DNA sequence data applied to the Tyrolean Iceman. bioRxiv, p.050559. Available at: http://biorxiv.org/content/early/2016/04/27/050559.
- MaltExtract Huebler, R. et al., 2019. HOPS: Automated detection and authentication of pathogen DNA in archaeological remains. bioRxiv, p.534198. Available at: https://www.biorxiv.org/content/10.1101/534198v1?rss=1. Download: https://github.com/rhuebler/MaltExtract
- Kraken2 Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. Available at: https://doi.org/10.1186/s13059-019-1891-0. Download: https://ccb.jhu.edu/software/kraken2/
- endorS.py Aida Andrades Valtueña (Unpublished). Download: https://github.com/aidaanva/endorS.py