Introduction

SLAM-seq

SLAMseq is a novel sequencing protocol that directly uncovers 4-thiouridine incorporation events in RNA by high-throughput sequencing. When combined with metabolic labeling protocols, SLAM-seq allows to study the intracellular RNA dynamics, from transcription, RNA processing to RNA stability.

Original publication: Herzog et al., Nature Methods, 2017; doi:10.1038/nmeth.4435

images/slam.png

SLAM-DUNK

Neumann, T., Herzog, V. A., Muhar, M., Haeseler, von, A., Zuber, J., Ameres, S. L., & Rescheneder, P. (2019). Quantification of experimentally induced nucleotide conversions in high-throughput sequencing datasets . BMC Bioinformatics, 20(1), 258. doi:10.1186/s12859-019-2849-7

Installation

This section covers the installation of slamdunk . Alternatively one could also run slamdunk from out of the box Docker containers.

There are 3 different possibilities:

  1. Installation with conda from the Bioconda channel (recommended)
  2. Installation with pip from the Python Package Index
  3. Manual installation from source-label

Bioconda

We recommend using virtual environments to manage your Python installation. Our favourite is Anaconda , a cross-platform tool to manage Python environments. You can installation instructions for Anaconda here . Then you can directly install slamdunk from the Bioconda channel .

Installation

Once Anaconda is installed, you can create an environment containing the slamdunk package with the following command:

conda create --name <name> -c bioconda slamdunk

Python Package Index

pip is the recommended tool for installing Python packages. It allows a convenient and simple installation of slamdunk from the Python Package Index PyPI .

Requirements

  • Python 2.7
  • pip
  • Java
  • cmake
  • R 3.2.2 for alleyoop and splash

Installation

# Having root permissions

pip install slamdunk

# Latest development version

pip install git+https://github.com/t-neumann/slamdunk.git

# Local user only

pip install --user slamdunk
export PATH=$PATH:$HOME/.local/bin/

Quickstart

To analyze a given SLAMSeq dataset with slamdunk using the default parameters which worked well on datasets during development, call the all dunk using the recommended minimum set of default parameters:

slamdunk all -r <reference fasta> -b <bed file> -o <output directory> -5 12 -n 100
             -t <threads> -m -rl <maximum read length> --skip-sam files [files ...]

The input data you need and parameters you need to set yourself are the following:

Parameter Description
-r The reference fasta file.
-b BED-file containing coordinates for 3’ UTRs.
-o The output directory where all output files of this dunk will be placed.
-t The number of threads to use for this dunk. NextGenMap runs multi-threaded, so it is recommended to use more threads than available samples (default: 1).
files Samplesheet (see Sample file format ) or a list of all sample BAM/FASTA(gz)/FASTQ(gz) files (wildcard * accepted).

This will run the entire slamdunk analysis with the most relevant output files being:

  • Folder count : Tab-separated tcount files containing the SLAMSeq statistics per UTR (see Tcount file format )
  • Folder filter : BAM-files with the final mapped reads for visualization and further processing

If you want to have a very quick sanity check whether your desired conversion rates have been achieved, use this alleyoop command to plot the UTR conversion rates in your sample:

alleyoop utrrates -o <output directory> -r <reference fasta> -t <threads>
                  -b <bed file> -l <maximum read length> bam [bam ...]
Parameter Description
-o The output directory where the plots will be placed.
-r The reference fasta file.
-t The number of threads to use. All tools run single-threaded, so it is recommended to use as many threads as available samples.
-b BED-file containing coordinates for 3’ UTRs.
-l Maximum read length in all samples.
bam BAM file(s) containing the final filtered reads from the filter folder (wildcard * accepted).

This will produce a plot like the following example where you can clearly see that individual conversions for a given starting base are balanced and unbiased, except for the T->C conversions in this SLAMSeq sample with longer labelling times.

Basics

Conversion rates

The individual measured conversion rates within a 3’ UTR are highly dependent on the T-content of a given UTR and the number of reads that cover those Ts. The more reads containing Ts, the more accurate the overall estimation of the conversion rate of T>Cs in this given 3’ UTR will be.

We have adressed this by using a 3’ UTR positions based view instead of a read based view: For each T-position in a given 3’ UTR, we record the number of reads covering this position and the number of reads with T>C conversion at this position - the more reads, the more accurate the T>C conversion rate estimation will be.

Both values are summed over all T-positions in the UTR - this way, less reliable T-positions with little reads will have lower impact on the overall T>C conversion rate estimation than T-positions with many reads.

images/conversionrates.png

Note: Since QuantSeq is a strand-specific assay, only sense reads will be considered for the final analysis!

Multimapper reconciliation

Definition and importance

Multimappers are reads which align to more than one position in the genome with equal best alignment scores.

Such reads make up a substantial part of the alignment and are generally ignored, but can provide valuable information if they can be unequivocally reassigned to a given location, enhancing the overall signal and thus heavily influencing the overall accuracy.

However, there are also several caveats: For genes with isoforms or homologous genes, simply taking all alignments into account will heavily distort the signal, since the reads will likely not account equally for all regions, but rather originate mainly from a single region which cannot be unambiguously assigned.

Reassignment strategy

We have settled for a very conservative multimapper reassignment strategy:

Since the QuantSeq technology specifically enriches for 3’ UTRs, we only consider alignments to annotated 3’ UTRs supplied to slamdunk as relevant.

Therefore, any multimappers with alignments to a single 3’ UTR and non-3’UTR regions (i.e. not annotated in the supplied reference) will be unequivocally assigned to the single 3’UTR. If there are multiple alignments to this single 3’UTR, one will be chosen at random.

For all other cases, were a read maps to several 3’ UTRs, we are unable to reassign the read uniquely to a given 3’UTR and thus discard it from the analysis.

In short, the procedure is as follows, illustrated by the example below:

  1. Create an Interval Tree from the supplied 3’ UTR annotation
  2. Check the overlaps of all multimapping reads with 3’ UTRs using the Interval Tree
  3. Remove any reads with alignments to more than one single 3’ UTR
  4. Remove any additional alignments to non-UTR regions
images/multimappers.png

Sample file format

Sample description files are a convenient way to provide multiple sample files as well as corresponding sample information to slamdunk . A sample description file is a tab or comma separated plain text file with the following columns:

Column Datatype Description
Filepath String Path to raw reads in BAM/fasta(gz)/fastq(gz) format
Sample description String Description of the sample
Sample type String Type of sample. Only relevant for timecourses (pulse/chase) otherwise use any value
Timepoint Integer Timepoint of the sample in minutes. Again only relevant for timecourses otherwise use any value

Tcount file format

The tcount file is the central output file of slamdunk . It contains all results, conversion rates and other statistics for each UTR which is the main starting point for any subsequent analysis that will follow e.g. transcript half-life estimates or DE analysis.

Tcount files are essentially tab-separated text files containing one line entry per 3’ UTR supplied by the user. Each entry contains the following columns:

Column Datatype Description
chromosome String Chromosome on which the 3’ UTR resides
start Integer Start position of the 3’ UTR
end Integer End position of the 3’ UTR
name String Name or ID of the 3’ UTR supplied by the user
length Integer Length of the 3’ UTR
strand String Strand of the 3’ UTR
conversionRate Float Average conversion rate (see Conversion rates )
readsCPM Float Normalized number of reads as counts per million
Tcontent Integer Number of Ts in the 3’ UTR (As for - strand UTRs)
coverageOnTs Integer Cumulative coverage on all Ts in the 3’ UTR
conversionsOnTs Integer Cumulative number of T>C conversions in the 3’ UTR
readCount Integer Number of reads aligning to the 3’ UTR
tcReadCount Integer Number of reads with T>C conversions aligning to the 3’ UTR
multimapCount Integer Number of reads considered as multimappers aligning to the 3’ UTR

Here is an example:

chr13   14197734        14199362        ENSMUSG00000039219      1628    +       0.0466045272969 4.00770947448   587     751     35      59      29      0
chr2    53217389        53219220        ENSMUSG00000026960      1831    +       0.0270802560315 28.3936027175   709     6093    165     418     138     2
chr2    53217389        53218446        ENSMUSG00000026960      1057    +       0.0268910814471 23.9783295677   407     5169    139     353     118     0
chr3    95495394        95495567        ENSMUSG00000015522      173     +       0.0290697674419 1.08683646766   53      172     5       16      5       0
chr3    95495394        95497237        ENSMUSG00000015522      1843    +       0.0253636702723 11.6834920273   584     2681    68      172     55      4
chr6    113388777       113389056       ENSMUSG00000079426      279     +       0.0168514412417 16.7780379694   71      2255    38      247     38      3

Usage

Slamdunk is a modular analysis software. Modules are dubbed dunks and each dunk builds upon the results from the previous dunks.

The idea is to always process all samples with one command line call of a given dunk. Output files typically get the same name as the input files with a certain prefix (e.g. “slamdunk_mapped”). The command line interface follows the “samtools/bwa” style. Meaning that all commands are available through the central executable/script slamdunk (located in the bin directory) Wildcard characters to supply multiple files are recognized throughout slamdunk .

Available dunks are:

'map', 'filter', 'snp', 'count', 'all'

Calling a module with –help shows all possible parameters text:

usage: slamdunk all [-h] -r REFERENCEFILE -b BED [-fb FILTERBED] -o OUTPUTDIR [-5 TRIM5]
                 [-a MAXPOLYA] [-n TOPN] [-t THREADS] [-q] [-e] [-m]
                 [-mq MQ] [-mi IDENTITY] [-nm NM] [-mc COV] [-mv VAR]
                 [-mts] [-rl MAXLENGTH] [-mbq MINQUAL] [-i SAMPLEINDEX]
                 [-ss]
                 files [files ...]

positional arguments:
   files                 Single csv/tsv file (recommended) containing all
                         sample files and sample info or a list of all sample
                         BAM/FASTA(gz)/FASTQ(gz) files

optional arguments:
   -h, --help            show this help message and exit
   -r REFERENCEFILE, --reference REFERENCEFILE
                         Reference fasta file
   -b BED, --bed BED     BED file with 3'UTR coordinates
   -fb FILTERBED, --filterbed FILTERBED
                         BED file with 3'UTR coordinates to filter multimappers
   -o OUTPUTDIR, --outputDir OUTPUTDIR
                         Output directory for slamdunk run.
   -5 TRIM5, --trim-5p TRIM5
                         Number of bp removed from 5' end of all reads
                         (default: 12)
   -a MAXPOLYA, --max-polya MAXPOLYA
                         Max number of As at the 3' end of a read (default: 4)
   -n TOPN, --topn TOPN  Max. number of alignments to report per read (default:
                         1)
   -t THREADS, --threads THREADS
                         Thread number (default: 1)
   -q, --quantseq        Run plain Quantseq alignment without SLAM-seq scoring
   -e, --endtoend        Use a end to end alignment algorithm for mapping.
   -m, --multimap        Use reference to resolve multimappers (requires -n >
                         1).
   -mq MQ, --min-mq MQ   Minimum mapping quality (default: 2)
   -mi IDENTITY, --min-identity IDENTITY
                         Minimum alignment identity (default: 0.95)
   -nm NM, --max-nm NM   Maximum NM for alignments (default: -1)
   -mc COV, --min-coverage COV
                         Minimimum coverage to call variant (default: 10)
   -mv VAR, --var-fraction VAR
                         Minimimum variant fraction to call variant (default:
                         0.8)
   -mts, --multiTCStringency
                         Multiple T>C conversion required for T>C read
   -rl MAXLENGTH, --max-read-length MAXLENGTH
                         Max read length in BAM file
   -mbq MINQUAL, --min-base-qual MINQUAL
                         Min base quality for T -> C conversions (default: 27)
   -i SAMPLEINDEX, --sample-index SAMPLEINDEX
                         Run analysis only for sample <i>. Use for distributing
                         slamdunk analysis on a cluster (index is 1-based).
   -ss, --skip-sam       Output BAM while mapping. Slower but, uses less hard
                         disk.

The flow of slamdunk is to first map your reads, filter your alignments, call variants on your final alignments and use these to calculate conversion rates, counts and various statistics for your 3’UTRs.

images/slamdunk_flow.png

All steps create a log file that has the same name as the output file. Typically there is one log file per sample and task (makes parallel execution easier). Command line output is limited to a minimum at the moment. If a sample is finished a ”.” is printed (very basic progress bar).

Dunks

map

The map dunk is used to map reads to a given genome using NextGenMap’s SLAMSeq alignment settings.

slamdunk map [-h] -r <reference fasta> -o <output directory> [-5 <bp to trim from 5' end>]
             [-n <Output up to N alignments per multimapper>] [-a <maximum number of 3' As before trimming] [-t <threads>]
             [-q] [-e] [-i <sample index>] [-ss] files [files ...]

Input

File Description
Reference fasta The reference sequence of the genome to map against in fasta format.
files Samplesheet (see Sample file format ) or a list of all sample BAM/FASTA(gz)/FASTQ(gz) files (wildcard * accepted).

Output

File Description
Mapped BAM file One BAM file per sample containing the mapped reads.
Mapped bam index file The raw unmapped reads in fastq / BAM format (multiple read files can be run at once).

Output files have the same name as the input files with the prefix “_slamdunk_mapped.bam”. For example:

34330_An312_wt-2n_mRNA-slamseq-autoquant_0h-R1.fq.gz ->
34330_An312_wt-2n_mRNA-slamseq-autoquant_0h-R1.fq_slamdunk_mapped.bam

Parameters

Parameter Required Description
-h Prints the help.
-r x The reference fasta file.
-o x The output directory where all output files of this dunk will be placed.
-5 Number of bases that will be hard-clipped from the 5’ end of each read.
-a Maximum number of As at the 3’ end of a read.
-n The maximum number of alignments that will be reported for a multi-mapping read (i.e. reads with multiple alignments of equal best scores).
-t The number of threads to use for this dunk. NextGenMap runs multi-threaded, so it is recommended to use more threads than available samples.
-q Deactivates NextGenMap’s SLAMSeq alignment settings. Can be used to align plain QuantSeq data instead of SLAMSeq data.
-e Switches to semi-global alignment instead of local alignment.
-i Run analysis only for sample <i>. Use for distributin slamdunk analysis on a cluster (index is 0-based).
-ss Output BAM while mapping. Slower, but uses less hard disk.
file x Samplesheet (see Sample file format ) or a list of all sample BAM/FASTA(gz)/FASTQ(gz) files (wildcard * accepted).

filter

The filter dunk is used to filter the raw alignments from the map dunk using multiple quality criteria to obtain the final alignments for all subsequent analyses.

slamdunk filter [-h] -o <output directory> [-b <bed file>] [-mq <MQ cutoff>] [-mi <identity cutoff>]
                [-nm <NM cutoff>] [-t <threads>] bam [bam ...]

Input

File Description
bam The raw mapped reads in BAM format from the map dunk.
-b (optional) Bed file with coordinates of 3’UTRs.

Output

File Description
Filtered BAM file One BAM file per sample containing the filtered reads.
Filtered bam index file One BAM index file accompanying each filtered BAM file.

Output files have the same name as the input files with the prefix “_filtered”. For example:

34330_An312_wt-2n_mRNA-slamseq-autoquant_0h-R1.fq_slamdunk_mapped.bam ->
34330_An312_wt-2n_mRNA-slamseq-autoquant_0h-R1.fq_slamdunk_mapped_filtered.bam

Parameters

Parameter Required Description
-h Prints the help.
-o x The output directory where all output files of this dunk will be placed.
-b BED-file containing coordinates for 3’ UTRs.
-mq Minimum mapping quality required to retain a read.
-mi Minimum alignment identity required to retain a read.
-nm Maximum number of mismatches allowed in a read.
-t The number of threads to use for this dunk. This dunk runs single-threaded so the number of threads should be equal to the number of available samples.
bam x BAM file(s) containing the raw mapped reads (wildcard * accepted).

snp

The snp dunk is used to call variants on the final filtered alignments of the filter dunk using VarScan2 . Any called T->C SNPs from this dunk will be excluded in the subsequent analyses to reduce the false-positive number.

slamdunk snp [-h] -o <output directory> -r <reference fasta> [-c <coverage cutoff>]
             [-f <variant fraction cutoff>] [-t <threads>] bam [bam ...]

Input

File Description
bam The final filtered reads in BAM format from the filter dunk.

Output

File Description
VCF file One VCF file per sample containing the called variants.

Output files have the same name as the input files with the prefix “_snp”. For example:

34330_An312_wt-2n_mRNA-slamseq-autoquant_0h-R1.fq_slamdunk_mapped_filtered.bam ->
34330_An312_wt-2n_mRNA-slamseq-autoquant_0h-R1.fq_slamdunk_mapped_filtered_snp.vcf

Parameters

Parameter Required Description
-h Prints the help.
-r x The reference fasta file.
-o x The output directory where all output files of this dunk will be placed.
-c Minimum coverage to call a variant.
-f Minimum variant fraction to call a variant.
-t The number of threads to use for this dunk. VarScan2 runs multi-threaded, so it is recommended to use more threads than available samples.
bam x BAM file(s) containing the final filtered reads (wildcard * accepted).

count

The count dunk calculates all relevant numbers on statistics of SLAMSeq reads for each given 3’ UTR. Central output will be tcount table (see Tcount file format ).

slamdunk count [-h] -o <output directory> [-s <SNP directory>] -r <reference fasta> -b <bed file> [-m]
                [-l <maximum read length>] [-q <minimum base quality>] [-t <threads] bam [bam ...]

Note: Since QuantSeq is a strand-specific assay, only sense reads will be considered for the final analysis!

Input

File Description
bam The final filtered reads in BAM format from the filter dunk.
-s (optional) The called variants from the snp dunk to filter false-positive T->C conversions.
-b Bed file with coordinates of 3’UTRs.

Output

File Description
Tcount file A tab-separated tcount file per sample containing the SLAMSeq statistics (see Tcount file format ).
Bedgraph file A bedgraph file per sample showing the T->C conversion rate on each covered reference T nucleotide.

Output files have the same name as the input files with the prefix “_tcount”. For example:

34330_An312_wt-2n_mRNA-slamseq-autoquant_0h-R1.fq_slamdunk_mapped_filtered.bam ->
34330_An312_wt-2n_mRNA-slamseq-autoquant_0h-R1.fq_slamdunk_mapped_filtered_tcount.csv

Parameters

Parameter Required Description
-h Prints the help.
-o x The output directory where all output files of this dunk will be placed.
-s The output directory of the snp dunk containing the called variants.
-r x The reference fasta file.
-b x BED-file containing coordinates for 3’ UTRs.
-l Maximum read length (will be automatically estimated if not set).
-m Flag to activate the multiple T->C conversion stringency: Only T->C conversions in reads with more than 1 T->C conversion will be counted.
-q Minimum base quality for T->C conversions to be counted.
-t The number of threads to use for this dunk. This dunk runs single-threaded so the number of threads should be equal to the number of available samples.
bam x BAM file(s) containing the final filtered reads (wildcard * accepted).

all

The all dunk is used to run an entire slamdunk run at once. It sequentially calls the map , filter , snp and count dunks and provides parameters to keep full control over all dunks.

slamdunk all [-h] -r <reference fasta> -b <bed file> [-fb <bed file>] -o <output directory> [-5 <bp to trim from 5' end>] [-a <maximum number of 3' As before trimming]
             [-n <Output up to N alignments per multimapper>] [-t <threads>] [-q] [-e] [-m] [-mq <MQ cutoff>]
             [-mi <identity cutoff>] [-nm <NM cutoff>] [-mc <coverage cutoff>] [-mv <variant fraction cutoff>] [-mts]
             [-rl <maximum read length>] [-mbq <minimum base quality>] [-i <sample index>] [-ss] files [files ...]

Input

File Description
Reference fasta The reference sequence of the genome to map against in fasta format.
-b Bed file with coordinates of 3’UTRs.
file Samplesheet (see Sample file format ) or a list of all sample BAM/FASTA(gz)/FASTQ(gz) files (wildcard * accepted).

Output

One separate directory will be created for each dunk output:

Folder Dunk
map map
filter filter
snp snp
count count

Parameters

Parameter Required Description
-h x Prints the help.
-r x The reference fasta file.
-b x BED-file containing coordinates for 3’ UTRs.
-fb BED-file used to filter multimappers.
-o x The output directory where all output files of this dunk will be placed.
-5 Number of bases that will be hard-clipped from the 5’ end of each read [map] .
-a Maximum number of A at the 3’ end of a read.
-n The maximum number of alignments that will be reported for a multi-mapping read (i.e. reads with multiple alignments of equal best scores) [map] .
-t The number of threads to use for this dunk. NextGenMap runs multi-threaded, so it is recommended to use more threads than available samples
-q Deactivates NextGenMap’s SLAMSeq alignment settings. Can be used to align plain QuantSeq data instead of SLAMSeq data [map] .
-e Switches to semi-global alignment instead of local alignment [map] .
-m Use 3’UTR annotation to filter multimappers [filter] .
-mq Minimum mapping quality required to retain a read [filter] .
-mi Minimum alignment identity required to retain a read [filter] .
-nm Maximum number of mismatches allowed in a read [filter] .
-mc Minimum coverage to call a variant [snp] .
-mv Minimum variant fraction to call a variant [snp] .
-mts Flag to activate the multiple T->C conversion stringency: Only T->C conversions in reads with more than 1 T->C conversion will be counted. [count] .
-rl Maximum read length (will be automatically estimated if not set) [count] .
-mbq Minimum base quality for T->C conversions to be counted [count] .
files x Samplesheet (see Sample file format ) or a list of all sample BAM/FASTA(gz)/FASTQ(gz) files (wildcard * accepted).

Alleyoop

Alleyoop ( A dditional s L amdunk he L p E r tools for an Y diagn O stics O r P lots) is a collection of tools for post-processing and running diagnostics on slamdunk analyses. Similar to slamdunk , the command line interface follows the “samtools/bwa” style. Meaning that all commands are available through the central executable/script alleyoop (located in the bin directory).

MultiQC:

We implemented a module for MultiQC to support summary reports of Alleyoop modules. Currently MultiQC supports the summary , rates , utrrates , tcperreadpos and tcperutrpos commands.

dedup

This tool allows you to deduplicate a given bam-file. While many tools like Picard tools already collapses reads with same start and end position on the chromosome, alleyoop only collapses reads with same start and end position, mapping to the same strand and identical read sequence.

alleyoop dedup [-h] -o <output directory> [-tc <# tc mutations>] [-t <threads>] bam [bam ...]

Input

File Description
bam Fastq/BAM file(s) containing the final filtered reads from slamdunk . Can be multiple if multiple samples are analysed simultaneously (wildcard * is recognized).

Output

File Description
Deduplicated BAM file One BAM file per sample containing the deduplicated reads.

Output files have the same name as the input files with the prefix “_dedup”. For example:

34330_An312_wt-2n_mRNA-slamseq-autoquant_0h-R1.fq_slamdunk_mapped_filtered.bam ->
34330_An312_wt-2n_mRNA-slamseq-autoquant_0h-R1.fq_slamdunk_mapped_filtered_dedup.bam

Parameters

Parameter Required Description
-h Prints the help.
-o x The output directory where all output files of this tool will be placed.
-tc Only select reads with -tc number of T>C mutations for deduplication.
-t The number of threads to use. All tools run single-threaded, so it is recommended to use as many threads as available samples.
bam x Fastq/BAM file(s) containing the final filtered reads (wildcard * accepted).

collapse

This tool allows you to collapse all 3’UTR entries of a tcount file into one single entry per 3’UTR (similar to the exons->gene relationship in a gtf-file). All entries with identical 3’UTR IDs will be merged.

alleyoop collapse [-h] -o <output directory> [-t <threads>] tcount [tcount ...]

Input

File Description
Tcount file A tab-separated tcount file per sample containing the SLAMseq statistics (see Tcount file format ).

Output

File Description
Tcount file A tab-separated tcount file per sample containing the SLAMseq statistics (see Tcount file format ) with collapsed 3’UTR entries.

Output files have the same name as the input files with the prefix “_collapsed”.

For example:

34330_An312_wt-2n_mRNA-slamseq-autoquant_0h-R1.fq_slamdunk_mapped_filtered_tcount.tsv ->
34330_An312_wt-2n_mRNA-slamseq-autoquant_0h-R1.fq_slamdunk_mapped_filtered_tcount_collapsed.tsv

Parameters

Parameter Required Description
-h Prints the help.
-o x The output directory where all output files of this tool will be placed.
-t The number of threads to use. All tools run single-threaded, so it is recommended to use as many threads as available samples.
Tcount file x A tab-separated tcount file per sample containing the SLAMseq statistics (see Tcount file format ) with collapsed 3’UTR entries.

rates

This tool computes the overall conversion rates in your reads and plots them as a barplot.

alleyoop rates [-h] -o <output directory> -r <reference fasta> [-mq <MQ cutoff>]
               [-t <threads>] bam [bam ...]

Input

File Description
Reference fasta The reference sequence of the genome to map against in fasta format.
bam Fastq/BAM file(s) containing the final filtered reads from slamdunk (wildcard * accepted).

Output

File Description
overallrates.csv Tab-separated table of the overall conversion rates.
overallrates.pdf Overall conversion rate plot file.

Below is an example plot of the overall conversion rates of the reads in a sample. One can appreciate the typical excess of T->C conversion (A->G on minus strand) of the SLAMseq technology for later labelling timepoints.

images/stats.rates.png

Parameters

Parameter Required Description
-h Prints the help.
-o x The output directory where all output files of this tool will be placed.
-r x The reference fasta file.
-mq Minimum base quality for T->C conversions to be counted.
-t The number of threads to use. All tools run single-threaded, so it is recommended to use as many threads as available samples.
bam x Fastq/BAM file(s) containing the final filtered reads. Can be multiple if multiple samples are analysed simultaneously (wildcard * is recognized).

tccontext

This tool computes the genomic context of all Ts in a read and plots them as barplot to inspect any biases in that direction.

alleyoop tccontext [-h] -o <output directory> -r <reference fasta> [-mq <MQ cutoff>]
                   [-t <threads>] bam [bam ...]

Input

File Description
Reference fasta The reference sequence of the genome to map against in fasta format.
bam BAM file(s) containing the final filtered reads from slamdunk (wildcard * accepted).

Output

File Description
tccontext.csv Tab-separated table of the 5’ and 3’ T-contexts, separated by strand.
tccontext.pdf T-context plot file.

Below is an example plot of the T-context of all reads in a sample. On top you will find the 5’ context of individual Ts, at the bottom the respective 3’ context of the individual Ts. Note, that these will not be reciprocal (see e.g. this publication ).

images/stats.TCcontext.png

Parameters

Parameter Required Description
-h Prints the help.
-o x The output directory where all output files of this tool will be placed.
-r x The reference fasta file.
-mq Minimum base quality for T->C conversions to be counted.
-t The number of threads to use. All tools run single-threaded, so it is recommended to use as many threads as available samples.
bam x BAM file(s) containing the final filtered reads (wildcard * accepted).

utrrates

This tool checks the individual conversion rates per 3’UTR and plots them as boxplots over the entire realm of 3’UTRs. Each conversion is normalized to all possible conversions from it’s starting base e.g. A->G / (A->A + A->G + A->C + A->T).

alleyoop utrrates [-h] -o <output directory> [-r <reference fasta>] [-mq <MQ cutoff>] [-m]
                  [-t <threads>] -b <bed file> -l <maximum read length> bam [bam ...]

Input

File Description
Reference fasta The reference sequence of the genome to map against in fasta format.
-b Bed file with coordinates of 3’UTRs.
bam BAM file(s) containing the final filtered reads from slamdunk (wildcard * accepted).

Output

File Description
mutationrates_utr.csv Tab-separated table with conversion reads, one UTR per line.
mutationrates_utr.pdf UTR conversion rate plot file.

Below is an example plot of conversion rates for all UTRs for a given sample. Typically, the individual conversions for a given starting base are balanced and unbiased, except for T->C conversions in SLAMseq samples with longer labelling times.

images/stats.utrrates.png

Parameters

Parameter Required Description
-h Prints the help.
-o x The output directory where all output files of this tool will be placed.
-r x The reference fasta file.
-mq Minimum base quality for T->C conversions to be counted.
-m Flag to activate the multiple T->C conversion stringency: Only T->C conversions in reads with more than 1 T->C conversion will be counted.
-t The number of threads to use. All tools run single-threaded, so it is recommended to use as many threads as available samples.
-b x Bed file with coordinates of 3’UTRs.
-l Maximum read length in all samples (will be automatically estimated if not set).
bam x BAM file(s) containing the final filtered reads (wildcard * accepted).

snpeval

This tool produces some QA about the quality of your variant calls: Ideally, your T>C SNP calls should be independently of the number of reads with T>C conversions found in an UTR. Otherwise, this would mean that you call more T>C SNPs the more T>C reads you have and thus you lose signal by falsely calling SNPs from true T>C conversions.

To assess this, the UTRs are ranked by the number of containing T>C reads and marked with a bar if also a T>C SNP was called in the respective UTR. The list is then filtered for 3’UTRs with sufficient coverage to confidently call SNPs by using only the upper quartile of the 3’UTRs according to total read coverage.

The resulting plots will show once the distribution of SNPs across ranked 3’UTRs being blind to SNP information and including SNP information. Ideally, one would see the SNPs biased towards 3’UTRs with high T>C read content in the blind situation and uniformly distributed across all 3’UTRs in the informed situation.

This difference is also quantified using a GSEA-like Mann-Whitney-U test.

alleyoop snpeval [-h] -o <output directory> -s <SNP directory> -r <reference fasta> -b <bed file> [-c <coverage cutoff>]
                 [-f <variant fraction cutoff>] [-m] [-l <maximum read length>] [-q <minimum base quality>] [-t <threads>]
                 bam [bam ...]

Input

File Description
Reference fasta The reference sequence of the genome to map against in fasta format.
-s Directory of called SNPs from snp dunk.
-b Bed file with coordinates of 3’UTRs.
bam BAM file(s) containing the final filtered reads from slamdunk (wildcard * accepted).

Output

File Description
SNPeval.csv Tab-separated table with read counts, T>C read counts and SNP indication, one UTR per line.
SNPeval.pdf SNP evaluation plot.

An example plot is coming soon!

Parameters

Parameter Required Description
-h Prints the help.
-o x The output directory where all output files of this tool will be placed.
-s x The output directory of the snp dunk containing the called variants.
-r x The reference fasta file.
-q Minimum base quality for T->C conversions to be counted.
-m Flag to activate the multiple T->C conversion stringency: Only T->C conversions in reads with more than 1 T->C conversion will be counted.
-c Minimum coverage to call a variant.
-f Minimum variant fraction to call a variant.
-t The number of threads to use. All tools run single-threaded, so it is recommended to use as many threads as available samples.
-b x Bed file with coordinates of 3’UTRs.
-l Maximum read length in all samples (will be automatically estimated if not set).
bam x BAM file(s) containing the final filtered reads (wildcard * accepted).

summary

This tool lists basic statistics of the mapping process in a text file.

alleyoop summary [-h] -o <output file> [-t <directory of tcount files>] bam [bam ...]

Input

File Description
bam BAM file(s) containing the final filtered reads from slamdunk (wildcard * accepted).
tcount file directory (optional) Directory containing the associated tcount file(s) to the input BAM file(s).

Output

File Description
outputfile Tab-separated table with mapping statistics.
outputfile_PCA.pdf PCA plot of the samples based on T>C read counts per UTR.
outputfile_PCA.txt PCA values of the samples based on T>C read counts per UTR.

The output file will be a tab-separated text file with the following columns:

Column Content
FileName Path to raw reads in BAM/fasta(gz)/fastq(gz) format.
SampleName Description of the sample.
SampleType Type of sample.
SampleTime Timepoint of the sample in minutes.
Sequenced Number of sequenced reads.
Mapped Number of mapped reads.
Deduplicated Number of deduplicated reads.
Filtered Number of retained reads after filtering.
Counted Number of counted reads within UTRs (optional: only if tcount file directory was supplied) .
Annotation Annotation used for filtering.

An example PCA plot is coming soon!

Parameters

Parameter Required Description
-h Prints the help.
-o x The output file name.
-t The directory of associated tcount file(s) to the supplied BAM file(s).
bam x BAM file(s) containing the final filtered reads (wildcard * accepted).

merge

This tool merges tcount files of multiple samples into a single table based upon an expression of columns.

alleyoop merge [-h] -o <output file> [-c <expression>] countFiles [countFiles ...]

Input

File Description
countFiles A tab-separated tcount file per sample containing the SLAMseq statistics (see Tcount file format ).

Output

File Description
outputfile Tab-separated table merged tcount information based upon expression.

Parameters

Parameter Required Description
-h Prints the help.
-o x The output file name.
-c Column or expression used to summarize files (e.g. “TcReadCount / ReadCount”)
countFiles x A tab-separated tcount file per sample containing the SLAMseq statistics (see Tcount file format ).

tcperreadpos

This tool calculates the individual mutation rates per position in a read treating T->C mutations separately. This plot can be used to search for biases along reads.

alleyoop tcperreadpos [-h] -r <reference fasta> [-s <SNP directory>]
                      [-l <maximum read length>] -o <output directory> [-mq <MQ cutoff>]
                      [-t <threads>] bam [bam ...]

Input

File Description
Reference fasta The reference sequence of the genome to map against in fasta format.
-s (optional) The called variants from the snp dunk to filter false-positive T->C conversions.
bam BAM file(s) containing the final filtered reads from slamdunk (wildcard * accepted).

Output

File Description
tcperreadpos.csv Tab-separated table with mutation rates, one line per read position.
tcperreadpos.pdf Plot of the mutation rates along the reads.

Below is an example plot of mutation rates along all reads in a sample. Typically, one will see increasing error rates towards the end of a reads, as for all Illumina reads. In addition, depending on how many bases were clipped from the 5’ end of the reads, one will also observe higher error rates in the beginning of the read as illustrated in the example plot. Finally, for SLAMseq samples with longer labelling times, the overall T->C conversions in the bottom plot will begin to increase compared to the overall background in the top plot.

images/stats.tcperreadpos.png

Parameters

Parameter Required Description
-h Prints the help.
-o x The output directory where all output files of this tool will be placed.
-r x The reference fasta file.
-mq Minimum base quality for T->C conversions to be counted.
-t The number of threads to use. All tools run single-threaded, so it is recommended to use as many threads as available samples.
-s The called variants from the snp dunk to filter false-positive T->C conversions.
-l Maximum read length in all samples (will be automatically estimated if not set).
bam x BAM file(s) containing the final filtered reads (wildcard * accepted).

tcperutrpos

This tool calculates the individual mutation rates per position in an 3’UTR treating T->C mutations separately. This plot can be used to search for biases along UTRs. Only most 3’ 200bp of each UTR will be considered because: * Quantseq fragments are estimated have an average size of ~200bp * This way, any dynamic binning biases can be avoided

alleyoop tcperutrpos [-h] -r <reference fasta> -b <bed file> [-s <SNP directory>]
                     [-l <maximum read length>] -o <output directory> [-mq <MQ cutoff>]
                     [-t <threads>] bam [bam ...]

Input

File Description
Reference fasta The reference sequence of the genome to map against in fasta format.
-s (optional) The called variants from the snp dunk to filter false-positive T->C conversions.
-b Bed file with coordinates of 3’UTRs.
bam BAM file(s) containing the final filtered reads from slamdunk (wildcard * accepted).

Output

File Description
tcperutr.csv Tab-separated table with mutation rates, one line per UTR position.
tcperutr.pdf Plot of the mutation rates along the UTRs.

Below is an example plot of mutation rates along all UTRs in a sample. Typically, one will see increasing error rates towards the end of a UTRs. For SLAMseq samples with longer labelling times, the overall T->C conversions in the bottom plot will begin to increase compared to the overall background in the top plot.

images/stats.tcperutrpos.png

Parameters

Parameter Required Description
-h Prints the help.
-o x The output directory where all output files of this tool will be placed.
-r x The reference fasta file.
-b x Bed file with coordinates of 3’UTRs.
-mq Minimum base quality for T->C conversions to be counted.
-t The number of threads to use. All tools run single-threaded, so it is recommended to use as many threads as available samples.
-s The called variants from the snp dunk to filter false-positive T->C conversions.
-l Maximum read length in all samples (will be automatically estimated if not set).
bam x BAM file(s) containing the final filtered reads (wildcard * accepted).

dump

This tool outputs all available information calculated by slamdunk for each read in a sample.

alleyoop dump [-h] -r <reference fasta> -s <SNP directory> -o <output directory>
              [-mq <MQ cutoff>] [-t <threads>] bam [bam ...]

Input

File Description
Reference fasta The reference sequence of the genome to map against in fasta format.
-s The called variants from the snp dunk to filter false-positive T->C conversions.
bam BAM file(s) containing the final filtered reads from slamdunk (wildcard * accepted).

Output

File Description
readinfo.sdunk Tab-separated table with read info, one line per read

The following columns are contained in the readinfo file:

Column Description
Name Name of the read
Direction Read was mapped on forward (1) or reverse (2) strand
Sequence Sequence of the read
Mismatches Number of mismatches in the read
tcCount Number of T->C conversion in the read
ConversionRates List of all possible conversion in the read

Parameters

Parameter Required Description
-h Prints the help.
-o x The output directory where all output files of this tool will be placed.
-r x The reference fasta file.
-mq Minimum base quality for T->C conversions to be counted.
-t The number of threads to use. All tools run single-threaded, so it is recommended to use as many threads as available samples.
-s x The called variants from the snp dunk to filter false-positive T->C conversions.
bam x BAM file(s) containing the final filtered reads (wildcard * accepted).

Docker

Docker enables you to run slamdunk out-of-the box wihtout caring about prerequisites and installation procedures. All you need is to install the Docker engine and run the slamdunk Docker image.

Docker images

You can directly use the slamdunk Docker image from Docker hub .

  • slamdunk : Build from scratch from a clean Ubuntu image

You can also build them from scratch using any of the two Dockerfiles in slamdunk/docker :

# Build Ubuntu-based image
cd slamdunk/docker/ubuntu
docker build -t <user>/<imagename> .

# View images
docker images

Running the Docker container locally

Once the container images is pulled/created, slamdunk can be run out of the box. The only difference is that the data containing directory has to be mounted on the container and all files are relative to that mounting point.

# All data is contained in current working directory and available under /data in the container

 docker run -m 8g -v $(pwd):/data tobneu/slamdunk slamdunk map -o /data/map -r /data/GRCm38.fa -5 5 -n 1 /data/testset.fq.gz

Running a Docker container on AWS

Many cloud providers provide Docker support these days. The most prominent cloud provider are the Amazon Web Services (AWS) have a nice integration into Docker machine which easily let’s you install Docker engine on AWS machines. AWS provides free t2.micro instances for free-tier accounts. These only come with 1 GiB of memory which is not enough for mammalian genomes (~ 8 GiB needed) where one would need to go for at least t2.large instances.

# Start a free t2.micro instance (small genomes only e.g. bacteria) on e.g. the us-west-1 region
docker-machine create --driver amazonec2 --amazonec2-region us-west-1 aws-slamdunk

or

# Start a paid t2.large instance (large genomes) on e.g. the us-west-1 region
docker-machine create --driver amazonec2 --amazonec2-region us-west-1 --amazonec2-instance-type t2.large aws-slamdunk

Once the instance has started, one needs to connect to it before running any Docker images.

eval $(docker-machine env aws-slamdunk)

Now everything is set up and one can run slamdunk on an AWS instance.

docker run -m 8g -v $(pwd):/data tobneu/slamdunk slamdunk map -o /data/map -r /data/GRCm38.fa -5 5 -n 1 /data/testset.fq.gz

After the work is done, remember to shut down your AWS instances unless you do not care about paying for an idle instance.

# Shut down
docker-machine stop aws-slamdunk

# Remove
docker-machine rm -y aws-slamdunk

FAQ

Authors

  • Tobias Neumann
  • Philipp Rescheneder

License

slamdunk is provided under the GNU Affero General Public License v3 .

Simulating SlamSeq data

Example

#!bash
set -x
F=simulation_2/
B=finalAnnotation_test_cut_chrM_correct_100
RL=50
COV=100
SN=pooja_UTR_annotation_examples_sample

rm -rf $F
~/bin/slamdunk/bin/slamsim preparebed -b ${B}.bed -o ${F} -l ${RL}
~/bin/slamdunk/bin/slamsim utrs -b ${F}/${B}_original.bed -o $F -r /project/ngs/philipp/slamseq/ref/GRCm38.fa

~/bin/slamdunk/bin/slamsim reads -b ${F}/${B}_original_utrs.bed -l ${RL} -o $F -cov ${COV} -t 0 --sample-name ${SN}_1_0min
~/bin/slamdunk/bin/slamsim reads -b ${F}/${B}_original_utrs.bed -l ${RL} -o $F -cov ${COV} -t 15 --sample-name ${SN}_2_15min
~/bin/slamdunk/bin/slamsim reads -b ${F}/${B}_original_utrs.bed -l ${RL} -o $F -cov ${COV} -t 30 --sample-name ${SN}_3_30min
~/bin/slamdunk/bin/slamsim reads -b ${F}/${B}_original_utrs.bed -l ${RL} -o $F -cov ${COV} -t 60 --sample-name ${SN}_4_60min
~/bin/slamdunk/bin/slamsim reads -b ${F}/${B}_original_utrs.bed -l ${RL} -o $F -cov ${COV} -t 180 --sample-name ${SN}_5_180min
~/bin/slamdunk/bin/slamsim reads -b ${F}/${B}_original_utrs.bed -l ${RL} -o $F -cov ${COV} -t 360 --sample-name ${SN}_6_360min
~/bin/slamdunk/bin/slamsim reads -b ${F}/${B}_original_utrs.bed -l ${RL} -o $F -cov ${COV} -t 720 --sample-name ${SN}_7_720min
~/bin/slamdunk/bin/slamsim reads -b ${F}/${B}_original_utrs.bed -l ${RL} -o $F -cov ${COV} -t 1440 --sample-name ${SN}_8_1440min

~/bin/slamdunk/bin/slamdunk all -r /project/ngs/philipp/slamseq/ref/GRCm38.fa -b ${F}/${B}_original_utrs.bed -rl 55 -o ${F}/slamdunk ${F}/*.bam

~/bin/slamdunk/bin/slamsim plot.conversions -sim ${F} -slam ${F}/slamdunk/count/ -o ${F}/eval/conversion_rate_eval_plots.pdf

~/bin/slamdunk/bin/slamsim plot.halflifes -sim ${F} -slam ${F}/slamdunk/count/ -t 0,15,30,60,180,360,720,1440 -o ${F}/eval/halflife_per_gene_eval_plots.pdf -b ${F}/${B}_original_utrs.bed

Changelog

Version 0.3.4

Bugfixes:

  • Reverse strand A->G SNP coordinates offset fixed.
  • Mismatch of string to int in key search fixed for summary.

Version 0.3.3

Bugfixes:

  • Mutation type bugfix in MPTagToConversion fixed
  • Deduplicated bam file properly closed for indexing
  • alleyoop snpeval also works now without snp sets
  • Chromosome retreival fix for pysam
  • dplyr dependencies removed (Bioconda)

Version 0.3.2

Changes:

  • Bioconda : Slamdunk recipe added. Slamdunk now available via conda from v0.3.2!
  • Symlink ngm -> nextgenmap/bin/ngm-0.5.2/ngm replaced by hard copy for bioconda-utils build compatibility.

Version 0.3.1

Changes:

  • Upgrade Samtools 1.3.1 -> Samtools 1.9 for upcoming bioconda build.

Bugfixes:

  • pysam-0.15.0.1 region retreival fix. (dashed chromosomes allowed now)
  • NGM permission fix to allow Singularity image build directly from Docker hub.

Version 0.3.0

Changes:

  • Travis-CI testing added.
  • Automated versionable Docker container builds.
  • Static 2 T>C conversion cutoff multiTCSTringency parameter replaced by dynamic conversion-threshold parameter
  • Positional track module added to Alleyoop to produce genome-wide positional T>C conversion rate bigWigtracks
  • Read-separator module added to Alleyoop to separated T>C reads from non T>C reads into bam-files.
  • Docker base image set to ubuntu:18.04

Bugfixes:

  • Supplementary alignment flags unset in filtered bam
  • Fixed numpy version removed to satisfy pandas dependency
  • Minimum baseQ filter now inclusive which is the expected behaviour
  • Reference check for short chromosomes introduced
  • Backwards compatibility to pysam 0.8.3 introduced

Version 0.2.4

Changes:

  • License updated to AGPLv3
  • Deduplicator now only marks duplicates instead of removing them. Useful for deduplication QA like dupRadar . Allows for deduplication of T>C-fraction of reads only.
  • Singularity build-file branch added.

Bugfixes:

  • readInRegion fixed: Switched from fetch(region=”chr:start-end” (1-based) to fetch(reference=chr, start=start, end=end) (0-based). Intervals close to the ends of a chromosome are now correctly handeled and not skipped.
  • Hotfix to fix executable permissions on some R plotting scripts
  • Fixed alleyoop collapse to ignore all MLE columns for now
  • R package dependencies fixed for IMP cluster
  • include_package_data removed for proper contrib import

Version 0.2.3

Bugfixes:

  • Hotfix to include README.md in sdist packaging
  • Hotfix to fix executable permissions on some R plotting scripts
  • Fixed alleyoop collapse to ignore all MLE columns for now

Version 0.2.2

Changes:

  • Slamdunk supports now MultiQC report creation
  • Default set to local mapping
  • Base-quality cutoff implemented throughout the entire slamdunk / alleyoop workflow
  • BAQ computation disabled for samtools mpileup
  • Dedup now supports selection of reads containing a defined # T>C conversions
  • All documentation moved to gh-pages branch for slicker repo
  • CPMs now normalized to filtered reads
  • Docker image now build automatically to head revision
  • Official webpage introduced instead of readthedocs.org hosting
  • Evaluation module for count files added to splash

Bugfixes:

  • PCAs for single sample working now
  • Dedup working again
  • Sample files may contain empty lines now

Version 0.2.1

Changes:

  • PCA will be added to alleyoop summary if count folder is provided
  • Reads in UTRs will be added to alleyoop summary if count folder is provided
  • Comment tags added to summary, rates, utrrates, tcperreadpos, tcperutrpos to be recognized by MultiQC <http://multiqc.info/>
  • NextGenMap udated to version 0.5.2
  • Slamdunk now checks NextGenMap version
  • Simulation module (splash) added: simulates slamdunk samples from a set of UTRs (BED file) and evalutes results computed by SlamDunk. See splash –help for more information
  • -i/–sample-index is now 1-based instead of 0-based
  • Columns for conversion rate confidence interval added to count files (not used yet)
  • Filter summary integrated into log
  • Added sample names now in globalRatePlotter

Bugfixes:

  • Number of sequenced reads fixed for BAM files computed with -ss/–skip-sam option
  • NextGenMap memory leak fixed

Version 0.2.0

Due to major changes v0.2.0 is not compatible with v0.1.0. Please rerun alle your samples with slamdunk 0.2.0!

Changes:

  • Option to supply sample info via samplesheet instead of plain raw reads added.
  • Added @RG tag to bam files containing sample info for best-practice and easier summary calculation.
  • Flagstat files removed.
  • –skip-sam flag added to only output bam.
  • –sample-index flag implemented for cluster distribution.
  • Multi-TC stringency now also available for utr rate plots.
  • *stats prefixes removed from alleyoop modules.
  • Alleyoop summary module implemented. Summary contains sample infos (if specified via samplesheet) and number of sequenced, mapped and filtered reads.
  • Alleyoop merge module allows now merging upon column names.
  • Sample info and reference + checksum now documented in tcount files.
  • Version.py added.

Bugfixes :

  • NGM interspersed order of multimappers fixed.
  • Random seed from filter module deleted.
  • Alleyoop rates display error of bars higher than ylim fixed.
  • Min base quality now again propagated to all modules.
  • Auto-scaling error fixed in globalRatePlotter.

Version 0.1.0

  • Initial pre-release.