t-neumann.github.io

Orbital maneuvers

2019-09-08T15:29:00+02:00

From my last post you should have read up on the basics of orbits and orbital parameters. Now while this is interesting by itself, changing orbits and moving to different orbits in order to dock to space stations, escape to different celestial bodies or de-orbit onto a bodies surface - this is the stuff that is now why we are actually doing this. So that is why this post moves more into orbital mechanics and some basic maneuvers for modifying orbits.

Orbital mechanics is a core discipline within space-mission design and control. It focuses on spacecraft trajectories, including orbital maneuvers, orbital plane changes, and interplanetary transfers, and is used by mission planners to predict the results of propulsive maneuvers.

Now let’s pretend we have some well-funded space agency, can do anything we want and do not have to fear killing our astronauts - if only there was some simulation to do this. This is were KSP comes into play.

Vessel

We do not want to simply calculate orbits, we want some actual space ship with propulsion systems in the orbit so we can see the impact of our maneuvers live. For this purpose, I created already in endless hours a huge garage of different more or less efficient vessels for exploring the KSP universe.

For this particular, I will be using my rather tiny SSTO SlickOrbiter consisting of 4 rapier engines which are hybrid engines with both air-breathing and liquid fuel modes. This I complement with an Atomic Rocket Motor engine for space maneuvers with far lower thrust but much higher efficiency ($I_{SP}$). I will definitely dedicate a couple of posts to propulsion systems, staging modes etc in a later time, for know just take it as it is.

Spacecraft orientation

Now before we perform and orbit maneuvers or burns, we need to agree on the different directions we can point our spacecraft and perform these burns. Naturally, since we are in 3-dimensional space, we have 3 axis along which we can orient ourselves, each axis having 2 directions.

Prograde and retrograde

These vectors run along the axis in which direction the spacecraft is moving along its orbit.

Normal and anti-normal

The normal vectors are perpendicular to the orbital plane.

Radial in and radial out

These vectors are parallel to the orbital plane, and perpendicular to the prograde vector. The radial (or radial-in) vector points inside the orbit, towards the focus of the orbit, while the anti-radial (or radial-out) vector points outside the orbit, away from the body.

Orbital maneuvers

Ok now it is time to make a couple of burns into these directions and see how it affects our orbital parameters. To this end we set up maneuver nodes with directional indicators as shown below.

Orbital directions and directional markers.

I will go into more detail and Math about energy efficiency for those individual maneuvers in a later post, this should now only give you a first glimpse and general understanding of how to move around in space.

Prograde and retrograde maneuvers

So we are at the apoapsis of our nearly circular orbit perfectly aligned with the equatorial plane (0 degrees inclination). Let’s see what happens if we burn into prograde direction.

As we can see, the apoapsis moves to the opposite end of our now elliptic orbit and we raised the orbit’s altitude on the opposite side.

What if we do a retrograde burn?

As we can see, the periapsis on the opposing side is lowered until we go suborbital, meaning the spacecraft will deorbit on its way to periapsis and either burn up in the atmosphere or crash on the planet (unless a proper landing procedure is initiated).

In summary, burning prograde will increase orbital velocity, raising the altitude of the orbit on the other side, while burning retrograde will decrease velocity and reduce the orbit altitude on the other side.

This is the most efficient way to change the orbital shape (specifically the most common case, raising or lowering apsides) so whenever possible these vectors should be used.

Normal and anti-normal maneuvers

Again we are at the apoapsis of our nearly circular orbit perfectly aligned with the equatorial plane (0 degrees inclination). Let’s see what happens if we burn into normal direction.

We see that the orbital inclination (the angle between the orbital and equatorial plane) changes.

These vectors are generally used to match the orbital inclination of another celestial body or craft, and the only time this is possible is when the current craft’s orbit intersects the orbital plane of the target - at the ascending and descending nodes. We will get to this in a second.

Radial in and radial out maneuvers

One last time we are at the apoapsis of our nearly circular orbit perfectly aligned with the equatorial plane (0 degrees inclination). Let’s see what happens if we burn into the radial out direction.

We see that the orbit start rotating around the craft like spinning a hula hoop with a stick. Radial burns are usually not an efficient way of adjusting one’s path - it is generally more effective to use prograde and retrograde burns.

Orbital insertion

Now let’s combine all those basic orbital maneuvers of the previous section: All the maneuvers we experimented with in the last section are generally described (if sufficient change of the orbital parameters is achieved) as orbit insertion which is a general term for a maneuver that is more than a small correction. It may be used for a maneuver to change a transfer orbit or an ascent orbit into a stable one, but also to change a stable orbit into a descent. Also the term orbit injection is used - which I find even cooler - especially for changing a stable orbit into a transfer orbit, e.g. trans-lunar injection (TLI), trans-Mars injection (TMI) and trans-Earth injection (TEI).

Stable orbits have been described in the previous post, but now we want to specifically look at transfer orbits which enable us to put satellites into orbits, travel to the moon and Mars and all the fancy wonderous places in our solar system and beyond.

So what is a transfer orbit: In orbital mechanics a transfer orbit is an intermediate elliptical orbit that is used to move a satellite or other object from one circular, or largely circular, orbit to another.

There are several types of transfer orbits, which vary in their energy efficiency and speed of transfer and I will quickly go over the most famous ones.

Again, I will go into more detail and Math about energy efficiency for those transfer orbits in a later post, this should now only give you a first glimpse and general understanding of how these orbital insertions work.

Hohmann transfer

In orbital mechanics, the Hohmann transfer orbit is an elliptical orbit used to transfer between two circular orbits of different radii around the same body in the same plane. The Hohmann transfer orbit uses the lowest possible amount of energy in traveling between these orbits.

The term is also used to refer to transfer orbits between different bodies (planets, moons etc.).

A Hohmann transfer requires that the starting and destination points be at particular locations in their orbits relative to each other. Space missions using a Hohmann transfer must wait for this required alignment to occur, which opens a so-called launch window. For a space mission between Earth and Mars, for example, these launch windows occur every 26 months. A Hohmann transfer orbit also determines a fixed time required to travel between the starting and destination points; for an Earth-Mars journey this travel time is about 9 months.

The image shows a Hohmann transfer orbit to bring a spacecraft from a lower circular orbit into a higher one. It is one half of an elliptic orbit that touches both the lower circular orbit the spacecraft wishes to leave (green and labeled 1 on diagram) and the higher circular orbit that it wishes to reach (red and labeled 3 on diagram). The transfer (yellow and labeled 2 on diagram) is initiated by firing the spacecraft’s engine to accelerate prograde so that it will follow the elliptical orbit. This adds energy to the spacecraft’s orbit. When the spacecraft has reached its destination orbit, its orbital speed (and hence its orbital energy) must be increased again to change the elliptic orbit to the larger circular one which is termed circularization.

Now let’s do this in KSP. To simplify everything, assume both our starting orbit and our target orbit are already circular. Let’s say we want to reach some space station orbiting Laythe at 250k km and our SlickOrbiter is in a stable orbit at 100k km.

The first thing we have to do is match orbit inclination which is best done by a normal burn at the ascending node.

Now that our orbital planes are synchronized, we can start with our first prograde burn of the Hohmann transfer maneuver which is raising our apoapsis to the target orbit height, effectively transforming our circular orbit into an elliptic orbit.

Now once we have reached our transfer orbit’s apoapsis, we can circularize and match our target orbit by another prograde burn.

There it is, we have performed our first Hohmann transfer.

Bi-elliptic transfer

The bi-elliptic transfer consists of two half-elliptic orbits may, in certain situations, require less energy than a Hohmann transfer maneuver.

From the initial orbit, a first prograde burn (1) boosts the spacecraft into the first transfer orbit with an apoapsis at some point away from the central body. At this point a second prograde burn (2) sends the spacecraft into the second elliptical orbit with periapsis at the radius of the final desired orbit, where a third retrograde burn (3) is performed, injecting the spacecraft into the desired orbit.

While they require one more engine burn than a Hohmann transfer and generally requires a greater travel time, some bi-elliptic transfers require a lower amount of energy than a Hohmann transfer when the ratio of final to initial semi-major axis is 11.94 or greater, depending on the intermediate semi-major axis chosen.

Now let’s do this in KSP. To simplify everything, assume both our starting orbit and our target orbit are already circular and our orbital inclinations are already matched. Again, we want to reach some space station orbiting Laythe at 250k km and our SlickOrbiter is in a stable orbit at 100k km.

We will first raise our apoapsis above the target orbit to create an elliptic orbit with a long prograde burn.

Now we wait until we have reached the new apoapsis for another prograde burn to raise our periapsis to the level of the target orbit.

Finally, we perform a retrograde burn at the new periapsis to lower our apoapsis for circularizing our target orbit.

There it is, we have performed our first Bi-elliptic transfer.

Now that you have a basic overview of spacecraft orientation, burns into those directions and their impact on the spacecrafts orbit, as well as how to combined those maneuvers into orbit insertions, we can have laid the foundation to dive deeper into energy efficiency of those maneuvers, the famous delta-v and the Rocket equation in a later post. Until then - godspeed.

Square numbers proof

2019-09-02T22:05:00+02:00

I recently signed up for the MFPL PhD Selection where we got some scientific tasks to solve. One involved proving some statement about square numbers right or wrong.

Question

Is any of the integer numbers, A, consisting of exactly 15 ones and 15 zeros a square-number, that is an integer B exists, such that B*B=A? The number A should always have 30 digits and also numbers with leading zeros are considered. Please explain your answer. A simple YES or NO is not sufficient.

Probing the statement approach

I’m definitely no Maths genius, so the first thing I would do was to basically build randomly some numbers with 15 1s and 15 0s and calculate their square roots to get a feeling.

Here already I stumbled upon some misleading results since for bigger numbers as any number involving minimum 15 digits, the Apple calculator, R and Google tend to round and switch to scientific notation, making you believe you are looking at square numbers.

> a = 000000000000001111111111111110
> a
[1] 1.111111e+15
> sqrt(a)
[1] 33333333
> 33333333*33333333
[1] 1.111111e+15

As you can see, both R and Google calculator would make you believe $33333333^2$ yields $1111111111111110$ when in fact it does not - cross-checked with Apple calculator.

So now that I had after some detour already pretty quickly found an example proving the statement above wrong - which by the way already itself is sufficient to disprove the initial statement - but I wanted a little more sophistication.

I decided to take a rather lazy approach of reading up on properties of square numbers on Wikipedia and see whether any of them proves to be an easy no go. I came across the following:

No square number ends in 2, 3, 7 or 8.
The number of zeros at the end of a perfect square is always even.
Squares of even numbers are always even numbers and square of odd numbers are always odd.
The Square of a natural number other than one is either a multiple of 3 or exceeds a multiple of 3 by 1.
The Square of a natural number other than one is either a multiple of 4 or exceeds a multiple of 4 by 1.
The unit’s digit of the square of a natural number is the unit’s digit of the square of the digit at unit’s place of the given natural number.
There are $n$ natural numbers $p$ and $q$ such that $p^2 = 2q^2$.
For every natural number $n$, $(n + 1)^2 - n^2 = (n + 1) + n$.
For any natural number $m$ greater than 1, $(2m, m^2 - 1, m^2 + 1)$ is a Pythagorean triplet.

So let’s just quickly go through them:

Property 1 does not really help because we can only construct numbers ending at 0 and 1, both apparently valid digits for square numbers.

Property 2 - we already hit the jackpot. Since we can freely distribute 0s in our numbers, it is trivial to create one with an odd number of zeros at the end.

Allrighty, let’s formalize it.

Proof: Square numbers ending in zeros strictly end with an even number of zeros

Theorem: Square numbers ending in zeros strictly end with an even number of zeros.

(1) Let $k$ be an integer $k \in \mathbb{Z}$ with $k \geq 0$.

(2) Let $n$ be any number ending in $0$: $n = (10k + 0)$.

(3) The perfect square of $n$ equals to $n^2 = (10k + 0)^2 = 100k^2$

From (3) directly follows that any square number with ending zeros, strictly ends with zeros of a multiple of 2 - therefore an even number - of zeros.

We have proofed the theorem and therefore can use it to probe for counter-examples given the properties in our initial question.

Disprove statement by counterexample

It is trivial to find a number $m$ with an odd number of ending zeros and 15 additional 1s.

Simplest example:

\[m = 1111111111111110\] \[\sqrt{m} = \sqrt{1111111111111110} = 33333333.33333331\dot 6\]

Therefore it follows, that the question

Is any of the integer numbers, A, consisting of exactly 15 ones and 15 zeros a square-number, that is an integer B exists, such that B*B=A?

can be answered with No:

Not any integer number A, consisting of exactly 15 ones and 15 zeros is a square-number, that is an integer B exists, such that B*B=A.

Orbital basics

2019-08-26T13:42:00+02:00

I was always fascinated by rockets, space in general and zero-gravity environments, however the Math’s involved always deemed too complex for me. However, through the playful and still complex approach of Kerbal Space Program (KSP) - it is an awesome game I totally recommend to anybody remotely interested in space exploration - I picked up interest lately again and started reading into orbital mechanics, propulsion systems and related stuff in more detail.

This blog series is dedicated to summarising basic concepts at definitely super simplified and probably sometimes oversimplified and not entirely correct level.

The easiest concept for me to grasp, since once can do it quite interactively in KSP is the concept of orbits and orbital changes through orbital maneuvers.

So this very first post of this series will cover my basic understanding of the concept of orbits.

Ellipse

Let’s start of with refreshing our memory what an ellipse is - because that is what most relevant orbits for this blog series will look like. In mathematical terms, an ellipse is a plane curve surrounding two focal points ($F_1$ and $F_2$), such that for all points on the curve, the sum of the two distances $d(F_1) + d(F_2)$ is constant.

It is a generalization of a circle, where the two focal points are the same. Yes, also circular orbits exist.

Ellipse parameters

There are a few important parameters describing an ellipse which will be referred throughout this blog series, so make sure you memorize and understand them, because they will keep popping up again and again.

Semi-major and semi-minor axes $a \geq b$

$a$ is referred to as the semi-major axis, i.e. $a \geq b > 0$.

Linear eccentricity $c$

This is the distance from the center to any of the two foci: $c = \sqrt{a^2 - b^2}$.

Eccentricity $e$

The eccentricity is expressed as:

\[e = \frac{c}{a} = \sqrt{1 - (\frac{b}{a})^{2}}\]

assuming $a > b$. An ellipse with equal axes $(a = b)$ has zero eccentricity and is a circle.

Semi-latus rectum $l$

The length of the chord through one of the foci, perpendicular to the major axis, is called the latus rectum. One half of it is the semi-latus rectum $l$. A calculation shows:

\[l = \frac{b^2}{a} = a(1-e^2)\]

The semi-latus rectum $l$ is equal to the radius of curvature of the osculating circles at the vertices.

Orbit

Now probably everybody has some idea what an orbit is, but before going into details, let’s first summarise the definitions I found on the web.

Definition

In physics, an orbit is the gravitationally curved trajectory of an object, like the the trajectory of any plane around a star or a satellite around earth. Unless mentioned differently, in this blogpost orbit refers to a regularly repeating trajectory, but there are also non-repeating trajectories. To a close approximation, planets and satellites follow elliptic orbits, with the central mass being orbited at on of the two focal points of the ellipse, as described by Kepler’s laws of planetary motion.

The post will stick to the classical Newtonian mechanics paradigm of describing orbital motion, which is an adequate approximation for most situations. However, Einstein’s generaly theory of relativity, which accounts for gravity as due to curvature of spacetime and orbits following geodesics, provides a more accurate calculation and understanding of the exact mechanics of orbital motion, which is needed in near very massive bodies (e.g. Mercury’s orbit around the sun) or for extreme precision (as for GPS satellites).

Understanding orbits

There are two factors involved for understanding orbits:

Gravity pulling an object from its straight path into a curved path
The velocity at which this object is trying to travel along its path

This principal is illustrated by the illustration above, where gravity from a massive body in the center (green) pulls a object travelling on a straight path (pink object, black arrows), effectively bending the path with its constant pull (red) around the center body.

Another way how to illustrate how orbits develop is the though experiment of Newton’s cannonball. Here, we visualize a cannon on top of a very high mountain which can fire at any imaginable speed.

If the cannon fires its ball with a low initial speed, the trajectory of the ball curves downward and hits the ground (A). As the firing speed is increased, the cannonball hits the ground farther (B) away from the cannon, because while the ball is still falling towards the ground, the ground is increasingly curving away from it (see first point, above). All these motions are actually “orbits” in a technical sense – they are describing a portion of an elliptical path around the center of gravity – but the orbits are interrupted by striking the Earth. The horizontal speed for both (A) and (B) is 0 - 7,000 m/s for Earth.

If the cannonball is fired with sufficient speed, the ground curves away from the ball at least as much as the ball falls – so the ball never strikes the ground. It is now in what could be called a non-interrupted, or circumnavigating, orbit. For any specific combination of height above the center of gravity and mass of the planet, there is one specific firing speed (unaffected by the mass of the ball, which is assumed to be very small relative to the Earth’s mass) that produces a circular orbit, as shown in (C).

As the firing speed is increased beyond this, non-interrupted elliptic orbits are produced; one is shown in (D). If the initial firing is above the surface of the Earth as shown, there will also be non-interrupted elliptical orbits at slower firing speed; these will come closest to the Earth at the point half an orbit beyond, and directly opposite the firing point, below the circular orbit. The horizontal speed for both (C) and (D) ranges from 7,300 to 10,000 m/s for Earth.

At a specific horizontal firing speed called escape velocity, dependent on the mass of the planet, an open orbit (E) is achieved that has a parabolic path. At even greater speeds the object will follow a range of hyperbolic trajectories. In a practical sense, both of these trajectory types mean the object is “breaking free” of the planet’s gravity, and “going off into space” never to return. This involves any horizontal speed > 10,000 m/s for Earth.

Various firing speeds of Newton’s cannon and the resulting trajectory.

This leads to four practical classes of moving objects:

No orbit
Suborbital trajectories
- Range of interrupted elliptical paths
Orbital trajectories
- Range of elliptical paths with closes point opposite firing point
- Circular path
- Range of elliptical paths with closes point at firing point
Open (escape) trajectories
- Parabolic paths
- Hyperbolic paths

Apsis

The first two terms I learned about in KSP were the two apsis - probably because a lot of orbital maneuvers happen at those and they are pretty simple to comprehend.

Apsis denotes either of the two extreme points (i.e., the farthest or nearest point) in the orbit of a planetary body about its primary body.

There are two apsides in any elliptic orbit. Each is named by selecting the appropriate prefix: apo- , or peri- and then joining it to the reference suffix of the “host” body being orbited. The general form is apoapsis (see figure above (1)) for the farthest point and periapsis (see top figure (2)) for the nearest point. Depending what central body is orbited it will become apogee and perigee for object orbiting earth, apohelion and perihelion for objects orbiting the sun etc.

Orbital elements

Orbital elements are the parameters required to uniquely identify a specific orbits. In celestial mechanices, usually a Kepler orbit is used. A real orbit changes over time due to gravitational perturbations by other objects and relativistic effects, so a Keplerian orbit is merely an idealized, mathematical approximation at a particular time.

An orbit is generally defined by six elements (known as Keplerian elements) that can be computed from position and velocity:

Two define the size and shape of the trajectory (compare with ellipse parameters):

Semimajor axis $a$
Eccentricity $e$

Two elements define the orientation of the orbital plane in which the ellipse is embedded:

Inclination $i$ - vertical tilt of the ellipse with respect to the reference plane (for the earth e.g. the equatorial plane), measured at the ascending node. The ascending node is where the orbit passes upwards through the reference plane). The tilt angle is measured perpendicular to the line of intersection between the orbital plane and the reference plane.
Longitude of the ascending node $\Omega$ - horizontally orients the ascending node of the ellipse with respect to the reference frame’s vernal point :aries:.

I found it pretty hard at first to wrap my head around what the vernal point :aries: actually is - naturally it is some arbitrary reference point to fix the angle for the ascending node $\Omega$. So actually the vernal point :aries: is one of the equinoctes, namely the one occurring in spring in the northern hemisphere. It is regarded as the instant of time when the plane of the Earth’s equator passes through the center of the sun. So at the equator, the sunrays will hit the earth perpendicular directly from the sky zenith. After passing the vernal point, the northern hemisphere will receive more light - summer is here - before the vernal point, the northern hemisphere received less light - winter was coming. Same is true vice versa for the southern hemisphere.

The two remaining elements are as follows:

Argument of periapsis $\omega$ defines the orientation of the ellipse in the orbital plane. It is measured as the angle from the ascending node to the periapsis.
True anomaly ($v$, $\theta$, or $f$) at epoch ($M_0$) defines the position of the orbiting body along the ellipse at a specific time (“epoch”). The true anomaly is an angular parameter defining the angle between the direction of the periapsis and the current position of the orbiting body.

Epoch sounds pretty sophisticated, but basically just just a moment in time used as a reference point for some time-varying astronomical quantity, like the true anomaly. Still sounds complicated?

Let’s look at some unit indicating a specific epoch: J2000.

The $J$ unit refers to Julian years, which are intervals with the length of a mean year in the Julian calendar, i.e. 365.25 days. This interval measure does not itself define any epoch: the Gregorian calendar is in general use for dating. Thus “J2000” refers to the instant of 12:00 TT (noon) on January 1, 2000.

Now an arbitrary Julian epoch is therefore related to the Julian date by

\[J = 2000 + (Julian date − 2451545.0) ÷ 365.25\]

So in a sense everybody has definitely a feeling for an Epoch because we also structure our lifes and set up meetings for certain “Epochs” everyday.

Orbital period

The orbital period is simply how long an orbiting body takes to complete one orbit.

Ellipse vs orbits

For elliptical orbits, some formulas from ellipses are directly related.

Let $e$ be the eccentricity, $r_a$ the radius of the apoapsis, $r_p$ the radius of the periapsis and $a$ the length of the smi-major axis. Then:

\[e = \frac{r_a - r_p}{r_a + r_p} = \frac{r_a - r_p}{2a}\] \[r_a = (1 + e)a\] \[r_p = (1 - e)a\]

Interestingly, the semi-major axis $a$ is the arithmetic mean, the semi-minor axis $b$ is the geometric mean and the semi-latus rectum $l$ is the harmonic mean of $r_a$ and $r_b$:

\[a = \frac{r_a + r_p}{2}\] \[b = \frac{2}{\sqrt{r_a * r_p}}\] \[l = \frac{2}{\frac{1}{r_a} + \frac{1}{r_p}} = \frac{2r_{a}r_{p}}{r_a + r_p}\]

Orbits in KSP

Now this post should leave with a basic idea what an orbit is, how it is defined and what are important parameters to specify orbits and positioning moving object in a given orbit. As a little teaser for the next section where we will be talking about basic orbital maneuvers and mechanics, find a first sceenshot from KSP of a random orbit. What can you tell from it?

Given from what I have told you, you should be able to spot that it is a circular orbit (eccentricity = 0 or apoapsis $\approx$ periapsis) and it’s orbital plane is perfectly aligned with the equatorial plane of the central body (inclination = 0).

Now you should be equipped with the basic toolset for the next post where we will be modifying orbital parameters with maneuvers.

Pipelines on AWS

2019-08-25T21:51:00+02:00

The prerequisite for this post is that you have a sound understanding of Nextflow and made yourself familiar with the salmon-nf workflow created in this post. Furthermore, you should know all the essential AWS building blocks and basic architecture of an AWS based batch scheduler as I presented in my previous post. In this post, I will show you what environment and resources you have to actually set up on AWS to make the salmon-nf example pipeline run and then how to actually run jobs on the setup AWS Batch queue with Nextflow.

Credits

Many people have done a great job into setting up tutorials and blogs on this and I would like to acknowledge a few that helped me a lot to actually make my AWS pipelines happen:

Maxime Garcia and his great blog
Alex Peltzer
Paolo Di Tommaso for Nextflow and Gitter support

There are a couple of tutorials that helped a lot:

Prerequisites

Accounts, users, roles, permissions

Some things have to be setup prior to starting setting up the actual AWS compute environment such as obvious things as an AWS account and other things like setting up an IAM user or Service roles which all has to be done only once and is exhaustively covered already in several blog posts such as this one by Alex Peltzer and Tobias Koch. Therefore, I will not spend any time on this and suggest you just follow the instructions in the blog post until it is time to set up your AMI which is where I will start off.

Step 1: Estimate resource requirements

Appropriate resource allocation is crucial for setting up AWS workflow that are both cost-efficient and high-throughput. Therefore, I strongly advise you to take a big enough test-dataset, run it on in a limitless test environment - hopefully many of you have some kind of in-house HPC cluster - and take the resulting measurements of resource consumption to find optimal storage, memory and CPU sizes.

Conveniently, Nextflow workflows can be easily executed both on AWS but also in your local HPC environment by simply defining additional profiles for the scheduler of your choice.

Here is one example of a simple SLURM profile:

singularity {
	enabled = true
}

docker {
	enabled = false
}

process {

    executor = 'slurm'
    clusterOptions = '--qos=short'
    cpus = '12'
    memory = { 8.GB * task.attempt }
}

params {

   salmonIndex = '/groups/Software/indices/hg38/salmon/gencode.v28.IMPACT'

}

As you can see, usually HPC environments do not allow Docker containers to run, but support Singularity containers which can be easily built from Docker containers.

The process section basically defines the scheduler, resources and the job queue in which the processes should run. Finally, the index files are usually stored in some globally accessible directory, similar to the s3 storage on AWS.

Now that we are set, Nextflow has this neat option flag -with-report that gives you a very comprehensive overview of the resources your processes consumed during execution.

Below are the most important excerpts of an example report from when I ran my Nextflow workflow on 1,222 breast cancer datasets from TCGA:

On average a single task ran on 6 threads, consumed 8 GB of memory and ran 2:30 minutes - this is the rough framework of resources we will have to consider when allocating resources and choosing appropriate EC2 instances.

Step 2: Creating a suitable AMI

I found the setup and configuration of suitable AMIs to be the most demanding step when creating an environment to run a pipeline on AWS. Several things have to be considered:

Base image: It has to be ECS-compatible
EBS storage: The attached volumes have to be large enough to contain all input, index, temporary and output files
AWS CLI: The AMI has to contain the AWS CLI or otherwise no files can be fetch from and copied to S3 from the EBS volume
AMIs cannot be reused for processes containing less EBS (more is possible)

This section covers how you can set up your AMI for a given task of your pipeline and what to consider on the way.

Choose an Amazon Machine Image (AMI)

As a first step, we want to make sure to pick a base image that supports ECS from the AWS Market Place. I strongly advise you to use one of the Amazon ECS-Optimized Amazon Linux AMI images.

Choose an Instance Type

The EC2 instance we want to use to create our custom AMI does not need be resourceful, since we won’t run any jobs on that. Therefore, a t2.micro instance is more than sufficient.

Configure Instance Details

The instance configuration can be mostly left to the defaults. However, I would strongly advise you to set the shutdown behaviour to terminate, otherwise attached volumes will be kept persistent and you continue to pay unless you explicitely terminate the instance manually. I actually ran into huge costs when misconfiguring this (300$) so watch out!.

Add storage

This is the single most important point of the entire AMI setup process - here you define the minimum number of added storage for your AMI. This storage must be large enough, to contain all input and index files for a given task as well as all temporary and final output files produced during the computation. I hope you did some thorough benchmarking and extrapolation of resources on your input dataset.

Add tags

Unless you want to add optional tags, nothing to do here…

Configure Security Group

Before firing up your instance, you need to configure the associated security group. For me, letting AWS create the security group worked perfectly fine, I would still double check that you can connect to the EC2 instance - in case of doubt set the source to 0.0.0.0/0, even though probably all IT security experts will kill me for that. Now you are ready to lauch the instance.

SSH connect to instance

Now right click and hit Connect to get your ssh connect command to your instance. You might have to change the default root user to ec2-user.

Adjust Docker container size to EBS

The first thing we want to check once we connected to our instance is that the Docker configuration reflects the amount of added EBS storage.

[ec2-user@ip-172-31-40-128 ~]$ docker info | grep -i data
 Data Space Used: 309.3MB
 Data Space Total: 42.42GB
 Data Space Available: 42.11GB
 Metadata Space Used: 4.833MB
 Metadata Space Total: 46.14MB
 Metadata Space Available: 41.3MB

In the above example we see that indeed Docker is configure for the specified 40 GB EBS data volume.

As per default, the maximum storage size of a single Docker container is 10 GB - independent of the data space available - we have to adjust this.

[ec2-user@ip-172-31-40-128 ~]$ docker info | grep -i base
 Base Device Size: 10.74GB

To this end, we have to extend the file in /etc/sysconfig/docker-storage to contain the following parameter --storage-opt dm.basesize=40GB and restart the Docker service.

sudo vi /etc/sysconfig/docker-storage

DOCKER_STORAGE_OPTIONS="--storage-driver devicemapper --storage-opt dm.thinpooldev=/dev/mapper/docker-docker--pool --storage-opt dm.use_deferred_removal=true --storage-opt dm.use_deferred_deletion=true --storage-opt dm.fs=ext4 --storage-opt dm.use_deferred_deletion=true --storage-opt dm.basesize=40GB"

[ec2-user@ip-172-31-40-128 ~]$ sudo service docker restart
Stopping docker:                                           [  OK  ]
Starting docker:       	.                                  [  OK  ]

[ec2-user@ip-172-31-40-128 ~]$ docker info | grep -i base
 Base Device Size: 42.95GB

Install AWS CLI

Nextflow requires the AWS CLI to copy files such as input files and indices from and output files to S3.

Use the following lines to add it to your AMI:

sudo yum install -y bzip2 wget
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $HOME/miniconda
$HOME/miniconda/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh

Give it a quick spin to see whether everything is ok.

[ec2-user@ip-172-31-40-128 ~]$ ./miniconda/bin/aws --version
aws-cli/1.16.121 Python/3.7.1 Linux/4.14.94-73.73.amzn1.x86_64 botocore/1.12.111

Save your AMI

Now you can go back to your EC2 instance dashboard and save your AMI by right clicking and going for Image->Create Image.

Congratulations you have created your first AMI!

Don’t forget to terminate your running EC2 instance from which you created the AMI to get prevent any running EBS and EC2 costs.

Step 3: Creating compute environments and job queues

Now it is time to create appropriate compute environments and their corresponding job queues. I usually like to create some baseline workload queue that should handle most of the jobs providing resources estimated from Step 1 and an excess queue with very extensive resources that handles the few jobs that overflow the workload resources, so that the entire batch is still successfully processed.

Overview

First, we want to create a new compute environment upon which we can base job queues. For this, go to the AWS Batch dashboard -> Compute Environments.

I have already created some production environments, for you this overview will probably be empty. Then go to Create Environment.

Naming, roles and permissions

First, we want to have a managed environment, so AWS Batch can do configuration and scaling for us. Now, we can name our compute environment. I chose to create first a workload compute environment, thus naming it salmonWorkload. Then we simply select the service and instance roles as well as keypairs we created earlier in the prerequisite section, there should be only one option to choose from.

Some words on instance types and vCPU limits

In my opinion, this part is the most crucial part of setting up an optimal environment both in terms of computation and cost efficiency. So pay special attention here!

First of all, I hope you did a good enough job in Step 1 of estimating your resource requirements per task.

These are the punchlines you have to consider now for fixing instance types and vCPU limits for your compute environment:

Fit only 1 task in 1 instance!

If you look at the instance pricing table, you will see that prices linearly scaling with instance types - meaning doubling resources results in double prices. You will not save anything by running more jobs on a single larger instance, but you will pay for it since from experience the Docker daemon on the instance sometimes gets confused and hung-up when there’s multiple tasks run on the instance.

vCPUs refers to the total number of vCPUs of your environments

This got me confused also when trying to figure out, how many instances will be fired up in total. Essentially you have to divide this number by the number of vCPUs provided by your instance type of choice and then you will get the number of instances that will be launched at peak times.

So let’s say you chose c5.2xlarge as your instance type with 8 vCPUs and your specified Maximum vCPUs is 100, then 100 / 8 = 12 instances will be launched in total if the entire compute environment is utilized.

Keep some spare memory for instance services

I will address this in detail later, but keep in mind that not the entire memory listed in the instance type specification can be used, since some of it will be occupied with running basic instance services.

Keep homogeneous compute environments

Since we did a careful resource requirement estimation, I find it easiest for keeping track of cost and also ensuring that the tasks will actually finish, to have homogeneous compute environments - meaning one environment will only allow for one specific instance type.

Specifying instance types and vCPU limits

Now let’s put it all together. First up, let’s quickly refresh the resource requirements we had per Salmon task:

We need an instance to provide 8 GB of memory to fit index + data
If we run our tasks on a 6 thread instance, it will run 2:30 mins

Now if we check the instance type table, we find there is actually 2 types of instances that would cover these requirements:

The c5.xlarge comes with 8 GB of memory and 4 vCPUs, the c5.2xlarge with double the memory and vCPUs. So in principal, we could fit on average on task in the smaller instance, but remember you will have some overhead of services running on the instance that effectively reduces these 8 GB and second these requirements are average requirements, so anything above average will fail to run in such an instance. Therefore, we should definitely go for a c5.2xlarge here.

Choose c5.2xlarge as your only instance type and delete optimal
Set Minimum vCPUs and Desired vCPUs both to 0 to have no idle running instances in background
Tick the Enable user-specified Ami ID, copy the AMI ID from the AMI we created and validate it

Everything else you can leave empty and click Create.

Congratulations, you have created your first compute environment!

Step 4: Creating job queues

Now we need to create a job queue and associated this with our compute environment. This step is actually pretty easy and straightforward.

First go to Job queues and click Create Queue.

Now you can pick a name for your job queue - in our simple case I give it the same name as our compute environment salmonWorkload. You can in principal assign multiple job queues to one compute environment and set priorities via the Priority field, but we can simply put 1 in there.

Finally, associated the job queue with our salmonWorkload compute environment. Note again here, you can in principal assign multiple compute environments to a given job queue.

That’s it - click Create job queue and you have successfully created your first job queue!

Excess queue

Now that we have our workload compute environment and job queues, we want to do the same with for our excess compute environment and job queues to handle any datasets with overshooting resource requirements.

Therefore, we repeat the steps starting from Step 3 to create a salmonExcess compute environment and job queue based on c5.4xlarge instances with double the resources compared to our salmonWorkload queue.

This should leave you know with the following compute environments and job queues and finally ready to specify our final resource constraints before submitting our first jobs.

Step 5: Adjusting resources

Ok so now that we have set all the compute environments with associated instance types as well as job queues up on the AWS end, we know what resources we have available and how much of those will be consumed by our tasks.

Resource definition

So naïvely we can directly enter the specifications of our EC2 instance type of choice in the awsbatch.config file of our salmon-nf Nextflow workflow, since we know the salmonWorkload queue consists of c5.2xlarge instances with 16 GB memroy and 8 vCPUs each and our salmonExcess queue of c5.4xlarge instances with 32 GB memory and 16 vCPUs each.

aws.region = 'eu-central-1'
aws.client.storageEncryption = 'AES256'
executor.name = 'awsbatch'
executor.awscli = '/home/ec2-user/miniconda/bin/aws'

process {
  queue = { task.attempt > 1 ? 'salmonExcess' : 'salmonWorkload' }
	memory = { task.attempt > 1 ? 32.GB : 16.GB }
	cpus = { task.attempt > 1 ? 16 : 8 }
}

params {

   salmonIndex = 's3://obenauflab/indices/salmon/gencode.v28.IMPACT'

}

Now just let’s quickly fast-forward and look what happens if we submit our jobs like this.

You will notice that we have one runnable job for each task, yet no instances will fire up.

If we check one of the jobs, we will see that the environment requirements have been exactly set up as we specified in our Nextflow config which is also matched by the instance types of our job queue - so why does this not work?

ECS overhead extraction

The solution for this is the fact, that there are overhead container services running in your instance which consume some chunk of your total available memory. So when you ask for X GB memory on an instance with X GB total memory, you have to be aware that there is Y GB memory preoccupied with service tasks, so your effective available memory will be X-Y.

To get your jobs running on such instances, you cannot request X GB memory then, but rather the X-Y chunk. How do we determine Y now?

Let’s first fire up an instance of our compute environment by simply selecting our compute environment and clicking on Edit.

Now we select 1 minimum and desired vCPU to fire up one instance of the compute environment and Save.

Wait a couple of minutes to let the EC2 instance fire up, then again click on your compute environment. Follow the link given in ECS Cluster name.

This will bring you to the cluster overview page, where you need to click on ECS instances.

Now finally we get what we want - the actual amount of memory available on a given instance on this ECS cluster.

According to the ECS tab, we have 15,434 MB memory available on our salmonWorkload queue - repeat the same procedure to get the numbers for our salmonExcess queue.

Updated resource definition

Having obtained the mysterious actual available memory X-Y on our EC2 instances of our compute environment, we can finally enter the final numbers in our awsbatch.config definition of our salmon-nf Nextflow pipeline.

aws.region = 'eu-central-1'
aws.client.storageEncryption = 'AES256'
executor.name = 'awsbatch'
executor.awscli = '/home/ec2-user/miniconda/bin/aws'

process {

queue = {
	task.attempt > 1 ? 'salmonExcess' : 'salmonWorkload' }
	memory = { task.attempt > 1 ? 31100.MB : 15400.MB }
	cpus = { task.attempt > 1 ? 16 : 8 }
}

params {

   salmonIndex = 's3://obenauflab/indices/salmon/gencode.v28.IMPACT'

}

Finally, we are ready to testdrive our salmon-nf Nextflow pipeline on our AWS job queue!

Step 6: Running jobs with AWS Batch

Allright, now things are getting serious, just a little more preparation needed to finally run our salmon-nf Nextflow pipeline on AWS:

Upload our index file to s3
Upload our input fastq files to s3
Launch a submission EC2 instance for running our salmon-nf Nextflow pipeline
Enter credentials
Go!

Upload files to `s3`

To upload files to s3, I recommend you to use the AWS CLI.

For installation just follow the instructions. Important afterwards it to expose your AWS credentials which you obtained when creating your IAM user to Nextflow which can be done in 2 ways:

Exporting the default AWS environment variables

export AWS_DEFAULT_REGION=<REGION IDENTIFIER>
export AWS_ACCESS_KEY_ID=<YOUR S3 ACCESS KEY>
export AWS_SECRET_ACCESS_KEY=<YOUR S3 SECRET KEY>

Specify your credentials in the Nextflow configuration file

aws {
    region = '<REGION IDENTIFIER>'
    accessKey = '<YOUR S3 ACCESS KEY>'
    secretKey = '<YOUR S3 SECRET KEY>'
}

I personally prefer option 1 to not accidentally commit and push any of my credentials to my Nextflow Github repo.

Now we can upload our fastq files to our target destination in our s3 bucket, assuming you are in the directory where your fastq files are stored:

aws s3 cp . s3://obenauflab/fastq --recursive --include "*.fq.gz"

Repeat the same with your index files to your s3 bucket destination and you now all files we need for running salmon-nf are ready. You can view them via numerous clients, I used Cyberduck for Mac. Below you will see that my 40 testsamples and index files have been uploaded in the appropriate locations in my s3 bucket.

Launch and prepare your submission instance

Finally, we need some machine where we run our Nextflow master process that submits jobs to the AWS Batch queues. You can of course to this locally on your machine or have a long running job in our HPC environment.

But for heavy, long-running workloads it definitely makes sense to have a dedicated instance to run the Nextflow process on to not run into troubles.

Fortunately, we only need a very minimal EC2 instance for this, which is available from AWS under the so-called Free Tier - meaning it’s free, yay!

So this is what we will do - first go to your EC2 dashboard and select Launch Instance.

Next up, we have to select the AMI we want to run on our instance. I have already precreated a Nextflow AMI which is simply an AMI created as in Step 2, where I in addition installed Java 8 and Nextflow.

For the instance type, make sure to select something labeled as Free Tier eligible to not run into any costs for this instance, e.g. t2.micro in the example below. Then just hit Review and Launch and then Launch the instance.

Finally, make sure to launch it with a keypair that you also have downloaded, otherwise you will be unable to connect to the instance.

Finally, give some name to your master instance, since many more will be launched once we fire up our salmon-nfNextflow pipeline on our AWS Batch compute environment.

Finally, connect to the instance as already shown in Step 2 for example. Now we can pull our salmon-nf Nextflow pipeline.

[ec2-user@ip-172-31-38-222 ~]$ nextflow pull t-neumann/salmon-nf
Checking t-neumann/salmon-nf ...
 downloaded from https://github.com/t-neumann/salmon-nf.git - revision: 6ac6e6a15a [master]
[ec2-user@ip-172-31-38-222 ~]$

Next up, don’t forget again to export your AWS credentials.

[ec2-user@ip-172-31-38-222 ~]$ export AWS_DEFAULT_REGION=<REGION IDENTIFIER>
[ec2-user@ip-172-31-38-222 ~]$ export AWS_ACCESS_KEY_ID=<YOUR S3 ACCESS KEY>
[ec2-user@ip-172-31-38-222 ~]$ export AWS_SECRET_ACCESS_KEY=<YOUR S3 SECRET KEY>

Now there is only 1 last crucial step before we can actually launch our jobs on the AWS Batch queue: We have to create job definitions. Luckily for us, Nextflow will automatically create job definitions for us upon the first launch of a pipeline.

However, what I found is, that job definitions will only be properly created if the initial run contains only very few samples. So always have your initial run on a SINGLE SAMPLE!!. What happens if you don’t, is that your Nextflow submission will be stuck at the following step:

[ec2-user@ip-172-31-38-222 ~]$ nextflow run t-neumann/salmon-nf --inputDir s3://obenauflab/fastq --outputDir s3://obenauflab/salmon -profile awsbatch -w s3://obenauflab/work/salmon
N E X T F L O W  ~  version 18.10.1
Launching `t-neumann/salmon-nf` [silly_mccarthy] - revision: 6ac6e6a15a [master]

 parameters
 ======================
 input directory          : s3://obenauflab/fastq
 output directory         : s3://obenauflab/salmon
 ======================

[warm up] executor > awsbatch

From there on, you wait forever and wonder what’s going on, as it happened to me.

Start your Nextflow run on AWS batch

Now the last and most rewarding step of all - you are finally ready to launch the salmon-nf Nextflow pipeline on AWS!

[ec2-user@ip-172-31-38-222 ~]$ nextflow run t-neumann/salmon-nf --inputDir s3://obenauflab/fastq --outputDir s3://obenauflab/salmon -profile awsbatch -w s3://obenauflab/work/salmon

Notice, how both the inputDir and outputDir point to an s3 directory and how we also have to supply a work directory with -w on s3. Now hit Enter and watch the beauty unfold on AWS.

[ec2-user@ip-172-31-38-222 ~]$ nextflow run t-neumann/salmon-nf --inputDir s3://obenauflab/fastq --outputDir s3://obenauflab/salmon -profile awsbatch -w s3://obenauflab/work/salmon
N E X T F L O W  ~  version 18.10.1
Launching `t-neumann/salmon-nf` [silly_mccarthy] - revision: 6ac6e6a15a [master]

 parameters
 ======================
 input directory          : s3://obenauflab/fastq
 output directory         : s3://obenauflab/salmon
 ======================

[warm up] executor > awsbatch
[4a/72c0f7] Submitted process > salmon (d1ada222-b67f-47c0-b380-091eaab093b4_gdc_realn_rehead)
[f2/f8d97a] Submitted process > salmon (e46e4f3a-62f8-4bd1-a143-f384e219d6af_gdc_realn_rehead)
[90/35eb4d] Submitted process > salmon (1672de07-77db-4817-9c7f-f201c25e8132_gdc_realn_rehead)
[81/c47fe3] Submitted process > salmon (741fbacf-3694-46ef-b16f-66bac6ee0452_gdc_realn_rehead)
[f1/bc3afc] Submitted process > salmon (db18dd75-3b48-4c21-aa68-58b1cf37c8c2_gdc_realn_rehead)
[a8/88095d] Submitted process > salmon (0ac6634e-00b0-4107-a5d6-db8ffc602645_gdc_realn_rehead)
[a6/36e366] Submitted process > salmon (9fa785f2-1dcb-4966-a5fa-fe75d327cb81_gdc_realn_rehead)
[7d/5ae2b0] Submitted process > salmon (5b3c329a-aa14-4965-8d13-f508f4390eaf_gdc_realn_rehead)
[d9/3ec3fc] Submitted process > salmon (6cf08e2b-7e59-4537-b1c3-1c5b3838ab95_gdc_realn_rehead)
[19/d7d441] Submitted process > salmon (9c714c63-ee50-4385-9e25-09f940f5f902_gdc_realn_rehead)
[71/ff40cf] Submitted process > salmon (17686cd5-271a-4e24-9746-f93334fb86b5_gdc_realn_rehead)
[66/aaa185] Submitted process > salmon (0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead)
[67/ccd647] Submitted process > salmon (1916abcd-61c0-4f23-96ac-be70aacb8dc1_gdc_realn_rehead)
[7d/0a090b] Submitted process > salmon (e1a4167d-b4ca-405c-8550-cc32bb1b1d09_gdc_realn_rehead)
[3b/a9972e] Submitted process > salmon (876a9725-34c1-4a23-a3fe-58a860d0f0c5_gdc_realn_rehead)

Note how AWS Batch automatically upscales the number of desired vCPUs of your compute environment once the jobs are submitted.

Watch in awe how AWS Batch fires up multiple EC2 instances automatically in your EC2 dashboard.

Watch how jobs transition from Runnable to Starting to Runnable to Succeeded state until all your samples have been processed.

[47/c580b5] Submitted process > salmon (2864cbe8-4d77-4477-ac84-791004e42237_gdc_realn_rehead)
[8c/84bc14] Submitted process > salmon (0fdb3d0e-e405-4e8d-8897-4a90ea4fe00c_gdc_realn_rehead)
[1d/3f6ec6] Submitted process > salmon (7ed99d57-f199-4dac-87a8-62393f5e0aea_gdc_realn_rehead)
[a9/330e5d] Submitted process > salmon (825daddc-a89a-483b-947e-74cc12ba013c_gdc_realn_rehead)
[98/33bed5] Submitted process > salmon (c3588f96-95c6-4008-bda2-502ceb963adb_gdc_realn_rehead)

t-neumann/salmon-nf has finished.
Status:   SUCCESS
Time:     Sun Aug 25 11:20:13 UTC 2019
Duration: 10m 22s

[ec2-user@ip-172-31-38-222 ~]$

Now let’s check whether the results were produced in the correct s3 output directory.

Congratulations! You did it! It took a long time, was quite a tedious setup and frustrating for me at numerous steps, but with amazing help from the community, Boehringer-Ingelheim and also quite some trial-and-error I got it to work and hopefully so did you with much less hassle!

Happy pipeline building and number crunching with AWS and Nextflow!

Slamdunk paper

2019-06-28T13:42:00+02:00

For the past couple of years I was involved in the development of SLAMseq, a sequencing technology for time-resolved measurement of newly synthesized and existing RNA in cultured cells. Originally developed by the lab of Stefan Ameres, the lab of my boss Johannes Zuber extended the approach with pharmacological and chemical-genetic perturbations in order to identify direct transcriptional targets of any gene or pathway (Muhar et al, Science 2018).

Processing and interpreting this data required novel analysis methods, so I was given the opportunity to team up with a good friend of mine - Philipp Rescheneder - to develop Slamdunk which we recently published in BMC Bioinformatics and is generally applicable to any nucleotide-conversion containing dataset.

This post will quickly highlight the main functionality, findings and features.

Slamdunk workflow

Slamdunk differs from naive read processing in 4 ways:

It maps with a nucleotide-conversion aware scoring scheme since in the example of SLAMseq data, T>C mismatches are expected and identify reads from labelled transcripts
Since QuantSeq processes smaller, more repetitive regions of transcripts - namely the 3’ ends - Slamdunk cannot simply discard all multimappers, but utilizes a strategy to recover them
Genuine T>C SNPs would contribute greatly to false-positive conversion-quantifications and have to be excluded during the quantification step
Depending on coverage and T-content in the 3’ end, observing T>C reads will have a different likelihood which has to be corrected for during conversion quantification

Features

Conversion-aware mapping

Slamdunk utilizes a conversion-aware scoring scheme implemented with the mapper NextGenMap. Using this scoring-scheme, we could demonstrated the following:

We can map reads independent of the inherent conversion-rates in the respective datasets (see top Figure a)
With commonly found conversion-rates (0-7%), we are able to map constantly > 90% of the reads with 100-150bp and >80% of shorter reads with 50bp read length.

Multimapper recovery

We devised a multimapper recovery strategy to deal with repetitive 3’ UTR regions of transcripts. To this end, multimapping reads that still map uniquely to annotated 3’ UTRs are recovered and only reads with alignments to several annotated 3’ UTRs are discarded.

Using this strategy, we are able to recover valuable signal in genes with 3’ UTRs with low mappability and increase overall correlation of QuantSeq datasets to corresponding RNA-seq datasets.

Conversion quantification

Plain quantification of the number of TC-conversion containing reads in a given interval is biased towards intervals with higher T-content and higher coverage, since the probability of observing a T>C conversion in this intervals is increased. To address this issue, we devised a T-content and coverage aware nucleotide-conversion quantification within intervals that is clearly superior in error rates (see bottom Figure left). Overall, the variance of relative error decreases with higher coverage and while it slightly underestimates the true conversion rate with short reads (50bp), it accurately estimates the conversion rates for reads starting from 100bp (bottom Figure right).

MultiQC report

Visualization of results and quality control is an important aspect of each analysis. To this end, with lots of help from Phil Ewels, we developed a plugin to MultiQC to facilitate quality control of SLAMseq datasets. Using this plugin, we can visualize conversion rates within samples (bottom Figure a), display the principal components of samples based on T>C containing reads (bottom Figure b), plot non T>C mismatches over read positions to identify problematic read positions (bottom Figure c) or plot T>C conversions at 3’ ends (bottom Figure d) to check for base composition biases.

Documentation

A thorough documentation is available from the main website:

https://t-neumann.github.io/slamdunk/

Availability

Slamdunk is available from several platforms:

Pipelines with Nextflow

2019-03-03T21:51:00+01:00

Nowadays, workflow management systems have become an integral part of large-scale analysis of biological datasets with multiple software packages and multi-platform language support. These systems enable the rapid prototyping and deployment of pipelines that combine complementary software packages. Several such systems are already available, such as Snakemake and CWL.

This post will give you an overview of my favourite workflow building system - Nextflow - and look at one toy workflow implementation example that will also be used in later posts.

Nextflow

Here, I will more or less shamelessly copy large parts of the description of Nextflow’s website since it summarises the main features quite neatly.

Up front, the most severe disadvantage for me: Nextflow is written in Groovy which is kind of a pain for me, since I am mostly Python, R, C/C++ and Java based, but have never needed to touch any Groovy.

However, with some fiddling around and especially a lot of low-latency community support via the Nextflow Gitter channel, these are hurdles that can be overcome.

Once you lost your fear of Groovy, the advantages of Nextflow are quite convincing.

If you want to read more about Nextflow, here is the documentation and here is the original paper.

Fast prototyping

Nextflow allows you to write a computational pipeline by making it simpler to put together many different tasks.

You may reuse your existing scripts and tools and you don’t need to learn a new language or API to start using it.

As an example, look at how easy it is to run code from different languages within Nextflow processes out of the box.

process perlStuff {

    """
    #!/usr/bin/perl

    print 'Hi there!' . '\n';
    """

}

process pyStuff {

    """
    #!/usr/bin/python

    x = 'Hello'
    y = 'world!'
    print "%s - %s" % (x,y)
    """

}

Portable

Nextflow provides an abstraction layer between your pipeline’s logic and the execution layer, so that it can be executed on multiple platforms without it changing.

It provides out of the box executors for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for Kubernetes, Amazon AWS and Google Cloud platforms.

Again, check the so-called profile configurations one can quite easily setup to enable support for yet another scheduler.

profiles {

    standard {
        process.executor = 'local'
    }

    cluster_sge {
        process.executor = 'sge'
        process.penv = 'smp'
        process.cpus = 20
        process.queue = 'public.q'
        process.memory = '10GB'
    }

    cluster_slurm {
        process.executor = 'slurm'
        process.cpus = 20
        process.queue = 'work'
    }
}

With these few lines of code, you can now seamlessly execute your pipeline on your local machine, on PBS and SLURM, even with customized resource settings.

Reproducibility

Nextflow supports Docker and Singularity containers technology.

This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration.

This is an especially nice feature, since it also allows to run Nextflow workflows on cloud based platforms such as Amazon Web Services which strictly require all software environments supplied in a public Docker registry reachable by ECS batch.

Unified parallelism

Nextflow is based on the dataflow programming model which greatly simplifies writing complex distributed pipelines.

Parallelisation is implicitly defined by the processes input and output declarations. The resulting applications are inherently parallel and can scale-up or scale-out, transparently, without having to adapt to a specific platform architecture.

Continuous checkpoints

All the intermediate results produced during the pipeline execution are automatically tracked.

This allows you to resume its execution, from the last successfully executed step, no matter what the reason was for it stopping.

Stream oriented

Nextflow extends the Unix pipes model with a fluent DSL, allowing you to handle complex stream interactions easily.

It promotes a programming approach, based on functional composition, that results in resilient and easily reproducible pipelines.

Salmon

Our first small toy Nextflow workflow will be based upon Salmon.

Salmon is a tool for quantifying the expression of transcripts using RNA-seq data. Salmon uses the concept of quasi-mapping coupled with a two-phase inference procedure to provide accurate expression estimates very quickly (i.e. wicked-fast) and while using little memory. Salmon performs its inference using an expressive and realistic model of RNA-seq data that takes into account experimental attributes and biases commonly observed in real RNA-seq data.

Essentially, Salmon will create a transcript index which it then uses to quantify expression estimates for each of the transcripts from raw fastq reads.

Our goal:

Obtain those transcript expression estimates for our samples
Obtain reads mapping to these transcripts via the --writeMappings flag as pseudo-bam

If you want to read more on Salmon, here is the paper.

salmon-nf

So the Nextflow pipeline we will create during this exercise I will call salmon-nf and it can be found on my GitHub page as a fully functional repository.

Any standalone Nextflow pipeline will need 2 files to be executable out of the box and also directly from GitHub:

main.nf - This file contains the individual processes and channels
nextflow.config - The configuration file for parameters, profiles etc. For more info read here

Workflow layout

First, we need to get an idea about what the data flow will be and what software and scripts will be run on it. I have outline the basic workflow of salmon-nf below:

We will only have one single process salmon which will use the input fastq files and the respective transcriptome index file to produce our expression estimates and the pseudo-bam files of aligning reads.

So for our salmon process we will have 2 input channels:

fastqChannel - feeding in our raw reads in fastq format
indexChannel - providing our transcriptome index to which we align the reads to

Our salmon process will produce several output files of which we choose to feed 2 file types into output processes as our final results:

quant.sf files via the salmonChannel output channel
pseudo.bam files via the pseudoBamChannel output channel

Now let’s have a look how we can actually realize and implement this on the coding end.

Docker container

Before we can run anything, we need to provide the software environment containing all dependencies and software packages our salmon process will run. These days, this is usually done via a Docker container, or a Singularity container on HPC environments.

Many software packages - including Salmon in our case - usually provide already read-to-use Docker containers (combinelab/salmon). But even if they don’t, do not despair and brainlessly jump into creating your own containers. If the packages was provided via BioConda, you will find a Docker container on BioContainers. I found this last resort to work in many cases.

Either way, since I wanted to convert the raw SAM output from salmon into a compressed BAM file, I chose to extend their Docker image with adding samtools as shown in the Dockerfile below.

# Copyright (c) 2019 Tobias Neumann.
#
# You should have received a copy of the GNU Affero General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

FROM combinelab/salmon:0.12.0

MAINTAINER Tobias Neumann <tobias.neumann.at@gmail.com>

RUN buildDeps='wget ca-certificates make g++' \
    runDeps='zlib1g-dev libncurses5-dev unzip gcc' \
    && set -x \
    && apt-get install -y $buildDeps $runDeps --no-install-recommends \
    && rm -rf /var/lib/apt/lists/* \
    && wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 \
    && tar xvfj samtools-1.9.tar.bz2 \
    && cd samtools-1.9 \
    && ./configure --prefix=/usr/local/ \
    && make \
    && make install \
    && apt-get purge -y --auto-remove $buildDeps

The resulting Docker image was pushed to Docker Hub and can be pulled via docker pull obenauflab/salmon:latest.

main.nf

Now we are ready to create the central main.nf file which contains all processes as well as channels. As mentioned before, you will find the entire code on GitHub, so here is an excerpt of the important sections.

`fastqChannel`

pairedEndRegex = params.inputDir + "/*_{1,2}.fq.gz"
SERegex = params.inputDir + "/*[!12].fq.gz"

pairFiles = Channel.fromFilePairs(pairedEndRegex)
singleFiles = Channel.fromFilePairs(SERegex, size: 1){ file -> file.baseName.replaceAll(/.fq/,"") }

singleFiles.mix(pairFiles)
.set { fastqChannel }

This elaborate chunk of code is needed to enable the fastqChannel input channel to our salmon process to handle both single- and paired-end fastq files. As you can see, we created a pairFiles channel with a paired-end regex basically assuming that our read-pairs are named *_1.fq.gz and *_2.fq.gz. In addition, we have a singleFiles channel that takes all fastq files not following the _1 and _2 naming convention and assuming it is single-end read files.

The fromFilePairs method creates a channel emitting the file pairs matching the regex we provided. The matching files are emitted as tuples in which the first element is the grouping key of the matching pair and the second element is the list of files (sorted in lexicographical order). For example:

[0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead, [0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead_1.fq.gz, 0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead_2.fq.gz]]
[0ac6634e-00b0-4107-a5d6-db8ffc602645_gdc_realn_rehead, [0ac6634e-00b0-4107-a5d6-db8ffc602645_gdc_realn_rehead_1.fq.gz, 0ac6634e-00b0-4107-a5d6-db8ffc602645_gdc_realn_rehead_2.fq.gz]]

As you can see, for the single-end reads channel singleFiles, the method is slightly extended:

First, we set an additional parameter size: 1 to set the number of files each emitted item is expected to hold to 1. In additional, we manually provide the a custom grouping strategy in the closure, which based on the current file as parameter, returns the grouping key. In our case, we simply strip anything from the file name after .fq and use this as our grouping key. For example:

[0fdb3d0e-e405-4e8d-8897-4a90ea4fe00c_gdc_realn_rehead, [0fdb3d0e-e405-4e8d-8897-4a90ea4fe00c_gdc_realn_rehead.fq.gz]]
[1916abcd-61c0-4f23-96ac-be70aacb8dc1_gdc_realn_rehead, [1916abcd-61c0-4f23-96ac-be70aacb8dc1_gdc_realn_rehead.fq.gz]]

Finally, we combined both channels via a mix operator into our final fastqChannel input channel to our salmon process.

`indexChannel`

indexChannel = Channel
	.fromPath(params.salmonIndex)
	.ifEmpty { exit 1, "Salmon index not found: ${params.salmonIndex}" }

This input channel is pretty straightforward set up. Only thing we need to do is to precreate our Salmon index (read how to do this here) and supply it via the salmonIndex parameter - how this is done will follow later.

Process `salmon`

process salmon {

	tag { lane }

    input:
    set val(lane), file(reads) from fastqChannel
    file index from indexChannel.first()

    output:
    file ("${lane}_salmon/quant.sf") into salmonChannel
    file ("${lane}_pseudo.bam") into pseudoBamChannel

    shell:

    def single = reads instanceof Path

    if (!single)

      '''
      salmon quant -i !{index} -l A -1 !{reads[0]} -2 !{reads[1]} -o !{lane}_salmon -p !{task.cpus} --validateMappings --no-version-check -z | samtools view -Sb -F 256 - > !{lane}_pseudo.bam
	    '''
    else
      '''
      salmon quant -i !{index} -l A -r !{reads} -o !{lane}_salmon -p !{task.cpus} --validateMappings --no-version-check -z | samtools view -Sb -F 256 - > !{lane}_pseudo.bam
	    '''

}

Our only process for the salmon-nf workflow is the salmon process.

You will notice that it has the 2 input channels we previously defined - fastqChannel and indexChannel. Note, how we have to use the .first() method on the indexChannel since it is a folder.

In addition, we have defined 2 output channels - salmonChannel outputting all quant.sf files and pseudoBamChannel outputting the corresponding pseudo.bam files.

The actual script that is run, is a plain conditional bash script. We have an initial condition that asks whether we have single read files coming in from the fastqChannel or paired-end reads - and based on this evaluation will run one or the other script branch.

The bash script itself is then basically only a salmon call on the respective input files.

nextflow.config

The Nextflow configuration files contain directives for for parameter definitions, profile definitions and many others.

In our particular example of salmon-nf, we will have a master nextflow.config that is tidied up and include additional configs for each section.

includeConfig 'config/general.config'
includeConfig 'config/docker.config'

profiles {
    standard {
        process.executor = 'local'
        process.maxForks = 3
    }

    slurm {
    	includeConfig 'config/slurm.config'
    }

    awsbatch {
        includeConfig 'config/awsbatch.config'
    }
}

As you can see, we have simply included some more config files and some barebone definition of profiles. Let’s look at the sub-config files.

general.config

This holds general configurations, parameters and definitions that are applicable to any of our run profiles.

params {

   outputDir = './results'
}

process {

	publishDir = [
      [path: params.outputDir, mode: 'copy', overwrite: 'true', pattern: "*/quant.sf"],
      [path: params.outputDir, mode: 'copy', overwrite: 'true', pattern: "*pseudo.bam"]
  	]

	errorStrategy = 'retry'
	maxRetries = 3
	maxForks = 100

}


cloud {
    imageId = 'ami-0f99d00928be3a282'
    instanceType = 't2.micro'
    userName = 'ec2-user'
    keyName = 'awsbatch'
    // Type: SSH, Protocol: TCP, Port: 22, Source IP: 0.0.0.0/0
    securityGroup = 'sg-0307dbec406526c14'
}


timeline {
	enabled = true
}

report {
	enabled = true
}

We set a default output directory in the params section, copy the quant.sf and pseudo.bam files to a dedicated publish directory, set our error strategy, a basic cloud profile for starting up instances on AWS and enable timeline and execution reports per default.

docker.config

With this configuration file, we enable Docker support per default and supply the Docker image to use with our salmon process.

docker {
    enabled = true
}

process {
    // Process-specific docker containers
    withName:salmon {
        container = 'obenauflab/salmon:latest'
    }
}

slurm.config

This configuration file defines a profile for the SLURM scheduler which is run on our HPC system. Our cluster only supports Singularity, so we disable Docker and enable Singuarity in return, as well as define basic resource constraints and queues on our HPC system where to run our tasks - and finally also supply the location of the salmonIndex on our file system.

singularity {
	enabled = true
}

docker {
	enabled = false
}

process {

    executor = 'slurm'
    clusterOptions = '--qos=short'
    cpus = '12'
    memory = { 8.GB * task.attempt }
}

params {

   salmonIndex = '/groups/Software/indices/hg38/salmon/gencode.v28.IMPACT'

}

awsbatch.config

This configuration file will be explained in detail in a later post - but in brief it enables execution of tasks in the cloud using AWS Batch, yet it still requires extensive configuration before it is usable.

Running the `salmon-nf` Nextflow workflow

Now that we have written our code and committed everything to GitHub, we can finally testdrive our workflow on some actual data.

First, let’s pull in our workflow:

tobias.neumann@login-01 [BIO] $ nextflow pull t-neumann/salmon-nf
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
Checking t-neumann/salmon-nf ...
 downloaded from https://github.com/t-neumann/salmon-nf.git - revision: 4fbaea7165 [master]
tobias.neumann@login-01 [BIO] $

Now we are ready to run our workflow. Make sure to select the profile you desire - for this example I will run it on our in-house cluster with SLURM:

tobias.neumann@login-01 [BIO] $ nextflow run t-neumann/salmon-nf --inputDir /tmp/data --outputDir results -profile slurm -resume
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
N E X T F L O W  ~  version 19.01.0
Launching `t-neumann/salmon-nf` [maniac_poisson] - revision: 4fbaea7165 [master]

 parameters
 ======================
 input directory          : /tmp/data
 output directory         : results
 ======================

[warm up] executor > slurm
[fb/20d1dc] Submitted process > salmon (8cec7235-3572-460c-b1d7-efe7961988e1_gdc_realn_rehead)
[e9/6f6404] Submitted process > salmon (5e18b02d-7e56-4f0d-b892-e7798eee5205_gdc_realn_rehead)
[f9/509312] Submitted process > salmon (d1ada222-b67f-47c0-b380-091eaab093b4_gdc_realn_rehead)
[6d/30354f] Submitted process > salmon (3783843f-c4fa-4aab-8f5b-e0749763164e_gdc_realn_rehead)
[9b/2a81e9] Submitted process > salmon (0fdb3d0e-e405-4e8d-8897-4a90ea4fe00c_gdc_realn_rehead)
[de/418130] Submitted process > salmon (383e3574-d22c-4dd6-842f-656ee2ab3b32_gdc_realn_rehead)
[c1/e00c04] Submitted process > salmon (1916abcd-61c0-4f23-96ac-be70aacb8dc1_gdc_realn_rehead)
[63/6a2e93] Submitted process > salmon (30fe4005-f4f2-41ce-bb1a-4830f3959ab7_gdc_realn_rehead)

Now we just have to wait till our workflow has successfully finished processing all our samples.

[76/67754e] Submitted process > salmon (0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead)

t-neumann/salmon-nf has finished.
Status:   SUCCESS
Time:     Sun Aug 25 23:35:49 CEST 2019
Duration: 2m

tobias.neumann@login-01 [BIO] $

If we now check our results and execution folder, we will find all the files we asked for in there - Nextflow is awesome!

tobias.neumann@login-01 [BIO] $ ls
report.html  results  timeline.html
tobias.neumann@login-01 [BIO] $ ls results
0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead_pseudo.bam  0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead_salmon

Have fun building workflows on your own - it pays off, especially for larger samples and heterogeneous computing environments!

AWS architecture outline

2019-02-10T09:45:00+01:00

If you talk about the omni-present buzzword cloud computing, you will inevitably stumble over Amazon Web Services . Sounds super cool and everybody gets excited about it, but I for my part was simply overwhelmed by the amount of services and products available from the platform.

The good news for us bioinformaticians is - and probably all cloud computing professionals handling on enterprise solutions are going to beat me for this statement - for setting up a proper and failsafe analysis pipeline with AWS, you only need a tiny fraction of those and can ignore the rest. In this post, I will walk you through the essential AWS building blocks I deem required for a basic bioinformatics processing pipeline, their characteristics, caveats and how they play together.

AWS building blocks

If you are familiar with cluster computing environments, you should not have a hard time to find the same architecture principal when building your own custom cluster computing environment in the cloud with AWS. I will elaborate on those pieces I encountered when building up a basic processing pipeline:

S3 for storage of input and auxiliary (e.g. index) files
EBS as local compute storage
AMI Machine image (the operating system) to be run on your instances
EC2 instances that do the actual computation
ECS to create your “software” from Docker containers to run on your instances
AWS Batch that handles everything from submission to scaling and proper finalization of your individual jobs

In the limited number of pipelines I have set up to run in AWS (they can also run on any other compute environment, but that’s a different later story) I have never used any services beyond that. For anything that involves reading e.g. raw read files, processing them and retreiving the output one should be able to make do with a combination of those. This can probably be optimized or done more elegantly with different services, but I had some discussions on this with various people and we have not come across a solution that could do it at a lower cost.

S3 - Simple Storage Service

This is the long-term storage solution from AWS. If you are familiar with a compute environment, this would be your globally accessible file-system were you store all your important files, reference genomes, alignment-indices - you name it. Contrary to the storage you are used to (unless you copy files locally to your node temporary storage for fast I/O), none of the files on S3 are directly read or written when utilizing EC2 instances for computational tasks. Before any pipeline start, all of the necessary files have to present in S3 such as:

Input files:
- Raw read files (fastq, bam,…)
- Quantification tables (txt, tsv, csv,…)
Reference files:
- Genome sequence (fasta)
- Feature annotations (gtf, bed, …)
Index files:
- Alignment indices (bwa, bowtie, STAR,…)
- Exon junction annotations (gtf, …)
- Transcriptome indices (callisto, salmon, …)

S3 also will be the final storage location where any of your final output files produced by your pipeline will end up. Since only S3 is long-term storage, usually you don’t have to worry about deleted intermediate or temporary files produced by your pipeline since they will be discarded after your instance has finished processing a given task.

Upload to S3 does not come with any cost, however downloading data from S3 is charged at around 10 cent / GB. Storage on S3 is charged at a per GB / per month basis. So I guess the fact that they charge data downloads is just merely based on the fact that you could up/download data in-between for free and thus circumvent the storage cost which they want to prevent.

EBS - Elastic Block Store

Every launched instance comes with a root volume of a limited size (8 GB) where all the OS and Service files are located required to start up an instance. To each instance, you can (and often must) attach additional volumes - EBS volumes - of configurable size where your data goes.

There are 3 things to consider when choosing your EBS size

It needs to be large enough to store all input files for a given job
- This includes all auxiliary files such as index files!
It needs to be large enough to store all intermediate files for a given job
It needs to be large enough to store all output files from a given job

Remember - S3 data is never directly accessed from your instance, but always copied to your local EBS volume!

Estimating EBS volume sizes gave me a hard time initially and I did a lot of benchmarking runs - if it is too small, your jobs will crash. In practice, I found that EBS cost is a negligible fraction of your overall cost - so in the end, I ended up being very generous on EBS volume sizes.

AMI - Amazon Machine Image

The AMI is basically Amazon’s version of an image similar to Virtual Machine images. They offer quite a variety of OS base versions in their store (Linux, Windows etc.), but what you would usually want to go for is extending any of those base images yourself with all the software you need during your pipeline run. These days with Docker , usually there is very little effort to setup your software environment, but even then you will in most cases have to install at least the AWS Command Line Interface to copy files from and to S3.

EC2 - Elastic Compute Cloud

EC2 is the part where you bring the computing heat: These are the instances upon which you launch your AMIs, attach your EBS volumes and then do some heavy computation. EC2 instances come in all form and shapes - depending on your demands. Below is an excerpt of compute optimized instance types, but depending on the application you might go for memory optimized, storage optimized GPUs, you name it.

The cool thing about them - probably you noticed already if you did the Math - is in terms of cost, it does not matter whether you pick a smaller or a larger instance. The price will scale exactly linearly, meaning you don’t need to squeeze in two jobs in a 2-timers bigger instance necessarily - which will be important at a later point.

ECS - Elastic Container Service

This definition and especially it’s distinction from AWS Batch was the hardest for me - I found the most helpful explanation here and summarized it below.

According to Amazon,

Amazon Elastic Container Service (Amazon ECS) is a highly scalable, high-performance container orchestration service that supports Docker containers and allows you to easily run and scale containerized applications on AWS.

With ECS you can run Docker containers on EC2 instances with AMIs pre-installed with Docker. ECS handles the installation of containers and the scaling, monitoring and management of the EC2 instances through an API or the AWS Management console. An ECS instance has Docker and an ECS Container Agent running on it. A Container Instance can run many Tasks. The Agent takes care of the communication between ECS and the instance, providing the status of running containers and managing running new ones.

Several ECS container instances can be combined into an ECS cluster: Amazon ECS handles the logic of scheduling, maintaining, and handling scaling requests to these instances. It also takes away the work of finding the optimal placement of each Task based on your CPU and memory needs.

AWS Batch

The separation of AWS Batch from ECS was most blurry to me. Essentially, AWS Batch is build on top of regular ECS and comes with additional features such as:

Managed compute environment: AWS handles cluster scaling in response to workload.
Heterogenous instance types: useful when having outlier jobs taking up large amounts of resources
Spot instances: Save money compared to on-demand instances
Easy integration with Cloudwatch logs (stdout and stderr captured automatically). This can also lead to insane cost, so watch out. More on that later.

AWS Batch will effectively take care of firing up instances to handle your workload and then let ECS handle the Docker orchestration and job execution.

Putting it all together

So how do all the AWS building blocks we just discussed fit together to process jobs? Let’s walk through it and conclude this post:

All jobs we want to be processed are sent to AWS Batch, which will assess the resources needed and fire up ECS instances accordingly.
ECS will take care of pulling the Docker images needed from a container registry (usually Docker hub) and fire up containers on the EC2 instances using the pre-installed Docker daemon.
These EC2 instances have been initialized with custom AMIs on startup, having all ECS prerequisites and additional customized resources such as e.g. the AWS CLI and additional EBS volume space.
All data required for this job is fetched from their long-term storage in S3 to the local EBS storage of the respective EC2 instance.

Now the job has everything it needs to run and will be processed. After reading this post, you should have a basic understanding what AWS building blocks an AWS batch scheduling system comprises. The next step is to then actually build the architecture for such a pipeline for which I will dedicate another comprehensive post.

Welcome to my website!

2019-01-17T21:10:00+01:00

Hello world!

I was repeatedly gently pushed towards writing a couple of blogs posts of all the obstacles I bothered people on various Gitter channels with, so I finally made it happen.

Since I hate anything related to web development, HTML, CSS, JS - you name it - hosting Jekyll on GitHub is the most I can reasonably do. I’m actually quite happy that it requires little CSS and HTML and can be mostly put together via Markdown.

To glue this minimal website together, I shamelessly forked the Minimal mistakes template and checked out at code from Maxime Garcia for some stuff I liked from the blogs I looked at.

The plan is to put up posts here with anything regarding to Bioinformatics, reproducible pipeline engineering and occasionally rocket science and orbital mechanics.

Cheers

t-neumann.github.io

Orbital maneuvers

Vessel

Spacecraft orientation

Prograde and retrograde

Normal and anti-normal

Radial in and radial out

Orbital maneuvers

Prograde and retrograde maneuvers

Normal and anti-normal maneuvers

Radial in and radial out maneuvers

Orbital insertion

Hohmann transfer

Bi-elliptic transfer

Square numbers proof

Question

Probing the statement approach

Proof: Square numbers ending in zeros strictly end with an even number of zeros

Disprove statement by counterexample

Orbital basics

Ellipse

Ellipse parameters

Semi-major and semi-minor axes \(a \geq b\)

Linear eccentricity \(c\)

Eccentricity \(e\)

Semi-latus rectum \(l\)

Orbit

Definition

Understanding orbits

Apsis

Orbital elements

Orbital period

Ellipse vs orbits

Orbits in KSP

Pipelines on AWS

Credits

Prerequisites

Accounts, users, roles, permissions

Step 1: Estimate resource requirements

Step 2: Creating a suitable AMI

Choose an Amazon Machine Image (AMI)

Choose an Instance Type

Configure Instance Details

Add storage

Add tags

Configure Security Group

SSH connect to instance

Adjust Docker container size to EBS

Install AWS CLI

Save your AMI

Step 3: Creating compute environments and job queues

Overview

Naming, roles and permissions

Some words on instance types and vCPU limits

Fit only 1 task in 1 instance!

vCPUs refers to the total number of vCPUs of your environments

Keep some spare memory for instance services

Keep homogeneous compute environments

Specifying instance types and vCPU limits

Step 4: Creating job queues

Excess queue

Step 5: Adjusting resources

Resource definition

ECS overhead extraction

Updated resource definition

Step 6: Running jobs with AWS Batch

Upload files to s3

Launch and prepare your submission instance

Start your Nextflow run on AWS batch

Slamdunk paper

Slamdunk workflow

Features

Conversion-aware mapping

Multimapper recovery

Conversion quantification

MultiQC report

Documentation

Availability

Pipelines with Nextflow

Nextflow

Upload files to `s3`

`fastqChannel`

`indexChannel`

Process `salmon`

Running the `salmon-nf` Nextflow workflow