Jekyll2023-12-14T13:13:31+01:00https://t-neumann.github.io/feed.xmlt-neumann.github.ioPersonal website of Tobias Neumann.Tobias NeumannOrbital maneuvers2019-09-08T15:29:00+02:002019-09-08T15:29:00+02:00https://t-neumann.github.io/space/OrbitalManeuvers<p>From my <a href="https://t-neumann.github.io/space/OrbitalBasics/">last post</a> you should have read up on the basics of orbits and orbital parameters. Now while this is interesting by itself, changing orbits and moving to different orbits in order to dock to space stations, escape to different celestial bodies or de-orbit onto a bodies surface - this is the stuff that is now why we are actually doing this. So that is why this post moves more into orbital mechanics and some basic maneuvers for modifying orbits.</p>
<p>Orbital mechanics is a core discipline within space-mission design and control.
It focuses on spacecraft trajectories, including orbital maneuvers, orbital plane changes, and interplanetary transfers, and is used by mission planners to predict the results of propulsive maneuvers.</p>
<p>Now let’s pretend we have some well-funded space agency, can do anything we want and do not have to fear killing our astronauts - if only there was some simulation to do this. This is were KSP comes into play.</p>
<h2 id="vessel">Vessel</h2>
<p>We do not want to simply calculate orbits, we want some actual space ship with propulsion systems in the orbit so we can see the impact of our maneuvers live. For this purpose, I created already in endless hours a <i class="fab fa-github" aria-hidden="true"></i> <a href="https://github.com/t-neumann/ksp-garage">huge garage</a> of different more or less efficient vessels for exploring the KSP universe.</p>
<p>For this particular, I will be using my rather tiny <a href="https://en.wikipedia.org/wiki/Single-stage-to-orbit">SSTO</a> <em>SlickOrbiter</em> consisting of 4 rapier engines which are hybrid engines with both air-breathing and liquid fuel modes. This I complement with an Atomic Rocket Motor engine for space maneuvers with far lower thrust but much higher efficiency (\(I_{SP}\)). I will definitely dedicate a couple of posts to propulsion systems, staging modes etc in a later time, for know just take it as it is.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/slickorbiter.gif" alt="Slick orbiter" width="100%" /></p>
<h2 id="spacecraft-orientation">Spacecraft orientation</h2>
<p>Now before we perform and orbit maneuvers or burns, we need to agree on the different directions we can point our spacecraft and perform these burns. Naturally, since we are in 3-dimensional space, we have 3 axis along which we can orient ourselves, each axis having 2 directions.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/spacecraftorientation.png" alt="Spacecraft orientation" width="100%" /></p>
<h4 id="prograde-and-retrograde">Prograde and retrograde</h4>
<p>These vectors run along the axis in which direction the spacecraft is moving along its orbit.</p>
<h4 id="normal-and-anti-normal">Normal and anti-normal</h4>
<p>The normal vectors are perpendicular to the orbital plane.</p>
<h4 id="radial-in-and-radial-out">Radial in and radial out</h4>
<p>These vectors are parallel to the orbital plane, and perpendicular to the prograde vector. The radial (or radial-in) vector points inside the orbit, towards the focus of the orbit, while the anti-radial (or radial-out) vector points outside the orbit, away from the body.</p>
<h2 id="orbital-maneuvers">Orbital maneuvers</h2>
<p>Ok now it is time to make a couple of burns into these directions and see how it affects our orbital parameters. To this end we set up maneuver nodes with directional indicators as shown below.</p>
<figure class="single ">
<img src="/assets/images/posts/Maneuvers/orbitorientation.png" alt="Orbit orientation" />
<img src="/assets/images/posts/Maneuvers/directions.png" alt="Directional markers" />
<figcaption>Orbital directions and directional markers.
</figcaption>
</figure>
<p>I will go into more detail and Math about energy efficiency for those individual maneuvers in a later post, this should now only give you a first glimpse and general understanding of how to move around in space.</p>
<h4 id="prograde-and-retrograde-maneuvers">Prograde and retrograde maneuvers</h4>
<p>So we are at the apoapsis of our nearly circular orbit perfectly aligned with the equatorial plane (0 degrees inclination). Let’s see what happens if we burn into prograde direction.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/progradeburn.gif" alt="Prograde burn" width="50%" /></p>
<p>As we can see, the apoapsis moves to the opposite end of our now elliptic orbit and we raised the orbit’s altitude on the opposite side.</p>
<p>What if we do a retrograde burn?</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/retrogradeburn.gif" alt="Retrograde burn" width="50%" /></p>
<p>As we can see, the periapsis on the opposing side is lowered until we go suborbital, meaning the spacecraft will deorbit on its way to periapsis and either burn up in the atmosphere or crash on the planet (unless a proper landing procedure is initiated).</p>
<p>In summary, burning prograde will increase orbital velocity, raising the altitude of the orbit on the other side, while burning retrograde will decrease velocity and reduce the orbit altitude on the other side.</p>
<p>This is the most efficient way to change the orbital shape (specifically the most common case, raising or lowering apsides) so whenever possible these vectors should be used.</p>
<h4 id="normal-and-anti-normal-maneuvers">Normal and anti-normal maneuvers</h4>
<p>Again we are at the apoapsis of our nearly circular orbit perfectly aligned with the equatorial plane (0 degrees inclination). Let’s see what happens if we burn into normal direction.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/normalburn.gif" alt="Normal burn" width="50%" /></p>
<p>We see that the orbital inclination (the angle between the orbital and equatorial plane) changes.</p>
<p>These vectors are generally used to match the orbital inclination of another celestial body or craft, and the only time this is possible is when the current craft’s orbit intersects the orbital plane of the target - at the ascending and descending nodes. We will get to this in a second.</p>
<h4 id="radial-in-and-radial-out-maneuvers">Radial in and radial out maneuvers</h4>
<p>One last time we are at the apoapsis of our nearly circular orbit perfectly aligned with the equatorial plane (0 degrees inclination). Let’s see what happens if we burn into the radial out direction.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/radialoutburn.gif" alt="Radial out burn" width="50%" /></p>
<p>We see that the orbit start rotating around the craft like spinning a hula hoop with a stick. Radial burns are usually not an efficient way of adjusting one’s path - it is generally more effective to use prograde and retrograde burns.</p>
<h2 id="orbital-insertion">Orbital insertion</h2>
<p>Now let’s combine all those basic orbital maneuvers of the previous section:
All the maneuvers we experimented with in the last section are generally described (if sufficient change of the orbital parameters is achieved) as <strong>orbit insertion</strong> which is a general term for a maneuver that is more than a small correction. It may be used for a maneuver to change a transfer orbit or an ascent orbit into a stable one, but also to change a stable orbit into a descent. Also the term <strong>orbit injection</strong> is used - which I find even cooler - especially for changing a stable orbit into a transfer orbit, e.g. trans-lunar injection (TLI), trans-Mars injection (TMI) and trans-Earth injection (TEI).</p>
<p>Stable orbits have been described in the <a href="https://t-neumann.github.io/space/OrbitalBasics/">previous post</a>, but now we want to specifically look at transfer orbits which enable us to put satellites into orbits, travel to the moon and Mars and all the fancy wonderous places in our solar system and beyond.</p>
<p>So what is a <strong>transfer orbit</strong>: In orbital mechanics a transfer orbit is an intermediate elliptical orbit that is used to move a satellite or other object from one circular, or largely circular, orbit to another.</p>
<p>There are several types of transfer orbits, which vary in their energy efficiency and speed of transfer and I will quickly go over the most famous ones.</p>
<p>Again, I will go into more detail and Math about energy efficiency for those transfer orbits in a later post, this should now only give you a first glimpse and general understanding of how these orbital insertions work.</p>
<h3 id="hohmann-transfer">Hohmann transfer</h3>
<p>In orbital mechanics, the Hohmann transfer orbit is an elliptical orbit used to transfer between two circular orbits of different radii around the same body in the same plane. The Hohmann transfer orbit uses the lowest possible amount of energy in traveling between these orbits.</p>
<p>The term is also used to refer to transfer orbits between different bodies (planets, moons etc.).</p>
<p>A Hohmann transfer requires that the starting and destination points be at particular locations in their orbits relative to each other. Space missions using a Hohmann transfer must wait for this required alignment to occur, which opens a so-called launch window. For a space mission between Earth and Mars, for example, these launch windows occur every 26 months. A Hohmann transfer orbit also determines a fixed time required to travel between the starting and destination points; for an Earth-Mars journey this travel time is about 9 months.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/Hohmann_transfer_orbit.svg" alt="Hohmann transfer" width="50%" /></p>
<p>The image shows a Hohmann transfer orbit to bring a spacecraft from a lower circular orbit into a higher one. It is one half of an elliptic orbit that touches both the lower circular orbit the spacecraft wishes to leave (green and labeled 1 on diagram) and the higher circular orbit that it wishes to reach (red and labeled 3 on diagram). The transfer (yellow and labeled 2 on diagram) is initiated by firing the spacecraft’s engine to accelerate prograde so that it will follow the elliptical orbit. This adds energy to the spacecraft’s orbit. When the spacecraft has reached its destination orbit, its orbital speed (and hence its orbital energy) must be increased again to change the elliptic orbit to the larger circular one which is termed <em>circularization</em>.</p>
<p>Now let’s do this in KSP. To simplify everything, assume both our starting orbit and our target orbit are already circular. Let’s say we want to reach some space station orbiting Laythe at 250k km and our <em>SlickOrbiter</em> is in a stable orbit at 100k km.</p>
<p>The first thing we have to do is match orbit inclination which is best done by a normal burn at the ascending node.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/inclinationchange.gif" alt="Orbit inclination correction" width="50%" /></p>
<p>Now that our orbital planes are synchronized, we can start with our first prograde burn of the Hohmann transfer maneuver which is raising our apoapsis to the target orbit height, effectively transforming our circular orbit into an elliptic orbit.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/HohmannBurn1.gif" alt="Hohmann transfer apoapsis change" width="50%" /></p>
<p>Now once we have reached our transfer orbit’s apoapsis, we can circularize and match our target orbit by another prograde burn.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/HohmannBurn2.gif" alt="Hohmann transfer circularization" width="50%" /></p>
<p>There it is, we have performed our first Hohmann transfer.</p>
<h3 id="bi-elliptic-transfer">Bi-elliptic transfer</h3>
<p>The bi-elliptic transfer consists of two half-elliptic orbits may, in certain situations, require less energy than a Hohmann transfer maneuver.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/Bi-elliptic_transfer.svg" alt="Bi-elliptic transfer" width="50%" /></p>
<p>From the initial orbit, a first prograde burn (1) boosts the spacecraft into the first transfer orbit with an apoapsis at some point away from the central body. At this point a second prograde burn (2) sends the spacecraft into the second elliptical orbit with periapsis at the radius of the final desired orbit, where a third retrograde burn (3) is performed, injecting the spacecraft into the desired orbit.</p>
<p>While they require one more engine burn than a Hohmann transfer and generally requires a greater travel time, some bi-elliptic transfers require a lower amount of energy than a Hohmann transfer when the ratio of final to initial semi-major axis is 11.94 or greater, depending on the intermediate semi-major axis chosen.</p>
<p>Now let’s do this in KSP. To simplify everything, assume both our starting orbit and our target orbit are already circular and our orbital inclinations are already matched. Again, we want to reach some space station orbiting Laythe at 250k km and our <em>SlickOrbiter</em> is in a stable orbit at 100k km.</p>
<p>We will first raise our apoapsis above the target orbit to create an elliptic orbit with a long prograde burn.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/Bi-elliptic_burn1.gif" alt="Bi-elliptic transfer apoapsis raise" width="50%" /></p>
<p>Now we wait until we have reached the new apoapsis for another prograde burn to raise our periapsis to the level of the target orbit.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/Bi-elliptic_burn2.gif" alt="Bi-elliptic transfer periapsis raise" width="50%" /></p>
<p>Finally, we perform a retrograde burn at the new periapsis to lower our apoapsis for <em>circularizing</em> our target orbit.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/Bi-elliptic_burn3.gif" alt="Bi-elliptic transfer circularization" width="50%" /></p>
<p>There it is, we have performed our first Bi-elliptic transfer.</p>
<p>Now that you have a basic overview of spacecraft orientation, burns into those directions and their impact on the spacecrafts orbit, as well as how to combined those maneuvers into orbit insertions, we can have laid the foundation to dive deeper into energy efficiency of those maneuvers, the famous <em>delta-v</em> and the Rocket equation in a later post. Until then - godspeed.</p>Tobias NeumannFrom my last post you should have read up on the basics of orbits and orbital parameters. Now while this is interesting by itself, changing orbits and moving to different orbits in order to dock to space stations, escape to different celestial bodies or de-orbit onto a bodies surface - this is the stuff that is now why we are actually doing this. So that is why this post moves more into orbital mechanics and some basic maneuvers for modifying orbits.Square numbers proof2019-09-02T22:05:00+02:002019-09-02T22:05:00+02:00https://t-neumann.github.io/mathematics/SquareNumberZeros<p>I recently signed up for the <a href="http://www.vds-molecules-of-life.org/index.php?id=1350">MFPL PhD Selection</a> where we got some scientific tasks to solve. One involved proving some statement about <a href="https://en.wikipedia.org/wiki/Square_number">square numbers</a> right or wrong.</p>
<h2 id="question">Question</h2>
<blockquote>
<p>Is any of the integer numbers, A, consisting of exactly 15 ones and 15 zeros a square-number, that is an integer B exists, such that B*B=A? The number A should always have 30 digits and also numbers with leading zeros are considered. Please explain your answer. A simple YES or NO is not sufficient.</p>
</blockquote>
<h2 id="probing-the-statement-approach">Probing the statement approach</h2>
<p>I’m definitely no Maths genius, so the first thing I would do was to basically build randomly some numbers with 15 1s and 15 0s and calculate their square roots to get a feeling.</p>
<p>Here already I stumbled upon some misleading results since for bigger numbers as any number involving minimum 15 digits, the Apple calculator, <a href="https://www.r-project.org/">R</a> and Google tend to round and switch to scientific notation, making you believe you are looking at square numbers.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">></span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">000000000000001111111111111110</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">a</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">1.111111e+15</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">33333333</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="m">33333333</span><span class="o">*</span><span class="m">33333333</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">1.111111e+15</span><span class="w">
</span></code></pre></div></div>
<p><img src="https://t-neumann.github.io/assets/images/posts/SquareNumberProof/googlecalculator.png" alt="Google calculator" width="50%" /></p>
<p>As you can see, both R and Google calculator would make you believe \(33333333^2\) yields \(1111111111111110\) when in fact it does not - cross-checked with Apple calculator.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/SquareNumberProof/applecalculator.png" alt="Apple calculator" width="50%" /></p>
<p>So now that I had after some detour already pretty quickly found an example proving the statement above wrong - which by the way already itself is sufficient to disprove the initial statement - but I wanted a little more sophistication.</p>
<p>I decided to take a rather lazy approach of reading up on properties of square numbers on <a href="https://en.wikipedia.org/wiki/Square_number">Wikipedia</a> and see whether any of them proves to be an easy no go. I came across the following:</p>
<ol>
<li>No square number ends in 2, 3, 7 or 8.</li>
<li>The number of zeros at the end of a perfect square is always even.</li>
<li>Squares of even numbers are always even numbers and square of odd numbers are always odd.</li>
<li>The Square of a natural number other than one is either a multiple of 3 or exceeds a multiple of 3 by 1.</li>
<li>The Square of a natural number other than one is either a multiple of 4 or exceeds a multiple of 4 by 1.</li>
<li>The unit’s digit of the square of a natural number is the unit’s digit of the square of the digit at unit’s place of the given natural number.</li>
<li>There are \(n\) natural numbers \(p\) and \(q\) such that \(p^2 = 2q^2\).</li>
<li>For every natural number \(n\),
\((n + 1)^2 - n^2 = (n + 1) + n\).</li>
<li>For any natural number \(m\) greater than 1,
\((2m, m^2 - 1, m^2 + 1)\) is a Pythagorean triplet.</li>
</ol>
<p>So let’s just quickly go through them:</p>
<p><strong>Property 1</strong> does not really help because we can only construct numbers ending at 0 and 1, both apparently valid digits for square numbers.</p>
<p><strong>Property 2</strong> - we already hit the jackpot. Since we can freely distribute 0s in our numbers, it is trivial to create one with an odd number of zeros at the end.</p>
<p>Allrighty, let’s formalize it.</p>
<h2 id="proof-square-numbers-ending-in-zeros-strictly-end-with-an-even-number-of-zeros">Proof: Square numbers ending in zeros strictly end with an even number of zeros</h2>
<blockquote>
<p>Theorem: Square numbers ending in zeros strictly end with an even number of zeros.</p>
</blockquote>
<p>(1) Let \(k\) be an integer \(k \in \mathbb{Z}\) with \(k \geq 0\).</p>
<p>(2) Let \(n\) be any number ending in \(0\): \(n = (10k + 0)\).</p>
<p>(3) The perfect square of \(n\) equals to \(n^2 = (10k + 0)^2 = 100k^2\)</p>
<p>From (3) directly follows that any square number with ending zeros, strictly ends with zeros of a multiple of 2 - therefore an even number - of zeros.</p>
<p>We have proofed the theorem and therefore can use it to probe for counter-examples given the properties in our initial question.</p>
<h2 id="disprove-statement-by-counterexample">Disprove statement by counterexample</h2>
<p>It is trivial to find a number \(m\) with an odd number of ending zeros and 15 additional 1s.</p>
<p>Simplest example:</p>
\[m = 1111111111111110\]
\[\sqrt{m} = \sqrt{1111111111111110} = 33333333.33333331\dot 6\]
<p>Therefore it follows, that the question</p>
<blockquote>
<p>Is any of the integer numbers, A, consisting of exactly 15 ones and 15 zeros a square-number, that is an integer B exists, such that B*B=A?</p>
</blockquote>
<p>can be answered with <strong>No</strong>:</p>
<blockquote>
<p>Not any integer number A, consisting of exactly 15 ones and 15 zeros is a square-number, that is an integer B exists, such that B*B=A.</p>
</blockquote>Tobias NeumannI recently signed up for the MFPL PhD Selection where we got some scientific tasks to solve. One involved proving some statement about square numbers right or wrong.Orbital basics2019-08-26T13:42:00+02:002019-08-26T13:42:00+02:00https://t-neumann.github.io/space/OrbitalBasics<p>I was always fascinated by rockets, space in general and zero-gravity environments, however the Math’s involved always deemed too complex for me. However, through the playful and still complex approach of <a href="https://www.kerbalspaceprogram.com/">Kerbal Space Program</a> (KSP) - it is an awesome game I totally recommend to anybody remotely interested in space exploration - I picked up interest lately again and started reading into orbital mechanics, propulsion systems and related stuff in more detail.</p>
<p>This blog series is dedicated to summarising basic concepts at definitely super simplified and probably sometimes oversimplified and not entirely correct level.</p>
<p>The easiest concept for me to grasp, since once can do it quite interactively in KSP is the concept of orbits and orbital changes through orbital maneuvers.</p>
<p>So this very first post of this series will cover my basic understanding of the concept of orbits.</p>
<h2 id="ellipse">Ellipse</h2>
<p>Let’s start of with refreshing our memory what an ellipse is - because that is what most relevant orbits for this blog series will look like. In mathematical terms, an ellipse is a plane curve surrounding two focal points (\(F_1\) and \(F_2\)), such that for all points on the curve, the sum of the two distances \(d(F_1) + d(F_2)\) is constant.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/Ellipse-definition.png" alt="Ellipse definition" width="50%" /></p>
<p>It is a generalization of a circle, where the two focal points are the same. Yes, also circular orbits exist.</p>
<h3 id="ellipse-parameters">Ellipse parameters</h3>
<p>There are a few important parameters describing an ellipse which will be referred throughout this blog series, so make sure you memorize and understand them, because they will keep popping up again and again.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/Ellipse-param.png" alt="Ellipse parameters" width="50%" /></p>
<h6 id="semi-major-and-semi-minor-axes-a-geq-b">Semi-major and semi-minor axes \(a \geq b\)</h6>
<p>\(a\) is referred to as the semi-major axis, i.e. \(a \geq b > 0\).</p>
<h6 id="linear-eccentricity-c">Linear eccentricity \(c\)</h6>
<p>This is the distance from the center to any of the two foci: \(c = \sqrt{a^2 - b^2}\).</p>
<h6 id="eccentricity-e">Eccentricity \(e\)</h6>
<p>The eccentricity is expressed as:</p>
\[e = \frac{c}{a} = \sqrt{1 - (\frac{b}{a})^{2}}\]
<p>assuming \(a > b\). An ellipse with equal axes \((a = b)\) has zero eccentricity and is a circle.</p>
<h6 id="semi-latus-rectum-l">Semi-latus rectum \(l\)</h6>
<p>The length of the chord through one of the foci, perpendicular to the major axis, is called the latus rectum. One half of it is the semi-latus rectum \(l\). A calculation shows:</p>
\[l = \frac{b^2}{a} = a(1-e^2)\]
<p>The semi-latus rectum \(l\) is equal to the radius of curvature of the osculating circles at the vertices.</p>
<h2 id="orbit">Orbit</h2>
<p>Now probably everybody has some idea what an orbit is, but before going into details, let’s first summarise the definitions I found on the web.</p>
<h4 id="definition">Definition</h4>
<p>In physics, an orbit is the gravitationally curved trajectory of an object, like the the trajectory of any plane around a star or a satellite around earth. Unless mentioned differently, in this blogpost orbit refers to a regularly repeating trajectory, but there are also non-repeating trajectories. To a close approximation, planets and satellites follow elliptic orbits, with the central mass being orbited at on of the two focal points of the ellipse, as described by <a href="https://en.wikipedia.org/wiki/Kepler%27s_laws_of_planetary_motion">Kepler’s laws of planetary motion</a>.</p>
<p>The post will stick to the classical Newtonian mechanics paradigm of describing orbital motion, which is an adequate approximation for most situations. However, Einstein’s generaly theory of relativity, which accounts for gravity as due to curvature of spacetime and orbits following geodesics, provides a more accurate calculation and understanding of the exact mechanics of orbital motion, which is needed in near very massive bodies (e.g. Mercury’s orbit around the sun) or for extreme precision (as for GPS satellites).</p>
<h4 id="understanding-orbits">Understanding orbits</h4>
<p>There are two factors involved for understanding orbits:</p>
<ul>
<li>Gravity pulling an object from its straight path into a curved path</li>
<li>The velocity at which this object is trying to travel along its path</li>
</ul>
<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/tangentialvelocity.jpg" alt="Tangential velocity vs gravity" width="50%" /></p>
<p>This principal is illustrated by the illustration above, where gravity from a massive body in the center (green) pulls a object travelling on a straight path (pink object, black arrows), effectively bending the path with its constant pull (red) around the center body.</p>
<p>Another way how to illustrate how orbits develop is the though experiment of <a href="https://en.wikipedia.org/wiki/Newton%27s_cannonball">Newton’s cannonball</a>. Here, we visualize a cannon on top of a very high mountain which can fire at any imaginable speed.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/Newton_Cannon.png" alt="Newton cannon" width="50%" /></p>
<p>If the cannon fires its ball with a low initial speed, the trajectory of the ball curves downward and hits the ground <strong>(A)</strong>. As the firing speed is increased, the cannonball hits the ground farther <strong>(B)</strong> away from the cannon, because while the ball is still falling towards the ground, the ground is increasingly curving away from it (see first point, above). All these motions are actually “orbits” in a technical sense – they are describing a portion of an elliptical path around the center of gravity – but the orbits are interrupted by striking the Earth. The horizontal speed for both <strong>(A)</strong> and <strong>(B)</strong> is 0 - 7,000 m/s for Earth.</p>
<p>If the cannonball is fired with sufficient speed, the ground curves away from the ball at least as much as the ball falls – so the ball never strikes the ground. It is now in what could be called a non-interrupted, or circumnavigating, orbit. For any specific combination of height above the center of gravity and mass of the planet, there is one specific firing speed (unaffected by the mass of the ball, which is assumed to be very small relative to the Earth’s mass) that produces a circular orbit, as shown in <strong>(C)</strong>.</p>
<p>As the firing speed is increased beyond this, non-interrupted elliptic orbits are produced; one is shown in <strong>(D)</strong>. If the initial firing is above the surface of the Earth as shown, there will also be non-interrupted elliptical orbits at slower firing speed; these will come closest to the Earth at the point half an orbit beyond, and directly opposite the firing point, below the circular orbit. The horizontal speed for both <strong>(C)</strong> and <strong>(D)</strong> ranges from 7,300 to 10,000 m/s for Earth.</p>
<p>At a specific horizontal firing speed called escape velocity, dependent on the mass of the planet, an open orbit <strong>(E)</strong> is achieved that has a parabolic path. At even greater speeds the object will follow a range of hyperbolic trajectories. In a practical sense, both of these trajectory types mean the object is “breaking free” of the planet’s gravity, and “going off into space” never to return. This involves any horizontal speed > 10,000 m/s for Earth.</p>
<figure class="third ">
<img src="/assets/images/posts/Orbits/Newtonsmountainv=0.gif" alt="Newton's cannon v=0" />
<img src="/assets/images/posts/Orbits/Newtonsmountainv=6000.gif" alt="Newton's cannon v=6000" />
<img src="/assets/images/posts/Orbits/Newtonsmountainv=7300.gif" alt="Newton's cannon v=7300" />
<img src="/assets/images/posts/Orbits/Newtonsmountainv=8000.gif" alt="Newton's cannon v=8000" />
<img src="/assets/images/posts/Orbits/Newtonsmountainv=10000.gif" alt="Newton's cannon v=10000" />
<figcaption>Various firing speeds of Newton’s cannon and the resulting trajectory.
</figcaption>
</figure>
<p>This leads to four practical classes of moving objects:</p>
<ol>
<li>No orbit</li>
<li>
<p>Suborbital trajectories</p>
<ul>
<li>Range of interrupted elliptical paths</li>
</ul>
</li>
<li>
<p>Orbital trajectories</p>
<ul>
<li>Range of elliptical paths with closes point opposite firing point</li>
<li>Circular path</li>
<li>Range of elliptical paths with closes point at firing point</li>
</ul>
</li>
<li>
<p>Open (escape) trajectories</p>
<ul>
<li>Parabolic paths</li>
<li>Hyperbolic paths</li>
</ul>
</li>
</ol>
<h4 id="apsis">Apsis</h4>
<p>The first two terms I learned about in KSP were the two apsis - probably because a lot of orbital maneuvers happen at those and they are pretty simple to comprehend.</p>
<p>Apsis denotes either of the two extreme points (i.e., the farthest or nearest point) in the orbit of a planetary body about its primary body.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/apsis.png" alt="Apsis" width="50%" /></p>
<p>There are two apsides in any elliptic orbit. Each is named by selecting the appropriate prefix: apo- , or peri- and then joining it to the reference suffix of the “host” body being orbited. The general form is <strong>apoapsis</strong> (see figure above (1)) for the farthest point and <strong>periapsis</strong> (see top figure (2)) for the nearest point. Depending what central body is orbited it will become apogee and perigee for object orbiting earth, apohelion and perihelion for objects orbiting the sun etc.</p>
<h4 id="orbital-elements">Orbital elements</h4>
<p>Orbital elements are the parameters required to uniquely identify a specific orbits. In celestial mechanices, usually a Kepler orbit is used. A real orbit changes over time due to gravitational perturbations by other objects and relativistic effects, so a Keplerian orbit is merely an idealized, mathematical approximation at a particular time.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/orbitalelements.png" alt="Orbital elements" width="50%" /></p>
<p>An orbit is generally defined by six elements (known as Keplerian elements) that can be computed from position and velocity:</p>
<p>Two define the size and shape of the trajectory (compare with <a href="###Ellipse">ellipse parameters</a>):</p>
<ul>
<li>
<p>Semimajor axis \(a\)</p>
</li>
<li>
<p>Eccentricity \(e\)</p>
</li>
</ul>
<p>Two elements define the orientation of the orbital plane in which the ellipse is embedded:</p>
<ul>
<li>
<p>Inclination \(i\) - vertical tilt of the ellipse with respect to the reference plane (for the earth e.g. the equatorial plane), measured at the ascending node. The ascending node is where the orbit passes upwards through the reference plane). The tilt angle is measured perpendicular to the line of intersection between the orbital plane and the reference plane.</p>
</li>
<li>
<p>Longitude of the ascending node \(\Omega\) - horizontally orients the ascending node of the ellipse with respect to the reference frame’s vernal point :aries:.</p>
</li>
</ul>
<p>I found it pretty hard at first to wrap my head around what the vernal point :aries: actually is - naturally it is some arbitrary reference point to fix the angle for the ascending node \(\Omega\). So actually the vernal point :aries: is one of the equinoctes, namely the one occurring in spring in the northern hemisphere. It is regarded as the instant of time when the plane of the Earth’s equator passes through the center of the sun. So at the equator, the sunrays will hit the earth perpendicular directly from the sky zenith. After passing the vernal point, the northern hemisphere will receive more light - summer is here - before the vernal point, the northern hemisphere received less light - winter was coming. Same is true vice versa for the southern hemisphere.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/vernalpoint.png" alt="Vernal point" width="100%" /></p>
<p>The two remaining elements are as follows:</p>
<ul>
<li>
<p>Argument of periapsis \(\omega\) defines the orientation of the ellipse in the orbital plane. It is measured as the angle from the ascending node to the periapsis.</p>
</li>
<li>
<p>True anomaly (\(v\), \(\theta\), or \(f\)) at epoch (\(M_0\)) defines the position of the orbiting body along the ellipse at a specific time (“epoch”). The true anomaly is an angular parameter defining the angle between the direction of the periapsis and the current position of the orbiting body.</p>
</li>
</ul>
<p>Epoch sounds pretty sophisticated, but basically just just a moment in time used as a reference point for some time-varying astronomical quantity, like the true anomaly. Still sounds complicated?</p>
<p>Let’s look at some unit indicating a specific epoch: J2000.</p>
<p>The \(J\) unit refers to Julian years, which are intervals with the length of a mean year in the Julian calendar, i.e. 365.25 days. This interval measure does not itself define any epoch: the Gregorian calendar is in general use for dating. Thus “J2000” refers to the instant of 12:00 TT (noon) on January 1, 2000.</p>
<p>Now an arbitrary Julian epoch is therefore related to the Julian date by</p>
\[J = 2000 + (Julian date − 2451545.0) ÷ 365.25\]
<p>So in a sense everybody has definitely a feeling for an Epoch because we also structure our lifes and set up meetings for certain “Epochs” everyday.</p>
<h4 id="orbital-period">Orbital period</h4>
<p>The orbital period is simply how long an orbiting body takes to complete one orbit.</p>
<h4 id="ellipse-vs-orbits">Ellipse vs orbits</h4>
<p>For elliptical orbits, some formulas from ellipses are directly related.</p>
<p>Let \(e\) be the eccentricity, \(r_a\) the radius of the apoapsis, \(r_p\) the radius of the periapsis and \(a\) the length of the smi-major axis. Then:</p>
\[e = \frac{r_a - r_p}{r_a + r_p} = \frac{r_a - r_p}{2a}\]
\[r_a = (1 + e)a\]
\[r_p = (1 - e)a\]
<p>Interestingly, the semi-major axis \(a\) is the arithmetic mean, the semi-minor axis \(b\) is the geometric mean and the semi-latus rectum \(l\) is the harmonic mean of \(r_a\) and \(r_b\):</p>
\[a = \frac{r_a + r_p}{2}\]
\[b = \frac{2}{\sqrt{r_a * r_p}}\]
\[l = \frac{2}{\frac{1}{r_a} + \frac{1}{r_p}} = \frac{2r_{a}r_{p}}{r_a + r_p}\]
<h4 id="orbits-in-ksp">Orbits in KSP</h4>
<p>Now this post should leave with a basic idea what an orbit is, how it is defined and what are important parameters to specify orbits and positioning moving object in a given orbit. As a little teaser for the next section where we will be talking about basic orbital maneuvers and mechanics, find a first sceenshot from KSP of a random orbit. What can you tell from it?</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/ksp-orbital-parameters.png" alt="KSP orbits" width="100%" /></p>
<p>Given from what I have told you, you should be able to spot that it is a circular orbit (eccentricity = 0 or apoapsis \(\approx\) periapsis) and it’s orbital plane is perfectly aligned with the equatorial plane of the central body (inclination = 0).</p>
<p>Now you should be equipped with the basic toolset for the next post where we will be modifying orbital parameters with maneuvers.</p>Tobias NeumannI was always fascinated by rockets, space in general and zero-gravity environments, however the Math’s involved always deemed too complex for me. However, through the playful and still complex approach of Kerbal Space Program (KSP) - it is an awesome game I totally recommend to anybody remotely interested in space exploration - I picked up interest lately again and started reading into orbital mechanics, propulsion systems and related stuff in more detail.Pipelines on AWS2019-08-25T21:51:00+02:002019-08-25T21:51:00+02:00https://t-neumann.github.io/pipelines/AWS-pipeline<p>The prerequisite for this post is that you have a sound understanding of Nextflow and made yourself familiar with the <code class="language-plaintext highlighter-rouge">salmon-nf</code> workflow created in <a href="https://t-neumann.github.io/pipelines/Nextflow-pipeline/">this post</a>. Furthermore, you should know all the essential AWS building blocks and basic architecture of an AWS based batch scheduler as I presented in my <a href="https://t-neumann.github.io/pipelines/AWS-architecture/">previous post</a>. In this post, I will show you what environment and resources you have to actually set up on AWS to make the <a href="https://github.com/t-neumann/salmon-nf"><code class="language-plaintext highlighter-rouge">salmon-nf</code></a> example pipeline run and then how to actually run jobs on the setup AWS Batch queue with <a href="https://www.nextflow.io/">Nextflow</a>.</p>
<h2 id="credits">Credits</h2>
<p>Many people have done a great job into setting up tutorials and blogs on this and I would like to acknowledge a few that helped me a lot to actually make my AWS pipelines happen:</p>
<ul>
<li><a href="https://maxulysse.github.io/">Maxime Garcia</a> and his great blog</li>
<li><a href="https://apeltzer.github.io/">Alex Peltzer</a></li>
<li><a href="https://github.com/pditommaso">Paolo Di Tommaso</a> for Nextflow and Gitter support</li>
</ul>
<p>There are a couple of tutorials that helped a lot:</p>
<ul>
<li><a href="https://www.nextflow.io/docs/latest/awscloud.html#aws-batch">Nextflow documentation</a></li>
<li><a href="https://www.nextflow.io/blog/2017/scaling-with-aws-batch.html">Nextflow blog</a></li>
</ul>
<h2 id="prerequisites">Prerequisites</h2>
<h3 id="accounts-users-roles-permissions">Accounts, users, roles, permissions</h3>
<p>Some things have to be setup prior to starting setting up the actual AWS compute environment such as obvious things as an <code class="language-plaintext highlighter-rouge">AWS account</code> and other things like setting up an <code class="language-plaintext highlighter-rouge">IAM user</code> or <code class="language-plaintext highlighter-rouge">Service roles</code> which all has to be done only once and is exhaustively covered already in several blog posts such as <a href="https://apeltzer.github.io/post/01-aws-nfcore/">this one</a> by Alex Peltzer and Tobias Koch. Therefore, I will not spend any time on this and suggest you just follow the instructions in the blog post until it is time to set up your <code class="language-plaintext highlighter-rouge">AMI</code> which is where I will start off.</p>
<h2 id="step-1-estimate-resource-requirements">Step 1: Estimate resource requirements</h2>
<p>Appropriate resource allocation is crucial for setting up AWS workflow that are both cost-efficient and high-throughput. Therefore, I strongly advise you to take a big enough test-dataset, run it on in a limitless test environment - hopefully many of you have some kind of in-house HPC cluster - and take the resulting measurements of resource consumption to find optimal storage, memory and CPU sizes.</p>
<p>Conveniently, Nextflow workflows can be easily executed both on <code class="language-plaintext highlighter-rouge">AWS</code> but also in your local HPC environment by simply defining additional <a href="https://www.nextflow.io/docs/latest/config.html#config-profiles">profiles</a> for the scheduler of your choice.</p>
<p>Here is one example of a simple <code class="language-plaintext highlighter-rouge">SLURM</code> profile:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">singularity</span> <span class="o">{</span>
<span class="n">enabled</span> <span class="o">=</span> <span class="kc">true</span>
<span class="o">}</span>
<span class="n">docker</span> <span class="o">{</span>
<span class="n">enabled</span> <span class="o">=</span> <span class="kc">false</span>
<span class="o">}</span>
<span class="n">process</span> <span class="o">{</span>
<span class="n">executor</span> <span class="o">=</span> <span class="err">'</span><span class="n">slurm</span><span class="err">'</span>
<span class="n">clusterOptions</span> <span class="o">=</span> <span class="err">'</span><span class="o">--</span><span class="n">qos</span><span class="o">=</span><span class="kt">short</span><span class="err">'</span>
<span class="n">cpus</span> <span class="o">=</span> <span class="err">'</span><span class="mi">12</span><span class="err">'</span>
<span class="n">memory</span> <span class="o">=</span> <span class="o">{</span> <span class="mi">8</span><span class="o">.</span><span class="na">GB</span> <span class="o">*</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">}</span>
<span class="o">}</span>
<span class="n">params</span> <span class="o">{</span>
<span class="n">salmonIndex</span> <span class="o">=</span> <span class="err">'</span><span class="o">/</span><span class="n">groups</span><span class="o">/</span><span class="nc">Software</span><span class="o">/</span><span class="n">indices</span><span class="o">/</span><span class="n">hg38</span><span class="o">/</span><span class="n">salmon</span><span class="o">/</span><span class="n">gencode</span><span class="o">.</span><span class="na">v28</span><span class="o">.</span><span class="na">IMPACT</span><span class="err">'</span>
<span class="o">}</span>
</code></pre></div></div>
<p>As you can see, usually <code class="language-plaintext highlighter-rouge">HPC</code> environments do not allow Docker containers to run, but support <a href="https://singularity.lbl.gov/">Singularity</a> containers which can be <a href="https://singularity.lbl.gov/docs-build-container#downloading-a-existing-container-from-docker-hub">easily built from Docker containers</a>.</p>
<p>The <code class="language-plaintext highlighter-rouge">process</code> section basically defines the scheduler, resources and the job queue in which the processes should run. Finally, the index files are usually stored in some globally accessible directory, similar to the <code class="language-plaintext highlighter-rouge">s3</code> storage on <code class="language-plaintext highlighter-rouge">AWS</code>.</p>
<p>Now that we are set, Nextflow has this neat option flag <code class="language-plaintext highlighter-rouge">-with-report</code> that gives you a very <a href="https://www.nextflow.io/docs/latest/tracing.html#execution-report">comprehensive overview</a> of the resources your processes consumed during execution.</p>
<p>Below are the most important excerpts of an example report from when I ran my Nextflow workflow on 1,222 breast cancer datasets from <a href="https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga">TCGA</a>:</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/nextflowreport_CPU.png" alt="Nextflow CPU consumption" /></p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/nextflowreport_memory.png" alt="Nextflow memory consumption" />
<img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/nextflowreport_time.png" alt="Nextflow time duration" /></p>
<p>On average a single task ran on <strong>6 threads</strong>, consumed <strong>8 GB of memory</strong> and ran <strong>2:30 minutes</strong> - this is the rough framework of resources we will have to consider when allocating resources and choosing appropriate <code class="language-plaintext highlighter-rouge">EC2</code> instances.</p>
<h2 id="step-2-creating-a-suitable-ami">Step 2: Creating a suitable AMI</h2>
<p>I found the setup and configuration of suitable <code class="language-plaintext highlighter-rouge">AMIs</code> to be the most demanding step when creating an environment to run a pipeline on <code class="language-plaintext highlighter-rouge">AWS</code>. Several things have to be considered:</p>
<ul>
<li>Base image: It has to be <code class="language-plaintext highlighter-rouge">ECS</code>-compatible</li>
<li><code class="language-plaintext highlighter-rouge">EBS</code> storage: The attached volumes have to be large enough to contain all input, index, temporary and output files</li>
<li><code class="language-plaintext highlighter-rouge">AWS CLI</code>: The <code class="language-plaintext highlighter-rouge">AMI</code> has to contain the <code class="language-plaintext highlighter-rouge">AWS CLI</code> or otherwise no files can be fetch from and copied to <code class="language-plaintext highlighter-rouge">S3</code> from the <code class="language-plaintext highlighter-rouge">EBS</code> volume</li>
<li><code class="language-plaintext highlighter-rouge">AMIs</code> cannot be reused for processes containing less <code class="language-plaintext highlighter-rouge">EBS</code> (more is possible)</li>
</ul>
<p>This section covers how you can set up your <code class="language-plaintext highlighter-rouge">AMI</code> for a given task of your pipeline and what to consider on the way.</p>
<h3 id="choose-an-amazon-machine-image-ami">Choose an Amazon Machine Image (AMI)</h3>
<p>As a first step, we want to make sure to pick a base image that supports <code class="language-plaintext highlighter-rouge">ECS</code> from the AWS Market Place. I strongly advise you to use one of the <code class="language-plaintext highlighter-rouge">Amazon ECS-Optimized Amazon Linux AMI</code> images.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Choose-AMI.png" alt="Choose AMI" /></p>
<h3 id="choose-an-instance-type">Choose an Instance Type</h3>
<p>The <code class="language-plaintext highlighter-rouge">EC2</code> instance we want to use to create our custom <code class="language-plaintext highlighter-rouge">AMI</code> does not need be resourceful, since we won’t run any jobs on that. Therefore, a <code class="language-plaintext highlighter-rouge">t2.micro</code> instance is more than sufficient.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Choose-Instance.png" alt="Choose Instance" /></p>
<h3 id="configure-instance-details">Configure Instance Details</h3>
<p>The instance configuration can be mostly left to the defaults. However, I would strongly advise you to set the shutdown behaviour to <code class="language-plaintext highlighter-rouge">terminate</code>, otherwise attached volumes will be kept persistent and you continue to pay unless you explicitely terminate the instance manually. I actually ran into huge costs when misconfiguring this (300$) so <strong>watch out!</strong>.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Configure-Instance.png" alt="Configure Instance" /></p>
<h3 id="add-storage">Add storage</h3>
<p>This is the single most important point of the entire <code class="language-plaintext highlighter-rouge">AMI</code> setup process - here you define the <strong>minimum</strong> number of added storage for your <code class="language-plaintext highlighter-rouge">AMI</code>. This storage <strong>must</strong> be large enough, to contain <strong>all</strong> input and index files for a given task as well as <strong>all</strong> temporary and final output files produced during the computation. I hope you did some thorough benchmarking and extrapolation of resources on your input dataset.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Add-Storage.png" alt="Add Storage" /></p>
<h3 id="add-tags">Add tags</h3>
<p>Unless you want to add optional tags, nothing to do here…</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Add-Tags.png" alt="Add Tags" /></p>
<h3 id="configure-security-group">Configure Security Group</h3>
<p>Before firing up your instance, you need to configure the associated security group. For me, letting AWS create the security group worked perfectly fine, I would still double check that you can connect to the <code class="language-plaintext highlighter-rouge">EC2</code> instance - in case of doubt set the source to <code class="language-plaintext highlighter-rouge">0.0.0.0/0</code>, even though probably all IT security experts will kill me for that. Now you are ready to <strong>lauch the instance</strong>.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Security-Group.png" alt="Security Group" /></p>
<h3 id="ssh-connect-to-instance">SSH connect to instance</h3>
<p>Now right click and hit <em>Connect</em> to get your <code class="language-plaintext highlighter-rouge">ssh</code> connect command to your instance. You might have to change the default <code class="language-plaintext highlighter-rouge">root</code> user to <code class="language-plaintext highlighter-rouge">ec2-user</code>.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-SSH.png" alt="AMI SSH connect" /></p>
<h3 id="adjust-docker-container-size-to-ebs">Adjust Docker container size to EBS</h3>
<p>The first thing we want to check once we connected to our instance is that the Docker configuration reflects the amount of added EBS storage.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-40-128 ~]<span class="nv">$ </span>docker info | <span class="nb">grep</span> <span class="nt">-i</span> data
Data Space Used: 309.3MB
Data Space Total: 42.42GB
Data Space Available: 42.11GB
Metadata Space Used: 4.833MB
Metadata Space Total: 46.14MB
Metadata Space Available: 41.3MB
</code></pre></div></div>
<p>In the above example we see that indeed Docker is configure for the specified 40 GB EBS data volume.</p>
<p>As per default, the maximum storage size of a single Docker container is 10 GB - independent of the data space available - we have to adjust this.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-40-128 ~]<span class="nv">$ </span>docker info | <span class="nb">grep</span> <span class="nt">-i</span> base
Base Device Size: 10.74GB
</code></pre></div></div>
<p>To this end, we have to extend the file in <code class="language-plaintext highlighter-rouge">/etc/sysconfig/docker-storage</code> to contain the following parameter <code class="language-plaintext highlighter-rouge">--storage-opt dm.basesize=40GB</code> and restart the Docker service.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>vi /etc/sysconfig/docker-storage
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">DOCKER_STORAGE_OPTIONS</span><span class="o">=</span><span class="s2">"--storage-driver devicemapper --storage-opt dm.thinpooldev=/dev/mapper/docker-docker--pool --storage-opt dm.use_deferred_removal=true --storage-opt dm.use_deferred_deletion=true --storage-opt dm.fs=ext4 --storage-opt dm.use_deferred_deletion=true --storage-opt dm.basesize=40GB"</span>
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-40-128 ~]<span class="nv">$ </span><span class="nb">sudo </span>service docker restart
Stopping docker: <span class="o">[</span> OK <span class="o">]</span>
Starting docker: <span class="nb">.</span> <span class="o">[</span> OK <span class="o">]</span>
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-40-128 ~]<span class="nv">$ </span>docker info | <span class="nb">grep</span> <span class="nt">-i</span> base
Base Device Size: 42.95GB
</code></pre></div></div>
<h3 id="install-aws-cli">Install AWS CLI</h3>
<p><code class="language-plaintext highlighter-rouge">Nextflow</code> requires the <code class="language-plaintext highlighter-rouge">AWS CLI</code> to copy files such as input files and indices from and output files to <code class="language-plaintext highlighter-rouge">S3</code>.</p>
<p>Use the following lines to add it to your <code class="language-plaintext highlighter-rouge">AMI</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>yum <span class="nb">install</span> <span class="nt">-y</span> bzip2 wget
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh <span class="nt">-b</span> <span class="nt">-f</span> <span class="nt">-p</span> <span class="nv">$HOME</span>/miniconda
<span class="nv">$HOME</span>/miniconda/bin/conda <span class="nb">install</span> <span class="nt">-c</span> conda-forge <span class="nt">-y</span> awscli
<span class="nb">rm </span>Miniconda3-latest-Linux-x86_64.sh
</code></pre></div></div>
<p>Give it a quick spin to see whether everything is ok.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-40-128 ~]<span class="nv">$ </span>./miniconda/bin/aws <span class="nt">--version</span>
aws-cli/1.16.121 Python/3.7.1 Linux/4.14.94-73.73.amzn1.x86_64 botocore/1.12.111
</code></pre></div></div>
<h3 id="save-your-ami">Save your AMI</h3>
<p>Now you can go back to your <code class="language-plaintext highlighter-rouge">EC2</code> instance dashboard and save your <code class="language-plaintext highlighter-rouge">AMI</code> by right clicking and going for <code class="language-plaintext highlighter-rouge">Image->Create Image</code>.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Create-AMI.png" alt="Create AMI" /></p>
<p><strong>Congratulations</strong> you have created your first <code class="language-plaintext highlighter-rouge">AMI</code>!</p>
<p>Don’t forget to terminate your running <code class="language-plaintext highlighter-rouge">EC2</code> instance from which you created the <code class="language-plaintext highlighter-rouge">AMI</code> to get prevent any running <code class="language-plaintext highlighter-rouge">EBS</code> and <code class="language-plaintext highlighter-rouge">EC2</code> costs.</p>
<h2 id="step-3-creating-compute-environments-and-job-queues">Step 3: Creating compute environments and job queues</h2>
<p>Now it is time to create appropriate compute environments and their corresponding job queues. I usually like to create some baseline <em>workload</em> queue that should handle most of the jobs providing resources estimated from Step 1 and an <em>excess</em> queue with very extensive resources that handles the few jobs that overflow the <em>workload</em> resources, so that the entire batch is still successfully processed.</p>
<h3 id="overview">Overview</h3>
<p>First, we want to create a new compute environment upon which we can base job queues. For this, go to the <code class="language-plaintext highlighter-rouge">AWS Batch</code> dashboard -> <code class="language-plaintext highlighter-rouge">Compute Environments</code>.</p>
<p>I have already created some production environments, for you this overview will probably be empty. Then go to <code class="language-plaintext highlighter-rouge">Create Environment</code>.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/ComputeEnvironment_Overview.png" alt="Compute environment overview" /></p>
<h3 id="naming-roles-and-permissions">Naming, roles and permissions</h3>
<p>First, we want to have a <code class="language-plaintext highlighter-rouge">managed</code> environment, so <code class="language-plaintext highlighter-rouge">AWS Batch</code> can do configuration and scaling for us. Now, we can name our compute environment. I chose to create first a <code class="language-plaintext highlighter-rouge">workload</code> compute environment, thus naming it <code class="language-plaintext highlighter-rouge">salmonWorkload</code>. Then we simply select the service and instance roles as well as keypairs we created earlier in the <code class="language-plaintext highlighter-rouge">prerequisite</code> section, there should be only one option to choose from.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/ComputeEnvironment_Names.png" alt="Compute environment naming" /></p>
<h3 id="some-words-on-instance-types-and-vcpu-limits">Some words on instance types and vCPU limits</h3>
<p>In my opinion, this part is <strong>the most crucial part</strong> of setting up an optimal environment both in terms of computation and cost efficiency. <strong>So pay special attention here!</strong></p>
<p>First of all, I hope you did a good enough job in Step 1 of estimating your resource requirements <strong>per task</strong>.</p>
<p>These are the punchlines you have to consider now for fixing instance types and vCPU limits for your compute environment:</p>
<h4 id="fit-only-1-task-in-1-instance">Fit only <strong>1</strong> task in <strong>1</strong> instance!</h4>
<p>If you look at the instance pricing table, you will see that prices linearly scaling with instance types - meaning doubling resources results in double prices. You will not save anything by running more jobs on a single larger instance, but you will pay for it since from experience the Docker daemon on the instance sometimes gets confused and hung-up when there’s multiple tasks run on the instance.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-architecture/EC2Instances.png" alt="EC2 instances" /></p>
<h4 id="vcpus-refers-to-the-total-number-of-vcpus-of-your-environments">vCPUs refers to the total number of vCPUs of your environments</h4>
<p>This got me confused also when trying to figure out, how many instances will be fired up in total. Essentially you have to divide this number by the number of vCPUs provided by your instance type of choice and then you will get the number of instances that will be launched at peak times.</p>
<p>So let’s say you chose <code class="language-plaintext highlighter-rouge">c5.2xlarge</code> as your instance type with 8 vCPUs and your specified <code class="language-plaintext highlighter-rouge">Maximum vCPUs</code> is 100, then 100 / 8 = 12 instances will be launched in total if the entire compute environment is utilized.</p>
<h4 id="keep-some-spare-memory-for-instance-services">Keep some spare memory for instance services</h4>
<p>I will address this in detail later, but keep in mind that not the entire memory listed in the instance type specification can be used, since some of it will be occupied with running basic instance services.</p>
<h4 id="keep-homogeneous-compute-environments">Keep homogeneous compute environments</h4>
<p>Since we did a careful resource requirement estimation, I find it easiest for keeping track of cost and also ensuring that the tasks will actually finish, to have homogeneous compute environments - meaning one environment will only allow for one specific instance type.</p>
<h3 id="specifying-instance-types-and-vcpu-limits">Specifying instance types and vCPU limits</h3>
<p>Now let’s put it all together. First up, let’s quickly refresh the resource requirements we had per Salmon task:</p>
<ul>
<li>We need an instance to provide 8 GB of memory to fit index + data</li>
<li>If we run our tasks on a 6 thread instance, it will run 2:30 mins</li>
</ul>
<p>Now if we check the instance type table, we find there is actually 2 types of instances that would cover these requirements:</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/ComputeEnvironment_InstanceResearch.png" alt="Potential instance types" /></p>
<p>The <code class="language-plaintext highlighter-rouge">c5.xlarge</code> comes with 8 GB of memory and 4 vCPUs, the <code class="language-plaintext highlighter-rouge">c5.2xlarge</code> with double the memory and vCPUs. So in principal, we could fit on average on task in the smaller instance, but remember you will have some overhead of services running on the instance that effectively reduces these 8 GB and second these requirements are average requirements, so anything above average will fail to run in such an instance. Therefore, we should definitely go for a <code class="language-plaintext highlighter-rouge">c5.2xlarge</code> here.</p>
<ul>
<li>Choose <code class="language-plaintext highlighter-rouge">c5.2xlarge</code> as your only instance type and delete <code class="language-plaintext highlighter-rouge">optimal</code></li>
<li>Set <code class="language-plaintext highlighter-rouge">Minimum vCPUs</code> and <code class="language-plaintext highlighter-rouge">Desired vCPUs</code> both to 0 to have no idle running instances in background</li>
<li>Tick the <code class="language-plaintext highlighter-rouge">Enable user-specified Ami ID</code>, copy the <code class="language-plaintext highlighter-rouge">AMI ID</code> from the <code class="language-plaintext highlighter-rouge">AMI</code> we created and validate it</li>
</ul>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/ComputeEnvironment_Resources.png" alt="Compute environment resources" /></p>
<p>Everything else you can leave empty and click <code class="language-plaintext highlighter-rouge">Create</code>.</p>
<p>Congratulations, you have created your first compute environment!</p>
<h2 id="step-4-creating-job-queues">Step 4: Creating job queues</h2>
<p>Now we need to create a job queue and associated this with our compute environment. This step is actually pretty easy and straightforward.</p>
<p>First go to <code class="language-plaintext highlighter-rouge">Job queues</code> and click <code class="language-plaintext highlighter-rouge">Create Queue</code>.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/JobQueue_Overview.png" alt="Job queue overview" /></p>
<p>Now you can pick a name for your job queue - in our simple case I give it the same name as our compute environment <code class="language-plaintext highlighter-rouge">salmonWorkload</code>. You can in principal assign multiple job queues to one compute environment and set priorities via the <code class="language-plaintext highlighter-rouge">Priority</code> field, but we can simply put <code class="language-plaintext highlighter-rouge">1</code> in there.</p>
<p>Finally, associated the job queue with our <code class="language-plaintext highlighter-rouge">salmonWorkload</code> compute environment. Note again here, you can in principal assign multiple compute environments to a given job queue.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/JobQueue_Create.png" alt="Job queue creation" /></p>
<p>That’s it - click <code class="language-plaintext highlighter-rouge">Create job queue</code> and you have successfully created your first job queue!</p>
<h3 id="excess-queue">Excess queue</h3>
<p>Now that we have our workload compute environment and job queues, we want to do the same with for our excess compute environment and job queues to handle any datasets with overshooting resource requirements.</p>
<p>Therefore, we repeat the steps starting from Step 3 to create a <code class="language-plaintext highlighter-rouge">salmonExcess</code> compute environment and job queue based on <code class="language-plaintext highlighter-rouge">c5.4xlarge</code> instances with double the resources compared to our <code class="language-plaintext highlighter-rouge">salmonWorkload</code> queue.</p>
<p>This should leave you know with the following compute environments and job queues and finally ready to specify our final resource constraints before submitting our first jobs.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/TwoQueue_environments.png" alt="Two queue environments" /></p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/TwoQueue_jobqueues.png" alt="Two queue job queues" /></p>
<h2 id="step-5-adjusting-resources">Step 5: Adjusting resources</h2>
<p>Ok so now that we have set all the compute environments with associated instance types as well as job queues up on the <code class="language-plaintext highlighter-rouge">AWS</code> end, we know what resources we have available and how much of those will be consumed by our tasks.</p>
<h3 id="resource-definition">Resource definition</h3>
<p>So naïvely we can directly enter the specifications of our <code class="language-plaintext highlighter-rouge">EC2</code> instance type of choice in the <code class="language-plaintext highlighter-rouge">awsbatch.config</code> file of our <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow workflow, since we know the <code class="language-plaintext highlighter-rouge">salmonWorkload</code> queue consists of <code class="language-plaintext highlighter-rouge">c5.2xlarge</code> instances with 16 GB memroy and 8 vCPUs each and our <code class="language-plaintext highlighter-rouge">salmonExcess</code> queue of <code class="language-plaintext highlighter-rouge">c5.4xlarge</code> instances with 32 GB memory and 16 vCPUs each.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">aws</span><span class="o">.</span><span class="na">region</span> <span class="o">=</span> <span class="err">'</span><span class="n">eu</span><span class="o">-</span><span class="n">central</span><span class="o">-</span><span class="mi">1</span><span class="err">'</span>
<span class="n">aws</span><span class="o">.</span><span class="na">client</span><span class="o">.</span><span class="na">storageEncryption</span> <span class="o">=</span> <span class="err">'</span><span class="no">AES256</span><span class="err">'</span>
<span class="n">executor</span><span class="o">.</span><span class="na">name</span> <span class="o">=</span> <span class="err">'</span><span class="n">awsbatch</span><span class="err">'</span>
<span class="n">executor</span><span class="o">.</span><span class="na">awscli</span> <span class="o">=</span> <span class="err">'</span><span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">ec2</span><span class="o">-</span><span class="n">user</span><span class="o">/</span><span class="n">miniconda</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">aws</span><span class="err">'</span>
<span class="n">process</span> <span class="o">{</span>
<span class="n">queue</span> <span class="o">=</span> <span class="o">{</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">></span> <span class="mi">1</span> <span class="o">?</span> <span class="err">'</span><span class="n">salmonExcess</span><span class="err">'</span> <span class="o">:</span> <span class="err">'</span><span class="n">salmonWorkload</span><span class="err">'</span> <span class="o">}</span>
<span class="n">memory</span> <span class="o">=</span> <span class="o">{</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">></span> <span class="mi">1</span> <span class="o">?</span> <span class="mi">32</span><span class="o">.</span><span class="na">GB</span> <span class="o">:</span> <span class="mi">16</span><span class="o">.</span><span class="na">GB</span> <span class="o">}</span>
<span class="n">cpus</span> <span class="o">=</span> <span class="o">{</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">></span> <span class="mi">1</span> <span class="o">?</span> <span class="mi">16</span> <span class="o">:</span> <span class="mi">8</span> <span class="o">}</span>
<span class="o">}</span>
<span class="n">params</span> <span class="o">{</span>
<span class="n">salmonIndex</span> <span class="o">=</span> <span class="err">'</span><span class="nl">s3:</span><span class="c1">//obenauflab/indices/salmon/gencode.v28.IMPACT'</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Now just let’s quickly fast-forward and look what happens if we submit our jobs like this.</p>
<p>You will notice that we have one runnable job for each task, yet no instances will fire up.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_Overflow.png" alt="Resource overflow" /></p>
<p>If we check one of the jobs, we will see that the environment requirements have been exactly set up as we specified in our Nextflow config which is also matched by the instance types of our job queue - so why does this not work?</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_OverflowJob.png" alt="Job overflow" /></p>
<h3 id="ecs-overhead-extraction">ECS overhead extraction</h3>
<p>The solution for this is the fact, that there are <strong>overhead container services</strong> running in your instance which consume some chunk of your total available memory. So when you ask for X GB memory on an instance with X GB total memory, you have to be aware that there is Y GB memory preoccupied with service tasks, so your effective available memory will be X-Y.</p>
<p>To get your jobs running on such instances, you cannot request X GB memory then, but rather the X-Y chunk. How do we determine Y now?</p>
<p>Let’s first fire up an instance of our compute environment by simply selecting our compute environment and clicking on <code class="language-plaintext highlighter-rouge">Edit</code>.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_edit.png" alt="Edit compute environment" /></p>
<p>Now we select 1 minimum and desired vCPU to fire up one instance of the compute environment and <code class="language-plaintext highlighter-rouge">Save</code>.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_vCPUs.png" alt="Select 1 vCPU" /></p>
<p>Wait a couple of minutes to let the <code class="language-plaintext highlighter-rouge">EC2</code> instance fire up, then again click on your compute environment. Follow the link given in <code class="language-plaintext highlighter-rouge">ECS Cluster name</code>.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_ECS.png" alt="Follow ECS" /></p>
<p>This will bring you to the cluster overview page, where you need to click on <code class="language-plaintext highlighter-rouge">ECS instances</code>.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_Cluster.png" alt="ECS overview" /></p>
<p>Now finally we get what we want - the actual amount of memory available on a given instance on this <code class="language-plaintext highlighter-rouge">ECS</code> cluster.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_ActualMemory.png" alt="Factual available memory" /></p>
<p>According to the ECS tab, we have <strong>15,434 MB</strong> memory available on our <code class="language-plaintext highlighter-rouge">salmonWorkload</code> queue - repeat the same procedure to get the numbers for our <code class="language-plaintext highlighter-rouge">salmonExcess</code> queue.</p>
<h3 id="updated-resource-definition">Updated resource definition</h3>
<p>Having obtained the mysterious actual available memory X-Y on our <code class="language-plaintext highlighter-rouge">EC2</code> instances of our compute environment, we can finally enter the final numbers in our <code class="language-plaintext highlighter-rouge">awsbatch.config</code> definition of our <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow pipeline.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">aws</span><span class="o">.</span><span class="na">region</span> <span class="o">=</span> <span class="err">'</span><span class="n">eu</span><span class="o">-</span><span class="n">central</span><span class="o">-</span><span class="mi">1</span><span class="err">'</span>
<span class="n">aws</span><span class="o">.</span><span class="na">client</span><span class="o">.</span><span class="na">storageEncryption</span> <span class="o">=</span> <span class="err">'</span><span class="no">AES256</span><span class="err">'</span>
<span class="n">executor</span><span class="o">.</span><span class="na">name</span> <span class="o">=</span> <span class="err">'</span><span class="n">awsbatch</span><span class="err">'</span>
<span class="n">executor</span><span class="o">.</span><span class="na">awscli</span> <span class="o">=</span> <span class="err">'</span><span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">ec2</span><span class="o">-</span><span class="n">user</span><span class="o">/</span><span class="n">miniconda</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">aws</span><span class="err">'</span>
<span class="n">process</span> <span class="o">{</span>
<span class="n">queue</span> <span class="o">=</span> <span class="o">{</span>
<span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">></span> <span class="mi">1</span> <span class="o">?</span> <span class="err">'</span><span class="n">salmonExcess</span><span class="err">'</span> <span class="o">:</span> <span class="err">'</span><span class="n">salmonWorkload</span><span class="err">'</span> <span class="o">}</span>
<span class="n">memory</span> <span class="o">=</span> <span class="o">{</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">></span> <span class="mi">1</span> <span class="o">?</span> <span class="mi">31100</span><span class="o">.</span><span class="na">MB</span> <span class="o">:</span> <span class="mi">15400</span><span class="o">.</span><span class="na">MB</span> <span class="o">}</span>
<span class="n">cpus</span> <span class="o">=</span> <span class="o">{</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">></span> <span class="mi">1</span> <span class="o">?</span> <span class="mi">16</span> <span class="o">:</span> <span class="mi">8</span> <span class="o">}</span>
<span class="o">}</span>
<span class="n">params</span> <span class="o">{</span>
<span class="n">salmonIndex</span> <span class="o">=</span> <span class="err">'</span><span class="nl">s3:</span><span class="c1">//obenauflab/indices/salmon/gencode.v28.IMPACT'</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Finally, we are ready to testdrive our <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow pipeline on our AWS job queue!</p>
<h2 id="step-6-running-jobs-with-aws-batch">Step 6: Running jobs with AWS Batch</h2>
<p>Allright, now things are getting serious, just a little more preparation needed to finally run our <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow pipeline on <code class="language-plaintext highlighter-rouge">AWS</code>:</p>
<ul>
<li>Upload our index file to <code class="language-plaintext highlighter-rouge">s3</code></li>
<li>Upload our input <code class="language-plaintext highlighter-rouge">fastq</code> files to <code class="language-plaintext highlighter-rouge">s3</code></li>
<li>Launch a submission <code class="language-plaintext highlighter-rouge">EC2</code> instance for running our <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow pipeline</li>
<li>Enter credentials</li>
<li>Go!</li>
</ul>
<h3 id="upload-files-to-s3">Upload files to <code class="language-plaintext highlighter-rouge">s3</code></h3>
<p>To upload files to <code class="language-plaintext highlighter-rouge">s3</code>, I recommend you to use the <a href="https://aws.amazon.com/cli/">AWS CLI</a>.</p>
<p>For installation just follow the instructions. Important afterwards it to expose your <code class="language-plaintext highlighter-rouge">AWS credentials</code> which you obtained when creating your <code class="language-plaintext highlighter-rouge">IAM</code> user to Nextflow which can be done in <a href="https://www.nextflow.io/docs/latest/awscloud.html#aws-credentials">2 ways</a>:</p>
<ol>
<li>Exporting the default <code class="language-plaintext highlighter-rouge">AWS</code> environment variables</li>
</ol>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">AWS_DEFAULT_REGION</span><span class="o">=</span><REGION IDENTIFIER>
<span class="nb">export </span><span class="nv">AWS_ACCESS_KEY_ID</span><span class="o">=</span><YOUR S3 ACCESS KEY>
<span class="nb">export </span><span class="nv">AWS_SECRET_ACCESS_KEY</span><span class="o">=</span><YOUR S3 SECRET KEY>
</code></pre></div></div>
<ol>
<li>Specify your credentials in the Nextflow configuration file</li>
</ol>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">aws</span> <span class="o">{</span>
<span class="n">region</span> <span class="o">=</span> <span class="err">'</span><span class="o"><</span><span class="no">REGION</span> <span class="no">IDENTIFIER</span><span class="o">></span><span class="err">'</span>
<span class="n">accessKey</span> <span class="o">=</span> <span class="err">'</span><span class="o"><</span><span class="no">YOUR</span> <span class="no">S3</span> <span class="no">ACCESS</span> <span class="no">KEY</span><span class="o">></span><span class="err">'</span>
<span class="n">secretKey</span> <span class="o">=</span> <span class="err">'</span><span class="o"><</span><span class="no">YOUR</span> <span class="no">S3</span> <span class="no">SECRET</span> <span class="no">KEY</span><span class="o">></span><span class="err">'</span>
<span class="o">}</span>
</code></pre></div></div>
<p>I personally prefer option 1 to not accidentally commit and push any of my credentials to my Nextflow Github repo.</p>
<p>Now we can upload our fastq files to our target destination in our <code class="language-plaintext highlighter-rouge">s3</code> bucket, assuming you are in the directory where your <code class="language-plaintext highlighter-rouge">fastq</code> files are stored:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws s3 <span class="nb">cp</span> <span class="nb">.</span> s3://obenauflab/fastq <span class="nt">--recursive</span> <span class="nt">--include</span> <span class="s2">"*.fq.gz"</span>
</code></pre></div></div>
<p>Repeat the same with your index files to your <code class="language-plaintext highlighter-rouge">s3</code> bucket destination and you now all files we need for running <code class="language-plaintext highlighter-rouge">salmon-nf</code> are ready. You can view them via numerous clients, I used <a href="https://cyberduck.io/">Cyberduck</a> for Mac. Below you will see that my 40 testsamples and index files have been uploaded in the appropriate locations in my <code class="language-plaintext highlighter-rouge">s3</code> bucket.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/S3_fastqs.png" alt="S3 fastq file location" /></p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/S3_index.png" alt="S3 index file location" /></p>
<h3 id="launch-and-prepare-your-submission-instance">Launch and prepare your submission instance</h3>
<p>Finally, we need some machine where we run our Nextflow master process that submits jobs to the <code class="language-plaintext highlighter-rouge">AWS Batch</code> queues. You can of course to this locally on your machine or have a long running job in our HPC environment.</p>
<p>But for heavy, long-running workloads it definitely makes sense to have a dedicated instance to run the Nextflow process on to not run into troubles.</p>
<p>Fortunately, we only need a very minimal <code class="language-plaintext highlighter-rouge">EC2</code> instance for this, which is available from <code class="language-plaintext highlighter-rouge">AWS</code> under the so-called <code class="language-plaintext highlighter-rouge">Free Tier</code> - meaning it’s free, yay!</p>
<p>So this is what we will do - first go to your <code class="language-plaintext highlighter-rouge">EC2</code> dashboard and select <code class="language-plaintext highlighter-rouge">Launch Instance</code>.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Launch_EC2Dashboard.png" alt="EC2 Dashboard" /></p>
<p>Next up, we have to select the <code class="language-plaintext highlighter-rouge">AMI</code> we want to run on our instance. I have already precreated a <code class="language-plaintext highlighter-rouge">Nextflow AMI</code> which is simply an <code class="language-plaintext highlighter-rouge">AMI</code> created as in Step 2, where I in addition installed <a href="http://www.oracle.com/technetwork/java/javase/downloads/index.html">Java 8</a> and <a href="https://www.nextflow.io/docs/latest/getstarted.html#installation">Nextflow</a>.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Launch_NextflowAMI.png" alt="Nextflow AMI" /></p>
<p>For the instance type, make sure to select something labeled as <code class="language-plaintext highlighter-rouge">Free Tier eligible</code> to not run into any costs for this instance, e.g. <code class="language-plaintext highlighter-rouge">t2.micro</code> in the example below. Then just hit <code class="language-plaintext highlighter-rouge">Review and Launch</code> and then <code class="language-plaintext highlighter-rouge">Launch</code> the instance.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Launch_EC2Instance.png" alt="Nextflow EC2 instance" /></p>
<p>Finally, make sure to launch it with a keypair that you also have downloaded, otherwise you will be unable to connect to the instance.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Launch_KeyPair.png" alt="Nextflow keypair" /></p>
<p>Finally, give some name to your master instance, since many more will be launched once we fire up our <code class="language-plaintext highlighter-rouge">salmon-nf</code>Nextflow pipeline on our <code class="language-plaintext highlighter-rouge">AWS Batch</code> compute environment.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Launch_Name.png" alt="Nextflow EC2 naming" /></p>
<p>Finally, connect to the instance as already shown in Step 2 for example. Now we can pull our <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow pipeline.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span>nextflow pull t-neumann/salmon-nf
Checking t-neumann/salmon-nf ...
downloaded from https://github.com/t-neumann/salmon-nf.git - revision: 6ac6e6a15a <span class="o">[</span>master]
<span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="err">$</span>
</code></pre></div></div>
<p>Next up, don’t forget again to export your <code class="language-plaintext highlighter-rouge">AWS</code> credentials.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span><span class="nb">export </span><span class="nv">AWS_DEFAULT_REGION</span><span class="o">=</span><REGION IDENTIFIER>
<span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span><span class="nb">export </span><span class="nv">AWS_ACCESS_KEY_ID</span><span class="o">=</span><YOUR S3 ACCESS KEY>
<span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span><span class="nb">export </span><span class="nv">AWS_SECRET_ACCESS_KEY</span><span class="o">=</span><YOUR S3 SECRET KEY>
</code></pre></div></div>
<p>Now there is only <strong>1 last crucial</strong> step before we can actually launch our jobs on the <code class="language-plaintext highlighter-rouge">AWS Batch</code> queue: We have to create <a href="https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html">job definitions</a>. Luckily for us, Nextflow will <a href="https://www.nextflow.io/docs/latest/awscloud.html#custom-job-definition">automatically create job definitions</a> for us upon the first launch of a pipeline.</p>
<p>However, what I found is, that job definitions will only be properly created if the initial run contains only very few samples. So <strong>always have your initial run on a SINGLE SAMPLE!!</strong>. What happens if you don’t, is that your Nextflow submission will be stuck at the following step:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span>nextflow run t-neumann/salmon-nf <span class="nt">--inputDir</span> s3://obenauflab/fastq <span class="nt">--outputDir</span> s3://obenauflab/salmon <span class="nt">-profile</span> awsbatch <span class="nt">-w</span> s3://obenauflab/work/salmon
N E X T F L O W ~ version 18.10.1
Launching <span class="sb">`</span>t-neumann/salmon-nf<span class="sb">`</span> <span class="o">[</span>silly_mccarthy] - revision: 6ac6e6a15a <span class="o">[</span>master]
parameters
<span class="o">======================</span>
input directory : s3://obenauflab/fastq
output directory : s3://obenauflab/salmon
<span class="o">======================</span>
<span class="o">[</span>warm up] executor <span class="o">></span> awsbatch
</code></pre></div></div>
<p>From there on, you wait forever and wonder what’s going on, as it happened to me.</p>
<h3 id="start-your-nextflow-run-on-aws-batch">Start your Nextflow run on AWS batch</h3>
<p>Now the last and most rewarding step of all - you are finally ready to launch the <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow pipeline on <code class="language-plaintext highlighter-rouge">AWS</code>!</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span>nextflow run t-neumann/salmon-nf <span class="nt">--inputDir</span> s3://obenauflab/fastq <span class="nt">--outputDir</span> s3://obenauflab/salmon <span class="nt">-profile</span> awsbatch <span class="nt">-w</span> s3://obenauflab/work/salmon
</code></pre></div></div>
<p>Notice, how both the <code class="language-plaintext highlighter-rouge">inputDir</code> and <code class="language-plaintext highlighter-rouge">outputDir</code> point to an <code class="language-plaintext highlighter-rouge">s3</code> directory and how we also have to supply a <code class="language-plaintext highlighter-rouge">work directory</code> with <code class="language-plaintext highlighter-rouge">-w</code> on <code class="language-plaintext highlighter-rouge">s3</code>. Now hit <code class="language-plaintext highlighter-rouge">Enter</code> and watch the beauty unfold on <code class="language-plaintext highlighter-rouge">AWS</code>.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span>nextflow run t-neumann/salmon-nf <span class="nt">--inputDir</span> s3://obenauflab/fastq <span class="nt">--outputDir</span> s3://obenauflab/salmon <span class="nt">-profile</span> awsbatch <span class="nt">-w</span> s3://obenauflab/work/salmon
N E X T F L O W ~ version 18.10.1
Launching <span class="sb">`</span>t-neumann/salmon-nf<span class="sb">`</span> <span class="o">[</span>silly_mccarthy] - revision: 6ac6e6a15a <span class="o">[</span>master]
parameters
<span class="o">======================</span>
input directory : s3://obenauflab/fastq
output directory : s3://obenauflab/salmon
<span class="o">======================</span>
<span class="o">[</span>warm up] executor <span class="o">></span> awsbatch
<span class="o">[</span>4a/72c0f7] Submitted process <span class="o">></span> salmon <span class="o">(</span>d1ada222-b67f-47c0-b380-091eaab093b4_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>f2/f8d97a] Submitted process <span class="o">></span> salmon <span class="o">(</span>e46e4f3a-62f8-4bd1-a143-f384e219d6af_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>90/35eb4d] Submitted process <span class="o">></span> salmon <span class="o">(</span>1672de07-77db-4817-9c7f-f201c25e8132_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>81/c47fe3] Submitted process <span class="o">></span> salmon <span class="o">(</span>741fbacf-3694-46ef-b16f-66bac6ee0452_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>f1/bc3afc] Submitted process <span class="o">></span> salmon <span class="o">(</span>db18dd75-3b48-4c21-aa68-58b1cf37c8c2_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>a8/88095d] Submitted process <span class="o">></span> salmon <span class="o">(</span>0ac6634e-00b0-4107-a5d6-db8ffc602645_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>a6/36e366] Submitted process <span class="o">></span> salmon <span class="o">(</span>9fa785f2-1dcb-4966-a5fa-fe75d327cb81_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>7d/5ae2b0] Submitted process <span class="o">></span> salmon <span class="o">(</span>5b3c329a-aa14-4965-8d13-f508f4390eaf_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>d9/3ec3fc] Submitted process <span class="o">></span> salmon <span class="o">(</span>6cf08e2b-7e59-4537-b1c3-1c5b3838ab95_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>19/d7d441] Submitted process <span class="o">></span> salmon <span class="o">(</span>9c714c63-ee50-4385-9e25-09f940f5f902_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>71/ff40cf] Submitted process <span class="o">></span> salmon <span class="o">(</span>17686cd5-271a-4e24-9746-f93334fb86b5_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>66/aaa185] Submitted process <span class="o">></span> salmon <span class="o">(</span>0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>67/ccd647] Submitted process <span class="o">></span> salmon <span class="o">(</span>1916abcd-61c0-4f23-96ac-be70aacb8dc1_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>7d/0a090b] Submitted process <span class="o">></span> salmon <span class="o">(</span>e1a4167d-b4ca-405c-8550-cc32bb1b1d09_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>3b/a9972e] Submitted process <span class="o">></span> salmon <span class="o">(</span>876a9725-34c1-4a23-a3fe-58a860d0f0c5_gdc_realn_rehead<span class="o">)</span>
</code></pre></div></div>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AWSBatch_Dashboard.png" alt="AWS Batch dashboard" /></p>
<p>Note how <code class="language-plaintext highlighter-rouge">AWS Batch</code> automatically upscales the number of desired vCPUs of your compute environment once the jobs are submitted.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AWSBatch_MultiInstances.png" alt="AWS Batch EC2 instances" /></p>
<p>Watch in awe how <code class="language-plaintext highlighter-rouge">AWS Batch</code> fires up multiple <code class="language-plaintext highlighter-rouge">EC2</code> instances automatically in your <code class="language-plaintext highlighter-rouge">EC2</code> dashboard.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AWSBatch_JobTransition.png" alt="AWS Batch Job transition" /></p>
<p>Watch how jobs transition from <code class="language-plaintext highlighter-rouge">Runnable</code> to <code class="language-plaintext highlighter-rouge">Starting</code> to <code class="language-plaintext highlighter-rouge">Runnable</code> to <code class="language-plaintext highlighter-rouge">Succeeded</code> state until all your samples have been processed.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>47/c580b5] Submitted process <span class="o">></span> salmon <span class="o">(</span>2864cbe8-4d77-4477-ac84-791004e42237_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>8c/84bc14] Submitted process <span class="o">></span> salmon <span class="o">(</span>0fdb3d0e-e405-4e8d-8897-4a90ea4fe00c_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>1d/3f6ec6] Submitted process <span class="o">></span> salmon <span class="o">(</span>7ed99d57-f199-4dac-87a8-62393f5e0aea_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>a9/330e5d] Submitted process <span class="o">></span> salmon <span class="o">(</span>825daddc-a89a-483b-947e-74cc12ba013c_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>98/33bed5] Submitted process <span class="o">></span> salmon <span class="o">(</span>c3588f96-95c6-4008-bda2-502ceb963adb_gdc_realn_rehead<span class="o">)</span>
t-neumann/salmon-nf has finished.
Status: SUCCESS
Time: Sun Aug 25 11:20:13 UTC 2019
Duration: 10m 22s
<span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="err">$</span>
</code></pre></div></div>
<p>Now let’s check whether the results were produced in the correct <code class="language-plaintext highlighter-rouge">s3</code> output directory.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AWSBatch_Success.png" alt="AWS Batch Job success" /></p>
<p>Congratulations! You did it! It took a long time, was quite a tedious setup and frustrating for me at numerous steps, but with amazing help from the community, Boehringer-Ingelheim and also quite some trial-and-error I got it to work and hopefully so did you with much less hassle!</p>
<p>Happy pipeline building and number crunching with <code class="language-plaintext highlighter-rouge">AWS</code> and Nextflow!</p>Tobias NeumannThe prerequisite for this post is that you have a sound understanding of Nextflow and made yourself familiar with the salmon-nf workflow created in this post. Furthermore, you should know all the essential AWS building blocks and basic architecture of an AWS based batch scheduler as I presented in my previous post. In this post, I will show you what environment and resources you have to actually set up on AWS to make the salmon-nf example pipeline run and then how to actually run jobs on the setup AWS Batch queue with Nextflow.Slamdunk paper2019-06-28T13:42:00+02:002019-06-28T13:42:00+02:00https://t-neumann.github.io/pipelines/Slamdunk<p>For the past couple of years I was involved in the development of <a href="http://doi.org/10.1038/nmeth.4435">SLAMseq</a>, a sequencing technology for time-resolved measurement of newly synthesized and existing RNA in cultured cells. Originally developed by the lab of <a href="https://www.imba.oeaw.ac.at/research/stefan-ameres/">Stefan Ameres</a>, the lab of my boss <a href="https://www.imp.ac.at/groups/johannes-zuber/">Johannes Zuber</a> extended the approach with pharmacological and chemical-genetic perturbations in order to identify direct transcriptional targets of any gene or pathway (<a href="http://doi.org/10.1126/science.aao2793">Muhar et al, Science 2018</a>).</p>
<p>Processing and interpreting this data required novel analysis methods, so I was given the opportunity to team up with a good friend of mine - <a href="https://github.com/philres">Philipp Rescheneder</a> - to develop <a href="https://t-neumann.github.io/slamdunk/">Slamdunk</a> which we recently published in <a href="http://doi.org/10.1186/s12859-019-2849-7">BMC Bioinformatics</a> and is generally applicable to any nucleotide-conversion containing dataset.</p>
<p>This post will quickly highlight the main functionality, findings and features.</p>
<h2 id="slamdunk-workflow">Slamdunk workflow</h2>
<p><img src="https://t-neumann.github.io/assets/images/posts/Slamdunk/slamdunk_outline.png" alt="Slamdunk outline" /></p>
<p>Slamdunk differs from naive read processing in 4 ways:</p>
<ul>
<li>It maps with a nucleotide-conversion aware scoring scheme since in the example of SLAMseq data, T>C mismatches are expected and identify reads from labelled transcripts</li>
<li>Since QuantSeq processes smaller, more repetitive regions of transcripts - namely the 3’ ends - Slamdunk cannot simply discard all multimappers, but utilizes a strategy to recover them</li>
<li>Genuine T>C SNPs would contribute greatly to false-positive conversion-quantifications and have to be excluded during the quantification step</li>
<li>Depending on coverage and T-content in the 3’ end, observing T>C reads will have a different likelihood which has to be corrected for during conversion quantification</li>
</ul>
<h2 id="features">Features</h2>
<h3 id="conversion-aware-mapping">Conversion-aware mapping</h3>
<p><img src="https://t-neumann.github.io/assets/images/posts/Slamdunk/slamdunk_mapping.png" alt="Slamdunk mapping" /></p>
<p>Slamdunk utilizes a conversion-aware scoring scheme implemented with the mapper <a href="http://cibiv.github.io/NextGenMap/">NextGenMap</a>.
Using this scoring-scheme, we could demonstrated the following:</p>
<ul>
<li>We can map reads independent of the inherent conversion-rates in the respective datasets (see top Figure a)</li>
<li>With commonly found conversion-rates (0-7%), we are able to map constantly > 90% of the reads with 100-150bp and >80% of shorter reads with 50bp read length.</li>
</ul>
<h3 id="multimapper-recovery">Multimapper recovery</h3>
<p>We devised a multimapper recovery strategy to deal with repetitive 3’ UTR regions of transcripts. To this end, multimapping reads that still map uniquely to annotated 3’ UTRs are recovered and only reads with alignments to several annotated 3’ UTRs are discarded.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Slamdunk/multimappers.png" alt="Multimapper recovery strategy" /></p>
<p>Using this strategy, we are able to recover valuable signal in genes with 3’ UTRs with low mappability and increase overall correlation of QuantSeq datasets to corresponding RNA-seq datasets.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Slamdunk/rnaseqcorrelation.png" alt="RNA-seq correlation" /></p>
<h3 id="conversion-quantification">Conversion quantification</h3>
<p>Plain quantification of the number of TC-conversion containing reads in a given interval is biased towards intervals with higher T-content and higher coverage, since the probability of observing a T>C conversion in this intervals is increased. To address this issue, we devised a T-content and coverage aware nucleotide-conversion quantification within intervals that is clearly superior in error rates (see bottom Figure left). Overall, the variance of relative error decreases with higher coverage and while it slightly underestimates the true conversion rate with short reads (50bp), it accurately estimates the conversion rates for reads starting from 100bp (bottom Figure right).</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Slamdunk/tcontentquantification.png" alt="T-content coverage aware quantification" /></p>
<h3 id="multiqc-report">MultiQC report</h3>
<p>Visualization of results and quality control is an important aspect of each analysis. To this end, with lots of help from <a href="https://phil.ewels.co.uk">Phil Ewels</a>, we developed a plugin to <a href="https://multiqc.info/">MultiQC</a> to facilitate quality control of SLAMseq datasets. Using this plugin, we can visualize conversion rates within samples (bottom Figure a), display the principal components of samples based on T>C containing reads (bottom Figure b), plot non T>C mismatches over read positions to identify problematic read positions (bottom Figure c) or plot T>C conversions at 3’ ends (bottom Figure d) to check for base composition biases.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Slamdunk/multiqc.png" alt="MultiQC module" /></p>
<h2 id="documentation">Documentation</h2>
<p>A thorough documentation is available from the main website:</p>
<ul>
<li><a href="https://t-neumann.github.io/slamdunk/">https://t-neumann.github.io/slamdunk/</a></li>
</ul>
<h2 id="availability">Availability</h2>
<p>Slamdunk is available from several platforms:</p>
<ul>
<li><a href="https://bioconda.github.io/recipes/slamdunk/README.html">BioConda</a></li>
<li><a href="https://galaxyproject.eu/posts/2019/08/17/Slamdunk/">Galaxy</a></li>
<li><a href="https://hub.docker.com/r/tobneu/slamdunk">Docker <i class="fab fa-docker" aria-hidden="true"></i></a></li>
<li><a href="https://pypi.org/project/slamdunk/">PyPI <i class="fab fa-python" aria-hidden="true"></i></a></li>
<li><a href="https://github.com/t-neumann/slamdunk">GitHub <i class="fab fa-github" aria-hidden="true"></i></a></li>
</ul>
<embed src="https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-019-2849-7" width="100%" height="700" type="application/pdf" />Tobias NeumannFor the past couple of years I was involved in the development of SLAMseq, a sequencing technology for time-resolved measurement of newly synthesized and existing RNA in cultured cells. Originally developed by the lab of Stefan Ameres, the lab of my boss Johannes Zuber extended the approach with pharmacological and chemical-genetic perturbations in order to identify direct transcriptional targets of any gene or pathway (Muhar et al, Science 2018).Pipelines with Nextflow2019-03-03T21:51:00+01:002019-03-03T21:51:00+01:00https://t-neumann.github.io/pipelines/Nextflow-pipeline<p>Nowadays, workflow management systems have become an integral part of large-scale analysis of biological datasets with multiple software packages and multi-platform language support. These systems enable the rapid prototyping and deployment of pipelines that combine complementary software packages.
Several such systems are already available, such as <a href="https://snakemake.readthedocs.io/en/stable/">Snakemake</a> and <a href="https://www.commonwl.org/">CWL</a>.</p>
<p>This post will give you an overview of my favourite workflow building system - <a href="https://www.nextflow.io/">Nextflow</a> - and look at one toy workflow implementation example that will also be used in later posts.</p>
<h2 id="nextflow">Nextflow</h2>
<p>Here, I will more or less shamelessly copy large parts of the description of Nextflow’s <a href="(https://www.nextflow.io/)">website</a> since it summarises the main features quite neatly.</p>
<p>Up front, the most severe disadvantage for me: Nextflow is written in <a href="https://groovy-lang.org/">Groovy</a> which is kind of a pain for me, since I am mostly Python, R, C/C++ and Java based, but have never needed to touch any Groovy.</p>
<p>However, with some fiddling around and especially a lot of low-latency community support via the <a href="https://gitter.im/nextflow-io/nextflow">Nextflow Gitter channel</a>, these are hurdles that can be overcome.</p>
<p>Once you lost your fear of Groovy, the advantages of Nextflow are quite convincing.</p>
<p>If you want to read more about Nextflow, <a href="https://www.nextflow.io/docs/latest/index.html">here is the documentation</a> and <a href="https://www.nature.com/articles/nbt.3820">here is the original paper</a>.</p>
<h4 id="fast-prototyping">Fast prototyping</h4>
<p>Nextflow allows you to write a computational pipeline by making it simpler to put together many different tasks.</p>
<p>You may reuse your existing scripts and tools and you don’t need to learn a new language or API to start using it.</p>
<p>As an example, look at how easy it is to run code from different languages within Nextflow processes out of the box.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">process</span> <span class="n">perlStuff</span> <span class="o">{</span>
<span class="s">"""
#!/usr/bin/perl
print 'Hi there!' . '\n';
"""</span>
<span class="o">}</span>
<span class="n">process</span> <span class="n">pyStuff</span> <span class="o">{</span>
<span class="s">"""
#!/usr/bin/python
x = 'Hello'
y = 'world!'
print "</span><span class="o">%</span><span class="n">s</span> <span class="o">-</span> <span class="o">%</span><span class="n">s</span><span class="s">" % (x,y)
"""</span>
<span class="o">}</span>
</code></pre></div></div>
<h4 id="portable">Portable</h4>
<p>Nextflow provides an abstraction layer between your pipeline’s logic and the execution layer, so that it can be executed on multiple platforms without it changing.</p>
<p>It provides out of the box executors for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for Kubernetes, Amazon AWS and Google Cloud platforms.</p>
<p>Again, check the so-called profile configurations one can quite easily setup to enable support for yet another scheduler.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">profiles</span> <span class="o">{</span>
<span class="n">standard</span> <span class="o">{</span>
<span class="n">process</span><span class="o">.</span><span class="na">executor</span> <span class="o">=</span> <span class="err">'</span><span class="n">local</span><span class="err">'</span>
<span class="o">}</span>
<span class="n">cluster_sge</span> <span class="o">{</span>
<span class="n">process</span><span class="o">.</span><span class="na">executor</span> <span class="o">=</span> <span class="err">'</span><span class="n">sge</span><span class="err">'</span>
<span class="n">process</span><span class="o">.</span><span class="na">penv</span> <span class="o">=</span> <span class="err">'</span><span class="n">smp</span><span class="err">'</span>
<span class="n">process</span><span class="o">.</span><span class="na">cpus</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">process</span><span class="o">.</span><span class="na">queue</span> <span class="o">=</span> <span class="err">'</span><span class="kd">public</span><span class="o">.</span><span class="na">q</span><span class="err">'</span>
<span class="n">process</span><span class="o">.</span><span class="na">memory</span> <span class="o">=</span> <span class="err">'</span><span class="mi">10</span><span class="no">GB</span><span class="err">'</span>
<span class="o">}</span>
<span class="n">cluster_slurm</span> <span class="o">{</span>
<span class="n">process</span><span class="o">.</span><span class="na">executor</span> <span class="o">=</span> <span class="err">'</span><span class="n">slurm</span><span class="err">'</span>
<span class="n">process</span><span class="o">.</span><span class="na">cpus</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">process</span><span class="o">.</span><span class="na">queue</span> <span class="o">=</span> <span class="err">'</span><span class="n">work</span><span class="err">'</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p>With these few lines of code, you can now seamlessly execute your pipeline on your local machine, on PBS and SLURM, even with customized resource settings.</p>
<h4 id="reproducibility">Reproducibility</h4>
<p>Nextflow supports <a href="https://www.docker.com/">Docker</a> and <a href="https://singularity.lbl.gov/">Singularity</a> containers technology.</p>
<p>This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration.</p>
<p>This is an especially nice feature, since it also allows to run Nextflow workflows on cloud based platforms such as <a href="https://aws.amazon.com/">Amazon Web Services</a> which strictly require all software environments supplied in a public <a href="https://www.nextflow.io/docs/latest/awscloud.html#awscloud-batch-config">Docker registry</a> reachable by ECS batch.</p>
<h4 id="unified-parallelism">Unified parallelism</h4>
<p>Nextflow is based on the dataflow programming model which greatly simplifies writing complex distributed pipelines.</p>
<p>Parallelisation is implicitly defined by the processes input and output declarations. The resulting applications are inherently parallel and can scale-up or scale-out, transparently, without having to adapt to a specific platform architecture.</p>
<h4 id="continuous-checkpoints">Continuous checkpoints</h4>
<p>All the intermediate results produced during the pipeline execution are automatically tracked.</p>
<p>This allows you to resume its execution, from the last successfully executed step, no matter what the reason was for it stopping.</p>
<h4 id="stream-oriented">Stream oriented</h4>
<p>Nextflow extends the Unix pipes model with a fluent DSL, allowing you to handle complex stream interactions easily.</p>
<p>It promotes a programming approach, based on functional composition, that results in resilient and easily reproducible pipelines.</p>
<h2 id="salmon">Salmon</h2>
<p>Our first small toy Nextflow workflow will be based upon <a href="https://combine-lab.github.io/salmon/">Salmon</a>.</p>
<p>Salmon is a tool for quantifying the expression of transcripts using RNA-seq data. Salmon uses the concept of quasi-mapping coupled with a two-phase inference procedure to provide accurate expression estimates very quickly (i.e. wicked-fast) and while using little memory. Salmon performs its inference using an expressive and realistic model of RNA-seq data that takes into account experimental attributes and biases commonly observed in real RNA-seq data.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Salmon-pipeline/salmon.png" alt="Salmon overview" /></p>
<p>Essentially, Salmon will create a transcript index which it then uses to quantify expression estimates for each of the transcripts from raw fastq reads.</p>
<p>Our goal:</p>
<ul>
<li>Obtain those transcript expression estimates for our samples</li>
<li>Obtain reads mapping to these transcripts via the <code class="language-plaintext highlighter-rouge">--writeMappings</code> flag as pseudo-bam</li>
</ul>
<p>If you want to read more on Salmon, <a href="https://www.nature.com/articles/nmeth.4197">here is the paper</a>.</p>
<h2 id="salmon-nf">salmon-nf</h2>
<p>So the Nextflow pipeline we will create during this exercise I will call <code class="language-plaintext highlighter-rouge">salmon-nf</code> and it can be found on my <a href="https://github.com/t-neumann/salmon-nf">GitHub page</a> as a fully functional repository.</p>
<p>Any standalone Nextflow pipeline will need 2 files to be executable out of the box and also directly <a href="https://www.nextflow.io/docs/latest/sharing.html#running-a-pipeline">from GitHub</a>:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">main.nf</code> - This file contains the individual processes and channels</li>
<li><code class="language-plaintext highlighter-rouge">nextflow.config</code> - The configuration file for parameters, profiles etc. For more info read <a href="https://www.nextflow.io/docs/latest/config.html#configuration-file">here</a></li>
</ul>
<h3 id="workflow-layout">Workflow layout</h3>
<p>First, we need to get an idea about what the data flow will be and what software and scripts will be run on it. I have outline the basic workflow of <code class="language-plaintext highlighter-rouge">salmon-nf</code> below:</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/Salmon-pipeline/salmon-nf.png" alt="salmon-nf" width="50%" /></p>
<p>We will only have one single process <code class="language-plaintext highlighter-rouge">salmon</code> which will use the input <code class="language-plaintext highlighter-rouge">fastq</code> files and the respective transcriptome <code class="language-plaintext highlighter-rouge">index</code> file to produce our expression estimates and the pseudo-bam files of aligning reads.</p>
<p>So for our <code class="language-plaintext highlighter-rouge">salmon</code> process we will have 2 input channels:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">fastqChannel</code> - feeding in our raw reads in <code class="language-plaintext highlighter-rouge">fastq</code> format</li>
<li><code class="language-plaintext highlighter-rouge">indexChannel</code> - providing our transcriptome <code class="language-plaintext highlighter-rouge">index</code> to which we align the reads to</li>
</ul>
<p>Our <code class="language-plaintext highlighter-rouge">salmon</code> process will produce several output files of which we choose to feed 2 file types into output processes as our final results:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">quant.sf</code> files via the <code class="language-plaintext highlighter-rouge">salmonChannel</code> output channel</li>
<li><code class="language-plaintext highlighter-rouge">pseudo.bam</code> files via the <code class="language-plaintext highlighter-rouge">pseudoBamChannel</code> output channel</li>
</ul>
<p>Now let’s have a look how we can actually realize and implement this on the coding end.</p>
<h3 id="docker-container">Docker container</h3>
<p>Before we can run anything, we need to provide the software environment containing <strong>all</strong> dependencies and software packages our <code class="language-plaintext highlighter-rouge">salmon</code> process will run. These days, this is usually done via a <a href="https://www.docker.com/">Docker</a> container, or a <a href="https://singularity.lbl.gov/">Singularity</a> container on HPC environments.</p>
<p>Many software packages - including Salmon in our case - usually provide already read-to-use Docker containers (<code class="language-plaintext highlighter-rouge">combinelab/salmon</code>). But even if they don’t, do not despair and brainlessly jump into creating your own containers. If the packages was provided via <a href="https://bioconda.github.io/">BioConda</a>, you will find a Docker container on <a href="https://quay.io/organization/biocontainers">BioContainers</a>. I found this last resort to work in many cases.</p>
<p>Either way, since I wanted to convert the raw <code class="language-plaintext highlighter-rouge">SAM</code> output from <code class="language-plaintext highlighter-rouge">salmon</code> into a compressed <code class="language-plaintext highlighter-rouge">BAM</code> file, I chose to extend their Docker image with adding <code class="language-plaintext highlighter-rouge">samtools</code> as shown in the <a href="https://docs.docker.com/engine/reference/builder/">Dockerfile</a> below.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Copyright (c) 2019 Tobias Neumann.</span>
<span class="c">#</span>
<span class="c"># You should have received a copy of the GNU Affero General Public License</span>
<span class="c"># along with this program. If not, see <http://www.gnu.org/licenses/>.</span>
FROM combinelab/salmon:0.12.0
MAINTAINER Tobias Neumann <tobias.neumann.at@gmail.com>
RUN <span class="nv">buildDeps</span><span class="o">=</span><span class="s1">'wget ca-certificates make g++'</span> <span class="se">\</span>
<span class="nv">runDeps</span><span class="o">=</span><span class="s1">'zlib1g-dev libncurses5-dev unzip gcc'</span> <span class="se">\</span>
<span class="o">&&</span> <span class="nb">set</span> <span class="nt">-x</span> <span class="se">\</span>
<span class="o">&&</span> apt-get <span class="nb">install</span> <span class="nt">-y</span> <span class="nv">$buildDeps</span> <span class="nv">$runDeps</span> <span class="nt">--no-install-recommends</span> <span class="se">\</span>
<span class="o">&&</span> <span class="nb">rm</span> <span class="nt">-rf</span> /var/lib/apt/lists/<span class="k">*</span> <span class="se">\</span>
<span class="o">&&</span> wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 <span class="se">\</span>
<span class="o">&&</span> <span class="nb">tar </span>xvfj samtools-1.9.tar.bz2 <span class="se">\</span>
<span class="o">&&</span> <span class="nb">cd </span>samtools-1.9 <span class="se">\</span>
<span class="o">&&</span> ./configure <span class="nt">--prefix</span><span class="o">=</span>/usr/local/ <span class="se">\</span>
<span class="o">&&</span> make <span class="se">\</span>
<span class="o">&&</span> make <span class="nb">install</span> <span class="se">\</span>
<span class="o">&&</span> apt-get purge <span class="nt">-y</span> <span class="nt">--auto-remove</span> <span class="nv">$buildDeps</span>
</code></pre></div></div>
<p>The resulting Docker image was pushed to <a href="https://hub.docker.com/">Docker Hub</a> and can be pulled via <code class="language-plaintext highlighter-rouge">docker pull obenauflab/salmon:latest</code>.</p>
<h3 id="mainnf">main.nf</h3>
<p>Now we are ready to create the central <code class="language-plaintext highlighter-rouge">main.nf</code> file which contains all processes as well as channels. As mentioned before, you will find the entire code on <a href="(https://github.com/t-neumann/salmon-nf)">GitHub</a>, so here is an excerpt of the important sections.</p>
<h5 id="fastqchannel"><code class="language-plaintext highlighter-rouge">fastqChannel</code></h5>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pairedEndRegex</span> <span class="o">=</span> <span class="n">params</span><span class="o">.</span><span class="na">inputDir</span> <span class="o">+</span> <span class="s">"/*_{1,2}.fq.gz"</span>
<span class="nc">SERegex</span> <span class="o">=</span> <span class="n">params</span><span class="o">.</span><span class="na">inputDir</span> <span class="o">+</span> <span class="s">"/*[!12].fq.gz"</span>
<span class="n">pairFiles</span> <span class="o">=</span> <span class="nc">Channel</span><span class="o">.</span><span class="na">fromFilePairs</span><span class="o">(</span><span class="n">pairedEndRegex</span><span class="o">)</span>
<span class="n">singleFiles</span> <span class="o">=</span> <span class="nc">Channel</span><span class="o">.</span><span class="na">fromFilePairs</span><span class="o">(</span><span class="nc">SERegex</span><span class="o">,</span> <span class="nl">size:</span> <span class="mi">1</span><span class="o">){</span> <span class="n">file</span> <span class="o">-></span> <span class="n">file</span><span class="o">.</span><span class="na">baseName</span><span class="o">.</span><span class="na">replaceAll</span><span class="o">(/.</span><span class="na">fq</span><span class="o">/,</span><span class="s">""</span><span class="o">)</span> <span class="o">}</span>
<span class="n">singleFiles</span><span class="o">.</span><span class="na">mix</span><span class="o">(</span><span class="n">pairFiles</span><span class="o">)</span>
<span class="o">.</span><span class="na">set</span> <span class="o">{</span> <span class="n">fastqChannel</span> <span class="o">}</span>
</code></pre></div></div>
<p>This elaborate chunk of code is needed to enable the <code class="language-plaintext highlighter-rouge">fastqChannel</code> input channel to our <code class="language-plaintext highlighter-rouge">salmon</code> process to handle both single- and paired-end <code class="language-plaintext highlighter-rouge">fastq</code> files. As you can see, we created a <code class="language-plaintext highlighter-rouge">pairFiles</code> channel with a paired-end regex basically assuming that our read-pairs are named <code class="language-plaintext highlighter-rouge">*_1.fq.gz</code> and <code class="language-plaintext highlighter-rouge">*_2.fq.gz</code>. In addition, we have a <code class="language-plaintext highlighter-rouge">singleFiles</code> channel that takes all <code class="language-plaintext highlighter-rouge">fastq</code> files not following the <code class="language-plaintext highlighter-rouge">_1</code> and <code class="language-plaintext highlighter-rouge">_2</code> naming convention and assuming it is single-end read files.</p>
<p>The <code class="language-plaintext highlighter-rouge">fromFilePairs</code> method creates a channel emitting the file pairs matching the regex we provided. The matching files are emitted as tuples in which the first element is the grouping key of the matching pair and the second element is the list of files (sorted in lexicographical order). For example:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span><span class="mo">03</span><span class="mi">99</span><span class="n">ad16</span><span class="o">-</span><span class="mi">816</span><span class="n">f</span><span class="o">-</span><span class="mi">4824</span><span class="o">-</span><span class="n">ae28</span><span class="o">-</span><span class="mi">7</span><span class="n">b82e006e7b7_gdc_realn_rehead</span><span class="o">,</span> <span class="o">[</span><span class="mo">03</span><span class="mi">99</span><span class="n">ad16</span><span class="o">-</span><span class="mi">816</span><span class="n">f</span><span class="o">-</span><span class="mi">4824</span><span class="o">-</span><span class="n">ae28</span><span class="o">-</span><span class="mi">7</span><span class="n">b82e006e7b7_gdc_realn_rehead_1</span><span class="o">.</span><span class="na">fq</span><span class="o">.</span><span class="na">gz</span><span class="o">,</span> <span class="mo">03</span><span class="mi">99</span><span class="n">ad16</span><span class="o">-</span><span class="mi">816</span><span class="n">f</span><span class="o">-</span><span class="mi">4824</span><span class="o">-</span><span class="n">ae28</span><span class="o">-</span><span class="mi">7</span><span class="n">b82e006e7b7_gdc_realn_rehead_2</span><span class="o">.</span><span class="na">fq</span><span class="o">.</span><span class="na">gz</span><span class="o">]]</span>
<span class="o">[</span><span class="mi">0</span><span class="n">ac6634e</span><span class="o">-</span><span class="mo">00</span><span class="n">b0</span><span class="o">-</span><span class="mi">4107</span><span class="o">-</span><span class="n">a5d6</span><span class="o">-</span><span class="n">db8ffc602645_gdc_realn_rehead</span><span class="o">,</span> <span class="o">[</span><span class="mi">0</span><span class="n">ac6634e</span><span class="o">-</span><span class="mo">00</span><span class="n">b0</span><span class="o">-</span><span class="mi">4107</span><span class="o">-</span><span class="n">a5d6</span><span class="o">-</span><span class="n">db8ffc602645_gdc_realn_rehead_1</span><span class="o">.</span><span class="na">fq</span><span class="o">.</span><span class="na">gz</span><span class="o">,</span> <span class="mi">0</span><span class="n">ac6634e</span><span class="o">-</span><span class="mo">00</span><span class="n">b0</span><span class="o">-</span><span class="mi">4107</span><span class="o">-</span><span class="n">a5d6</span><span class="o">-</span><span class="n">db8ffc602645_gdc_realn_rehead_2</span><span class="o">.</span><span class="na">fq</span><span class="o">.</span><span class="na">gz</span><span class="o">]]</span>
</code></pre></div></div>
<p>As you can see, for the single-end reads channel <code class="language-plaintext highlighter-rouge">singleFiles</code>, the method is slightly extended:</p>
<p>First, we set an additional parameter <code class="language-plaintext highlighter-rouge">size: 1</code> to set the number of files each emitted item is expected to hold to 1. In additional, we manually provide the a custom grouping strategy in the closure, which based on the current file as parameter, returns the grouping key. In our case, we simply strip anything from the file name after <code class="language-plaintext highlighter-rouge">.fq</code> and use this as our grouping key. For example:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span><span class="mi">0</span><span class="n">fdb3d0e</span><span class="o">-</span><span class="n">e405</span><span class="o">-</span><span class="mi">4</span><span class="n">e8d</span><span class="o">-</span><span class="mi">8897</span><span class="o">-</span><span class="mi">4</span><span class="n">a90ea4fe00c_gdc_realn_rehead</span><span class="o">,</span> <span class="o">[</span><span class="mi">0</span><span class="n">fdb3d0e</span><span class="o">-</span><span class="n">e405</span><span class="o">-</span><span class="mi">4</span><span class="n">e8d</span><span class="o">-</span><span class="mi">8897</span><span class="o">-</span><span class="mi">4</span><span class="n">a90ea4fe00c_gdc_realn_rehead</span><span class="o">.</span><span class="na">fq</span><span class="o">.</span><span class="na">gz</span><span class="o">]]</span>
<span class="o">[</span><span class="mi">1916</span><span class="n">abcd</span><span class="o">-</span><span class="mi">61</span><span class="n">c0</span><span class="o">-</span><span class="mi">4</span><span class="n">f23</span><span class="o">-</span><span class="mi">96</span><span class="n">ac</span><span class="o">-</span><span class="n">be70aacb8dc1_gdc_realn_rehead</span><span class="o">,</span> <span class="o">[</span><span class="mi">1916</span><span class="n">abcd</span><span class="o">-</span><span class="mi">61</span><span class="n">c0</span><span class="o">-</span><span class="mi">4</span><span class="n">f23</span><span class="o">-</span><span class="mi">96</span><span class="n">ac</span><span class="o">-</span><span class="n">be70aacb8dc1_gdc_realn_rehead</span><span class="o">.</span><span class="na">fq</span><span class="o">.</span><span class="na">gz</span><span class="o">]]</span>
</code></pre></div></div>
<p>Finally, we combined both channels via a <code class="language-plaintext highlighter-rouge">mix</code> operator into our final <code class="language-plaintext highlighter-rouge">fastqChannel</code> input channel to our <code class="language-plaintext highlighter-rouge">salmon</code> process.</p>
<h5 id="indexchannel"><code class="language-plaintext highlighter-rouge">indexChannel</code></h5>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">indexChannel</span> <span class="o">=</span> <span class="nc">Channel</span>
<span class="o">.</span><span class="na">fromPath</span><span class="o">(</span><span class="n">params</span><span class="o">.</span><span class="na">salmonIndex</span><span class="o">)</span>
<span class="o">.</span><span class="na">ifEmpty</span> <span class="o">{</span> <span class="n">exit</span> <span class="mi">1</span><span class="o">,</span> <span class="s">"Salmon index not found: ${params.salmonIndex}"</span> <span class="o">}</span>
</code></pre></div></div>
<p>This input channel is pretty straightforward set up. Only thing we need to do is to precreate our Salmon index (read how to do this <a href="https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode">here</a>) and supply it via the <code class="language-plaintext highlighter-rouge">salmonIndex</code> parameter - how this is done will follow later.</p>
<h5 id="process-salmon">Process <code class="language-plaintext highlighter-rouge">salmon</code></h5>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">process</span> <span class="n">salmon</span> <span class="o">{</span>
<span class="n">tag</span> <span class="o">{</span> <span class="n">lane</span> <span class="o">}</span>
<span class="nl">input:</span>
<span class="n">set</span> <span class="nf">val</span><span class="o">(</span><span class="n">lane</span><span class="o">),</span> <span class="n">file</span><span class="o">(</span><span class="n">reads</span><span class="o">)</span> <span class="n">from</span> <span class="n">fastqChannel</span>
<span class="n">file</span> <span class="n">index</span> <span class="n">from</span> <span class="n">indexChannel</span><span class="o">.</span><span class="na">first</span><span class="o">()</span>
<span class="nl">output:</span>
<span class="n">file</span> <span class="o">(</span><span class="s">"${lane}_salmon/quant.sf"</span><span class="o">)</span> <span class="n">into</span> <span class="n">salmonChannel</span>
<span class="nf">file</span> <span class="o">(</span><span class="s">"${lane}_pseudo.bam"</span><span class="o">)</span> <span class="n">into</span> <span class="n">pseudoBamChannel</span>
<span class="nl">shell:</span>
<span class="n">def</span> <span class="n">single</span> <span class="o">=</span> <span class="n">reads</span> <span class="k">instanceof</span> <span class="nc">Path</span>
<span class="nf">if</span> <span class="o">(!</span><span class="n">single</span><span class="o">)</span>
<span class="sc">'''</span>
<span class="n">salmon</span> <span class="n">quant</span> <span class="o">-</span><span class="n">i</span> <span class="o">!{</span><span class="n">index</span><span class="o">}</span> <span class="o">-</span><span class="n">l</span> <span class="no">A</span> <span class="o">-</span><span class="mi">1</span> <span class="o">!{</span><span class="n">reads</span><span class="o">[</span><span class="mi">0</span><span class="o">]}</span> <span class="o">-</span><span class="mi">2</span> <span class="o">!{</span><span class="n">reads</span><span class="o">[</span><span class="mi">1</span><span class="o">]}</span> <span class="o">-</span><span class="n">o</span> <span class="o">!{</span><span class="n">lane</span><span class="o">}</span><span class="n">_salmon</span> <span class="o">-</span><span class="n">p</span> <span class="o">!{</span><span class="n">task</span><span class="o">.</span><span class="na">cpus</span><span class="o">}</span> <span class="o">--</span><span class="n">validateMappings</span> <span class="o">--</span><span class="n">no</span><span class="o">-</span><span class="n">version</span><span class="o">-</span><span class="n">check</span> <span class="o">-</span><span class="n">z</span> <span class="o">|</span> <span class="n">samtools</span> <span class="n">view</span> <span class="o">-</span><span class="nc">Sb</span> <span class="o">-</span><span class="no">F</span> <span class="mi">256</span> <span class="o">-</span> <span class="o">></span> <span class="o">!{</span><span class="n">lane</span><span class="o">}</span><span class="n">_pseudo</span><span class="o">.</span><span class="na">bam</span>
<span class="sc">'''</span>
<span class="k">else</span>
<span class="sc">'''</span>
<span class="n">salmon</span> <span class="n">quant</span> <span class="o">-</span><span class="n">i</span> <span class="o">!{</span><span class="n">index</span><span class="o">}</span> <span class="o">-</span><span class="n">l</span> <span class="no">A</span> <span class="o">-</span><span class="n">r</span> <span class="o">!{</span><span class="n">reads</span><span class="o">}</span> <span class="o">-</span><span class="n">o</span> <span class="o">!{</span><span class="n">lane</span><span class="o">}</span><span class="n">_salmon</span> <span class="o">-</span><span class="n">p</span> <span class="o">!{</span><span class="n">task</span><span class="o">.</span><span class="na">cpus</span><span class="o">}</span> <span class="o">--</span><span class="n">validateMappings</span> <span class="o">--</span><span class="n">no</span><span class="o">-</span><span class="n">version</span><span class="o">-</span><span class="n">check</span> <span class="o">-</span><span class="n">z</span> <span class="o">|</span> <span class="n">samtools</span> <span class="n">view</span> <span class="o">-</span><span class="nc">Sb</span> <span class="o">-</span><span class="no">F</span> <span class="mi">256</span> <span class="o">-</span> <span class="o">></span> <span class="o">!{</span><span class="n">lane</span><span class="o">}</span><span class="n">_pseudo</span><span class="o">.</span><span class="na">bam</span>
<span class="sc">'''</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Our only process for the <code class="language-plaintext highlighter-rouge">salmon-nf</code> workflow is the <code class="language-plaintext highlighter-rouge">salmon</code> process.</p>
<p>You will notice that it has the 2 input channels we previously defined - <code class="language-plaintext highlighter-rouge">fastqChannel</code> and <code class="language-plaintext highlighter-rouge">indexChannel</code>. Note, how we have to use the <code class="language-plaintext highlighter-rouge">.first()</code> method on the <code class="language-plaintext highlighter-rouge">indexChannel</code> since it is a folder.</p>
<p>In addition, we have defined 2 output channels - <code class="language-plaintext highlighter-rouge">salmonChannel</code> outputting all <code class="language-plaintext highlighter-rouge">quant.sf</code> files and <code class="language-plaintext highlighter-rouge">pseudoBamChannel</code> outputting the corresponding <code class="language-plaintext highlighter-rouge">pseudo.bam</code> files.</p>
<p>The actual script that is run, is a plain conditional bash script. We have an initial condition that asks whether we have single read files coming in from the <code class="language-plaintext highlighter-rouge">fastqChannel</code> or paired-end reads - and based on this evaluation will run one or the other script branch.</p>
<p>The bash script itself is then basically only a <code class="language-plaintext highlighter-rouge">salmon</code> call on the respective input files.</p>
<p>This input channel is pretty straightforward set up. Only thing we need to do is to precreate our Salmon index (read how to do this <a href="https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode">here</a>) and supply it via the <code class="language-plaintext highlighter-rouge">salmonIndex</code> parameter - how this is done will follow later.</p>
<h3 id="nextflowconfig">nextflow.config</h3>
<p>The Nextflow configuration files contain directives for for parameter definitions, profile definitions and many others.</p>
<p>In our particular example of <code class="language-plaintext highlighter-rouge">salmon-nf</code>, we will have a master <code class="language-plaintext highlighter-rouge">nextflow.config</code> that is tidied up and include additional configs for each section.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">includeConfig</span> <span class="err">'</span><span class="n">config</span><span class="o">/</span><span class="n">general</span><span class="o">.</span><span class="na">config</span><span class="err">'</span>
<span class="n">includeConfig</span> <span class="err">'</span><span class="n">config</span><span class="o">/</span><span class="n">docker</span><span class="o">.</span><span class="na">config</span><span class="err">'</span>
<span class="n">profiles</span> <span class="o">{</span>
<span class="n">standard</span> <span class="o">{</span>
<span class="n">process</span><span class="o">.</span><span class="na">executor</span> <span class="o">=</span> <span class="err">'</span><span class="n">local</span><span class="err">'</span>
<span class="n">process</span><span class="o">.</span><span class="na">maxForks</span> <span class="o">=</span> <span class="mi">3</span>
<span class="o">}</span>
<span class="n">slurm</span> <span class="o">{</span>
<span class="n">includeConfig</span> <span class="err">'</span><span class="n">config</span><span class="o">/</span><span class="n">slurm</span><span class="o">.</span><span class="na">config</span><span class="err">'</span>
<span class="o">}</span>
<span class="n">awsbatch</span> <span class="o">{</span>
<span class="n">includeConfig</span> <span class="err">'</span><span class="n">config</span><span class="o">/</span><span class="n">awsbatch</span><span class="o">.</span><span class="na">config</span><span class="err">'</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p>As you can see, we have simply included some more config files and some barebone definition of profiles. Let’s look at the sub-config files.</p>
<h5 id="generalconfig">general.config</h5>
<p>This holds general configurations, parameters and definitions that are applicable to any of our run profiles.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">params</span> <span class="o">{</span>
<span class="n">outputDir</span> <span class="o">=</span> <span class="err">'</span><span class="o">./</span><span class="n">results</span><span class="err">'</span>
<span class="o">}</span>
<span class="n">process</span> <span class="o">{</span>
<span class="n">publishDir</span> <span class="o">=</span> <span class="o">[</span>
<span class="o">[</span><span class="nl">path:</span> <span class="n">params</span><span class="o">.</span><span class="na">outputDir</span><span class="o">,</span> <span class="nl">mode:</span> <span class="err">'</span><span class="n">copy</span><span class="err">'</span><span class="o">,</span> <span class="nl">overwrite:</span> <span class="err">'</span><span class="kc">true</span><span class="err">'</span><span class="o">,</span> <span class="nl">pattern:</span> <span class="s">"*/quant.sf"</span><span class="o">],</span>
<span class="o">[</span><span class="nl">path:</span> <span class="n">params</span><span class="o">.</span><span class="na">outputDir</span><span class="o">,</span> <span class="nl">mode:</span> <span class="err">'</span><span class="n">copy</span><span class="err">'</span><span class="o">,</span> <span class="nl">overwrite:</span> <span class="err">'</span><span class="kc">true</span><span class="err">'</span><span class="o">,</span> <span class="nl">pattern:</span> <span class="s">"*pseudo.bam"</span><span class="o">]</span>
<span class="o">]</span>
<span class="n">errorStrategy</span> <span class="o">=</span> <span class="err">'</span><span class="n">retry</span><span class="err">'</span>
<span class="n">maxRetries</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">maxForks</span> <span class="o">=</span> <span class="mi">100</span>
<span class="o">}</span>
<span class="n">cloud</span> <span class="o">{</span>
<span class="n">imageId</span> <span class="o">=</span> <span class="err">'</span><span class="n">ami</span><span class="o">-</span><span class="mi">0</span><span class="n">f99d00928be3a282</span><span class="err">'</span>
<span class="n">instanceType</span> <span class="o">=</span> <span class="err">'</span><span class="n">t2</span><span class="o">.</span><span class="na">micro</span><span class="err">'</span>
<span class="n">userName</span> <span class="o">=</span> <span class="err">'</span><span class="n">ec2</span><span class="o">-</span><span class="n">user</span><span class="err">'</span>
<span class="n">keyName</span> <span class="o">=</span> <span class="err">'</span><span class="n">awsbatch</span><span class="err">'</span>
<span class="c1">// Type: SSH, Protocol: TCP, Port: 22, Source IP: 0.0.0.0/0</span>
<span class="n">securityGroup</span> <span class="o">=</span> <span class="err">'</span><span class="n">sg</span><span class="o">-</span><span class="mo">0307</span><span class="n">dbec406526c14</span><span class="err">'</span>
<span class="o">}</span>
<span class="n">timeline</span> <span class="o">{</span>
<span class="n">enabled</span> <span class="o">=</span> <span class="kc">true</span>
<span class="o">}</span>
<span class="n">report</span> <span class="o">{</span>
<span class="n">enabled</span> <span class="o">=</span> <span class="kc">true</span>
<span class="o">}</span>
</code></pre></div></div>
<p>We set a default output directory in the <code class="language-plaintext highlighter-rouge">params</code> section, copy the <code class="language-plaintext highlighter-rouge">quant.sf</code> and <code class="language-plaintext highlighter-rouge">pseudo.bam</code> files to a dedicated publish directory, set our error strategy, a basic cloud profile for starting up instances on <a href="https://aws.amazon.com">AWS</a> and enable <a href="https://www.nextflow.io/docs/latest/tracing.html#timeline-report">timeline</a> and <a href="https://www.nextflow.io/docs/latest/tracing.html#execution-report">execution</a> reports per default.</p>
<h5 id="dockerconfig">docker.config</h5>
<p>With this configuration file, we enable Docker support per default and supply the Docker image to use with our <code class="language-plaintext highlighter-rouge">salmon</code> process.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">docker</span> <span class="o">{</span>
<span class="n">enabled</span> <span class="o">=</span> <span class="kc">true</span>
<span class="o">}</span>
<span class="n">process</span> <span class="o">{</span>
<span class="c1">// Process-specific docker containers</span>
<span class="nl">withName:</span><span class="n">salmon</span> <span class="o">{</span>
<span class="n">container</span> <span class="o">=</span> <span class="err">'</span><span class="n">obenauflab</span><span class="o">/</span><span class="nl">salmon:</span><span class="n">latest</span><span class="err">'</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h5 id="slurmconfig">slurm.config</h5>
<p>This configuration file defines a profile for the <a href="https://slurm.schedmd.com/documentation.html">SLURM</a> scheduler which is run on our HPC system. Our cluster only supports Singularity, so we disable Docker and enable Singuarity in return, as well as define basic resource constraints and queues on our HPC system where to run our tasks - and finally also supply the location of the <code class="language-plaintext highlighter-rouge">salmonIndex</code> on our file system.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">singularity</span> <span class="o">{</span>
<span class="n">enabled</span> <span class="o">=</span> <span class="kc">true</span>
<span class="o">}</span>
<span class="n">docker</span> <span class="o">{</span>
<span class="n">enabled</span> <span class="o">=</span> <span class="kc">false</span>
<span class="o">}</span>
<span class="n">process</span> <span class="o">{</span>
<span class="n">executor</span> <span class="o">=</span> <span class="err">'</span><span class="n">slurm</span><span class="err">'</span>
<span class="n">clusterOptions</span> <span class="o">=</span> <span class="err">'</span><span class="o">--</span><span class="n">qos</span><span class="o">=</span><span class="kt">short</span><span class="err">'</span>
<span class="n">cpus</span> <span class="o">=</span> <span class="err">'</span><span class="mi">12</span><span class="err">'</span>
<span class="n">memory</span> <span class="o">=</span> <span class="o">{</span> <span class="mi">8</span><span class="o">.</span><span class="na">GB</span> <span class="o">*</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">}</span>
<span class="o">}</span>
<span class="n">params</span> <span class="o">{</span>
<span class="n">salmonIndex</span> <span class="o">=</span> <span class="err">'</span><span class="o">/</span><span class="n">groups</span><span class="o">/</span><span class="nc">Software</span><span class="o">/</span><span class="n">indices</span><span class="o">/</span><span class="n">hg38</span><span class="o">/</span><span class="n">salmon</span><span class="o">/</span><span class="n">gencode</span><span class="o">.</span><span class="na">v28</span><span class="o">.</span><span class="na">IMPACT</span><span class="err">'</span>
<span class="o">}</span>
</code></pre></div></div>
<h5 id="awsbatchconfig">awsbatch.config</h5>
<p>This configuration file will be explained in detail in a later post - but in brief it enables execution of tasks in the cloud using <a href="https://aws.amazon.com/batch/">AWS Batch</a>, yet it still requires extensive configuration before it is usable.</p>
<h2 id="running-the-salmon-nf-nextflow-workflow">Running the <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow workflow</h2>
<p>Now that we have written our code and committed everything to GitHub, we can finally testdrive our workflow on some actual data.</p>
<p>First, let’s pull in our workflow:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tobias.neumann@login-01 <span class="o">[</span>BIO] <span class="nv">$ </span>nextflow pull t-neumann/salmon-nf
Picked up _JAVA_OPTIONS: <span class="nt">-Djava</span>.io.tmpdir<span class="o">=</span>/tmp
Checking t-neumann/salmon-nf ...
downloaded from https://github.com/t-neumann/salmon-nf.git - revision: 4fbaea7165 <span class="o">[</span>master]
tobias.neumann@login-01 <span class="o">[</span>BIO] <span class="err">$</span>
</code></pre></div></div>
<p>Now we are ready to run our workflow. Make sure to select the profile you desire - for this example I will run it on our in-house cluster with SLURM:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tobias.neumann@login-01 <span class="o">[</span>BIO] <span class="nv">$ </span>nextflow run t-neumann/salmon-nf <span class="nt">--inputDir</span> /tmp/data <span class="nt">--outputDir</span> results <span class="nt">-profile</span> slurm <span class="nt">-resume</span>
Picked up _JAVA_OPTIONS: <span class="nt">-Djava</span>.io.tmpdir<span class="o">=</span>/tmp
N E X T F L O W ~ version 19.01.0
Launching <span class="sb">`</span>t-neumann/salmon-nf<span class="sb">`</span> <span class="o">[</span>maniac_poisson] - revision: 4fbaea7165 <span class="o">[</span>master]
parameters
<span class="o">======================</span>
input directory : /tmp/data
output directory : results
<span class="o">======================</span>
<span class="o">[</span>warm up] executor <span class="o">></span> slurm
<span class="o">[</span>fb/20d1dc] Submitted process <span class="o">></span> salmon <span class="o">(</span>8cec7235-3572-460c-b1d7-efe7961988e1_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>e9/6f6404] Submitted process <span class="o">></span> salmon <span class="o">(</span>5e18b02d-7e56-4f0d-b892-e7798eee5205_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>f9/509312] Submitted process <span class="o">></span> salmon <span class="o">(</span>d1ada222-b67f-47c0-b380-091eaab093b4_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>6d/30354f] Submitted process <span class="o">></span> salmon <span class="o">(</span>3783843f-c4fa-4aab-8f5b-e0749763164e_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>9b/2a81e9] Submitted process <span class="o">></span> salmon <span class="o">(</span>0fdb3d0e-e405-4e8d-8897-4a90ea4fe00c_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>de/418130] Submitted process <span class="o">></span> salmon <span class="o">(</span>383e3574-d22c-4dd6-842f-656ee2ab3b32_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>c1/e00c04] Submitted process <span class="o">></span> salmon <span class="o">(</span>1916abcd-61c0-4f23-96ac-be70aacb8dc1_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>63/6a2e93] Submitted process <span class="o">></span> salmon <span class="o">(</span>30fe4005-f4f2-41ce-bb1a-4830f3959ab7_gdc_realn_rehead<span class="o">)</span>
</code></pre></div></div>
<p>Now we just have to wait till our workflow has successfully finished processing all our samples.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>76/67754e] Submitted process <span class="o">></span> salmon <span class="o">(</span>0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead<span class="o">)</span>
t-neumann/salmon-nf has finished.
Status: SUCCESS
Time: Sun Aug 25 23:35:49 CEST 2019
Duration: 2m
tobias.neumann@login-01 <span class="o">[</span>BIO] <span class="err">$</span>
</code></pre></div></div>
<p>If we now check our results and execution folder, we will find all the files we asked for in there - Nextflow is awesome!</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tobias.neumann@login-01 <span class="o">[</span>BIO] <span class="nv">$ </span><span class="nb">ls
</span>report.html results timeline.html
tobias.neumann@login-01 <span class="o">[</span>BIO] <span class="nv">$ </span><span class="nb">ls </span>results
0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead_pseudo.bam 0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead_salmon
</code></pre></div></div>
<p>Have fun building workflows on your own - it pays off, especially for larger samples and heterogeneous computing environments!</p>Tobias NeumannNowadays, workflow management systems have become an integral part of large-scale analysis of biological datasets with multiple software packages and multi-platform language support. These systems enable the rapid prototyping and deployment of pipelines that combine complementary software packages. Several such systems are already available, such as Snakemake and CWL.AWS architecture outline2019-02-10T09:45:00+01:002019-02-10T09:45:00+01:00https://t-neumann.github.io/pipelines/AWS-architecture<p>If you talk about the omni-present buzzword <strong>cloud computing</strong>, you will inevitably stumble over <a href="https://aws.amazon.com">Amazon Web Services <i class="fab fa-aws" aria-hidden="true"></i></a>. Sounds super cool and everybody gets excited about it, but I for my part was simply overwhelmed by the amount of services and products available from the platform.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-architecture/AWSServices.png" alt="AWS Services" /></p>
<p>The good news for us bioinformaticians is - and probably all cloud computing professionals handling on enterprise solutions are going to beat me for this statement - for setting up a proper and failsafe analysis pipeline with AWS, you only need a tiny fraction of those and can ignore the rest. In this post, I will walk you through the essential AWS building blocks I deem required for a basic bioinformatics processing pipeline, their characteristics, caveats and how they play together.</p>
<h1 id="aws-building-blocks">AWS building blocks</h1>
<p>If you are familiar with cluster computing environments, you should not have a hard time to find the same architecture principal when building your own custom cluster computing environment in the cloud with AWS. I will elaborate on those pieces I encountered when building up a basic processing pipeline:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">S3</code> for storage of input and auxiliary (e.g. index) files</li>
<li><code class="language-plaintext highlighter-rouge">EBS</code> as local compute storage</li>
<li><code class="language-plaintext highlighter-rouge">AMI</code> Machine image (the operating system) to be run on your instances</li>
<li><code class="language-plaintext highlighter-rouge">EC2</code> instances that do the actual computation</li>
<li><code class="language-plaintext highlighter-rouge">ECS</code> to create your “software” from Docker containers to run on your instances</li>
<li><code class="language-plaintext highlighter-rouge">AWS Batch</code> that handles everything from submission to scaling and proper finalization of your individual jobs</li>
</ul>
<p>In the limited number of pipelines I have set up to run in AWS (they can also run on any other compute environment, but that’s a different later story) I have never used any services beyond that. For anything that involves reading e.g. raw read files, processing them and retreiving the output one should be able to make do with a combination of those. This can probably be optimized or done more elegantly with different services, but I had some discussions on this with various people and we have not come across a solution that could do it at a lower cost.</p>
<h2 id="s3---simple-storage-service">S3 - Simple Storage Service</h2>
<p>This is the long-term storage solution from AWS. If you are familiar with a compute environment, this would be your globally accessible file-system were you store all your important files, reference genomes, alignment-indices - you name it. Contrary to the storage you are used to (unless you copy files locally to your node temporary storage for fast I/O), none of the files on <code class="language-plaintext highlighter-rouge">S3</code> are directly read or written when utilizing <code class="language-plaintext highlighter-rouge">EC2</code> instances for computational tasks. Before any pipeline start, all of the necessary files have to present in <code class="language-plaintext highlighter-rouge">S3</code> such as:</p>
<ul>
<li>Input files:
<ul>
<li>Raw read files (<code class="language-plaintext highlighter-rouge">fastq</code>, <code class="language-plaintext highlighter-rouge">bam</code>,…)</li>
<li>Quantification tables (<code class="language-plaintext highlighter-rouge">txt</code>, <code class="language-plaintext highlighter-rouge">tsv</code>, <code class="language-plaintext highlighter-rouge">csv</code>,…)</li>
</ul>
</li>
<li>Reference files:
<ul>
<li>Genome sequence (<code class="language-plaintext highlighter-rouge">fasta</code>)</li>
<li>Feature annotations (<code class="language-plaintext highlighter-rouge">gtf</code>, <code class="language-plaintext highlighter-rouge">bed</code>, …)</li>
</ul>
</li>
<li>Index files:
<ul>
<li>Alignment indices (<code class="language-plaintext highlighter-rouge">bwa</code>, <code class="language-plaintext highlighter-rouge">bowtie</code>, <code class="language-plaintext highlighter-rouge">STAR</code>,…)</li>
<li>Exon junction annotations (<code class="language-plaintext highlighter-rouge">gtf</code>, …)</li>
<li>Transcriptome indices (<code class="language-plaintext highlighter-rouge">callisto</code>, <code class="language-plaintext highlighter-rouge">salmon</code>, …)</li>
</ul>
</li>
</ul>
<p><code class="language-plaintext highlighter-rouge">S3</code> also will be the final storage location where any of your final output files produced by your pipeline will end up. Since only <code class="language-plaintext highlighter-rouge">S3</code> is long-term storage, usually you don’t have to worry about deleted intermediate or temporary files produced by your pipeline since they will be discarded after your instance has finished processing a given task.</p>
<p>Upload to <code class="language-plaintext highlighter-rouge">S3</code> does not come with any cost, however downloading data from <code class="language-plaintext highlighter-rouge">S3</code> is charged at around 10 cent / GB. Storage on <code class="language-plaintext highlighter-rouge">S3</code> is charged at a per GB / per month basis. So I guess the fact that they charge data downloads is just merely based on the fact that you could up/download data in-between for free and thus circumvent the storage cost which they want to prevent.</p>
<h2 id="ebs---elastic-block-store">EBS - Elastic Block Store</h2>
<p>Every launched instance comes with a root volume of a limited size (8 GB) where all the OS and Service files are located required to start up an instance. To each instance, you can (and often <strong>must</strong>) attach additional volumes - <code class="language-plaintext highlighter-rouge">EBS</code> volumes - of configurable size where your data goes.</p>
<p>There are 3 things to consider when choosing your EBS size</p>
<ul>
<li>It needs to be large enough to store all input files for a given job
<ul>
<li>This includes <strong>all</strong> auxiliary files such as index files!</li>
</ul>
</li>
<li>It needs to be large enough to store <strong>all</strong> intermediate files for a given job</li>
<li>It needs to be large enough to store <strong>all</strong> output files from a given job</li>
</ul>
<p>Remember - <code class="language-plaintext highlighter-rouge">S3</code> data is never directly accessed from your instance, but always copied to your local <code class="language-plaintext highlighter-rouge">EBS</code> volume!</p>
<p>Estimating <code class="language-plaintext highlighter-rouge">EBS</code> volume sizes gave me a hard time initially and I did a lot of benchmarking runs - if it is too small, your jobs will crash. In practice, I found that <code class="language-plaintext highlighter-rouge">EBS</code> cost is a negligible fraction of your overall cost - so in the end, I ended up being very generous on <code class="language-plaintext highlighter-rouge">EBS</code> volume sizes.</p>
<h2 id="ami---amazon-machine-image">AMI - Amazon Machine Image</h2>
<p>The <code class="language-plaintext highlighter-rouge">AMI</code> is basically Amazon’s version of an image similar to Virtual Machine images. They offer quite a variety of OS base versions in their store (Linux, Windows etc.), but what you would usually want to go for is extending any of those base images yourself with all the software you need during your pipeline run. These days with <a href="https://www.docker.com">Docker <i class="fab fa-docker" aria-hidden="true"></i></a>, usually there is very little effort to setup your software environment, but even then you will in most cases have to install at least the <a href="https://aws.amazon.com/cli">AWS Command Line Interface</a> to copy files from and to <code class="language-plaintext highlighter-rouge">S3</code>.</p>
<h2 id="ec2---elastic-compute-cloud">EC2 - Elastic Compute Cloud</h2>
<p><code class="language-plaintext highlighter-rouge">EC2</code> is the part where you bring the computing heat: These are the instances upon which you launch your <code class="language-plaintext highlighter-rouge">AMI</code>s, attach your <code class="language-plaintext highlighter-rouge">EBS</code> volumes and then do some heavy computation. <code class="language-plaintext highlighter-rouge">EC2</code> instances come in all form and shapes - depending on your demands. Below is an excerpt of compute optimized instance types, but depending on the application you might go for memory optimized, storage optimized GPUs, you name it.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-architecture/EC2Instances.png" alt="EC2 instances" /></p>
<p>The cool thing about them - probably you noticed already if you did the Math - is in terms of cost, it does not matter whether you pick a smaller or a larger instance. The price will scale exactly linearly, meaning you don’t need to squeeze in two jobs in a 2-timers bigger instance necessarily - which will be important at a later point.</p>
<h2 id="ecs---elastic-container-service">ECS - Elastic Container Service</h2>
<p>This definition and especially it’s distinction from <code class="language-plaintext highlighter-rouge">AWS Batch</code> was the hardest for me - I found the most helpful explanation <a href="https://medium.freecodecamp.org/amazon-ecs-terms-and-architecture-807d8c4960fd">here</a> and summarized it below.</p>
<p>According to Amazon,</p>
<blockquote>
<p>Amazon Elastic Container Service (Amazon ECS) is a highly scalable, high-performance container orchestration service that supports Docker containers and allows you to easily run and scale containerized applications on AWS.</p>
</blockquote>
<p>With <code class="language-plaintext highlighter-rouge">ECS</code> you can run Docker containers on <code class="language-plaintext highlighter-rouge">EC2</code> instances with <code class="language-plaintext highlighter-rouge">AMIs</code> pre-installed with Docker. <code class="language-plaintext highlighter-rouge">ECS</code> handles the installation of containers and the scaling, monitoring and management of the <code class="language-plaintext highlighter-rouge">EC2</code> instances through an API or the AWS Management console. An <code class="language-plaintext highlighter-rouge">ECS</code> instance has Docker and an <code class="language-plaintext highlighter-rouge">ECS</code> Container Agent running on it. A Container Instance can run many Tasks. The Agent takes care of the communication between ECS and the instance, providing the status of running containers and managing running new ones.</p>
<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-architecture/ECS.png" alt="ECS" /></p>
<p>Several <code class="language-plaintext highlighter-rouge">ECS</code> container instances can be combined into an <code class="language-plaintext highlighter-rouge">ECS</code> cluster: Amazon ECS handles the logic of scheduling, maintaining, and handling scaling requests to these instances. It also takes away the work of finding the optimal placement of each Task based on your CPU and memory needs.</p>
<h2 id="aws-batch">AWS Batch</h2>
<p>The separation of <code class="language-plaintext highlighter-rouge">AWS Batch</code> from <code class="language-plaintext highlighter-rouge">ECS</code> was most blurry to me. Essentially, <code class="language-plaintext highlighter-rouge">AWS Batch</code> is build on top of regular <code class="language-plaintext highlighter-rouge">ECS</code> and comes with additional features such as:</p>
<ul>
<li>Managed compute environment: AWS handles cluster scaling in response to workload.</li>
<li>Heterogenous instance types: useful when having outlier jobs taking up large amounts of resources</li>
<li>Spot instances: Save money compared to on-demand instances</li>
<li>Easy integration with <code class="language-plaintext highlighter-rouge">Cloudwatch</code> logs (<code class="language-plaintext highlighter-rouge">stdout</code> and <code class="language-plaintext highlighter-rouge">stderr</code> captured automatically). This can also lead to insane cost, so <strong>watch out</strong>. More on that later.</li>
</ul>
<p><code class="language-plaintext highlighter-rouge">AWS Batch</code> will effectively take care of firing up instances to handle your workload and then let <code class="language-plaintext highlighter-rouge">ECS</code> handle the Docker orchestration and job execution.</p>
<h1 id="putting-it-all-together">Putting it all together</h1>
<figure style="width: 500px" class="align-right">
<img src="https://t-neumann.github.io/assets/images/posts/AWS-architecture/AWSArchitecture.png" alt="AWS Architecture" />
</figure>
<p>So how do all the AWS building blocks we just discussed fit together to process jobs? Let’s walk through it and conclude this post:</p>
<ul>
<li>All jobs we want to be processed are sent to <code class="language-plaintext highlighter-rouge">AWS Batch</code>, which will assess the resources needed and fire up <code class="language-plaintext highlighter-rouge">ECS</code> instances accordingly.</li>
<li><code class="language-plaintext highlighter-rouge">ECS</code> will take care of pulling the Docker images needed from a container registry (usually Docker hub) and fire up containers on the <code class="language-plaintext highlighter-rouge">EC2</code> instances using the pre-installed Docker daemon.</li>
<li>These <code class="language-plaintext highlighter-rouge">EC2</code> instances have been initialized with custom <code class="language-plaintext highlighter-rouge">AMIs</code> on startup, having all <code class="language-plaintext highlighter-rouge">ECS</code> prerequisites and additional customized resources such as e.g. the <code class="language-plaintext highlighter-rouge">AWS CLI</code> and additional <code class="language-plaintext highlighter-rouge">EBS</code> volume space.</li>
<li>All data required for this job is fetched from their long-term storage in <code class="language-plaintext highlighter-rouge">S3</code> to the local <code class="language-plaintext highlighter-rouge">EBS</code> storage of the respective <code class="language-plaintext highlighter-rouge">EC2</code> instance.</li>
</ul>
<p>Now the job has everything it needs to run and will be processed.
After reading this post, you should have a basic understanding what AWS building blocks an AWS batch scheduling system comprises. The next step is to then actually build the architecture for such a pipeline for which I will dedicate another comprehensive post.</p>Tobias NeumannIf you talk about the omni-present buzzword cloud computing, you will inevitably stumble over Amazon Web Services . Sounds super cool and everybody gets excited about it, but I for my part was simply overwhelmed by the amount of services and products available from the platform.Welcome to my website!2019-01-17T21:10:00+01:002019-01-17T21:10:00+01:00https://t-neumann.github.io/general/intro<h1 id="hello-world">Hello world!</h1>
<p>I was repeatedly gently pushed towards writing a couple of blogs posts of all the obstacles I bothered people on various <a href="https://gitter.im">Gitter channels <i class="fab fa-gitter" aria-hidden="true"></i></a> with, so I finally made it happen.</p>
<p>Since I hate anything related to web development, HTML, CSS, JS - you name it - hosting Jekyll on GitHub is the most I can reasonably do. I’m actually quite happy that it requires little CSS and HTML and can be mostly put together via Markdown.</p>
<p>To glue this minimal website together, I shamelessly forked the <a href="https://github.com/mmistakes/minimal-mistakes">Minimal mistakes <i class="fab fa-github" aria-hidden="true"></i></a> template and checked out at code from <a href="https://github.com/maxulysse/maxulysse.github.io">Maxime Garcia <i class="fab fa-github" aria-hidden="true"></i></a> for some stuff I liked from the blogs I looked at.</p>
<p>The plan is to put up posts here with anything regarding to Bioinformatics, reproducible pipeline engineering and occasionally rocket science and orbital mechanics.</p>
<p>Cheers</p>Tobias NeumannHello world!