<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://t-neumann.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://t-neumann.github.io/" rel="alternate" type="text/html" /><updated>2025-11-09T16:44:36+01:00</updated><id>https://t-neumann.github.io/feed.xml</id><title type="html">t-neumann.github.io</title><subtitle>Personal website of Tobias Neumann.</subtitle><author><name>Tobias Neumann</name></author><entry><title type="html">ICLR 2025 digest</title><link href="https://t-neumann.github.io/conferences/machine%20learning/iclr2025/" rel="alternate" type="text/html" title="ICLR 2025 digest" /><published>2025-04-20T14:30:00+02:00</published><updated>2025-04-20T14:30:00+02:00</updated><id>https://t-neumann.github.io/conferences/machine%20learning/iclr2025</id><content type="html" xml:base="https://t-neumann.github.io/conferences/machine%20learning/iclr2025/"><![CDATA[<p>This April, I had the opportunity to attend the International Conference on Learning Representations (ICLR) 2025 in Singapore (after missing the 2024 edition in Vienna, shame on me). ICLR has established itself as the premier gathering for professionals dedicated to representation learning and deep learning, and I needed to know what the fuzz was about.</p>

<p>The conference brought together an impressive mix of academic researchers, industry practitioners from companies like Google DeepMind, Meta, and Isomorphic Labs, as well as entrepreneurs and graduate students - all converging on Singapore to discuss cutting-edge research spanning machine vision, computational biology, speech recognition, and robotics.</p>

<p>Beyond the technical content, Singapore itself provided a stunning backdrop. The tropical climate, lush greenery, and modern architecture - especially the iconic Marina Bay Sands - created an very futuristic atmosphere. The city food scene and details like subway pop choreography performances added to the experience. But let me dive into what mattered most to me: the technical developments that are shaping the future of AI.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/ICLR-2025/singapore_architecture.jpg" alt="Signapore architecture" /></p>

<h2 id="the-rise-of-agentic-ai">The Rise of Agentic AI</h2>

<p>If there was one overarching theme at ICLR 2025, it was the evolution of AI systems from passive responders to active agents. The shift from traditional deep learning models to agentic AI represents a fundamental change in the field of artificial intelligence.</p>

<h3 id="from-models-to-agents">From Models to Agents</h3>

<p>Traditional AI systems, even sophisticated ones, essentially function as input-output machines. You provide data, they provide predictions. Agentic AI systems, by contrast, exhibit several key characteristics that make them fundamentally different:</p>

<p><strong>Goal-oriented behavior</strong>: Rather than simply responding to prompts, agentic systems can pursue complex, multi-step objectives autonomously. They don’t just answer “what should I do next?” - they actually do it.</p>

<p><strong>Reflection and adaptation</strong>: Perhaps most intriguingly, these systems can reflect upon their own work and iteratively improve their approach. This meta-cognitive capability allows them to identify failures, adjust strategies, and come up with follow-up steps without human intervention.</p>

<p><strong>Environment interaction</strong>: Agentic AI systems actively interact with their environment - whether that’s a database, a laboratory instrument, or a computational workflow. They can query information, execute commands, and observe the results of their actions.</p>

<p><strong>Tool integration</strong>: Modern agentic systems seamlessly integrate with external tools and APIs, allowing them to leverage specialized capabilities beyond their core model. For instance, TxAgent - a therapeutic reasoning agent presented at the conference - orchestrates 211 specialized tools spanning FDA drug databases, Open Targets, and the Human Phenotype Ontology. This dynamic tool selection allows agents to access verified, continually updated knowledge rather than relying solely on their training data.</p>

<p><strong>Multi-agent collaboration</strong>: Finally and most futuristic, these systems can coordinate with other AI agents, dividing tasks and sharing information to solve problems that would be intractable for a single agent.</p>

<p>Several talks at ICLR covered agentic systems targetting complex scientific workflows, from literature review and hypothesis generation to experimental design and data analysis. My favourite showcase was Agentic-Tx, a therapeutics-focused system powered by Gemini 2.5, which achieved a 52.3% relative improvement over o3-mini on Humanity’s Last Exam (Chemistry &amp; Biology) and demonstrated significant gains on ChemBench and GPQA benchmarks. The key insight is that these systems aren’t just faster versions of traditional AI - they represent a qualitatively different approach to automation.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/ICLR-2025/txagent_workflow.png" alt="TxAgent workflow" /></p>

<p><em>TxAgent workflow demonstrating agentic AI capabilities: knowledge grounding through tool calls, goal-oriented tool selection, multi-step reasoning, and access to continuously updated knowledge bases. The system generates transparent reasoning traces that show each decision step. Source: Gao et al., 2025</em></p>

<h3 id="txgemma-a-case-study-in-domain-specific-agents">TxGemma: A Case Study in Domain-Specific Agents</h3>

<p>The most interesting development coming from a drug discovery company was TxGemma, a suite of efficient, domain-specific large language models for therapeutic applications. What makes TxGemma noteworthy isn’t its performance, but its practical accessibility and adaptability for drug discovery.</p>

<p>Built on the Gemma-2 architecture, TxGemma comes in three sizes - 2B, 9B, and 27B parameters - making it dramatically more efficient than typical foundation models. The suite was fine-tuned on a comprehensive dataset of 7.08 million training samples from the Therapeutics Data Commons (TDC), covering 66 different therapeutic development tasks spanning small molecules, proteins, nucleic acids, diseases, and cell lines. This broad training enables TxGemma to handle diverse aspects of drug discovery, from early-stage target identification to late-stage clinical trial predictions.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/ICLR-2025/txgemma_overview.png" alt="TxGemma architecture" /></p>

<p><em>TxGemma model family: Three size variants (2B, 9B, 27B) trained on diverse therapeutic data from TDC, with specialized versions for prediction (TxGemma-Predict) and conversation (TxGemma-Chat). The models can be integrated as tools in agentic systems like Agentic-Tx. Source: Wang et al., 2025</em></p>

<p>The performance results very definitely useful: Across the 66 TDC tasks, TxGemma achieved superior or comparable performance to state-of-the-art models on 64 tasks (outperforming on 45), despite being orders of magnitude smaller than many competing models. On tasks involving drug-target interactions, pharmacokinetics, and toxicity prediction, TxGemma consistently matched or exceeded specialist models that were designed specifically for those narrow applications.</p>

<p>What’s particularly intriguing is TxGemma’s data efficiency. When fine-tuning for clinical trial adverse event prediction, TxGemma matched the performance of base Gemma-2 models using less than 10% of the training data. In data-scarce domains like drug discovery - where proprietary datasets are common and expensive to generate - this efficiency advantage is definitely a major plus if you want to put this in production.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/ICLR-2025/txgemma_performance.png" alt="TxGemma performance comparison" /></p>

<p><em>TxGemma-Predict demonstrates superior performance across diverse therapeutic task types, with particularly strong results on multi-instance tasks involving multiple data modalities. Median relative improvements show consistent gains over both generalist and specialist state-of-the-art models. Source: Wang et al., 2025</em></p>

<p>The real power of domain-specific foundational models lies in their ability to serve as starting points. Rather than training from scratch, researchers can fine-tune TxGemma on their specific tasks, dramatically reducing the computational resources and data required to achieve good performance. The models can even run on a single Nvidia H100 GPU, making them accessible to smaller research groups and enabling local deployment for sensitive applications.</p>

<p>TxGemma had also a quite nice illustrative example how it can be used in Agentic Systems - theirs being Agentic-Tx, a therapeutics-focused agentic system powered by Gemini 2.5 that extends TxGemma’s capabilities by orchestrating complex workflows. Agentic-Tx employs a modular, tool-usage paradigm, in contrast to TxGemma’s direct generation of solutions.
Agentic-Tx utilizes the ReAct framework, allowing it to interleave reasoning steps (“thoughts”) with actions (tool use). The agentic system receives a task or question and iteratively takes actions based on its current context and therefore answer questions that involve multiple reasoning steps to solve, for example “What structural modifications
could improve the potency of the given drug?” requires iteratively searching the drug’s structural space and
then prompting TxGemma to predict potency.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/ICLR-2025/agenticTx.png" alt="Agentic-Tx" /></p>

<p><em>Agentic-Tx in combination with the ReAct framework to interleave thought with tool-usage. In this example, Agentic-Tx uses two tools to prioritize which hit from a screening campaign should be prioritized: TxGemma-Chat and the clinical toxicity prediction tool based on TxGemma-Predict.</em></p>

<p>Lastly, TxGemma goes beyond prediction. Unlike traditional models that output only answers, TxGemma-Chat - the conversational variant - can explain its reasoning. When asked why a molecule crosses the blood-brain barrier, it can discuss lipophilicity, molecular weight, and hydrogen bonding based directly on the molecular structure. This explainability is a first in therapeutic AI and addresses one of the field’s most significant limitations: the “black box” problem.</p>

<p>TxGemma-Chat maintains this conversational ability while accepting only about a 10% performance reduction on predictive tasks compared to TxGemma-Predict. This trade-off - slightly lower raw accuracy for vastly improved interpretability and user interaction - represents an important design decision in therapeutic AI. For research applications where understanding model reasoning is crucial to get more insights into relevant decicsive parameters, this tradeoff is definitely negligible.</p>

<p>Moreover, TxGemma has been released as an open model specifically trained only on commercially licensed datasets. This decision recognizes the prevalence of proprietary data in pharmaceutical research and allows smaller biotech pharmaceutical startups to adapt and validate the models on their own datasets, potentially tailoring performance to their specific research needs and real-world applications.</p>

<h2 id="multimodal-learning-connecting-different-data-types">Multimodal Learning: Connecting Different Data Types</h2>

<p>Another major theme at ICLR was the challenge of integrating different types of biological and chemical data. In drug discovery and computational biology, we often have rich datasets in different modalities - transcriptomics, proteomics, chemical structures, microscopy images - but connecting these disparate data types is challenging.</p>

<h3 id="multimodal-adapters-efficient-cross-modal-learning">Multimodal Adapters: Efficient Cross-Modal Learning</h3>

<p>One elegant solution presented at the conference was the concept of multimodal adapters - specifically, the single-cell Drug-Conditional Adapter (scDCA). The core idea is pretty simple: rather than training massive end-to-end models that try to handle all modalities simultaneously, we can train small “adapter” layers that bridge between pre-trained foundational models for each modality.</p>

<p>Here’s how it works. You might have a powerful foundational model for single-cell transcriptomics (like scGPT, trained on 33 million cells) and another for molecular structures (like ChemBERTa, trained on 77 million compounds). Rather than starting from scratch to predict how drugs affect cells, scDCA introduces lightweight adapter layers that learn to translate molecular information into adjustments of the cellular model’s internal representations.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/ICLR-2025/scdca_architecture.png" alt="scDCA architecture" /></p>

<p><em>Architecture of scDCA showing drug-conditional adapters that efficiently fine-tune single-cell foundation models. The adapter introduces molecular conditioning through dynamic bias adjustments while keeping the original transformer weights frozen, enabling training with less than 1% of the original model’s parameters. Source: Maleki et al., 2025</em></p>

<p>The advantages are the pretty striking:</p>

<p><strong>Efficiency</strong>: Adapters typically involve training only 1% of the parameters compared to the original foundational models. For scDCA specifically, while the base scGPT model has millions of parameters, the adapters add only a tiny fraction. This makes them dramatically faster and cheaper to train - critical when working with limited datasets.</p>

<p><strong>Avoiding overfitting</strong>: By keeping the foundational models frozen and only training the adapter, you preserve the learned knowledge in the original models. This is particularly valuable when working with limited paired training data - a common challenge when you have only 188 compounds with cellular response data (as in the Sciplex3 dataset used for validation).</p>

<p><strong>Flexibility</strong>: You can mix and match different foundational models by simply training new adapters, without needing to retrain entire systems. Need to connect a different molecular encoder? Just train a new adapter layer.</p>

<p>Performance was pretty neat: scDCA successfully predicted cellular responses to novel drugs and - even more impressively - generalized to completely unseen cell lines in a zero-shot setting with 82% accuracy. This generalization happens because the frozen single-cell foundation model supposedly retains its understanding of gene-gene interactions and cellular states, while the adapter learns to modulate these representations based on molecular structure.</p>

<h3 id="multimodal-lego-assembling-models-like-building-blocks">Multimodal Lego: Assembling Models Like Building Blocks</h3>

<p>Taking the adapter concept even further, MM-Lego (Multimodal Lego) introduced a framework that makes any set of encoders compatible for model merging and fusion - without requiring paired training data at all.</p>

<p>The key innovation is the “LegoBlock” - a wrapper that enforces two critical properties. First, it ensures all modalities produce latent representations with the same dimensions (making them stackable, like Lego pieces). Second, and more cleverly, it learns representations in the frequency domain using Fourier transforms. Why does this matter? Frequency-domain representations apparently are less prone to signal interference when combined, making them ideal for model merging (I have to trust them on this ^^).</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/ICLR-2025/mm_lego_workflow.png" alt="MM-Lego workflow" /></p>

<p><em>The MM-Lego workflow showing how LegoBlocks enforce structural compatibility and learn frequency-domain representations that enable merging without signal interference. Models can be merged without any fine-tuning (LegoMerge) or with minimal fine-tuning for state-of-the-art performance (LegoFuse). Source: Hemker et al., 2024</em></p>

<p>MM-Lego introduced two approaches:</p>

<p><strong>LegoMerge</strong>: Combines models trained entirely separately - without any paired data or fine-tuning. The merged representation uses a harmonic mean of magnitudes and arithmetic mean of phases in the frequency domain, carefully designed to avoid one modality dominating the signal. Interestingly and counter-intuitively, this achieved competitive performance with end-to-end trained models across seven medical datasets, despite never seeing a single multimodal training sample.</p>

<p><strong>LegoFuse</strong>: Takes the merged components and fine-tunes them for just a few epochs (as little as 2) with paired data. This allows modalities to mutually contextualize each other while avoiding the computational overhead of full end-to-end training. LegoFuse achieved state-of-the-art results on 5 of 7 benchmarked tasks.</p>

<p>The practical advantages are substantial: MM-Lego scales linearly with the number of modalities (not quadratically like many attention-based methods), handles missing modalities gracefully, works with non-overlapping training sets, and - critically - doesn’t require architecturally identical models. You can combine a CNN for images with a transformer for sequences simply by wrapping each in a LegoBlock.</p>

<p>One presentation demonstrated training on completely non-overlapping datasets - one set of patients with histopathology slides, a different set with genomic data, both with the same clinical outcomes. Traditional end-to-end models can’t handle this scenario at all, but MM-Lego achieved strong performance by training each modality independently and merging the results.</p>

<h3 id="why-this-matters">Why This Matters</h3>

<p>These approaches address fundamental challenges we face everyday in computational biology and drug discovery. Paired multi-modal measurements are expensive and often impossible - you might have single-cell data for some conditions and bulk sequencing for others, microscopy for some samples and proteomics for different samples. Traditional methods force you to either throw away data (using only the intersection) or impute missing values (introducing noise).</p>

<p>Adapter-based and frequency-domain approaches like scDCA and MM-Lego let you leverage all available data by training on unpaired samples and combining models afterward. As Michael Bronstein memorably put it in his panel discussion: “Everybody wants to develop the next AlphaFold, nobody the next PDB.” The bottleneck isn’t model architecture - it’s generating high-quality data at scale. Methods that work with incomplete, unpaired, and heterogeneous data are essential for making progress.</p>

<h2 id="contrastive-learning-learning-from-similarity">Contrastive Learning: Learning from Similarity</h2>

<p>Contrastive learning was also another recurring topic at ICLR, particularly for biological and chemical applications where labeled data can be scarce but unlabeled structure is abundant.</p>

<h3 id="the-core-principle">The Core Principle</h3>

<p>The fundamental idea behind contrastive learning is elegantly simple: teach a model to recognize what’s similar and what’s different. Rather than requiring explicit labels for every example, you create pairs of data points and ask the model to learn that similar pairs should have similar representations, while dissimilar pairs should be far apart in representation space.</p>

<p>In the context of biological perturbations, this might mean:</p>

<ul>
  <li><strong>Similar pairs</strong>: Unperturbed samples from the same cell line, or cells treated with the same compound</li>
  <li><strong>Dissimilar pairs</strong>: Perturbed vs. unperturbed samples, or cells treated with different compounds</li>
</ul>

<p>By training models to maximize agreement within similar pairs and maximize disagreement between dissimilar pairs, you can learn rich representations that capture meaningful biological variation.</p>

<h3 id="better-separation-better-biology">Better Separation, Better Biology</h3>

<p>Several presentations demonstrated how contrastive learning leads to better separation of perturbations in embedding spaces. This has practical implications for downstream analyses - better UMAPs, clearer clustering, and more interpretable representations of complex biological states.</p>

<p>One particularly clever application involved using different molecular representations (SMILES strings, graphs, 3D conformations) and creating contrastive pairs based on their chemical similarity. Despite working with relatively small domain-specific datasets (1.5 million compounds), this approach produced models with performance comparable to those trained on billions of general chemical structures.</p>

<p>The key insight is that contrastive learning allows you to leverage the structure inherent in your data - the relationships between samples - rather than requiring expensive manual annotations for every data point.</p>

<h3 id="beyond-one-dimensional-learning-langpert-and-hybrid-llm-approaches">Beyond One-Dimensional Learning: LangPert and Hybrid LLM Approaches</h3>

<p>Several talks pushed contrastive learning into more sophisticated territory, combining it with Large Language Models to predict unseen perturbations. One particularly innovative approach was <strong>LangPert</strong>, presented by researchers from Novo Nordisk, which demonstrates a clever way to leverage LLMs’ biological knowledge without falling victim to their numerical limitations.</p>

<p>The core challenge is predicting cellular responses to genetic perturbations you’ve never experimentally tested. Traditional foundation models like scGPT and graph neural networks like GEARS have tackled this, but as we have seen repeatedly, even sophisticated deep learning methods often struggle to beat simple baselines like predicting mean expression.</p>

<p>LangPert’s insight is elegantly simple: <strong>let the LLM do biological reasoning, and let traditional methods handle the numbers</strong>. LLMs have absorbed vast scientific literature and “know” about gene functions, pathways, and interactions. But they’re terrible at handling high-dimensional gene expression data - thousands of numerical values that would choke on tokenization alone.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/ICLR-2025/langpert_framework.png" alt="LangPert framework" /></p>

<p><em>The LangPert framework architecture: Instead of asking LLMs to directly predict high-dimensional gene expression vectors, the system leverages LLMs to identify biologically relevant training examples. For an unseen perturbation (x</em>), the LLM examines all available training perturbations and selects a small subset of functionally related genes. These LLM-selected examples then guide a k-nearest neighbors aggregator that performs the actual numerical prediction in the high-dimensional expression space. This hybrid approach combines the biological reasoning capabilities of LLMs with efficient numerical computation. Source: Märtens et al., 2025*</p>

<p>The framework works in two steps. First, when predicting the effects of an unseen gene knockout, LangPert asks the LLM: “Which genes from my training set are most functionally similar to this target gene?” For instance, if predicting SMG5 (involved in mRNA decay), the LLM might select UPF1, UPF2, and RBM8A - all core components of the same pathway. The LLM provides biological reasoning: these genes participate in similar cellular processes and likely produce similar knockout effects.</p>

<p>Second, the system simply averages the actual experimental expression profiles of these LLM-selected genes. This is k-nearest neighbors with a twist - the “neighbors” are chosen by biological reasoning rather than numerical distance in expression space.</p>

<p>On the K562 benchmark, LangPert achieved substantially better performance than previous methods, and this advantage held across different data regimes. What makes it work? Unlike static embeddings, LangPert dynamically reasons about relevance for each prediction. Different frontier LLMs (Claude, OpenAI o1, o3-mini) select somewhat different gene sets yet achieve similar performance, suggesting multiple valid biological paths to good predictions. The system can even incorporate self-critique, asking the LLM to refine its initial selections.</p>

<p>This “LLM-informed contextual synthesis” represents a template for integrating LLMs into scientific workflows more broadly. Rather than forcing LLMs to handle everything, we architect systems where LLMs do conceptual reasoning while traditional methods handle precise numerical operations. The approach also maintains interpretability - you can examine which genes were selected and read the LLM’s biological rationale, crucial for scientific applications where understanding <em>why</em> matters as much as the predictions themselves.</p>

<h2 id="training-strategies-and-data-quality">Training Strategies and Data Quality</h2>

<p>Amidst all the excitement about new architectures and approaches, several sobering talks reminded attendees about fundamental challenges in model training and evaluation.</p>

<h3 id="the-train-test-split-problem">The Train-Test Split Problem</h3>

<p>One eye-opening presentation highlighted a subtle but critical issue in molecular machine learning: how we split our data for training and testing. The standard approach of random splitting can lead to severely imbalanced distributions of molecular similarity between training and test sets.</p>

<p>Here’s why this matters: if your test set happens to contain many molecules very similar to your training set, your model will appear to perform much better than it actually does. The good performance on easy (similar) test examples masks poor performance on truly novel molecules.</p>

<p>The proposed solution - Similarity-Aware Evaluation (SAE) - explicitly controls the distribution of similarities in test sets to ensure balanced evaluation. This gives a more honest assessment of how well models generalize to genuinely new chemical space.</p>

<p>This serves as a reminder that methodological rigor in evaluation is just as important as sophisticated model architectures. Without proper evaluation strategies, we risk fooling ourselves about our models’ capabilities.</p>

<h3 id="the-data-generation-bottleneck">The Data Generation Bottleneck</h3>

<p>Perhaps my most memorable quote from the conference came from Michael Bronstein during a panel discussion: “Everybody wants to develop the next AlphaFold, nobody wants to develop the next Protein Data Bank.”</p>

<p>This pithy observation captures a fundamental tension in computational biology and drug discovery. We’re incredibly good at building sophisticated AI models, but generating the high-quality, large-scale datasets these models need remains a major bottleneck.</p>

<p>Several speakers from both academia and industry emphasized this point. When I chatted with Google DeepMind , they also mentioned struggling with throughput in their lab automation efforts - testing only 20-30 proteins every two weeks. Meanwhile, unlearning and safety talks highlighted that even with massive datasets, ensuring data quality and removing harmful content remains challenging.</p>

<p>The message to me was clear and also somewhat encouraging when facing behemoths like OpenAI: the next major advances in AI for science won’t come from slightly better architectures, but from systematic approaches to generating better data at scale.</p>

<h2 id="industry-perspectives">Industry Perspectives</h2>

<p>One of the highlights of attending ICLR was the opportunity to mingle with researchers from major industry players. Networking events at Marina Bay Sands brought together people from Google DeepMind, Meta, Isomorphic Labs, and various biotech companies.</p>

<h3 id="isomorphic-labs">Isomorphic Labs</h3>

<p>Isomorphic Labs, built around the core IP of AlphaFold3, has made impressive progress in translating academic breakthroughs into practical drug discovery. Their partnerships with Novartis and Eli Lilly (totaling over $90M in upfront payments, with a $700M investment round) signal serious industry confidence.</p>

<p>What’s particularly interesting is their ambition to apply AI across the entire drug development process, not just structure prediction. This includes ADME (absorption, distribution, metabolism, and excretion) prediction - traditionally challenging areas that have resisted computational approaches and I have huge reservations that they can win on this turf, as big pharma players already have massive proprietary datasets in their hands.</p>

<p>According to conversations at the conference, access to internal pharmaceutical ADME datasets from their partners could be a game-changer, potentially allowing them to train models on data that has never been publicly available.</p>

<h3 id="google-deepmind-building-the-lab-in-the-loop">Google DeepMind: Building the Lab-in-the-Loop</h3>

<p>DeepMind’s research arm is taking a complementary approach, focusing on protein design and active learning with automated laboratories. Their vision of a fully automated “lab-in-the-loop” system - where AI designs experiments, robots execute them, and the results feed back to improve the AI - remains aspirational but compelling.</p>

<p>Interestingly, they maintain relatively small labs with simple readouts (protein binding, basic toxicity assays), suggesting that even with Google’s resources, scaling experimental throughput remains challenging. This again reinforces the data generation bottleneck theme.</p>

<h2 id="relevant-data-resources">Relevant Data Resources</h2>

<p>Throughout the conference, several valuable data resources were repeatedly mentioned. These public datasets are enabling the current wave of AI applications in biology and chemistry:</p>

<p><strong>Therapeutics Data Commons (TDC)</strong>: A comprehensive collection of datasets for therapeutic applications, spanning from molecular properties to clinical outcomes.</p>

<p><strong>PrimeKG</strong>: A holistic knowledge graph integrating 20 high-quality biomedical resources, describing over 17,000 diseases with more than 4 million relationships across biological scales.</p>

<p><strong>BioSNAP</strong>: Diverse biomedical networks including protein-protein interactions, single-cell similarity networks, and drug-drug interactions.</p>

<p><strong>Genome-wide Perturb-seq</strong>: Large-scale perturbation screens (K562 and RPE1 cell lines) enabling systematic study of gene function.</p>

<p><strong>Sciplex3</strong>: Single-cell RNA-seq data for over 100 perturbations across multiple cell lines, with dose and time resolution - particularly valuable for training models on chemical perturbations.</p>

<p><strong>Tahoe-100M</strong>: A massive dataset of 105 million single-cells across 60,000 conditions for cancer cell lines.</p>

<h2 id="philosophical-reflections">Philosophical Reflections</h2>

<p>Beyond the technical talks, ICLR featured several thought-provoking discussions on AI, human intelligence, and psychology. These sessions grappled with fundamental questions about what we’re building and where it’s headed.</p>

<p>One recurring theme was the relationship between artificial and biological intelligence. As our AI systems become more capable, are they converging on similar computational strategies to human cognition, or discovering fundamentally alien approaches to problem-solving? The jury is still out, but the question itself reflects how far the field has come.</p>

<p>Another set of talks focused on AI safety and alignment, particularly “unlearning” techniques for removing harmful capabilities from trained models. As models become more powerful, ensuring they can’t be easily jailbroken to produce harmful outputs becomes increasingly critical. The technical approaches discussed ranged from improved training set filtering to post-hoc unlearning procedures, though no silver bullet has emerged.</p>

<h2 id="closing-thoughts">Closing Thoughts</h2>

<p>ICLR 2025 showcased a field in transition. The move from passive models to active agents, the increasing sophistication of multimodal integration, and the maturation of contrastive learning approaches all point toward AI systems that are more flexible, more powerful, and more practically useful than ever before.</p>

<p>Yet the conference also highlighted persistent challenges: the data quality bottleneck, the difficulty of proper evaluation, and the gap between impressive demos and production systems. The most successful applications will likely come from groups that can address both the algorithmic and the data generation sides of the equation.</p>

<p>For computational biologists and drug discovery researchers, the message is clear: sophisticated AI tools are becoming increasingly accessible (you can run TxGemma on a single GPU!), but generating the right data to train and validate these tools remains the critical challenge. The next AlphaFold won’t come from a better architecture alone - it will require the next PDB.</p>

<p>Singapore provided an awesome setting for the conference: The city’s blend of natural beauty, cutting-edge architecture, and vibrant culture somehow felt fitting for a conference looking toward the future of technology. Between sessions, I could explore hawker centers with incredible Asian cuisine, walk through tropical gardens, or simply marvel at the engineering achievement that is Marina Bay Sands.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/ICLR-2025/singapore_food.jpg" alt="Signapore food" /></p>

<p>The field of deep learning continues to move at a breathtaking pace. If ICLR 2025 is any indication, the next few years should bring AI agents that can meaningfully accelerate scientific discovery - but only if we can match our computational architecture with new sophisticated experimental approaches to generating high-quality data at scale.</p>

<p>Looking forward to my next conference already - maybe ICLR 2026!</p>]]></content><author><name>Tobias Neumann</name></author><category term="Conferences" /><category term="Machine Learning" /><category term="AI" /><category term="Deep Learning" /><category term="Agentic AI" /><category term="Multimodal Learning" /><category term="Conference" /><summary type="html"><![CDATA[Highlights from the International Conference on Learning Representations 2025 in Singapore]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://t-neumann.github.io/assets/images/categories/iclr.png" /><media:content medium="image" url="https://t-neumann.github.io/assets/images/categories/iclr.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Splice_sim - Benchmarking RNA-seq mapping in the age of nucleotide Ccnversions</title><link href="https://t-neumann.github.io/bioinformatics/pipelines/splice-sim/" rel="alternate" type="text/html" title="Splice_sim - Benchmarking RNA-seq mapping in the age of nucleotide Ccnversions" /><published>2024-06-27T22:47:00+02:00</published><updated>2024-06-27T22:47:00+02:00</updated><id>https://t-neumann.github.io/bioinformatics/pipelines/splice-sim</id><content type="html" xml:base="https://t-neumann.github.io/bioinformatics/pipelines/splice-sim/"><![CDATA[<p>Nucleotide conversion RNA sequencing techniques have revolutionized how we study RNA modifications, stability, and dynamics. From metabolic labeling experiments that track RNA synthesis and decay, to bisulfite sequencing that maps methylation sites - these approaches provide unprecedented insights into post-transcriptional regulation. However, they come with a substantial challenge: the very nucleotide conversions that make these experiments powerful can introduce biases in how reads map to reference genomes.</p>

<p>Our recent paper published in <a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03313-8">Genome Biology</a> introduces <strong>splice_sim</strong>, a comprehensive simulation and evaluation framework designed to systematically measure and address these mapping biases. This post will walk you through what the tool does, why it matters, and how you can use it for your own projects.</p>

<h2 id="the-problem-when-conversions-confuse-mappers">The Problem: When conversions confuse mappers</h2>

<p>Nucleotide conversion (NC) RNA-seq encompasses a broad range of techniques, each introducing specific types of base changes, for instance:</p>

<ul>
  <li><strong>Metabolic labeling</strong> (e.g., SLAM-seq with 4-thiouridine): Introduces T-to-C conversions at low rates (1-5%) to distinguish newly synthesized RNA from pre-existing transcripts</li>
  <li><strong>RNA bisulfite sequencing</strong>: Creates C-to-T conversions at very high rates (&gt;98%) to identify methylated cytosines that resist conversion</li>
</ul>

<p>These mismatches to the reference genome make read mapping more challenging. The question we wanted to address is: how much does this affect our downstream biological interpretations?</p>

<p><strong>Table 1: Overview of Nucleotide Conversion RNA-seq Techniques</strong></p>

<table>
  <thead>
    <tr>
      <th>Technique</th>
      <th>Conversion Type</th>
      <th>Conversion Rate</th>
      <th>Key Metric</th>
      <th>Biological Application</th>
      <th>Example Use Case</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Metabolic Labeling</strong> (SLAM-seq, TUC-seq)</td>
      <td>T → C</td>
      <td>Low (1-5%)</td>
      <td><strong>FCR</strong> (Fraction Converted Reads)</td>
      <td>RNA synthesis, processing, decay kinetics</td>
      <td>Measuring RNA half-lives in pulse-chase experiments</td>
    </tr>
    <tr>
      <td><strong>RNA Bisulfite Sequencing</strong></td>
      <td>C → T</td>
      <td>Very High (&gt;98%)</td>
      <td><strong>metR</strong> (Methylation Rate)</td>
      <td>Post-transcriptional cytosine methylation (m5C)</td>
      <td>Identifying methylated cytosines in cellular transcripts</td>
    </tr>
    <tr>
      <td><strong>Isoform Analysis</strong> (any NC technique)</td>
      <td>Variable</td>
      <td>Variable</td>
      <td><strong>FMAT</strong> (Fraction Mature)</td>
      <td>Alternative splicing, intron retention</td>
      <td>Comparing spliced vs unspliced isoform abundances</td>
    </tr>
  </tbody>
</table>

<p><em>Table 1: Different NC RNA-seq approaches introduce specific types and rates of nucleotide conversions, each with distinct metrics and biological applications. The challenge: all of them introduce mismatches that can bias read mapping.</em></p>

<p>Consider a metabolic labeling pulse-chase experiment where you’re measuring RNA half-lives. If converted reads (labeled RNA) map with lower accuracy than unconverted reads (unlabeled RNA), your fraction of converted reads (FCR) will be biased, leading to incorrect half-life estimates. For genes involved in critical regulatory pathways, this isn’t just a technical nuisance - it leads to wrong biological conclusions.</p>

<h2 id="enter-splice_sim">Enter splice_sim</h2>

<p>Splice_sim is a Python-based simulation and evaluation pipeline that evaluates this problem head-on. It’s built around the core capabilities:</p>

<div style="text-align: center;">
  <img src="https://t-neumann.github.io/assets/images/posts/splice_sim/splice_sim-graphical_abstract.png" alt="Splice_sim workflow" style="width: 80%; max-width: 100%;" />
</div>
<p><em>Figure 1: The splice_sim analysis workflow. The pipeline simulates reads with realistic sequencing errors for premature and mature isoforms, injects nucleotide conversions at configured rates, maps reads with evaluated mappers, and compares results to ground truth to calculate TP/FP/FN counts per genomic annotation. (Figure 1A from the paper)</em></p>

<h4 id="realistic-rna-seq-simulation">Realistic RNA-seq simulation</h4>

<p>The framework simulates RNA-seq reads that mirror real experimental conditions:</p>

<ul>
  <li>Uses <a href="https://doi.org/10.1093/bioinformatics/btr708">ART</a> for realistic Illumina sequencing errors</li>
  <li>Generates reads from configurable isoform mixtures (premature unspliced and mature spliced transcripts)</li>
  <li>Introduces nucleotide conversions at user-defined rates</li>
  <li>Supports arbitrary single nucleotide variations (SNVs)</li>
</ul>

<h4 id="comprehensive-evaluation">Comprehensive Evaluation</h4>

<p>Unlike generic mappability scores that operate at the genome-wide level, splice_sim evaluates mapping accuracy for biologically meaningful units:</p>

<ul>
  <li><strong>Whole transcripts</strong> - overall gene-level performance</li>
  <li><strong>Exons and introns</strong> - feature-specific accuracy</li>
  <li><strong>Splice junctions</strong> - the most challenging mapping scenario</li>
</ul>

<p>For each annotation, it calculates true positives (TP), false positives (FP), and false negatives (FN), deriving precision, recall, and F1 scores.</p>

<h4 id="nextflow-workflow">Nextflow workflow</h4>

<p>The entire pipeline is wrapped in <a href="https://www.nextflow.io/">Nextflow</a> workflows with Docker containerization, making it reproducible and easy to deploy across different computing environments - from your laptop to HPC clusters to the cloud.</p>

<h2 id="key-contributions-of-our-study">Key contributions of our study</h2>

<p>We used splice_sim to generate deep simulated datasets for mouse and human transcriptomes, evaluating popular splice-aware mappers (STAR and HISAT-3N) under various conversion rates. Here are the headline findings:</p>

<blockquote>
  <p><strong>Notable metrics</strong>
📊 <strong>High mappability regions</strong>: F1 &gt; 0.98 for both mappers
⚠️ <strong>Low mappability regions</strong>: F1 &lt; 0.55 - substantial accuracy drop
🧬 <strong>Impact on biology</strong>: &gt;120 protein-coding genes showed &gt;10% error in half-life estimates
🎯 <strong>Mosaic improvement</strong>: Reduced outliers from 239 to 95 in half-life analysis
🔬 <strong>Methylation sites</strong>: 99.2% recall but considerable FP calls in low mappability regions</p>
</blockquote>

<h3 id="mappability-dominates-but-conversions-matter">Mappability dominates, But conversions matter</h3>

<p>Mapping accuracies with and without nucleotide conversions were high (F1 &gt; 0.98) for annotations with high or medium genome mappability, but substantially lower (F1 &lt; 0.55) for low mappability regions. This isn’t surprising - repetitive sequences are inherently difficult to map.</p>

<div style="text-align: center;">
  <img src="https://t-neumann.github.io/assets/images/posts/splice_sim/mappability_distribution.png" alt="Mappability distribution" style="width: 60%; max-width: 100%;" />
</div>
<p><em>Figure 2: Distribution of analyzed annotations across high (&gt;0.9), medium, and low (&lt;0.2) mean genome mappability categories. Low mappability regions, while smaller in number, are critical for understanding mapping biases. (Figure 1B from the paper)</em></p>

<p>What does also not come as too big of a surprise is that nucleotide conversion and sequencing errors increased false discovery and false negative rates for both mappers, as read alignment becomes more difficult with increased numbers of mismatches to the originating genomic sequence.</p>

<div style="text-align: center;">
  <img src="https://t-neumann.github.io/assets/images/posts/splice_sim/fdr_fnr_by_mismatches.png" alt="FDR and FNR by number of mismatches" style="width: 90%; max-width: 100%;" />
</div>
<p><em>Figure 3: Changes in false discovery (FDR) and false negative rates (FNR) by number of mismatches compared to reads without mismatches, stratified by mappability. Note how HISAT-3N (orange) is largely unaffected by T-to-C conversions due to its 3N mapping approach, while STAR (green) shows increasing error rates with more mismatches. (Figure 1C from the paper)</em></p>

<p>What is more notable, but somewhat expected is that for STAR, performance degraded with increasing conversion rates. HISAT-3N (a 3-nucleotide mapper that treats T and C as equivalent) was largely unaffected by T-to-C conversions but showed its own quirks, including ~2% of reads mapping to the wrong strand.</p>

<h2 id="the-mosaic-approach-leveraging-mapper-strengths">The Mosaic approach: Leveraging mapper strengths</h2>

<p>One of the most practical contributions of our study based on the observation that different mappers have different strengths is the “mosaic” analysis strategy. Rather than declaring one mapper superior, we found that different mappers excel in different genomic contexts.</p>

<p>The mosaic approach works like this:</p>

<ol>
  <li>Map your data with multiple mappers (e.g., STAR and HISAT-3N)</li>
  <li>For each genomic interval, select the result from the mapper with the best accuracy</li>
  <li>Optionally filter intervals where no mapper performs adequately</li>
</ol>

<div style="text-align: center;">
  <img src="https://t-neumann.github.io/assets/images/posts/splice_sim/mosaic_diagram.png" alt="Mosaic flow diagram" style="width: 100%; max-width: 100%;" />
</div>

<p>When combining a mosaic approach with a filtering strategy that removed transcripts for which none of the mappers returned results close to the simulation, the overall mean FCR approached the simulated true value.</p>

<p>This strategy improved FCR reconstruction, reduced half-life outliers, and recovered more accurate FMAT values. It does require running multiple alignments, but the accuracy gains can be substantial for critical analyses.</p>

<div style="text-align: center;">
  <img src="https://t-neumann.github.io/assets/images/posts/splice_sim/fcr_mosaic.png" alt="FCR reconstruction with mosaic approach" style="width: 70%; max-width: 100%;" />
</div>
<p><em>Figure 4: Mean difference to simulated exonic FCR per mapper. The mosaic approach (selecting the best mapper per interval) reduces differences to simulated values, and when combined with filtering, reconstruction is nearly perfect. (Figure 1E from the paper)</em></p>

<h3 id="biological-consequences-half-life-estimation">Biological consequences: Half-life estimation</h3>

<p>We simulated realistic pulse-chase experiments to measure RNA decay rates. Although half-life estimation was robust for most transcripts, a considerable number of outliers showed more than 10% difference to simulated half-lives for both mappers in the medium and low mappability segments. These outliers affected over 120 protein-coding genes - genes that researchers might be studying for their biological relevance.</p>

<div style="text-align: center;">
  <img src="https://t-neumann.github.io/assets/images/posts/splice_sim/halflife_reconstruction.png" alt="Half-life reconstruction" style="width: 100%; max-width: 100%;" />
</div>
<p><em>Figure 5: Effect of NC on transcript half-life reconstruction. Top left: Normalized FCR per time point showing increasing noise with decreasing mappability. Top right: Reconstructed half-lives per decay rate.
The box plots show a considerable number of outliers for both mappers; numbers of considered transcripts
are plotted below the boxes. Bottom: Reconstructed half-lives showing outliers (red triangles) that deviate &gt;10% from simulated values. The mosaic approach (bottom) combining both mappers reduces outliers considerably. (Figure 2 from the paper)</em></p>

<h3 id="isoform-quantification-challenges">Isoform quantification challenges</h3>

<p>Alignment of spliced reads is particularly difficult as it needs to take the possibility of large gaps due to spliced out introns into account and requires accurate placement of short sub-sequences of reads (anchors) that span over these gaps.</p>

<p>We found that transcripts often contain a mosaic of high, medium, and low mappability introns. By implementing an intron filtering strategy that removes problematic introns, we improved the accuracy of mature isoform fraction (FMAT) estimates considerably.</p>

<div style="text-align: center;">
  <img src="https://t-neumann.github.io/assets/images/posts/splice_sim/fmat_reconstruction.png" alt="FMAT reconstruction" style="width: 100%; max-width: 100%;" />
</div>
<p><em>Figure 6: FMAT (fraction of mature isoform) reconstruction improves with intron filtering. Left: Median difference to simulated FMAT showing improvement with filtering (solid vs dashed lines). Right: Distribution of FMAT values for low mappability transcripts shows that both filtering and the mosaic approach recover more accurate estimates closer to the theoretical value of 1/3. (Figure 3A and 3D from the paper)</em></p>

<h3 id="methylation-site-calling-hotspots">Methylation site calling hotspots</h3>

<p>For RNA bisulfite sequencing analysis, low mappability regions are hotspots of false cytosine methylation calls. All mappers produced considerable amounts of false positive and a few false negative m5C calls, mainly in low mappability regions of protein coding genes.</p>

<div style="text-align: center;">
  <img src="https://t-neumann.github.io/assets/images/posts/splice_sim/m5c_calling.png" alt="Methylation site calling accuracy" style="width: 100%; max-width: 100%;" />
</div>
<p><em>Figure 7: Effect of mappability on methylation site reconstruction. Most m5C sites were recovered correctly, but false positives (FP) and false negatives (FN) were predominantly located in low mappability regions. The correlation plots show methylation rates for HISAT-3N, meRanGs, and Segemehl with TP calls in green and false calls in red. (Figure 4 from the paper)</em></p>

<p>This has important implications for establishing accurate m5C site catalogs, which have varied wildly in the literature (from &lt;100 to &gt;10,000 sites reported).</p>

<h3 id="pre-computed-mapping-accuracy-tables-ready-to-use-resources">Pre-computed mapping accuracy tables: Ready-to-use resources</h3>

<p>One of the most immediately useful outputs from our study is a comprehensive set of pre-computed mapping accuracy tables for mouse (mm10) and human (GRCh38) transcriptomes, covering over 50,000 transcripts each. These tables are freely available on <a href="https://doi.org/10.5281/zenodo.11196570">Zenodo</a> and provide detailed performance metrics for every annotated transcript, exon, intron, and splice junction.</p>

<p><strong>What’s included in these tables:</strong></p>

<ul>
  <li><strong>Mapping accuracy scores</strong> (F1, precision, recall) for STAR and HISAT-3N across different conversion rates (0%, 1%, 3%, 5%, 10%)</li>
  <li><strong>Mappability classifications</strong> (high, medium, low) for each genomic feature</li>
  <li><strong>FCR and FMAT reconstruction accuracy</strong> per feature</li>
  <li><strong>Recommendations</strong> for which mapper performs best for each transcript</li>
</ul>

<p><strong>How to use them:</strong></p>

<ol>
  <li><strong>Before starting an experiment</strong>: Look up your genes of interest to assess whether they have sufficient mappability for reliable NC RNA-seq analysis</li>
  <li><strong>During data analysis</strong>: Filter out problematic transcripts/introns with known mapping issues</li>
  <li><strong>Choosing a mapper</strong>: Select the optimal mapper (or apply the mosaic strategy) based on pre-computed performance for your specific genes</li>
  <li><strong>Quality control</strong>: Flag results from low-accuracy regions for additional validation</li>
</ol>

<blockquote>
  <p><strong>📊 Ready-to-Use Data</strong></p>

  <ul>
    <li><strong>Mouse (mm10)</strong>: &gt;50,000 GENCODE transcripts evaluated</li>
    <li><strong>Human (GRCh38)</strong>: &gt;50,000 Ensembl canonical transcripts evaluated</li>
    <li><strong>Conditions tested</strong>: 0%, 1%, 3%, 5%, 10% T-to-C conversion rates</li>
    <li><strong>Mappers evaluated</strong>: STAR, HISAT-3N (metabolic labeling) + meRanGs, Segemehl (BS-seq)</li>
    <li><strong>Format</strong>: TSV.gz tables ready to import into R or Python</li>
  </ul>

  <p><strong>Available at</strong>: <a href="https://doi.org/10.5281/zenodo.11196570">https://doi.org/10.5281/zenodo.11196570</a></p>
</blockquote>

<p><strong>Beyond mouse and human:</strong></p>

<p>The beauty of splice_sim is that it’s not limited to these two species. The framework is designed to work with any genome and annotation set. Want to evaluate NC mapping accuracy for:</p>

<ul>
  <li><strong>Zebrafish</strong> developmental studies?</li>
  <li><strong>Drosophila</strong> genetics experiments?</li>
  <li><strong>Arabidopsis</strong> plant biology?</li>
  <li><strong>Yeast</strong> time-course experiments?</li>
  <li><strong>Non-model organisms</strong> with custom annotations?</li>
</ul>

<p>Simply provide your reference genome FASTA, gene annotation GFF3, and run splice_sim with your experimental parameters. The pipeline will generate the same comprehensive mapping accuracy tables tailored to your organism and genes of interest.</p>

<p>This is particularly valuable for researchers working on:</p>
<ul>
  <li>Organisms with less well-characterized repetitive elements</li>
  <li>Custom or alternative gene annotations</li>
  <li>Novel transcripts or isoforms not in standard databases</li>
  <li>Organisms with different genome complexity and mappability landscapes</li>
</ul>

<p>The paper provides detailed protocols and configuration examples for running splice_sim on arbitrary species, making it a truly universal tool for the NC RNA-seq community.</p>

<h2 id="practical-applications-what-can-you-do-with-splice_sim">Practical applications: What can you do with splice_sim?</h2>

<h3 id="1-pre-experiment-planning">1. Pre-experiment planning</h3>

<p>Before committing to expensive sequencing, simulate your experimental design:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Configure your experiment in JSON</span>
<span class="o">{</span>
  <span class="s2">"condition"</span>: <span class="o">{</span>
    <span class="s2">"ref"</span>: <span class="s2">"T"</span>,
    <span class="s2">"alt"</span>: <span class="s2">"C"</span>, 
    <span class="s2">"conversion_rates"</span>: <span class="o">[</span>0.01, 0.03, 0.05],
    <span class="s2">"base_coverage"</span>: 50
  <span class="o">}</span>,
  <span class="s2">"transcript_ids"</span>: <span class="s2">"genes_of_interest.tsv"</span>
<span class="o">}</span>

<span class="c"># Run the simulation</span>
nextflow run splice_sim.nf <span class="nt">-profile</span> docker <span class="nt">-c</span> my_config.json
</code></pre></div></div>

<p>This tells you whether your genes of interest have sufficient mappability for reliable analysis, or if you need to adjust your approach (longer reads, different protocol, etc.).</p>

<h3 id="2-benchmark-your-pipeline">2. Benchmark your pipeline</h3>

<p>Evaluating a new mapper or tweaking parameters? Run it against splice_sim-generated ground truth data:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Add your custom mapper to the config
</span><span class="s">"mappers"</span><span class="p">:</span> <span class="p">{</span>
  <span class="s">"MY_MAPPER"</span><span class="p">:</span> <span class="p">{</span>
    <span class="s">"cmd"</span><span class="p">:</span> <span class="s">"my_mapper"</span><span class="p">,</span>
    <span class="s">"index"</span><span class="p">:</span> <span class="s">"/path/to/index"</span><span class="p">,</span>
    <span class="s">"options"</span><span class="p">:</span> <span class="s">"--custom-parameters"</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The evaluation module will give you detailed per-feature accuracy metrics.</p>

<h3 id="3-data-quality-control">3. Data quality control</h3>

<p>Use our pre-computed mapping accuracy tables for mouse and human transcriptomes (available on <a href="https://doi.org/10.5281/zenodo.11196570">Zenodo</a>) to:</p>

<ul>
  <li>Filter transcripts/exons/introns with poor mapping accuracy</li>
  <li>Identify which mapper works best for your genes of interest</li>
  <li>Apply targeted filtering strategies to improve downstream analyses</li>
</ul>

<h3 id="4-comparing-sequencing-strategies">4. Comparing sequencing strategies</h3>

<p>We used splice_sim to compare full-transcript sequencing versus 3’ end sequencing. While simulated 3’ end sequencing data showed higher mapping accuracies across all conversion rates and mappability classes due to higher mappability, full-length sequencing showed the smallest deviation from the simulated FCR, implying that the larger mapping space of the full transcript allows for more robust FCR estimates.</p>

<p>You can set up similar comparisons for your specific use case.</p>

<h2 id="getting-started-with-splice_sim">Getting started with splice_sim</h2>

<p>The tool is available on <a href="https://github.com/popitsch/splice_sim">GitHub</a> with comprehensive documentation. Here’s a quick-start guide:</p>

<h3 id="installation">Installation</h3>

<p>The easiest way is via Docker - all dependencies are pre-packaged:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Pull the Docker image</span>
docker pull tobneu/splice_sim:release

<span class="c"># Or for HPC environments with Singularity</span>
singularity pull docker://tobneu/splice_sim:release
</code></pre></div></div>

<p>For local development, create a conda environment:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>conda <span class="nb">env </span>create <span class="nt">-f</span> environment.yml
conda activate splice_sim
</code></pre></div></div>

<h3 id="running-a-simple-simulation">Running a simple simulation</h3>

<p>The framework uses a central JSON configuration file to define all parameters. Here’s a minimal example for a small test:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"dataset_name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"my_test_run"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"splice_sim_cmd"</span><span class="p">:</span><span class="w"> </span><span class="s2">"python /path/to/splice_sim/main.py"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"gene_gff"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/references/gencode.vM21.gff3.gz"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"genome_fa"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/references/mm10.fa"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"genome_chromosome_sizes"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/references/mm10.chrom.sizes"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"transcript_ids"</span><span class="p">:</span><span class="w"> </span><span class="s2">"test_genes.tsv"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"isoform_mode"</span><span class="p">:</span><span class="w"> </span><span class="s2">"1:1"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"condition"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"ref"</span><span class="p">:</span><span class="w"> </span><span class="s2">"T"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"alt"</span><span class="p">:</span><span class="w"> </span><span class="s2">"C"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"conversion_rates"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="mf">0.02</span><span class="p">,</span><span class="w"> </span><span class="mf">0.05</span><span class="p">],</span><span class="w">
    </span><span class="nl">"base_coverage"</span><span class="p">:</span><span class="w"> </span><span class="mi">50</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"mappers"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"STAR"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"star_cmd"</span><span class="p">:</span><span class="w"> </span><span class="s2">"STAR"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"star_genome_idx"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/indices/star_index"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"star_splice_gtf"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/references/genes.gtf"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"readlen"</span><span class="p">:</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span><span class="w">
  </span><span class="nl">"random_seed"</span><span class="p">:</span><span class="w"> </span><span class="mi">42</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Run the complete workflow:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Simulation workflow</span>
nextflow run splice_sim.nf <span class="nt">-c</span> my_config.json <span class="nt">-profile</span> docker

<span class="c"># Evaluation workflow  </span>
nextflow run splice_sim_eva.nf <span class="nt">-c</span> my_config.json <span class="nt">-profile</span> docker
</code></pre></div></div>

<h3 id="understanding-the-output">Understanding the output</h3>

<p>Splice_sim generates several types of output:</p>

<p><strong>Count tables</strong> (<code class="language-plaintext highlighter-rouge">count/*.counts.tsv.gz</code>): Raw TP/FP/FN counts per mapper, conversion rate, and genomic feature</p>

<p><strong>Metadata tables</strong> (<code class="language-plaintext highlighter-rouge">meta/*.metadata.tsv.gz</code>): Feature characteristics including mappability, GC content, convertibility</p>

<p><strong>Performance metrics</strong>: Precision, recall, and F1 scores calculated from count tables</p>

<p><strong>Visual tracks</strong>: Optional TDF files for viewing in IGV, highlighting misaligned reads</p>

<h4 id="output-directory-structure">Output directory structure</h4>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>results/
├── simulation/
│   ├── model/
│   │   ├── transcript_model.pkl          # Transcript isoform definitions
│   │   ├── sequences.fa                  # Simulated transcript sequences
│   │   └── model_config.json             # Model parameters
│   │
│   ├── reads/
│   │   ├── condition_0pct/               # Unconverted reads
│   │   │   ├── simulated_reads.fq.gz
│   │   │   └── truth_alignments.bam
│   │   ├── condition_2pct/               # 2% conversion rate
│   │   │   ├── simulated_reads.fq.gz
│   │   │   └── truth_alignments.bam
│   │   └── condition_5pct/               # 5% conversion rate
│   │       ├── simulated_reads.fq.gz
│   │       └── truth_alignments.bam
│   │
│   └── alignments/
│       ├── STAR/
│       │   ├── condition_0pct.bam
│       │   ├── condition_2pct.bam
│       │   └── condition_5pct.bam
│       └── HISAT3N/
│           ├── condition_0pct.bam
│           ├── condition_2pct.bam
│           └── condition_5pct.bam
│
├── evaluation/
│   ├── count/
│   │   ├── STAR_0pct_tx.counts.tsv.gz    # Transcript-level counts
│   │   ├── STAR_0pct_fx.counts.tsv.gz    # Exon/intron counts
│   │   ├── STAR_0pct_sj.counts.tsv.gz    # Splice junction counts
│   │   ├── HISAT3N_0pct_tx.counts.tsv.gz
│   │   └── ...
│   │
│   ├── meta/
│   │   ├── tx.metadata.tsv.gz            # Transcript metadata
│   │   ├── fx.metadata.tsv.gz            # Exon/intron metadata
│   │   └── sj.metadata.tsv.gz            # Splice junction metadata
│   │
│   ├── tracks/                            # Optional IGV visualization
│   │   ├── STAR_0pct_FP.tdf              # False positive tracks
│   │   ├── STAR_0pct_FN.tdf              # False negative tracks
│   │   └── ...
│   │
│   └── processed/
│       ├── combined_results.rds           # R data object
│       └── summary_stats.tsv              # Summary statistics
│
├── reports/
│   ├── execution_report.html              # Nextflow execution report
│   ├── timeline.html                      # Pipeline timeline
│   └── trace.txt                          # Resource usage
│
└── logs/
    ├── simulation.log
    ├── mapping.log
    └── evaluation.log
</code></pre></div></div>

<p><strong>Key files to focus on:</strong></p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">count/*.counts.tsv.gz</code> - TP/FP/FN counts for calculating accuracy metrics</li>
  <li><code class="language-plaintext highlighter-rouge">meta/*.metadata.tsv.gz</code> - Mappability, GC content, and other feature characteristics</li>
  <li><code class="language-plaintext highlighter-rouge">processed/combined_results.rds</code> - Pre-processed data ready for analysis in R</li>
  <li><code class="language-plaintext highlighter-rouge">tracks/*.tdf</code> - Visual tracks for exploring specific mapping errors in IGV</li>
</ul>

<p>The evaluation results can be imported into R for detailed analysis using the provided preprocessing script:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Rscript splice_sim/src/main/R/splice_sim/preprocess_results.R <span class="se">\</span>
  my_config.json output_dir/
</code></pre></div></div>

<h2 id="advanced-use-cases">Advanced use cases</h2>

<h3 id="custom-nucleotide-conversion-models">Custom nucleotide conversion models</h3>

<p>While splice_sim uses Bernoulli processes for NC simulation (appropriate for BS-seq and SLAM-seq), you might need more sophisticated models. For example, A-to-I RNA editing by ADAR enzymes shows sequence-context dependencies.</p>

<p>The NC simulation is implemented as a Python method with access to:</p>

<ul>
  <li>Read sequence (without NC)</li>
  <li>Genomic coordinates and strand</li>
  <li>Configured reference/alternate bases</li>
  <li>Conversion rate</li>
  <li>List of convertible positions</li>
  <li>Any configured SNPs</li>
</ul>

<p>Simply modify the <code class="language-plaintext highlighter-rouge">splice_sim.simulator.modify_bases</code> method to implement your custom conversion logic.</p>

<h3 id="paired-end-support">Paired-end support</h3>

<p>The current version focuses on single-end reads, which are cost-effective for many experiments. However, paired-end data offers advantages:</p>

<ul>
  <li>Improved mappability from both mates</li>
  <li>Error correction in overlapping regions</li>
</ul>

<p>Paired-end support is on the TODO list.</p>

<h3 id="integrating-with-aws-batch">Integrating with AWS batch</h3>

<p>Since splice_sim uses Nextflow, you can easily scale to cloud computing. Here’s a basic AWS Batch configuration:</p>

<div class="language-groovy highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">profiles</span> <span class="o">{</span>
    <span class="n">awsbatch</span> <span class="o">{</span>
        <span class="n">process</span><span class="o">.</span><span class="na">executor</span> <span class="o">=</span> <span class="s1">'awsbatch'</span>
        <span class="n">process</span><span class="o">.</span><span class="na">queue</span> <span class="o">=</span> <span class="s1">'my-batch-queue'</span>
        <span class="n">process</span><span class="o">.</span><span class="na">container</span> <span class="o">=</span> <span class="s1">'tobneu/splice_sim:release'</span>
        <span class="n">workDir</span> <span class="o">=</span> <span class="s1">'s3://my-bucket/work'</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>This is particularly useful when simulating large datasets (full transcriptomes) or evaluating multiple mappers across many parameter combinations.</p>

<h2 id="recommended-best-practices">Recommended best practices</h2>

<p>Based on our findings, here are practical recommendations for NC RNA-seq analysis:</p>

<h3 id="for-metabolic-labeling-experiments">For metabolic labeling experiments</h3>

<ol>
  <li><strong>Always check mappability</strong>: Use our pre-computed tables or run splice_sim on your genes of interest</li>
  <li><strong>Consider a mosaic approach</strong>: If computationally feasible, map with both STAR and HISAT-3N and combine results</li>
  <li><strong>Filter carefully</strong>: Remove low-mappability transcripts or apply targeted filtering strategies</li>
  <li><strong>Validate biological findings</strong>: If a half-life estimate seems suspicious, check the mappability and conversion rate dependency</li>
</ol>

<h3 id="for-rna-bisulfite-sequencing">For RNA bisulfite sequencing</h3>

<ol>
  <li><strong>Use specialized mappers</strong>: HISAT-3N, meRanGs, or Segemehl perform better than standard mappers</li>
  <li><strong>Be extra cautious with low mappability regions</strong>: These are hotspots for false m5C calls</li>
  <li><strong>Cross-validate calls</strong>: Don’t rely solely on methylation rates for filtering - consider mappability and structural context</li>
  <li><strong>Expect protocol artifacts</strong>: Incomplete conversion and missing reads in low mappability regions are inherent to the protocol</li>
</ol>

<h3 id="for-isoform-analysis">For isoform analysis</h3>

<ol>
  <li><strong>Provide known splice sites</strong>: This dramatically improves spliced read mapping accuracy</li>
  <li><strong>Filter problematic introns</strong>: Use splice_sim’s approach to remove introns with poor FMAT reconstruction</li>
  <li><strong>Check for intron mappability mosaics</strong>: Transcripts often mix high and low mappability introns</li>
  <li><strong>Validate novel splice junctions carefully</strong>: False-positive SJ calls increase with conversion rates</li>
</ol>

<h2 id="limitations-and-future-directions">Limitations and future directions</h2>

<p>While splice_sim is powerful, it has some limitations worth noting:</p>

<p><strong>Simulation-based</strong>: Results depend on stochastic processes, though our replicate analysis showed high correlation</p>

<p><strong>Worst-case FP scenarios</strong>: Our main dataset simulates equal coverage for all transcripts, potentially inflating false positives. For your cell type, configure transcript abundances from real RNA-seq data.</p>

<p><strong>Single-end only</strong>: Currently limited to single-end reads, though paired-end support is planned</p>

<p><strong>Bernoulli NC model</strong>: May not capture sequence-context dependencies in some NC types</p>

<p>Despite these limitations, splice_sim provides an invaluable framework for understanding and mitigating NC mapping biases.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Nucleotide conversion RNA-seq techniques provide powerful insights into RNA biology, but their mapping biases can lead to substantial errors in biological interpretation. Splice_sim offers both a diagnostic tool to understand these biases and practical strategies to mitigate them.</p>

<p>The framework is:</p>

<ul>
  <li><strong>Comprehensive</strong>: Evaluates mapping accuracy at multiple biologically meaningful scales</li>
  <li><strong>Flexible</strong>: Configurable for diverse experimental designs and organisms</li>
  <li><strong>Actionable</strong>: Provides concrete strategies (mosaic approach, filtering) to improve analysis</li>
  <li><strong>Accessible</strong>: Wrapped in Nextflow with Docker containerization for easy deployment</li>
</ul>

<p>Whether you’re planning a new NC RNA-seq experiment, benchmarking analysis pipelines, or trying to understand unexpected results from existing data, splice_sim can help ensure your biological conclusions rest on solid computational ground.</p>

<h2 id="resources">Resources</h2>

<ul>
  <li><strong>Paper</strong>: <a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03313-8">Genome Biology publication</a></li>
  <li><strong>Source Code</strong>: <a href="https://github.com/popitsch/splice_sim">GitHub repository</a></li>
  <li><strong>Docker Image</strong>: <a href="https://hub.docker.com/repository/docker/tobneu/splice_sim">Docker Hub</a></li>
  <li><strong>Precomputed Data</strong>: <a href="https://doi.org/10.5281/zenodo.11196570">Zenodo repository</a> with mapping accuracy tables for mouse and human</li>
</ul>

<p>Have questions or want to discuss NC RNA-seq mapping challenges? Feel free to open an issue on GitHub or reach out directly. Happy simulating!</p>]]></content><author><name>Tobias Neumann</name></author><category term="Bioinformatics" /><category term="Pipelines" /><category term="RNA-seq" /><category term="Nextflow" /><category term="Python" /><category term="Docker" /><category term="Genomics" /><category term="Benchmarking" /><summary type="html"><![CDATA[A comprehensive simulation framework for evaluating mapping accuracies in nucleotide conversion RNA-seq experiments]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://t-neumann.github.io/assets/images/categories/genomebiology.png" /><media:content medium="image" url="https://t-neumann.github.io/assets/images/categories/genomebiology.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Orbital maneuvers</title><link href="https://t-neumann.github.io/space/OrbitalManeuvers/" rel="alternate" type="text/html" title="Orbital maneuvers" /><published>2019-09-08T15:29:00+02:00</published><updated>2019-09-08T15:29:00+02:00</updated><id>https://t-neumann.github.io/space/OrbitalManeuvers</id><content type="html" xml:base="https://t-neumann.github.io/space/OrbitalManeuvers/"><![CDATA[<p>From my <a href="https://t-neumann.github.io/space/OrbitalBasics/">last post</a> you should have read up on the basics of orbits and orbital parameters. Now while this is interesting by itself, changing orbits and moving to different orbits in order to dock to space stations, escape to different celestial bodies or de-orbit onto a bodies surface - this is the stuff that is now why we are actually doing this. So that is why this post moves more into orbital mechanics and some basic maneuvers for modifying orbits.</p>

<p>Orbital mechanics is a core discipline within space-mission design and control.
It focuses on spacecraft trajectories, including orbital maneuvers, orbital plane changes, and interplanetary transfers, and is used by mission planners to predict the results of propulsive maneuvers.</p>

<p>Now let’s pretend we have some well-funded space agency, can do anything we want and do not have to fear killing our astronauts - if only there was some simulation to do this. This is were KSP comes into play.</p>

<h2 id="vessel">Vessel</h2>

<p>We do not want to simply calculate orbits, we want some actual space ship with propulsion systems in the orbit so we can see the impact of our maneuvers live. For this purpose, I created already in endless hours a <i class="fab fa-github" aria-hidden="true"></i> <a href="https://github.com/t-neumann/ksp-garage">huge garage</a> of different more or less efficient vessels for exploring the KSP universe.</p>

<p>For this particular, I will be using my rather tiny <a href="https://en.wikipedia.org/wiki/Single-stage-to-orbit">SSTO</a> <em>SlickOrbiter</em> consisting of 4 rapier engines which are hybrid engines with both air-breathing and liquid fuel modes. This I complement with an Atomic Rocket Motor engine for space maneuvers with far lower thrust but much higher efficiency (\(I_{SP}\)). I will definitely dedicate a couple of posts to propulsion systems, staging modes etc in a later time, for know just take it as it is.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/slickorbiter.gif" alt="Slick orbiter" width="100%" /></p>

<h2 id="spacecraft-orientation">Spacecraft orientation</h2>

<p>Now before we perform and orbit maneuvers or burns, we need to agree on the different directions we can point our spacecraft and perform these burns. Naturally, since we are in 3-dimensional space, we have 3 axis along which we can orient ourselves, each axis having 2 directions.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/spacecraftorientation.png" alt="Spacecraft orientation" width="100%" /></p>

<h4 id="prograde-and-retrograde">Prograde and retrograde</h4>

<p>These vectors run along the axis in which direction the spacecraft is moving along its orbit.</p>

<h4 id="normal-and-anti-normal">Normal and anti-normal</h4>

<p>The normal vectors are perpendicular to the orbital plane.</p>

<h4 id="radial-in-and-radial-out">Radial in and radial out</h4>

<p>These vectors are parallel to the orbital plane, and perpendicular to the prograde vector. The radial (or radial-in) vector points inside the orbit, towards the focus of the orbit, while the anti-radial (or radial-out) vector points outside the orbit, away from the body.</p>

<h2 id="orbital-maneuvers">Orbital maneuvers</h2>

<p>Ok now it is time to make a couple of burns into these directions and see how it affects our orbital parameters. To this end we set up maneuver nodes with directional indicators as shown below.</p>

<figure class="single ">
  
    
      <img src="/assets/images/posts/Maneuvers/orbitorientation.png" alt="Orbit orientation" />
    
  
    
      <img src="/assets/images/posts/Maneuvers/directions.png" alt="Directional markers" />
    
  
  
    <figcaption>Orbital directions and directional markers.
</figcaption>
  
</figure>

<p>I will go into more detail and Math about energy efficiency for those individual maneuvers in a later post, this should now only give you a first glimpse and general understanding of how to move around in space.</p>

<h4 id="prograde-and-retrograde-maneuvers">Prograde and retrograde maneuvers</h4>

<p>So we are at the apoapsis of our nearly circular orbit perfectly aligned with the equatorial plane (0 degrees inclination). Let’s see what happens if we burn into prograde direction.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/progradeburn.gif" alt="Prograde burn" width="50%" /></p>

<p>As we can see, the apoapsis moves to the opposite end of our now elliptic orbit and we raised the orbit’s altitude on the opposite side.</p>

<p>What if we do a retrograde burn?</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/retrogradeburn.gif" alt="Retrograde burn" width="50%" /></p>

<p>As we can see, the periapsis on the opposing side is lowered until we go suborbital, meaning the spacecraft will deorbit on its way to periapsis and either burn up in the atmosphere or crash on the planet (unless a proper landing procedure is initiated).</p>

<p>In summary, burning prograde will increase orbital velocity, raising the altitude of the orbit on the other side, while burning retrograde will decrease velocity and reduce the orbit altitude on the other side.</p>

<p>This is the most efficient way to change the orbital shape (specifically the most common case, raising or lowering apsides) so whenever possible these vectors should be used.</p>

<h4 id="normal-and-anti-normal-maneuvers">Normal and anti-normal maneuvers</h4>

<p>Again we are at the apoapsis of our nearly circular orbit perfectly aligned with the equatorial plane (0 degrees inclination). Let’s see what happens if we burn into normal direction.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/normalburn.gif" alt="Normal burn" width="50%" /></p>

<p>We see that the orbital inclination (the angle between the orbital and equatorial plane) changes.</p>

<p>These vectors are generally used to match the orbital inclination of another celestial body or craft, and the only time this is possible is when the current craft’s orbit intersects the orbital plane of the target - at the ascending and descending nodes. We will get to this in a second.</p>

<h4 id="radial-in-and-radial-out-maneuvers">Radial in and radial out maneuvers</h4>

<p>One last time we are at the apoapsis of our nearly circular orbit perfectly aligned with the equatorial plane (0 degrees inclination). Let’s see what happens if we burn into the radial out direction.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/radialoutburn.gif" alt="Radial out burn" width="50%" /></p>

<p>We see that the orbit start rotating around the craft like spinning a hula hoop with a stick. Radial burns are usually not an efficient way of adjusting one’s path - it is generally more effective to use prograde and retrograde burns.</p>

<h2 id="orbital-insertion">Orbital insertion</h2>

<p>Now let’s combine all those basic orbital maneuvers of the previous section:
All the maneuvers we experimented with in the last section are generally described (if sufficient change of the orbital parameters is achieved) as <strong>orbit insertion</strong> which is a general term for a maneuver that is more than a small correction. It may be used for a maneuver to change a transfer orbit or an ascent orbit into a stable one, but also to change a stable orbit into a descent. Also the term <strong>orbit injection</strong> is used - which I find even cooler -  especially for changing a stable orbit into a transfer orbit, e.g. trans-lunar injection (TLI), trans-Mars injection (TMI) and trans-Earth injection (TEI).</p>

<p>Stable orbits have been described in the <a href="https://t-neumann.github.io/space/OrbitalBasics/">previous post</a>, but now we want to specifically look at transfer orbits which enable us to put satellites into orbits, travel to the moon and Mars and all the fancy wonderous places in our solar system and beyond.</p>

<p>So what is a <strong>transfer orbit</strong>: In orbital mechanics a transfer orbit is an intermediate elliptical orbit that is used to move a satellite or other object from one circular, or largely circular, orbit to another.</p>

<p>There are several types of transfer orbits, which vary in their energy efficiency and speed of transfer and I will quickly go over the most famous ones.</p>

<p>Again, I will go into more detail and Math about energy efficiency for those transfer orbits in a later post, this should now only give you a first glimpse and general understanding of how these orbital insertions work.</p>

<h3 id="hohmann-transfer">Hohmann transfer</h3>

<p>In orbital mechanics, the Hohmann transfer orbit is an elliptical orbit used to transfer between two circular orbits of different radii around the same body in the same plane. The Hohmann transfer orbit uses the lowest possible amount of energy in traveling between these orbits.</p>

<p>The term is also used to refer to transfer orbits between different bodies (planets, moons etc.).</p>

<p>A Hohmann transfer requires that the starting and destination points be at particular locations in their orbits relative to each other. Space missions using a Hohmann transfer must wait for this required alignment to occur, which opens a so-called launch window. For a space mission between Earth and Mars, for example, these launch windows occur every 26 months. A Hohmann transfer orbit also determines a fixed time required to travel between the starting and destination points; for an Earth-Mars journey this travel time is about 9 months.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/Hohmann_transfer_orbit.svg" alt="Hohmann transfer" width="50%" /></p>

<p>The image shows a Hohmann transfer orbit to bring a spacecraft from a lower circular orbit into a higher one. It is one half of an elliptic orbit that touches both the lower circular orbit the spacecraft wishes to leave (green and labeled 1 on diagram) and the higher circular orbit that it wishes to reach (red and labeled 3 on diagram). The transfer (yellow and labeled 2 on diagram) is initiated by firing the spacecraft’s engine to accelerate prograde so that it will follow the elliptical orbit. This adds energy to the spacecraft’s orbit. When the spacecraft has reached its destination orbit, its orbital speed (and hence its orbital energy) must be increased again to change the elliptic orbit to the larger circular one which is termed <em>circularization</em>.</p>

<p>Now let’s do this in KSP. To simplify everything, assume both our starting orbit and our target orbit are already circular. Let’s say we want to reach some space station orbiting Laythe at 250k km and our <em>SlickOrbiter</em> is in a stable orbit at 100k km.</p>

<p>The first thing we have to do is match orbit inclination which is best done by a normal burn at the ascending node.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/inclinationchange.gif" alt="Orbit inclination correction" width="50%" /></p>

<p>Now that our orbital planes are synchronized, we can start with our first prograde burn of the Hohmann transfer maneuver which is raising our apoapsis to the target orbit height, effectively transforming our circular orbit into an elliptic orbit.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/HohmannBurn1.gif" alt="Hohmann transfer apoapsis change" width="50%" /></p>

<p>Now once we have reached our transfer orbit’s apoapsis, we can circularize and match our target orbit by another prograde burn.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/HohmannBurn2.gif" alt="Hohmann transfer circularization" width="50%" /></p>

<p>There it is, we have performed our first Hohmann transfer.</p>

<h3 id="bi-elliptic-transfer">Bi-elliptic transfer</h3>

<p>The bi-elliptic transfer consists of two half-elliptic orbits may, in certain situations, require less energy than a Hohmann transfer maneuver.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/Bi-elliptic_transfer.svg" alt="Bi-elliptic transfer" width="50%" /></p>

<p>From the initial orbit, a first prograde burn (1) boosts the spacecraft into the first transfer orbit with an apoapsis at some point away from the central body. At this point a second prograde burn (2) sends the spacecraft into the second elliptical orbit with periapsis at the radius of the final desired orbit, where a third retrograde burn (3) is performed, injecting the spacecraft into the desired orbit.</p>

<p>While they require one more engine burn than a Hohmann transfer and generally requires a greater travel time, some bi-elliptic transfers require a lower amount of energy than a Hohmann transfer when the ratio of final to initial semi-major axis is 11.94 or greater, depending on the intermediate semi-major axis chosen.</p>

<p>Now let’s do this in KSP. To simplify everything, assume both our starting orbit and our target orbit are already circular and our orbital inclinations are already matched. Again, we want to reach some space station orbiting Laythe at 250k km and our <em>SlickOrbiter</em> is in a stable orbit at 100k km.</p>

<p>We will first raise our apoapsis above the target orbit to create an elliptic orbit with a long prograde burn.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/Bi-elliptic_burn1.gif" alt="Bi-elliptic transfer apoapsis raise" width="50%" /></p>

<p>Now we wait until we have reached the new apoapsis for another prograde burn to raise our periapsis to the level of the target orbit.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/Bi-elliptic_burn2.gif" alt="Bi-elliptic transfer periapsis raise" width="50%" /></p>

<p>Finally, we perform a retrograde burn at the new periapsis to lower our apoapsis for <em>circularizing</em> our target orbit.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Maneuvers/Bi-elliptic_burn3.gif" alt="Bi-elliptic transfer circularization" width="50%" /></p>

<p>There it is, we have performed our first Bi-elliptic transfer.</p>

<p>Now that you have a basic overview of spacecraft orientation, burns into those directions and their impact on the spacecrafts orbit, as well as how to combined those maneuvers into orbit insertions, we can have laid the foundation to dive deeper into energy efficiency of those maneuvers, the famous <em>delta-v</em> and the Rocket equation in a later post. Until then - godspeed.</p>]]></content><author><name>Tobias Neumann</name></author><category term="Space" /><category term="Orbits" /><category term="Orbital mechanics" /><category term="Orbital parameters" /><summary type="html"><![CDATA[Changing orbital parameters using propulsion systems.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://t-neumann.github.io/assets/images/categories/OOSS.jpg" /><media:content medium="image" url="https://t-neumann.github.io/assets/images/categories/OOSS.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Square numbers proof</title><link href="https://t-neumann.github.io/mathematics/SquareNumberZeros/" rel="alternate" type="text/html" title="Square numbers proof" /><published>2019-09-02T22:05:00+02:00</published><updated>2019-09-02T22:05:00+02:00</updated><id>https://t-neumann.github.io/mathematics/SquareNumberZeros</id><content type="html" xml:base="https://t-neumann.github.io/mathematics/SquareNumberZeros/"><![CDATA[<p>I recently signed up for the <a href="http://www.vds-molecules-of-life.org/index.php?id=1350">MFPL PhD Selection</a> where we got some scientific tasks to solve. One involved proving some statement about <a href="https://en.wikipedia.org/wiki/Square_number">square numbers</a> right or wrong.</p>

<h2 id="question">Question</h2>

<blockquote>
  <p>Is any of the integer numbers, A, consisting of exactly 15 ones and 15 zeros a square-number, that is an integer B exists, such that B*B=A? The number A should always have 30 digits and also numbers with leading zeros are considered. Please explain your answer. A simple YES or NO is not sufficient.</p>
</blockquote>

<h2 id="probing-the-statement-approach">Probing the statement approach</h2>

<p>I’m definitely no Maths genius, so the first thing I would do was to basically build randomly some numbers with 15 1s and 15 0s and calculate their square roots to get a feeling.</p>

<p>Here already I stumbled upon some misleading results since for bigger numbers as any number involving minimum 15 digits, the Apple calculator, <a href="https://www.r-project.org/">R</a> and Google tend to round and switch to scientific notation, making you believe you are looking at square numbers.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">000000000000001111111111111110</span><span class="w">
</span><span class="o">&gt;</span><span class="w"> </span><span class="n">a</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">1.111111e+15</span><span class="w">
</span><span class="o">&gt;</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">33333333</span><span class="w">
</span><span class="o">&gt;</span><span class="w"> </span><span class="m">33333333</span><span class="o">*</span><span class="m">33333333</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">1.111111e+15</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://t-neumann.github.io/assets/images/posts/SquareNumberProof/googlecalculator.png" alt="Google calculator" width="50%" /></p>

<p>As you can see, both R and Google calculator would make you believe \(33333333^2\) yields \(1111111111111110\) when in fact it does not - cross-checked with Apple calculator.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/SquareNumberProof/applecalculator.png" alt="Apple calculator" width="50%" /></p>

<p>So now that I had after some detour already pretty quickly found an example proving the statement above wrong - which by the way already itself is sufficient to disprove the initial statement - but I wanted a little more sophistication.</p>

<p>I decided to take a rather lazy approach of reading up on properties of square numbers on <a href="https://en.wikipedia.org/wiki/Square_number">Wikipedia</a> and see whether any of them proves to be an easy no go. I came across the following:</p>

<ol>
  <li>No square number ends in 2, 3, 7 or 8.</li>
  <li>The number of zeros at the end of a perfect square is always even.</li>
  <li>Squares of even numbers are always even numbers and square of odd numbers are always odd.</li>
  <li>The Square of a natural number other than one is either a multiple of 3 or exceeds a multiple of 3 by 1.</li>
  <li>The Square of a natural number other than one is either a multiple of 4 or exceeds a multiple of 4 by 1.</li>
  <li>The unit’s digit of the square of a natural number is the unit’s digit of the square of the digit at unit’s place of the given natural number.</li>
  <li>There are \(n\) natural numbers \(p\) and \(q\) such that \(p^2 = 2q^2\).</li>
  <li>For every natural number \(n\),
\((n + 1)^2 - n^2 = (n + 1) + n\).</li>
  <li>For any natural number \(m\) greater than 1,
\((2m, m^2 - 1, m^2 + 1)\) is a Pythagorean triplet.</li>
</ol>

<p>So let’s just quickly go through them:</p>

<p><strong>Property 1</strong> does not really help because we can only construct numbers ending at 0 and 1, both apparently valid digits for square numbers.</p>

<p><strong>Property 2</strong> - we already hit the jackpot. Since we can freely distribute 0s in our numbers, it is trivial to create one with an odd number of zeros at the end.</p>

<p>Allrighty, let’s formalize it.</p>

<h2 id="proof-square-numbers-ending-in-zeros-strictly-end-with-an-even-number-of-zeros">Proof: Square numbers ending in zeros strictly end with an even number of zeros</h2>

<blockquote>
  <p>Theorem: Square numbers ending in zeros strictly end with an even number of zeros.</p>
</blockquote>

<p>(1) Let \(k\) be an integer \(k \in \mathbb{Z}\) with \(k \geq 0\).</p>

<p>(2) Let \(n\) be any number ending in \(0\): \(n = (10k + 0)\).</p>

<p>(3) The perfect square of \(n\) equals to \(n^2 = (10k + 0)^2 = 100k^2\)</p>

<p>From (3) directly follows that any square number with ending zeros, strictly ends with zeros of a multiple of 2 - therefore an even number - of zeros.</p>

<p>We have proofed the theorem and therefore can use it to probe for counter-examples given the properties in our initial question.</p>

<h2 id="disprove-statement-by-counterexample">Disprove statement by counterexample</h2>

<p>It is trivial to find a number \(m\) with an odd number of ending zeros and 15 additional 1s.</p>

<p>Simplest example:</p>

\[m = 1111111111111110\]

\[\sqrt{m} = \sqrt{1111111111111110} = 33333333.33333331\dot 6\]

<p>Therefore it follows, that the question</p>

<blockquote>
  <p>Is any of the integer numbers, A, consisting of exactly 15 ones and 15 zeros a square-number, that is an integer B exists, such that B*B=A?</p>
</blockquote>

<p>can be answered with <strong>No</strong>:</p>

<blockquote>
  <p>Not any integer number A, consisting of exactly 15 ones and 15 zeros is a square-number, that is an integer B exists, such that B*B=A.</p>
</blockquote>]]></content><author><name>Tobias Neumann</name></author><category term="Mathematics" /><category term="Mathematics" /><category term="Proof" /><category term="Square Number" /><summary type="html"><![CDATA[Proof that the number of zeros at the end of a perfect square is always even.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://t-neumann.github.io/assets/images/categories/maths.jpg" /><media:content medium="image" url="https://t-neumann.github.io/assets/images/categories/maths.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Orbital basics</title><link href="https://t-neumann.github.io/space/OrbitalBasics/" rel="alternate" type="text/html" title="Orbital basics" /><published>2019-08-26T13:42:00+02:00</published><updated>2019-08-26T13:42:00+02:00</updated><id>https://t-neumann.github.io/space/OrbitalBasics</id><content type="html" xml:base="https://t-neumann.github.io/space/OrbitalBasics/"><![CDATA[<p>I was always fascinated by rockets, space in general and zero-gravity environments, however the Math’s involved always deemed too complex for me. However, through the playful and still complex approach of <a href="https://www.kerbalspaceprogram.com/">Kerbal Space Program</a> (KSP) - it is an awesome game I totally recommend to anybody remotely interested in space exploration - I picked up interest lately again and started reading into orbital mechanics, propulsion systems and related stuff in more detail.</p>

<p>This blog series is dedicated to summarising basic concepts at definitely super simplified and probably sometimes oversimplified and not entirely correct level.</p>

<p>The easiest concept for me to grasp, since once can do it quite interactively in KSP is the concept of orbits and orbital changes through orbital maneuvers.</p>

<p>So this very first post of this series will cover my basic understanding of the concept of orbits.</p>

<h2 id="ellipse">Ellipse</h2>

<p>Let’s start of with refreshing our memory what an ellipse is - because that is what most relevant orbits for this blog series will look like. In mathematical terms, an ellipse is a plane curve surrounding two focal points (\(F_1\) and \(F_2\)), such that for all points on the curve, the sum of the two distances \(d(F_1) + d(F_2)\) is constant.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/Ellipse-definition.png" alt="Ellipse definition" width="50%" /></p>

<p>It is a generalization of a circle, where the two focal points are the same. Yes, also circular orbits exist.</p>

<h3 id="ellipse-parameters">Ellipse parameters</h3>

<p>There are a few important parameters describing an ellipse which will be referred throughout this blog series, so make sure you memorize and understand them, because they will keep popping up again and again.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/Ellipse-param.png" alt="Ellipse parameters" width="50%" /></p>

<h6 id="semi-major-and-semi-minor-axes-a-geq-b">Semi-major and semi-minor axes \(a \geq b\)</h6>

<p>\(a\) is referred to as the semi-major axis, i.e. \(a \geq b &gt; 0\).</p>

<h6 id="linear-eccentricity-c">Linear eccentricity \(c\)</h6>

<p>This is the distance from the center to any of the two foci: \(c  = \sqrt{a^2 - b^2}\).</p>

<h6 id="eccentricity-e">Eccentricity \(e\)</h6>

<p>The eccentricity is expressed as:</p>

\[e = \frac{c}{a} = \sqrt{1 - (\frac{b}{a})^{2}}\]

<p>assuming \(a &gt; b\). An ellipse with equal axes \((a = b)\) has zero eccentricity and is a circle.</p>

<h6 id="semi-latus-rectum-l">Semi-latus rectum \(l\)</h6>

<p>The length of the chord through one of the foci, perpendicular to the major axis, is called the latus rectum. One half of it is the semi-latus rectum \(l\). A calculation shows:</p>

\[l = \frac{b^2}{a} = a(1-e^2)\]

<p>The semi-latus rectum \(l\) is equal to the radius of curvature of the osculating circles at the vertices.</p>

<h2 id="orbit">Orbit</h2>

<p>Now probably everybody has some idea what an orbit is, but before going into details, let’s first summarise the definitions I found on the web.</p>

<h4 id="definition">Definition</h4>

<p>In physics, an orbit is the gravitationally curved trajectory of an object, like the the trajectory of any plane around a star or a satellite around earth. Unless mentioned differently, in this blogpost orbit refers to a regularly repeating trajectory, but there are also non-repeating trajectories. To a close approximation, planets and satellites follow elliptic orbits, with the central mass being orbited at on of the two focal points of the ellipse, as described by <a href="https://en.wikipedia.org/wiki/Kepler%27s_laws_of_planetary_motion">Kepler’s laws of planetary motion</a>.</p>

<p>The post will stick to the classical Newtonian mechanics paradigm of describing orbital motion, which is an adequate approximation for most situations. However, Einstein’s generaly theory of relativity, which accounts for gravity as due to curvature of spacetime and orbits following geodesics, provides a more accurate calculation and understanding of the exact mechanics of orbital motion, which is needed in near very massive bodies (e.g. Mercury’s orbit around the sun) or for extreme precision (as for GPS satellites).</p>

<h4 id="understanding-orbits">Understanding orbits</h4>

<p>There are two factors involved for understanding orbits:</p>

<ul>
  <li>Gravity pulling an object from its straight path into a curved path</li>
  <li>The velocity at which this object is trying to travel along its path</li>
</ul>

<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/tangentialvelocity.jpg" alt="Tangential velocity vs gravity" width="50%" /></p>

<p>This principal is illustrated by the illustration above, where gravity from a massive body in the center (green) pulls a object travelling on a straight path (pink object, black arrows), effectively bending the path with its constant pull (red) around the center body.</p>

<p>Another way how to illustrate how orbits develop is the though experiment of <a href="https://en.wikipedia.org/wiki/Newton%27s_cannonball">Newton’s cannonball</a>. Here, we visualize a cannon on top of a very high mountain which can fire at any imaginable speed.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/Newton_Cannon.png" alt="Newton cannon" width="50%" /></p>

<p>If the cannon fires its ball with a low initial speed, the trajectory of the ball curves downward and hits the ground <strong>(A)</strong>. As the firing speed is increased, the cannonball hits the ground farther <strong>(B)</strong> away from the cannon, because while the ball is still falling towards the ground, the ground is increasingly curving away from it (see first point, above). All these motions are actually “orbits” in a technical sense – they are describing a portion of an elliptical path around the center of gravity – but the orbits are interrupted by striking the Earth. The horizontal speed for both <strong>(A)</strong> and <strong>(B)</strong> is 0 - 7,000 m/s for Earth.</p>

<p>If the cannonball is fired with sufficient speed, the ground curves away from the ball at least as much as the ball falls – so the ball never strikes the ground. It is now in what could be called a non-interrupted, or circumnavigating, orbit. For any specific combination of height above the center of gravity and mass of the planet, there is one specific firing speed (unaffected by the mass of the ball, which is assumed to be very small relative to the Earth’s mass) that produces a circular orbit, as shown in <strong>(C)</strong>.</p>

<p>As the firing speed is increased beyond this, non-interrupted elliptic orbits are produced; one is shown in <strong>(D)</strong>. If the initial firing is above the surface of the Earth as shown, there will also be non-interrupted elliptical orbits at slower firing speed; these will come closest to the Earth at the point half an orbit beyond, and directly opposite the firing point, below the circular orbit. The horizontal speed for both <strong>(C)</strong> and <strong>(D)</strong> ranges from 7,300 to 10,000 m/s for Earth.</p>

<p>At a specific horizontal firing speed called escape velocity, dependent on the mass of the planet, an open orbit <strong>(E)</strong> is achieved that has a parabolic path. At even greater speeds the object will follow a range of hyperbolic trajectories. In a practical sense, both of these trajectory types mean the object is “breaking free” of the planet’s gravity, and “going off into space” never to return. This involves any horizontal speed &gt; 10,000 m/s for Earth.</p>

<figure class="third ">
  
    
      <img src="/assets/images/posts/Orbits/Newtonsmountainv=0.gif" alt="Newton's cannon v=0" />
    
  
    
      <img src="/assets/images/posts/Orbits/Newtonsmountainv=6000.gif" alt="Newton's cannon v=6000" />
    
  
    
      <img src="/assets/images/posts/Orbits/Newtonsmountainv=7300.gif" alt="Newton's cannon v=7300" />
    
  
    
      <img src="/assets/images/posts/Orbits/Newtonsmountainv=8000.gif" alt="Newton's cannon v=8000" />
    
  
    
      <img src="/assets/images/posts/Orbits/Newtonsmountainv=10000.gif" alt="Newton's cannon v=10000" />
    
  
  
    <figcaption>Various firing speeds of Newton’s cannon and the resulting trajectory.
</figcaption>
  
</figure>

<p>This leads to four practical classes of moving objects:</p>

<ol>
  <li>No orbit</li>
  <li>
    <p>Suborbital trajectories</p>

    <ul>
      <li>Range of interrupted elliptical paths</li>
    </ul>
  </li>
  <li>
    <p>Orbital trajectories</p>

    <ul>
      <li>Range of elliptical paths with closes point opposite firing point</li>
      <li>Circular path</li>
      <li>Range of elliptical paths with closes point at firing point</li>
    </ul>
  </li>
  <li>
    <p>Open (escape) trajectories</p>

    <ul>
      <li>Parabolic paths</li>
      <li>Hyperbolic paths</li>
    </ul>
  </li>
</ol>

<h4 id="apsis">Apsis</h4>

<p>The first two terms I learned about in KSP were the two apsis - probably because a lot of orbital maneuvers happen at those and they are pretty simple to comprehend.</p>

<p>Apsis denotes either of the two extreme points (i.e., the farthest or nearest point) in the orbit of a planetary body about its primary body.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/apsis.png" alt="Apsis" width="50%" /></p>

<p>There are two apsides in any elliptic orbit. Each is named by selecting the appropriate prefix: apo- , or peri- and then joining it to the reference suffix of the “host” body being orbited. The general form is <strong>apoapsis</strong> (see figure above (1)) for the farthest point and <strong>periapsis</strong> (see top figure (2)) for the nearest point. Depending what central body is orbited it will become apogee and perigee for object orbiting earth, apohelion and perihelion for objects orbiting the sun etc.</p>

<h4 id="orbital-elements">Orbital elements</h4>

<p>Orbital elements are the parameters required to uniquely identify a specific orbits. In celestial mechanices, usually a Kepler orbit is used. A real orbit changes over time due to gravitational perturbations by other objects and relativistic effects, so a Keplerian orbit is merely an idealized, mathematical approximation at a particular time.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/orbitalelements.png" alt="Orbital elements" width="50%" /></p>

<p>An orbit is generally defined by six elements (known as Keplerian elements) that can be computed from position and velocity:</p>

<p>Two define the size and shape of the trajectory (compare with <a href="###Ellipse">ellipse parameters</a>):</p>

<ul>
  <li>
    <p>Semimajor axis \(a\)</p>
  </li>
  <li>
    <p>Eccentricity \(e\)</p>
  </li>
</ul>

<p>Two elements define the orientation of the orbital plane in which the ellipse is embedded:</p>

<ul>
  <li>
    <p>Inclination \(i\) - vertical tilt of the ellipse with respect to the reference plane (for the earth e.g. the equatorial plane), measured at the ascending node. The ascending node is where the orbit passes upwards through the reference plane). The tilt angle is measured perpendicular to the line of intersection between the orbital plane and the reference plane.</p>
  </li>
  <li>
    <p>Longitude of the ascending node \(\Omega\) - horizontally orients the ascending node of the ellipse with respect to the reference frame’s vernal point :aries:.</p>
  </li>
</ul>

<p>I found it pretty hard at first to wrap my head around what the vernal point :aries: actually is - naturally it is some arbitrary reference point to fix the angle for the ascending node \(\Omega\). So actually the vernal point :aries: is one of the equinoctes, namely the one occurring in spring in the northern hemisphere. It is regarded as the instant of time when the plane of the Earth’s equator passes through the center of the sun. So at the equator, the sunrays will hit the earth perpendicular directly from the sky zenith. After passing the vernal point, the northern hemisphere will receive more light - summer is here - before the vernal point, the northern hemisphere received less light - winter was coming. Same is true vice versa for the southern hemisphere.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/vernalpoint.png" alt="Vernal point" width="100%" /></p>

<p>The two remaining elements are as follows:</p>

<ul>
  <li>
    <p>Argument of periapsis \(\omega\) defines the orientation of the ellipse in the orbital plane. It is measured as the angle from the ascending node to the periapsis.</p>
  </li>
  <li>
    <p>True anomaly (\(v\), \(\theta\), or \(f\)) at epoch (\(M_0\)) defines the position of the orbiting body along the ellipse at a specific time (“epoch”). The true anomaly is an angular parameter defining the angle between the direction of the periapsis and the current position of the orbiting body.</p>
  </li>
</ul>

<p>Epoch sounds pretty sophisticated, but basically just just a moment in time used as a reference point for some time-varying astronomical quantity, like the true anomaly. Still sounds complicated?</p>

<p>Let’s look at some unit indicating a specific epoch: J2000.</p>

<p>The \(J\) unit refers to Julian years, which are intervals with the length of a mean year in the Julian calendar, i.e. 365.25 days. This interval measure does not itself define any epoch: the Gregorian calendar is in general use for dating. Thus “J2000” refers to the instant of 12:00 TT (noon) on January 1, 2000.</p>

<p>Now an arbitrary Julian epoch is therefore related to the Julian date by</p>

\[J = 2000 + (Julian date − 2451545.0) ÷ 365.25\]

<p>So in a sense everybody has definitely a feeling for an Epoch because we also structure our lifes and set up meetings for certain “Epochs” everyday.</p>

<h4 id="orbital-period">Orbital period</h4>

<p>The orbital period is simply how long an orbiting body takes to complete one orbit.</p>

<h4 id="ellipse-vs-orbits">Ellipse vs orbits</h4>

<p>For elliptical orbits, some formulas from ellipses are directly related.</p>

<p>Let \(e\) be the eccentricity, \(r_a\) the radius of the apoapsis, \(r_p\) the radius of the periapsis and \(a\) the length of the smi-major axis. Then:</p>

\[e = \frac{r_a - r_p}{r_a + r_p} = \frac{r_a - r_p}{2a}\]

\[r_a = (1 + e)a\]

\[r_p = (1 - e)a\]

<p>Interestingly, the semi-major axis \(a\) is the arithmetic mean, the semi-minor axis \(b\) is the geometric mean and the semi-latus rectum \(l\) is the harmonic mean of \(r_a\) and \(r_b\):</p>

\[a = \frac{r_a + r_p}{2}\]

\[b = \frac{2}{\sqrt{r_a * r_p}}\]

\[l = \frac{2}{\frac{1}{r_a} + \frac{1}{r_p}} = \frac{2r_{a}r_{p}}{r_a + r_p}\]

<h4 id="orbits-in-ksp">Orbits in KSP</h4>

<p>Now this post should leave with a basic idea what an orbit is, how it is defined and what are important parameters to specify orbits and positioning moving object in a given orbit. As a little teaser for the next section where we will be talking about basic orbital maneuvers and mechanics, find a first sceenshot from KSP of a random orbit. What can you tell from it?</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Orbits/ksp-orbital-parameters.png" alt="KSP orbits" width="100%" /></p>

<p>Given from what I have told you, you should be able to spot that it is a circular orbit (eccentricity = 0 or apoapsis \(\approx\) periapsis) and it’s orbital plane is perfectly aligned with the equatorial plane of the central body (inclination = 0).</p>

<p>Now you should be equipped with the basic toolset for the next post where we will be modifying orbital parameters with maneuvers.</p>]]></content><author><name>Tobias Neumann</name></author><category term="Space" /><category term="Orbits" /><category term="Orbital mechanics" /><category term="Orbital parameters" /><summary type="html"><![CDATA[Basic definition, terminology and concepts of orbits]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://t-neumann.github.io/assets/images/categories/OOSS.jpg" /><media:content medium="image" url="https://t-neumann.github.io/assets/images/categories/OOSS.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Pipelines on AWS</title><link href="https://t-neumann.github.io/pipelines/AWS-pipeline/" rel="alternate" type="text/html" title="Pipelines on AWS" /><published>2019-08-25T21:51:00+02:00</published><updated>2019-08-25T21:51:00+02:00</updated><id>https://t-neumann.github.io/pipelines/AWS-pipeline</id><content type="html" xml:base="https://t-neumann.github.io/pipelines/AWS-pipeline/"><![CDATA[<p>The prerequisite for this post is that you have a sound understanding of Nextflow and made yourself familiar with the <code class="language-plaintext highlighter-rouge">salmon-nf</code> workflow created in <a href="https://t-neumann.github.io/pipelines/Nextflow-pipeline/">this post</a>. Furthermore, you should know all the essential AWS building blocks and basic architecture of an AWS based batch scheduler as I presented in my <a href="https://t-neumann.github.io/pipelines/AWS-architecture/">previous post</a>. In this post, I will show you what environment and resources you have to actually set up on AWS to make the <a href="https://github.com/t-neumann/salmon-nf"><code class="language-plaintext highlighter-rouge">salmon-nf</code></a> example pipeline run and then how to actually run jobs on the setup AWS Batch queue with <a href="https://www.nextflow.io/">Nextflow</a>.</p>

<h2 id="credits">Credits</h2>

<p>Many people have done a great job into setting up tutorials and blogs on this and I would like to acknowledge a few that helped me a lot to actually make my AWS pipelines happen:</p>

<ul>
  <li><a href="https://maxulysse.github.io/">Maxime Garcia</a> and his great blog</li>
  <li><a href="https://apeltzer.github.io/">Alex Peltzer</a></li>
  <li><a href="https://github.com/pditommaso">Paolo Di Tommaso</a> for Nextflow and Gitter support</li>
</ul>

<p>There are a couple of tutorials that helped a lot:</p>

<ul>
  <li><a href="https://www.nextflow.io/docs/latest/awscloud.html#aws-batch">Nextflow documentation</a></li>
  <li><a href="https://www.nextflow.io/blog/2017/scaling-with-aws-batch.html">Nextflow blog</a></li>
</ul>

<h2 id="prerequisites">Prerequisites</h2>

<h3 id="accounts-users-roles-permissions">Accounts, users, roles, permissions</h3>

<p>Some things have to be setup prior to starting setting up the actual AWS compute environment such as obvious things as an <code class="language-plaintext highlighter-rouge">AWS account</code> and other things like setting up an <code class="language-plaintext highlighter-rouge">IAM user</code> or <code class="language-plaintext highlighter-rouge">Service roles</code> which all has to be done only once and is exhaustively covered already in several blog posts such as <a href="https://apeltzer.github.io/post/01-aws-nfcore/">this one</a> by Alex Peltzer and Tobias Koch. Therefore, I will not spend any time on this and suggest you just follow the instructions in the blog post until it is time to set up your <code class="language-plaintext highlighter-rouge">AMI</code> which is where I will start off.</p>

<h2 id="step-1-estimate-resource-requirements">Step 1: Estimate resource requirements</h2>

<p>Appropriate resource allocation is crucial for setting up AWS workflow that are both cost-efficient and high-throughput. Therefore, I strongly advise you to take a big enough test-dataset, run it on in a limitless test environment - hopefully many of you have some kind of in-house HPC cluster - and take the resulting measurements of resource consumption to find optimal storage, memory and CPU sizes.</p>

<p>Conveniently, Nextflow workflows can be easily executed both on <code class="language-plaintext highlighter-rouge">AWS</code> but also in your local HPC environment by simply defining additional <a href="https://www.nextflow.io/docs/latest/config.html#config-profiles">profiles</a> for the scheduler of your choice.</p>

<p>Here is one example of a simple <code class="language-plaintext highlighter-rouge">SLURM</code> profile:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">singularity</span> <span class="o">{</span>
	<span class="n">enabled</span> <span class="o">=</span> <span class="kc">true</span>
<span class="o">}</span>

<span class="n">docker</span> <span class="o">{</span>
	<span class="n">enabled</span> <span class="o">=</span> <span class="kc">false</span>
<span class="o">}</span>

<span class="n">process</span> <span class="o">{</span>

    <span class="n">executor</span> <span class="o">=</span> <span class="err">'</span><span class="n">slurm</span><span class="err">'</span>
    <span class="n">clusterOptions</span> <span class="o">=</span> <span class="err">'</span><span class="o">--</span><span class="n">qos</span><span class="o">=</span><span class="kt">short</span><span class="err">'</span>
    <span class="n">cpus</span> <span class="o">=</span> <span class="err">'</span><span class="mi">12</span><span class="err">'</span>
    <span class="n">memory</span> <span class="o">=</span> <span class="o">{</span> <span class="mi">8</span><span class="o">.</span><span class="na">GB</span> <span class="o">*</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">}</span>
<span class="o">}</span>

<span class="n">params</span> <span class="o">{</span>

   <span class="n">salmonIndex</span> <span class="o">=</span> <span class="err">'</span><span class="o">/</span><span class="n">groups</span><span class="o">/</span><span class="nc">Software</span><span class="o">/</span><span class="n">indices</span><span class="o">/</span><span class="n">hg38</span><span class="o">/</span><span class="n">salmon</span><span class="o">/</span><span class="n">gencode</span><span class="o">.</span><span class="na">v28</span><span class="o">.</span><span class="na">IMPACT</span><span class="err">'</span>

<span class="o">}</span>
</code></pre></div></div>

<p>As you can see, usually <code class="language-plaintext highlighter-rouge">HPC</code> environments do not allow Docker containers to run, but support <a href="https://singularity.lbl.gov/">Singularity</a> containers which can be <a href="https://singularity.lbl.gov/docs-build-container#downloading-a-existing-container-from-docker-hub">easily built from Docker containers</a>.</p>

<p>The <code class="language-plaintext highlighter-rouge">process</code> section basically defines the scheduler, resources and the job queue in which the processes should run. Finally, the index files are usually stored in some globally accessible directory, similar to the <code class="language-plaintext highlighter-rouge">s3</code> storage on <code class="language-plaintext highlighter-rouge">AWS</code>.</p>

<p>Now that we are set, Nextflow has this neat option flag <code class="language-plaintext highlighter-rouge">-with-report</code> that gives you a very <a href="https://www.nextflow.io/docs/latest/tracing.html#execution-report">comprehensive overview</a> of the resources your processes consumed during execution.</p>

<p>Below are the most important excerpts of an example report from when I ran my Nextflow workflow on 1,222 breast cancer datasets from <a href="https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga">TCGA</a>:</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/nextflowreport_CPU.png" alt="Nextflow CPU consumption" /></p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/nextflowreport_memory.png" alt="Nextflow memory consumption" />
<img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/nextflowreport_time.png" alt="Nextflow time duration" /></p>

<p>On average a single task ran on <strong>6 threads</strong>, consumed <strong>8 GB of memory</strong> and ran <strong>2:30 minutes</strong> - this is the rough framework of resources we will have to consider when allocating resources and choosing appropriate <code class="language-plaintext highlighter-rouge">EC2</code> instances.</p>

<h2 id="step-2-creating-a-suitable-ami">Step 2: Creating a suitable AMI</h2>

<p>I found the setup and configuration of suitable <code class="language-plaintext highlighter-rouge">AMIs</code> to be the most demanding step when creating an environment to run a pipeline on <code class="language-plaintext highlighter-rouge">AWS</code>. Several things have to be considered:</p>

<ul>
  <li>Base image: It has to be <code class="language-plaintext highlighter-rouge">ECS</code>-compatible</li>
  <li><code class="language-plaintext highlighter-rouge">EBS</code> storage: The attached volumes have to be large enough to contain all input, index, temporary and output files</li>
  <li><code class="language-plaintext highlighter-rouge">AWS CLI</code>: The <code class="language-plaintext highlighter-rouge">AMI</code> has to contain the <code class="language-plaintext highlighter-rouge">AWS CLI</code> or otherwise no files can be fetch from and copied to <code class="language-plaintext highlighter-rouge">S3</code> from the <code class="language-plaintext highlighter-rouge">EBS</code> volume</li>
  <li><code class="language-plaintext highlighter-rouge">AMIs</code> cannot be reused for processes containing less <code class="language-plaintext highlighter-rouge">EBS</code> (more is possible)</li>
</ul>

<p>This section covers how you can set up your <code class="language-plaintext highlighter-rouge">AMI</code> for a given task of your pipeline and what to consider on the way.</p>

<h3 id="choose-an-amazon-machine-image-ami">Choose an Amazon Machine Image (AMI)</h3>

<p>As a first step, we want to make sure to pick a base image that supports <code class="language-plaintext highlighter-rouge">ECS</code> from the AWS Market Place. I strongly advise you to use one of the <code class="language-plaintext highlighter-rouge">Amazon ECS-Optimized Amazon Linux AMI</code> images.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Choose-AMI.png" alt="Choose AMI" /></p>

<h3 id="choose-an-instance-type">Choose an Instance Type</h3>

<p>The <code class="language-plaintext highlighter-rouge">EC2</code> instance we want to use to create our custom <code class="language-plaintext highlighter-rouge">AMI</code> does not need be resourceful, since we won’t run any jobs on that. Therefore, a <code class="language-plaintext highlighter-rouge">t2.micro</code> instance is more than sufficient.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Choose-Instance.png" alt="Choose Instance" /></p>

<h3 id="configure-instance-details">Configure Instance Details</h3>

<p>The instance configuration can be mostly left to the defaults. However, I would strongly advise you to set the shutdown behaviour to <code class="language-plaintext highlighter-rouge">terminate</code>, otherwise attached volumes will be kept persistent and you continue to pay unless you explicitely terminate the instance manually. I actually ran into huge costs when misconfiguring this (300$) so <strong>watch out!</strong>.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Configure-Instance.png" alt="Configure Instance" /></p>

<h3 id="add-storage">Add storage</h3>

<p>This is the single most important point of the entire <code class="language-plaintext highlighter-rouge">AMI</code> setup process - here you define the <strong>minimum</strong> number of added storage for your <code class="language-plaintext highlighter-rouge">AMI</code>. This storage <strong>must</strong> be large enough, to contain <strong>all</strong> input and index files for a given task as well as <strong>all</strong> temporary and final output files produced during the computation. I hope you did some thorough benchmarking and extrapolation of resources on your input dataset.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Add-Storage.png" alt="Add Storage" /></p>

<h3 id="add-tags">Add tags</h3>

<p>Unless you want to add optional tags, nothing to do here…</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Add-Tags.png" alt="Add Tags" /></p>

<h3 id="configure-security-group">Configure Security Group</h3>

<p>Before firing up your instance, you need to configure the associated security group. For me, letting AWS create the security group worked perfectly fine, I would still double check that you can connect to the <code class="language-plaintext highlighter-rouge">EC2</code> instance - in case of doubt set the source to <code class="language-plaintext highlighter-rouge">0.0.0.0/0</code>, even though probably all IT security experts will kill me for that. Now you are ready to <strong>lauch the instance</strong>.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Security-Group.png" alt="Security Group" /></p>

<h3 id="ssh-connect-to-instance">SSH connect to instance</h3>

<p>Now right click and hit <em>Connect</em> to get your <code class="language-plaintext highlighter-rouge">ssh</code> connect command to your instance. You might have to change the default <code class="language-plaintext highlighter-rouge">root</code> user to <code class="language-plaintext highlighter-rouge">ec2-user</code>.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-SSH.png" alt="AMI SSH connect" /></p>

<h3 id="adjust-docker-container-size-to-ebs">Adjust Docker container size to EBS</h3>

<p>The first thing we want to check once we connected to our instance is that the Docker configuration reflects the amount of added EBS storage.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-40-128 ~]<span class="nv">$ </span>docker info | <span class="nb">grep</span> <span class="nt">-i</span> data
 Data Space Used: 309.3MB
 Data Space Total: 42.42GB
 Data Space Available: 42.11GB
 Metadata Space Used: 4.833MB
 Metadata Space Total: 46.14MB
 Metadata Space Available: 41.3MB
</code></pre></div></div>

<p>In the above example we see that indeed Docker is configure for the specified 40 GB EBS data volume.</p>

<p>As per default, the maximum storage size of a single Docker container is 10 GB - independent of the data space available - we have to adjust this.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-40-128 ~]<span class="nv">$ </span>docker info | <span class="nb">grep</span> <span class="nt">-i</span> base
 Base Device Size: 10.74GB
</code></pre></div></div>

<p>To this end, we have to extend the file in <code class="language-plaintext highlighter-rouge">/etc/sysconfig/docker-storage</code> to contain the following parameter <code class="language-plaintext highlighter-rouge">--storage-opt dm.basesize=40GB</code> and restart the Docker service.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>vi /etc/sysconfig/docker-storage
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">DOCKER_STORAGE_OPTIONS</span><span class="o">=</span><span class="s2">"--storage-driver devicemapper --storage-opt dm.thinpooldev=/dev/mapper/docker-docker--pool --storage-opt dm.use_deferred_removal=true --storage-opt dm.use_deferred_deletion=true --storage-opt dm.fs=ext4 --storage-opt dm.use_deferred_deletion=true --storage-opt dm.basesize=40GB"</span>
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-40-128 ~]<span class="nv">$ </span><span class="nb">sudo </span>service docker restart
Stopping docker:                                           <span class="o">[</span>  OK  <span class="o">]</span>
Starting docker:       	<span class="nb">.</span>                                  <span class="o">[</span>  OK  <span class="o">]</span>
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-40-128 ~]<span class="nv">$ </span>docker info | <span class="nb">grep</span> <span class="nt">-i</span> base
 Base Device Size: 42.95GB
</code></pre></div></div>

<h3 id="install-aws-cli">Install AWS CLI</h3>

<p><code class="language-plaintext highlighter-rouge">Nextflow</code> requires the <code class="language-plaintext highlighter-rouge">AWS CLI</code> to copy files such as input files and indices from and output files to <code class="language-plaintext highlighter-rouge">S3</code>.</p>

<p>Use the following lines to add it to your <code class="language-plaintext highlighter-rouge">AMI</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>yum <span class="nb">install</span> <span class="nt">-y</span> bzip2 wget
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh <span class="nt">-b</span> <span class="nt">-f</span> <span class="nt">-p</span> <span class="nv">$HOME</span>/miniconda
<span class="nv">$HOME</span>/miniconda/bin/conda <span class="nb">install</span> <span class="nt">-c</span> conda-forge <span class="nt">-y</span> awscli
<span class="nb">rm </span>Miniconda3-latest-Linux-x86_64.sh
</code></pre></div></div>

<p>Give it a quick spin to see whether everything is ok.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-40-128 ~]<span class="nv">$ </span>./miniconda/bin/aws <span class="nt">--version</span>
aws-cli/1.16.121 Python/3.7.1 Linux/4.14.94-73.73.amzn1.x86_64 botocore/1.12.111
</code></pre></div></div>

<h3 id="save-your-ami">Save your AMI</h3>

<p>Now you can go back to your <code class="language-plaintext highlighter-rouge">EC2</code> instance dashboard and save your <code class="language-plaintext highlighter-rouge">AMI</code> by right clicking and going for <code class="language-plaintext highlighter-rouge">Image-&gt;Create Image</code>.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AMI-Create-AMI.png" alt="Create AMI" /></p>

<p><strong>Congratulations</strong> you have created your first <code class="language-plaintext highlighter-rouge">AMI</code>!</p>

<p>Don’t forget to terminate your running <code class="language-plaintext highlighter-rouge">EC2</code> instance from which you created the <code class="language-plaintext highlighter-rouge">AMI</code> to get prevent any running <code class="language-plaintext highlighter-rouge">EBS</code> and <code class="language-plaintext highlighter-rouge">EC2</code> costs.</p>

<h2 id="step-3-creating-compute-environments-and-job-queues">Step 3: Creating compute environments and job queues</h2>

<p>Now it is time to create appropriate compute environments and their corresponding job queues. I usually like to create some baseline <em>workload</em> queue that should handle most of the jobs providing resources estimated from Step 1 and an <em>excess</em> queue with very extensive resources that handles the few jobs that overflow the <em>workload</em> resources, so that the entire batch is still successfully processed.</p>

<h3 id="overview">Overview</h3>

<p>First, we want to create a new compute environment upon which we can base job queues. For this, go to the <code class="language-plaintext highlighter-rouge">AWS Batch</code> dashboard -&gt; <code class="language-plaintext highlighter-rouge">Compute Environments</code>.</p>

<p>I have already created some production environments, for you this overview will probably be empty. Then go to <code class="language-plaintext highlighter-rouge">Create Environment</code>.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/ComputeEnvironment_Overview.png" alt="Compute environment overview" /></p>

<h3 id="naming-roles-and-permissions">Naming, roles and permissions</h3>

<p>First, we want to have a <code class="language-plaintext highlighter-rouge">managed</code> environment, so <code class="language-plaintext highlighter-rouge">AWS Batch</code> can do configuration and scaling for us. Now, we can name our compute environment. I chose to create first a <code class="language-plaintext highlighter-rouge">workload</code> compute environment, thus naming it <code class="language-plaintext highlighter-rouge">salmonWorkload</code>. Then we simply select the service and instance roles as well as keypairs we created earlier in the <code class="language-plaintext highlighter-rouge">prerequisite</code> section, there should be only one option to choose from.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/ComputeEnvironment_Names.png" alt="Compute environment naming" /></p>

<h3 id="some-words-on-instance-types-and-vcpu-limits">Some words on instance types and vCPU limits</h3>

<p>In my opinion, this part is <strong>the most crucial part</strong> of setting up an optimal environment both in terms of computation and cost efficiency. <strong>So pay special attention here!</strong></p>

<p>First of all, I hope you did a good enough job in Step 1 of estimating your resource requirements <strong>per task</strong>.</p>

<p>These are the punchlines you have to consider now for fixing instance types and vCPU limits for your compute environment:</p>

<h4 id="fit-only-1-task-in-1-instance">Fit only <strong>1</strong> task in <strong>1</strong> instance!</h4>

<p>If you look at the instance pricing table, you will see that prices linearly scaling with instance types - meaning doubling resources results in double prices. You will not save anything by running more jobs on a single larger instance, but you will pay for it since from experience the Docker daemon on the instance sometimes gets confused and hung-up when there’s multiple tasks run on the instance.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-architecture/EC2Instances.png" alt="EC2 instances" /></p>

<h4 id="vcpus-refers-to-the-total-number-of-vcpus-of-your-environments">vCPUs refers to the total number of vCPUs of your environments</h4>

<p>This got me confused also when trying to figure out, how many instances will be fired up in total. Essentially you have to divide this number by the number of vCPUs provided by your instance type of choice and then you will get the number of instances that will be launched at peak times.</p>

<p>So let’s say you chose <code class="language-plaintext highlighter-rouge">c5.2xlarge</code> as your instance type with 8 vCPUs and your specified <code class="language-plaintext highlighter-rouge">Maximum vCPUs</code> is 100, then 100 / 8 = 12 instances will be launched in total if the entire compute environment is utilized.</p>

<h4 id="keep-some-spare-memory-for-instance-services">Keep some spare memory for instance services</h4>

<p>I will address this in detail later, but keep in mind that not the entire memory listed in the instance type specification can be used, since some of it will be occupied with running basic instance services.</p>

<h4 id="keep-homogeneous-compute-environments">Keep homogeneous compute environments</h4>

<p>Since we did a careful resource requirement estimation, I find it easiest for keeping track of cost and also ensuring that the tasks will actually finish, to have homogeneous compute environments - meaning one environment will only allow for one specific instance type.</p>

<h3 id="specifying-instance-types-and-vcpu-limits">Specifying instance types and vCPU limits</h3>

<p>Now let’s put it all together. First up, let’s quickly refresh the resource requirements we had per Salmon task:</p>

<ul>
  <li>We need an instance to provide 8 GB of memory to fit index + data</li>
  <li>If we run our tasks on a 6 thread instance, it will run 2:30 mins</li>
</ul>

<p>Now if we check the instance type table, we find there is actually 2 types of instances that would cover these requirements:</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/ComputeEnvironment_InstanceResearch.png" alt="Potential instance types" /></p>

<p>The <code class="language-plaintext highlighter-rouge">c5.xlarge</code> comes with 8 GB of memory and 4 vCPUs, the <code class="language-plaintext highlighter-rouge">c5.2xlarge</code> with double the memory and vCPUs. So in principal, we could fit on average on task in the smaller instance, but remember you will have some overhead of services running on the instance that effectively reduces these 8 GB and second these requirements are average requirements, so anything above average will fail to run in such an instance. Therefore, we should definitely go for a <code class="language-plaintext highlighter-rouge">c5.2xlarge</code> here.</p>

<ul>
  <li>Choose <code class="language-plaintext highlighter-rouge">c5.2xlarge</code> as your only instance type and delete <code class="language-plaintext highlighter-rouge">optimal</code></li>
  <li>Set <code class="language-plaintext highlighter-rouge">Minimum vCPUs</code> and <code class="language-plaintext highlighter-rouge">Desired vCPUs</code> both to 0 to have no idle running instances in background</li>
  <li>Tick the <code class="language-plaintext highlighter-rouge">Enable user-specified Ami ID</code>, copy the <code class="language-plaintext highlighter-rouge">AMI ID</code> from the <code class="language-plaintext highlighter-rouge">AMI</code> we created and validate it</li>
</ul>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/ComputeEnvironment_Resources.png" alt="Compute environment resources" /></p>

<p>Everything else you can leave empty and click <code class="language-plaintext highlighter-rouge">Create</code>.</p>

<p>Congratulations, you have created your first compute environment!</p>

<h2 id="step-4-creating-job-queues">Step 4: Creating job queues</h2>

<p>Now we need to create a job queue and associated this with our compute environment. This step is actually pretty easy and straightforward.</p>

<p>First go to <code class="language-plaintext highlighter-rouge">Job queues</code> and click <code class="language-plaintext highlighter-rouge">Create Queue</code>.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/JobQueue_Overview.png" alt="Job queue overview" /></p>

<p>Now you can pick a name for your job queue - in our simple case I give it the same name as our compute environment <code class="language-plaintext highlighter-rouge">salmonWorkload</code>. You can in principal assign multiple job queues to one compute environment and set priorities via the <code class="language-plaintext highlighter-rouge">Priority</code> field, but we can simply put <code class="language-plaintext highlighter-rouge">1</code> in there.</p>

<p>Finally, associated the job queue with our <code class="language-plaintext highlighter-rouge">salmonWorkload</code> compute environment. Note again here, you can in principal assign multiple compute environments to a given job queue.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/JobQueue_Create.png" alt="Job queue creation" /></p>

<p>That’s it - click <code class="language-plaintext highlighter-rouge">Create job queue</code> and you have successfully created your first job queue!</p>

<h3 id="excess-queue">Excess queue</h3>

<p>Now that we have our workload compute environment and job queues, we want to do the same with for our excess compute environment and job queues to handle any datasets with overshooting resource requirements.</p>

<p>Therefore, we repeat the steps starting from Step 3 to create a <code class="language-plaintext highlighter-rouge">salmonExcess</code> compute environment and job queue based on <code class="language-plaintext highlighter-rouge">c5.4xlarge</code> instances with double the resources compared to our <code class="language-plaintext highlighter-rouge">salmonWorkload</code> queue.</p>

<p>This should leave you know with the following compute environments and job queues and finally ready to specify our final resource constraints before submitting our first jobs.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/TwoQueue_environments.png" alt="Two queue environments" /></p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/TwoQueue_jobqueues.png" alt="Two queue job queues" /></p>

<h2 id="step-5-adjusting-resources">Step 5: Adjusting resources</h2>

<p>Ok so now that we have set all the compute environments with associated instance types as well as job queues up on the <code class="language-plaintext highlighter-rouge">AWS</code> end, we know what resources we have available and how much of those will be consumed by our tasks.</p>

<h3 id="resource-definition">Resource definition</h3>

<p>So naïvely we can directly enter the specifications of our <code class="language-plaintext highlighter-rouge">EC2</code> instance type of choice in the <code class="language-plaintext highlighter-rouge">awsbatch.config</code> file of our <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow workflow, since we know the <code class="language-plaintext highlighter-rouge">salmonWorkload</code> queue consists of <code class="language-plaintext highlighter-rouge">c5.2xlarge</code> instances with 16 GB memroy and 8 vCPUs each and our <code class="language-plaintext highlighter-rouge">salmonExcess</code> queue of <code class="language-plaintext highlighter-rouge">c5.4xlarge</code> instances with 32 GB memory and 16 vCPUs each.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">aws</span><span class="o">.</span><span class="na">region</span> <span class="o">=</span> <span class="err">'</span><span class="n">eu</span><span class="o">-</span><span class="n">central</span><span class="o">-</span><span class="mi">1</span><span class="err">'</span>
<span class="n">aws</span><span class="o">.</span><span class="na">client</span><span class="o">.</span><span class="na">storageEncryption</span> <span class="o">=</span> <span class="err">'</span><span class="no">AES256</span><span class="err">'</span>
<span class="n">executor</span><span class="o">.</span><span class="na">name</span> <span class="o">=</span> <span class="err">'</span><span class="n">awsbatch</span><span class="err">'</span>
<span class="n">executor</span><span class="o">.</span><span class="na">awscli</span> <span class="o">=</span> <span class="err">'</span><span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">ec2</span><span class="o">-</span><span class="n">user</span><span class="o">/</span><span class="n">miniconda</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">aws</span><span class="err">'</span>

<span class="n">process</span> <span class="o">{</span>
  <span class="n">queue</span> <span class="o">=</span> <span class="o">{</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">&gt;</span> <span class="mi">1</span> <span class="o">?</span> <span class="err">'</span><span class="n">salmonExcess</span><span class="err">'</span> <span class="o">:</span> <span class="err">'</span><span class="n">salmonWorkload</span><span class="err">'</span> <span class="o">}</span>
	<span class="n">memory</span> <span class="o">=</span> <span class="o">{</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">&gt;</span> <span class="mi">1</span> <span class="o">?</span> <span class="mi">32</span><span class="o">.</span><span class="na">GB</span> <span class="o">:</span> <span class="mi">16</span><span class="o">.</span><span class="na">GB</span> <span class="o">}</span>
	<span class="n">cpus</span> <span class="o">=</span> <span class="o">{</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">&gt;</span> <span class="mi">1</span> <span class="o">?</span> <span class="mi">16</span> <span class="o">:</span> <span class="mi">8</span> <span class="o">}</span>
<span class="o">}</span>

<span class="n">params</span> <span class="o">{</span>

   <span class="n">salmonIndex</span> <span class="o">=</span> <span class="err">'</span><span class="nl">s3:</span><span class="c1">//obenauflab/indices/salmon/gencode.v28.IMPACT'</span>

<span class="o">}</span>
</code></pre></div></div>

<p>Now just let’s quickly fast-forward and look what happens if we submit our jobs like this.</p>

<p>You will notice that we have one runnable job for each task, yet no instances will fire up.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_Overflow.png" alt="Resource overflow" /></p>

<p>If we check one of the jobs, we will see that the environment requirements have been exactly set up as we specified in our Nextflow config which is also matched by the instance types of our job queue - so why does this not work?</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_OverflowJob.png" alt="Job overflow" /></p>

<h3 id="ecs-overhead-extraction">ECS overhead extraction</h3>

<p>The solution for this is the fact, that there are <strong>overhead container services</strong> running in your instance which consume some chunk of your total available memory. So when you ask for X GB memory on an instance with X GB total memory, you have to be aware that there is Y GB memory preoccupied with service tasks, so your effective available memory will be X-Y.</p>

<p>To get your jobs running on such instances, you cannot request X GB memory then, but rather the X-Y chunk. How do we determine Y now?</p>

<p>Let’s first fire up an instance of our compute environment by simply selecting our compute environment and clicking on <code class="language-plaintext highlighter-rouge">Edit</code>.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_edit.png" alt="Edit compute environment" /></p>

<p>Now we select 1 minimum and desired vCPU to fire up one instance of the compute environment and <code class="language-plaintext highlighter-rouge">Save</code>.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_vCPUs.png" alt="Select 1 vCPU" /></p>

<p>Wait a couple of minutes to let the <code class="language-plaintext highlighter-rouge">EC2</code> instance fire up, then again click on your compute environment. Follow the link given in <code class="language-plaintext highlighter-rouge">ECS Cluster name</code>.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_ECS.png" alt="Follow ECS" /></p>

<p>This will bring you to the cluster overview page, where you need to click on <code class="language-plaintext highlighter-rouge">ECS instances</code>.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_Cluster.png" alt="ECS overview" /></p>

<p>Now finally we get what we want - the actual amount of memory available on a given instance on this <code class="language-plaintext highlighter-rouge">ECS</code> cluster.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Resources_ActualMemory.png" alt="Factual available memory" /></p>

<p>According to the ECS tab, we have <strong>15,434 MB</strong> memory available on our <code class="language-plaintext highlighter-rouge">salmonWorkload</code> queue - repeat the same procedure to get the numbers for our <code class="language-plaintext highlighter-rouge">salmonExcess</code> queue.</p>

<h3 id="updated-resource-definition">Updated resource definition</h3>

<p>Having obtained the mysterious actual available memory X-Y on our <code class="language-plaintext highlighter-rouge">EC2</code> instances of our compute environment, we can finally enter the final numbers in our <code class="language-plaintext highlighter-rouge">awsbatch.config</code> definition of our <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow pipeline.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">aws</span><span class="o">.</span><span class="na">region</span> <span class="o">=</span> <span class="err">'</span><span class="n">eu</span><span class="o">-</span><span class="n">central</span><span class="o">-</span><span class="mi">1</span><span class="err">'</span>
<span class="n">aws</span><span class="o">.</span><span class="na">client</span><span class="o">.</span><span class="na">storageEncryption</span> <span class="o">=</span> <span class="err">'</span><span class="no">AES256</span><span class="err">'</span>
<span class="n">executor</span><span class="o">.</span><span class="na">name</span> <span class="o">=</span> <span class="err">'</span><span class="n">awsbatch</span><span class="err">'</span>
<span class="n">executor</span><span class="o">.</span><span class="na">awscli</span> <span class="o">=</span> <span class="err">'</span><span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">ec2</span><span class="o">-</span><span class="n">user</span><span class="o">/</span><span class="n">miniconda</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">aws</span><span class="err">'</span>

<span class="n">process</span> <span class="o">{</span>

<span class="n">queue</span> <span class="o">=</span> <span class="o">{</span>
	<span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">&gt;</span> <span class="mi">1</span> <span class="o">?</span> <span class="err">'</span><span class="n">salmonExcess</span><span class="err">'</span> <span class="o">:</span> <span class="err">'</span><span class="n">salmonWorkload</span><span class="err">'</span> <span class="o">}</span>
	<span class="n">memory</span> <span class="o">=</span> <span class="o">{</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">&gt;</span> <span class="mi">1</span> <span class="o">?</span> <span class="mi">31100</span><span class="o">.</span><span class="na">MB</span> <span class="o">:</span> <span class="mi">15400</span><span class="o">.</span><span class="na">MB</span> <span class="o">}</span>
	<span class="n">cpus</span> <span class="o">=</span> <span class="o">{</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">&gt;</span> <span class="mi">1</span> <span class="o">?</span> <span class="mi">16</span> <span class="o">:</span> <span class="mi">8</span> <span class="o">}</span>
<span class="o">}</span>

<span class="n">params</span> <span class="o">{</span>

   <span class="n">salmonIndex</span> <span class="o">=</span> <span class="err">'</span><span class="nl">s3:</span><span class="c1">//obenauflab/indices/salmon/gencode.v28.IMPACT'</span>

<span class="o">}</span>
</code></pre></div></div>

<p>Finally, we are ready to testdrive our <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow pipeline on our AWS job queue!</p>

<h2 id="step-6-running-jobs-with-aws-batch">Step 6: Running jobs with AWS Batch</h2>

<p>Allright, now things are getting serious, just a little more preparation needed to finally run our <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow pipeline on <code class="language-plaintext highlighter-rouge">AWS</code>:</p>

<ul>
  <li>Upload our index file to <code class="language-plaintext highlighter-rouge">s3</code></li>
  <li>Upload our input <code class="language-plaintext highlighter-rouge">fastq</code> files to <code class="language-plaintext highlighter-rouge">s3</code></li>
  <li>Launch a submission <code class="language-plaintext highlighter-rouge">EC2</code> instance for running our <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow pipeline</li>
  <li>Enter credentials</li>
  <li>Go!</li>
</ul>

<h3 id="upload-files-to-s3">Upload files to <code class="language-plaintext highlighter-rouge">s3</code></h3>

<p>To upload files to <code class="language-plaintext highlighter-rouge">s3</code>, I recommend you to use the <a href="https://aws.amazon.com/cli/">AWS CLI</a>.</p>

<p>For installation just follow the instructions. Important afterwards it to expose your <code class="language-plaintext highlighter-rouge">AWS credentials</code> which you obtained when creating your <code class="language-plaintext highlighter-rouge">IAM</code> user to Nextflow which can be done in <a href="https://www.nextflow.io/docs/latest/awscloud.html#aws-credentials">2 ways</a>:</p>

<ol>
  <li>Exporting the default <code class="language-plaintext highlighter-rouge">AWS</code> environment variables</li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">AWS_DEFAULT_REGION</span><span class="o">=</span>&lt;REGION IDENTIFIER&gt;
<span class="nb">export </span><span class="nv">AWS_ACCESS_KEY_ID</span><span class="o">=</span>&lt;YOUR S3 ACCESS KEY&gt;
<span class="nb">export </span><span class="nv">AWS_SECRET_ACCESS_KEY</span><span class="o">=</span>&lt;YOUR S3 SECRET KEY&gt;
</code></pre></div></div>

<ol>
  <li>Specify your credentials in the Nextflow configuration file</li>
</ol>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">aws</span> <span class="o">{</span>
    <span class="n">region</span> <span class="o">=</span> <span class="err">'</span><span class="o">&lt;</span><span class="no">REGION</span> <span class="no">IDENTIFIER</span><span class="o">&gt;</span><span class="err">'</span>
    <span class="n">accessKey</span> <span class="o">=</span> <span class="err">'</span><span class="o">&lt;</span><span class="no">YOUR</span> <span class="no">S3</span> <span class="no">ACCESS</span> <span class="no">KEY</span><span class="o">&gt;</span><span class="err">'</span>
    <span class="n">secretKey</span> <span class="o">=</span> <span class="err">'</span><span class="o">&lt;</span><span class="no">YOUR</span> <span class="no">S3</span> <span class="no">SECRET</span> <span class="no">KEY</span><span class="o">&gt;</span><span class="err">'</span>
<span class="o">}</span>
</code></pre></div></div>

<p>I personally prefer option 1 to not accidentally commit and push any of my credentials to my Nextflow Github repo.</p>

<p>Now we can upload our fastq files to our target destination in our <code class="language-plaintext highlighter-rouge">s3</code> bucket, assuming you are in the directory where your <code class="language-plaintext highlighter-rouge">fastq</code> files are stored:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws s3 <span class="nb">cp</span> <span class="nb">.</span> s3://obenauflab/fastq <span class="nt">--recursive</span> <span class="nt">--include</span> <span class="s2">"*.fq.gz"</span>
</code></pre></div></div>

<p>Repeat the same with your index files to your <code class="language-plaintext highlighter-rouge">s3</code> bucket destination and you now all files we need for running <code class="language-plaintext highlighter-rouge">salmon-nf</code> are ready. You can view them via numerous clients, I used <a href="https://cyberduck.io/">Cyberduck</a> for Mac. Below you will see that my 40 testsamples and index files have been uploaded in the appropriate locations in my <code class="language-plaintext highlighter-rouge">s3</code> bucket.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/S3_fastqs.png" alt="S3 fastq file location" /></p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/S3_index.png" alt="S3 index file location" /></p>

<h3 id="launch-and-prepare-your-submission-instance">Launch and prepare your submission instance</h3>

<p>Finally, we need some machine where we run our Nextflow master process that submits jobs to the <code class="language-plaintext highlighter-rouge">AWS Batch</code> queues. You can of course to this locally on your machine or have a long running job in our HPC environment.</p>

<p>But for heavy, long-running workloads it definitely makes sense to have a dedicated instance to run the Nextflow process on to not run into troubles.</p>

<p>Fortunately, we only need a very minimal <code class="language-plaintext highlighter-rouge">EC2</code> instance for this, which is available from <code class="language-plaintext highlighter-rouge">AWS</code> under the so-called <code class="language-plaintext highlighter-rouge">Free Tier</code> - meaning it’s free, yay!</p>

<p>So this is what we will do - first go to your <code class="language-plaintext highlighter-rouge">EC2</code> dashboard and select <code class="language-plaintext highlighter-rouge">Launch Instance</code>.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Launch_EC2Dashboard.png" alt="EC2 Dashboard" /></p>

<p>Next up, we have to select the <code class="language-plaintext highlighter-rouge">AMI</code> we want to run on our instance. I have already precreated a <code class="language-plaintext highlighter-rouge">Nextflow AMI</code> which is simply an <code class="language-plaintext highlighter-rouge">AMI</code> created as in Step 2, where I in addition installed <a href="http://www.oracle.com/technetwork/java/javase/downloads/index.html">Java 8</a> and <a href="https://www.nextflow.io/docs/latest/getstarted.html#installation">Nextflow</a>.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Launch_NextflowAMI.png" alt="Nextflow AMI" /></p>

<p>For the instance type, make sure to select something labeled as <code class="language-plaintext highlighter-rouge">Free Tier eligible</code> to not run into any costs for this instance, e.g. <code class="language-plaintext highlighter-rouge">t2.micro</code> in the example below. Then just hit <code class="language-plaintext highlighter-rouge">Review and Launch</code> and then <code class="language-plaintext highlighter-rouge">Launch</code> the instance.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Launch_EC2Instance.png" alt="Nextflow EC2 instance" /></p>

<p>Finally, make sure to launch it with a keypair that you also have downloaded, otherwise you will be unable to connect to the instance.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Launch_KeyPair.png" alt="Nextflow keypair" /></p>

<p>Finally, give some name to your master instance, since many more will be launched once we fire up our <code class="language-plaintext highlighter-rouge">salmon-nf</code>Nextflow pipeline on our <code class="language-plaintext highlighter-rouge">AWS Batch</code> compute environment.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/Launch_Name.png" alt="Nextflow EC2 naming" /></p>

<p>Finally, connect to the instance as already shown in Step 2 for example. Now we can pull our <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow pipeline.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span>nextflow pull t-neumann/salmon-nf
Checking t-neumann/salmon-nf ...
 downloaded from https://github.com/t-neumann/salmon-nf.git - revision: 6ac6e6a15a <span class="o">[</span>master]
<span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="err">$</span>
</code></pre></div></div>

<p>Next up, don’t forget again to export your <code class="language-plaintext highlighter-rouge">AWS</code> credentials.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span><span class="nb">export </span><span class="nv">AWS_DEFAULT_REGION</span><span class="o">=</span>&lt;REGION IDENTIFIER&gt;
<span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span><span class="nb">export </span><span class="nv">AWS_ACCESS_KEY_ID</span><span class="o">=</span>&lt;YOUR S3 ACCESS KEY&gt;
<span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span><span class="nb">export </span><span class="nv">AWS_SECRET_ACCESS_KEY</span><span class="o">=</span>&lt;YOUR S3 SECRET KEY&gt;
</code></pre></div></div>

<p>Now there is only <strong>1 last crucial</strong> step before we can actually launch our jobs on the <code class="language-plaintext highlighter-rouge">AWS Batch</code> queue: We have to create <a href="https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html">job definitions</a>. Luckily for us, Nextflow will <a href="https://www.nextflow.io/docs/latest/awscloud.html#custom-job-definition">automatically create job definitions</a> for us upon the first launch of a pipeline.</p>

<p>However, what I found is, that job definitions will only be properly created if the initial run contains only very few samples. So <strong>always have your initial run on a SINGLE SAMPLE!!</strong>. What happens if you don’t, is that your Nextflow submission will be stuck at the following step:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span>nextflow run t-neumann/salmon-nf <span class="nt">--inputDir</span> s3://obenauflab/fastq <span class="nt">--outputDir</span> s3://obenauflab/salmon <span class="nt">-profile</span> awsbatch <span class="nt">-w</span> s3://obenauflab/work/salmon
N E X T F L O W  ~  version 18.10.1
Launching <span class="sb">`</span>t-neumann/salmon-nf<span class="sb">`</span> <span class="o">[</span>silly_mccarthy] - revision: 6ac6e6a15a <span class="o">[</span>master]

 parameters
 <span class="o">======================</span>
 input directory          : s3://obenauflab/fastq
 output directory         : s3://obenauflab/salmon
 <span class="o">======================</span>

<span class="o">[</span>warm up] executor <span class="o">&gt;</span> awsbatch
</code></pre></div></div>

<p>From there on, you wait forever and wonder what’s going on, as it happened to me.</p>

<h3 id="start-your-nextflow-run-on-aws-batch">Start your Nextflow run on AWS batch</h3>

<p>Now the last and most rewarding step of all - you are finally ready to launch the <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow pipeline on <code class="language-plaintext highlighter-rouge">AWS</code>!</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span>nextflow run t-neumann/salmon-nf <span class="nt">--inputDir</span> s3://obenauflab/fastq <span class="nt">--outputDir</span> s3://obenauflab/salmon <span class="nt">-profile</span> awsbatch <span class="nt">-w</span> s3://obenauflab/work/salmon
</code></pre></div></div>

<p>Notice, how both the <code class="language-plaintext highlighter-rouge">inputDir</code> and <code class="language-plaintext highlighter-rouge">outputDir</code> point to an <code class="language-plaintext highlighter-rouge">s3</code> directory and how we also have to supply a <code class="language-plaintext highlighter-rouge">work directory</code> with <code class="language-plaintext highlighter-rouge">-w</code> on <code class="language-plaintext highlighter-rouge">s3</code>. Now hit <code class="language-plaintext highlighter-rouge">Enter</code> and watch the beauty unfold on <code class="language-plaintext highlighter-rouge">AWS</code>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="nv">$ </span>nextflow run t-neumann/salmon-nf <span class="nt">--inputDir</span> s3://obenauflab/fastq <span class="nt">--outputDir</span> s3://obenauflab/salmon <span class="nt">-profile</span> awsbatch <span class="nt">-w</span> s3://obenauflab/work/salmon
N E X T F L O W  ~  version 18.10.1
Launching <span class="sb">`</span>t-neumann/salmon-nf<span class="sb">`</span> <span class="o">[</span>silly_mccarthy] - revision: 6ac6e6a15a <span class="o">[</span>master]

 parameters
 <span class="o">======================</span>
 input directory          : s3://obenauflab/fastq
 output directory         : s3://obenauflab/salmon
 <span class="o">======================</span>

<span class="o">[</span>warm up] executor <span class="o">&gt;</span> awsbatch
<span class="o">[</span>4a/72c0f7] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>d1ada222-b67f-47c0-b380-091eaab093b4_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>f2/f8d97a] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>e46e4f3a-62f8-4bd1-a143-f384e219d6af_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>90/35eb4d] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>1672de07-77db-4817-9c7f-f201c25e8132_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>81/c47fe3] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>741fbacf-3694-46ef-b16f-66bac6ee0452_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>f1/bc3afc] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>db18dd75-3b48-4c21-aa68-58b1cf37c8c2_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>a8/88095d] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>0ac6634e-00b0-4107-a5d6-db8ffc602645_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>a6/36e366] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>9fa785f2-1dcb-4966-a5fa-fe75d327cb81_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>7d/5ae2b0] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>5b3c329a-aa14-4965-8d13-f508f4390eaf_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>d9/3ec3fc] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>6cf08e2b-7e59-4537-b1c3-1c5b3838ab95_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>19/d7d441] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>9c714c63-ee50-4385-9e25-09f940f5f902_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>71/ff40cf] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>17686cd5-271a-4e24-9746-f93334fb86b5_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>66/aaa185] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>67/ccd647] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>1916abcd-61c0-4f23-96ac-be70aacb8dc1_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>7d/0a090b] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>e1a4167d-b4ca-405c-8550-cc32bb1b1d09_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>3b/a9972e] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>876a9725-34c1-4a23-a3fe-58a860d0f0c5_gdc_realn_rehead<span class="o">)</span>
</code></pre></div></div>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AWSBatch_Dashboard.png" alt="AWS Batch dashboard" /></p>

<p>Note how <code class="language-plaintext highlighter-rouge">AWS Batch</code> automatically upscales the number of desired vCPUs of your compute environment once the jobs are submitted.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AWSBatch_MultiInstances.png" alt="AWS Batch EC2 instances" /></p>

<p>Watch in awe how <code class="language-plaintext highlighter-rouge">AWS Batch</code> fires up multiple <code class="language-plaintext highlighter-rouge">EC2</code> instances automatically in your <code class="language-plaintext highlighter-rouge">EC2</code> dashboard.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AWSBatch_JobTransition.png" alt="AWS Batch Job transition" /></p>

<p>Watch how jobs transition from <code class="language-plaintext highlighter-rouge">Runnable</code> to <code class="language-plaintext highlighter-rouge">Starting</code> to <code class="language-plaintext highlighter-rouge">Runnable</code> to <code class="language-plaintext highlighter-rouge">Succeeded</code> state until all your samples have been processed.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>47/c580b5] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>2864cbe8-4d77-4477-ac84-791004e42237_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>8c/84bc14] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>0fdb3d0e-e405-4e8d-8897-4a90ea4fe00c_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>1d/3f6ec6] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>7ed99d57-f199-4dac-87a8-62393f5e0aea_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>a9/330e5d] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>825daddc-a89a-483b-947e-74cc12ba013c_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>98/33bed5] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>c3588f96-95c6-4008-bda2-502ceb963adb_gdc_realn_rehead<span class="o">)</span>

t-neumann/salmon-nf has finished.
Status:   SUCCESS
Time:     Sun Aug 25 11:20:13 UTC 2019
Duration: 10m 22s

<span class="o">[</span>ec2-user@ip-172-31-38-222 ~]<span class="err">$</span>
</code></pre></div></div>

<p>Now let’s check whether the results were produced in the correct <code class="language-plaintext highlighter-rouge">s3</code> output directory.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-pipeline/AWSBatch_Success.png" alt="AWS Batch Job success" /></p>

<p>Congratulations! You did it! It took a long time, was quite a tedious setup and frustrating for me at numerous steps, but with amazing help from the community, Boehringer-Ingelheim and also quite some trial-and-error I got it to work and hopefully so did you with much less hassle!</p>

<p>Happy pipeline building and number crunching with <code class="language-plaintext highlighter-rouge">AWS</code> and Nextflow!</p>]]></content><author><name>Tobias Neumann</name></author><category term="Pipelines" /><category term="AMI" /><category term="AWS" /><category term="Containers" /><category term="Docker" /><category term="Nextflow" /><summary type="html"><![CDATA[Setting up and running a pipeline on AWS]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://t-neumann.github.io/assets/images/categories/aws.svg" /><media:content medium="image" url="https://t-neumann.github.io/assets/images/categories/aws.svg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Slamdunk paper</title><link href="https://t-neumann.github.io/pipelines/Slamdunk/" rel="alternate" type="text/html" title="Slamdunk paper" /><published>2019-06-28T13:42:00+02:00</published><updated>2019-06-28T13:42:00+02:00</updated><id>https://t-neumann.github.io/pipelines/Slamdunk</id><content type="html" xml:base="https://t-neumann.github.io/pipelines/Slamdunk/"><![CDATA[<p>For the past couple of years I was involved in the development of <a href="http://doi.org/10.1038/nmeth.4435">SLAMseq</a>, a sequencing technology for time-resolved measurement of newly synthesized and existing RNA in cultured cells. Originally developed by the lab of <a href="https://www.imba.oeaw.ac.at/research/stefan-ameres/">Stefan Ameres</a>, the lab of my boss <a href="https://www.imp.ac.at/groups/johannes-zuber/">Johannes Zuber</a> extended the approach with pharmacological and chemical-genetic perturbations in order to identify direct transcriptional targets of any gene or pathway (<a href="http://doi.org/10.1126/science.aao2793">Muhar et al, Science 2018</a>).</p>

<p>Processing and interpreting this data required novel analysis methods, so I was given the opportunity to team up with a good friend of mine - <a href="https://github.com/philres">Philipp Rescheneder</a> - to develop <a href="https://t-neumann.github.io/slamdunk/">Slamdunk</a> which we recently published in <a href="http://doi.org/10.1186/s12859-019-2849-7">BMC Bioinformatics</a> and is generally applicable to any nucleotide-conversion containing dataset.</p>

<p>This post will quickly highlight the main functionality, findings and features.</p>

<h2 id="slamdunk-workflow">Slamdunk workflow</h2>

<p><img src="https://t-neumann.github.io/assets/images/posts/Slamdunk/slamdunk_outline.png" alt="Slamdunk outline" /></p>

<p>Slamdunk differs from naive read processing in 4 ways:</p>

<ul>
  <li>It maps with a nucleotide-conversion aware scoring scheme since in the example of SLAMseq data, T&gt;C mismatches are expected and identify reads from labelled transcripts</li>
  <li>Since QuantSeq processes smaller, more repetitive regions of transcripts - namely the 3’ ends - Slamdunk cannot simply discard all multimappers, but utilizes a strategy to recover them</li>
  <li>Genuine T&gt;C SNPs would contribute greatly to false-positive conversion-quantifications and have to be excluded during the quantification step</li>
  <li>Depending on coverage and T-content in the 3’ end, observing T&gt;C reads will have a different likelihood which has to be corrected for during conversion quantification</li>
</ul>

<h2 id="features">Features</h2>

<h3 id="conversion-aware-mapping">Conversion-aware mapping</h3>

<p><img src="https://t-neumann.github.io/assets/images/posts/Slamdunk/slamdunk_mapping.png" alt="Slamdunk mapping" /></p>

<p>Slamdunk utilizes a conversion-aware scoring scheme implemented with the mapper <a href="http://cibiv.github.io/NextGenMap/">NextGenMap</a>.
Using this scoring-scheme, we could demonstrated the following:</p>

<ul>
  <li>We can map reads independent of the inherent conversion-rates in the respective datasets (see top Figure a)</li>
  <li>With commonly found conversion-rates (0-7%), we are able to map constantly &gt; 90% of the reads with 100-150bp and &gt;80% of shorter reads with 50bp read length.</li>
</ul>

<h3 id="multimapper-recovery">Multimapper recovery</h3>

<p>We devised a multimapper recovery strategy to deal with repetitive 3’ UTR regions of transcripts. To this end, multimapping reads that still map uniquely to annotated 3’ UTRs are recovered and only reads with alignments to several annotated 3’ UTRs are discarded.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Slamdunk/multimappers.png" alt="Multimapper recovery strategy" /></p>

<p>Using this strategy, we are able to recover valuable signal in genes with 3’ UTRs with low mappability and increase overall correlation of QuantSeq datasets to corresponding RNA-seq datasets.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Slamdunk/rnaseqcorrelation.png" alt="RNA-seq correlation" /></p>

<h3 id="conversion-quantification">Conversion quantification</h3>

<p>Plain quantification of the number of TC-conversion containing reads in a given interval is biased towards intervals with higher T-content and higher coverage, since the probability of observing a T&gt;C conversion in this intervals is increased. To address this issue, we devised a T-content and coverage aware nucleotide-conversion quantification within intervals that is clearly superior in error rates (see bottom Figure left). Overall, the variance of relative error decreases with higher coverage and while it slightly underestimates the true conversion rate with short reads (50bp), it accurately estimates the conversion rates for reads starting from 100bp (bottom Figure right).</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Slamdunk/tcontentquantification.png" alt="T-content coverage aware quantification" /></p>

<h3 id="multiqc-report">MultiQC report</h3>

<p>Visualization of results and quality control is an important aspect of each analysis. To this end, with lots of help from <a href="https://phil.ewels.co.uk">Phil Ewels</a>, we developed a plugin to <a href="https://multiqc.info/">MultiQC</a> to facilitate quality control of SLAMseq datasets. Using this plugin, we can visualize conversion rates within samples (bottom Figure a), display the principal components of samples based on T&gt;C containing reads (bottom Figure b), plot non T&gt;C mismatches over read positions to identify problematic read positions (bottom Figure c) or plot T&gt;C conversions at 3’ ends (bottom Figure d) to check for base composition biases.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Slamdunk/multiqc.png" alt="MultiQC module" /></p>

<h2 id="documentation">Documentation</h2>

<p>A thorough documentation is available from the main website:</p>

<ul>
  <li><a href="https://t-neumann.github.io/slamdunk/">https://t-neumann.github.io/slamdunk/</a></li>
</ul>

<h2 id="availability">Availability</h2>

<p>Slamdunk is available from several platforms:</p>

<ul>
  <li><a href="https://bioconda.github.io/recipes/slamdunk/README.html">BioConda</a></li>
  <li><a href="https://galaxyproject.eu/posts/2019/08/17/Slamdunk/">Galaxy</a></li>
  <li><a href="https://hub.docker.com/r/tobneu/slamdunk">Docker <i class="fab fa-docker" aria-hidden="true"></i></a></li>
  <li><a href="https://pypi.org/project/slamdunk/">PyPI <i class="fab fa-python" aria-hidden="true"></i></a></li>
  <li><a href="https://github.com/t-neumann/slamdunk">GitHub <i class="fab fa-github" aria-hidden="true"></i></a></li>
</ul>

<embed src="https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-019-2849-7" width="100%" height="700" type="application/pdf" />]]></content><author><name>Tobias Neumann</name></author><category term="Pipelines" /><category term="SLAMseq" /><category term="Slamdunk" /><category term="Containers" /><category term="Bioconda" /><category term="PyPI" /><category term="Docker" /><summary type="html"><![CDATA[SLAMseq analysis using Slamdunk for nucleotide-conversion sequencing datasets]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://t-neumann.github.io/assets/images/categories/logo_slamdunk_rgb_72.jpg" /><media:content medium="image" url="https://t-neumann.github.io/assets/images/categories/logo_slamdunk_rgb_72.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Pipelines with Nextflow</title><link href="https://t-neumann.github.io/pipelines/Nextflow-pipeline/" rel="alternate" type="text/html" title="Pipelines with Nextflow" /><published>2019-03-03T21:51:00+01:00</published><updated>2019-03-03T21:51:00+01:00</updated><id>https://t-neumann.github.io/pipelines/Nextflow-pipeline</id><content type="html" xml:base="https://t-neumann.github.io/pipelines/Nextflow-pipeline/"><![CDATA[<p>Nowadays, workflow management systems have become an integral part of large-scale analysis of biological datasets with multiple software packages and multi-platform language support. These systems enable the rapid prototyping and deployment of pipelines that combine complementary software packages.
Several such systems are already available, such as <a href="https://snakemake.readthedocs.io/en/stable/">Snakemake</a> and <a href="https://www.commonwl.org/">CWL</a>.</p>

<p>This post will give you an overview of my favourite workflow building system - <a href="https://www.nextflow.io/">Nextflow</a> - and look at one toy workflow implementation example that will also be used in later posts.</p>

<h2 id="nextflow">Nextflow</h2>

<p>Here, I will more or less shamelessly copy large parts of the description of Nextflow’s <a href="(https://www.nextflow.io/)">website</a> since it summarises the main features quite neatly.</p>

<p>Up front, the most severe disadvantage for me: Nextflow is written in <a href="https://groovy-lang.org/">Groovy</a> which is kind of a pain for me, since I am mostly Python, R, C/C++ and Java based, but have never needed to touch any Groovy.</p>

<p>However, with some fiddling around and especially a lot of low-latency community support via the <a href="https://gitter.im/nextflow-io/nextflow">Nextflow Gitter channel</a>, these are hurdles that can be overcome.</p>

<p>Once you lost your fear of Groovy, the advantages of Nextflow are quite convincing.</p>

<p>If you want to read more about Nextflow, <a href="https://www.nextflow.io/docs/latest/index.html">here is the documentation</a> and <a href="https://www.nature.com/articles/nbt.3820">here is the original paper</a>.</p>

<h4 id="fast-prototyping">Fast prototyping</h4>

<p>Nextflow allows you to write a computational pipeline by making it simpler to put together many different tasks.</p>

<p>You may reuse your existing scripts and tools and you don’t need to learn a new language or API to start using it.</p>

<p>As an example, look at how easy it is to run code from different languages within Nextflow processes out of the box.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">process</span> <span class="n">perlStuff</span> <span class="o">{</span>

    <span class="s">"""
    #!/usr/bin/perl

    print 'Hi there!' . '\n';
    """</span>

<span class="o">}</span>

<span class="n">process</span> <span class="n">pyStuff</span> <span class="o">{</span>

    <span class="s">"""
    #!/usr/bin/python

    x = 'Hello'
    y = 'world!'
    print "</span><span class="o">%</span><span class="n">s</span> <span class="o">-</span> <span class="o">%</span><span class="n">s</span><span class="s">" % (x,y)
    """</span>

<span class="o">}</span>
</code></pre></div></div>

<h4 id="portable">Portable</h4>

<p>Nextflow provides an abstraction layer between your pipeline’s logic and the execution layer, so that it can be executed on multiple platforms without it changing.</p>

<p>It provides out of the box executors for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for Kubernetes, Amazon AWS and Google Cloud platforms.</p>

<p>Again, check the so-called profile configurations one can quite easily setup to enable support for yet another scheduler.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">profiles</span> <span class="o">{</span>

    <span class="n">standard</span> <span class="o">{</span>
        <span class="n">process</span><span class="o">.</span><span class="na">executor</span> <span class="o">=</span> <span class="err">'</span><span class="n">local</span><span class="err">'</span>
    <span class="o">}</span>

    <span class="n">cluster_sge</span> <span class="o">{</span>
        <span class="n">process</span><span class="o">.</span><span class="na">executor</span> <span class="o">=</span> <span class="err">'</span><span class="n">sge</span><span class="err">'</span>
        <span class="n">process</span><span class="o">.</span><span class="na">penv</span> <span class="o">=</span> <span class="err">'</span><span class="n">smp</span><span class="err">'</span>
        <span class="n">process</span><span class="o">.</span><span class="na">cpus</span> <span class="o">=</span> <span class="mi">20</span>
        <span class="n">process</span><span class="o">.</span><span class="na">queue</span> <span class="o">=</span> <span class="err">'</span><span class="kd">public</span><span class="o">.</span><span class="na">q</span><span class="err">'</span>
        <span class="n">process</span><span class="o">.</span><span class="na">memory</span> <span class="o">=</span> <span class="err">'</span><span class="mi">10</span><span class="no">GB</span><span class="err">'</span>
    <span class="o">}</span>

    <span class="n">cluster_slurm</span> <span class="o">{</span>
        <span class="n">process</span><span class="o">.</span><span class="na">executor</span> <span class="o">=</span> <span class="err">'</span><span class="n">slurm</span><span class="err">'</span>
        <span class="n">process</span><span class="o">.</span><span class="na">cpus</span> <span class="o">=</span> <span class="mi">20</span>
        <span class="n">process</span><span class="o">.</span><span class="na">queue</span> <span class="o">=</span> <span class="err">'</span><span class="n">work</span><span class="err">'</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>With these few lines of code, you can now seamlessly execute your pipeline on your local machine, on PBS and SLURM, even with customized resource settings.</p>

<h4 id="reproducibility">Reproducibility</h4>

<p>Nextflow supports <a href="https://www.docker.com/">Docker</a> and <a href="https://singularity.lbl.gov/">Singularity</a> containers technology.</p>

<p>This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration.</p>

<p>This is an especially nice feature, since it also allows to run Nextflow workflows on cloud based platforms such as <a href="https://aws.amazon.com/">Amazon Web Services</a> which strictly require all software environments supplied in a public <a href="https://www.nextflow.io/docs/latest/awscloud.html#awscloud-batch-config">Docker registry</a> reachable by ECS batch.</p>

<h4 id="unified-parallelism">Unified parallelism</h4>

<p>Nextflow is based on the dataflow programming model which greatly simplifies writing complex distributed pipelines.</p>

<p>Parallelisation is implicitly defined by the processes input and output declarations. The resulting applications are inherently parallel and can scale-up or scale-out, transparently, without having to adapt to a specific platform architecture.</p>

<h4 id="continuous-checkpoints">Continuous checkpoints</h4>

<p>All the intermediate results produced during the pipeline execution are automatically tracked.</p>

<p>This allows you to resume its execution, from the last successfully executed step, no matter what the reason was for it stopping.</p>

<h4 id="stream-oriented">Stream oriented</h4>

<p>Nextflow extends the Unix pipes model with a fluent DSL, allowing you to handle complex stream interactions easily.</p>

<p>It promotes a programming approach, based on functional composition, that results in resilient and easily reproducible pipelines.</p>

<h2 id="salmon">Salmon</h2>

<p>Our first small toy Nextflow workflow will be based upon <a href="https://combine-lab.github.io/salmon/">Salmon</a>.</p>

<p>Salmon is a tool for quantifying the expression of transcripts using RNA-seq data. Salmon uses the concept of quasi-mapping coupled with a two-phase inference procedure to provide accurate expression estimates very quickly (i.e. wicked-fast) and while using little memory. Salmon performs its inference using an expressive and realistic model of RNA-seq data that takes into account experimental attributes and biases commonly observed in real RNA-seq data.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Salmon-pipeline/salmon.png" alt="Salmon overview" /></p>

<p>Essentially, Salmon will create a transcript index which it then uses to quantify expression estimates for each of the transcripts from raw fastq reads.</p>

<p>Our goal:</p>

<ul>
  <li>Obtain those transcript expression estimates for our samples</li>
  <li>Obtain reads mapping to these transcripts via the <code class="language-plaintext highlighter-rouge">--writeMappings</code> flag as pseudo-bam</li>
</ul>

<p>If you want to read more on Salmon, <a href="https://www.nature.com/articles/nmeth.4197">here is the paper</a>.</p>

<h2 id="salmon-nf">salmon-nf</h2>

<p>So the Nextflow pipeline we will create during this exercise I will call <code class="language-plaintext highlighter-rouge">salmon-nf</code> and it can be found on my <a href="https://github.com/t-neumann/salmon-nf">GitHub page</a> as a fully functional repository.</p>

<p>Any standalone Nextflow pipeline will need 2 files to be executable out of the box and also directly <a href="https://www.nextflow.io/docs/latest/sharing.html#running-a-pipeline">from GitHub</a>:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">main.nf</code> - This file contains the individual processes and channels</li>
  <li><code class="language-plaintext highlighter-rouge">nextflow.config</code> - The configuration file for parameters, profiles etc. For more info read <a href="https://www.nextflow.io/docs/latest/config.html#configuration-file">here</a></li>
</ul>

<h3 id="workflow-layout">Workflow layout</h3>

<p>First, we need to get an idea about what the data flow will be and what software and scripts will be run on it. I have outline the basic workflow of <code class="language-plaintext highlighter-rouge">salmon-nf</code> below:</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/Salmon-pipeline/salmon-nf.png" alt="salmon-nf" width="50%" /></p>

<p>We will only have one single process <code class="language-plaintext highlighter-rouge">salmon</code> which will use the input <code class="language-plaintext highlighter-rouge">fastq</code> files and the respective transcriptome <code class="language-plaintext highlighter-rouge">index</code> file to produce our expression estimates and the pseudo-bam files of aligning reads.</p>

<p>So for our <code class="language-plaintext highlighter-rouge">salmon</code> process we will have 2 input channels:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">fastqChannel</code> - feeding in our raw reads in <code class="language-plaintext highlighter-rouge">fastq</code> format</li>
  <li><code class="language-plaintext highlighter-rouge">indexChannel</code> - providing our transcriptome <code class="language-plaintext highlighter-rouge">index</code> to which we align the reads to</li>
</ul>

<p>Our <code class="language-plaintext highlighter-rouge">salmon</code> process will produce several output files of which we choose to feed 2 file types into output processes as our final results:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">quant.sf</code> files via the <code class="language-plaintext highlighter-rouge">salmonChannel</code> output channel</li>
  <li><code class="language-plaintext highlighter-rouge">pseudo.bam</code> files via the <code class="language-plaintext highlighter-rouge">pseudoBamChannel</code> output channel</li>
</ul>

<p>Now let’s have a look how we can actually realize and implement this on the coding end.</p>

<h3 id="docker-container">Docker container</h3>

<p>Before we can run anything, we need to provide the software environment containing <strong>all</strong> dependencies and software packages our <code class="language-plaintext highlighter-rouge">salmon</code> process will run. These days, this is usually done via a <a href="https://www.docker.com/">Docker</a> container, or a <a href="https://singularity.lbl.gov/">Singularity</a> container on HPC environments.</p>

<p>Many software packages - including Salmon in our case - usually provide already read-to-use Docker containers (<code class="language-plaintext highlighter-rouge">combinelab/salmon</code>). But even if they don’t, do not despair and brainlessly jump into creating your own containers. If the packages was provided via <a href="https://bioconda.github.io/">BioConda</a>, you will find a Docker container on <a href="https://quay.io/organization/biocontainers">BioContainers</a>. I found this last resort to work in many cases.</p>

<p>Either way, since I wanted to convert the raw <code class="language-plaintext highlighter-rouge">SAM</code> output from <code class="language-plaintext highlighter-rouge">salmon</code> into a compressed <code class="language-plaintext highlighter-rouge">BAM</code> file, I chose to extend their Docker image with adding <code class="language-plaintext highlighter-rouge">samtools</code> as shown in the <a href="https://docs.docker.com/engine/reference/builder/">Dockerfile</a> below.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Copyright (c) 2019 Tobias Neumann.</span>
<span class="c">#</span>
<span class="c"># You should have received a copy of the GNU Affero General Public License</span>
<span class="c"># along with this program.  If not, see &lt;http://www.gnu.org/licenses/&gt;.</span>

FROM combinelab/salmon:0.12.0

MAINTAINER Tobias Neumann &lt;tobias.neumann.at@gmail.com&gt;

RUN <span class="nv">buildDeps</span><span class="o">=</span><span class="s1">'wget ca-certificates make g++'</span> <span class="se">\</span>
    <span class="nv">runDeps</span><span class="o">=</span><span class="s1">'zlib1g-dev libncurses5-dev unzip gcc'</span> <span class="se">\</span>
    <span class="o">&amp;&amp;</span> <span class="nb">set</span> <span class="nt">-x</span> <span class="se">\</span>
    <span class="o">&amp;&amp;</span> apt-get <span class="nb">install</span> <span class="nt">-y</span> <span class="nv">$buildDeps</span> <span class="nv">$runDeps</span> <span class="nt">--no-install-recommends</span> <span class="se">\</span>
    <span class="o">&amp;&amp;</span> <span class="nb">rm</span> <span class="nt">-rf</span> /var/lib/apt/lists/<span class="k">*</span> <span class="se">\</span>
    <span class="o">&amp;&amp;</span> wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 <span class="se">\</span>
    <span class="o">&amp;&amp;</span> <span class="nb">tar </span>xvfj samtools-1.9.tar.bz2 <span class="se">\</span>
    <span class="o">&amp;&amp;</span> <span class="nb">cd </span>samtools-1.9 <span class="se">\</span>
    <span class="o">&amp;&amp;</span> ./configure <span class="nt">--prefix</span><span class="o">=</span>/usr/local/ <span class="se">\</span>
    <span class="o">&amp;&amp;</span> make <span class="se">\</span>
    <span class="o">&amp;&amp;</span> make <span class="nb">install</span> <span class="se">\</span>
    <span class="o">&amp;&amp;</span> apt-get purge <span class="nt">-y</span> <span class="nt">--auto-remove</span> <span class="nv">$buildDeps</span>
</code></pre></div></div>

<p>The resulting Docker image was pushed to <a href="https://hub.docker.com/">Docker Hub</a> and can be pulled via <code class="language-plaintext highlighter-rouge">docker pull obenauflab/salmon:latest</code>.</p>

<h3 id="mainnf">main.nf</h3>

<p>Now we are ready to create the central <code class="language-plaintext highlighter-rouge">main.nf</code> file which contains all processes as well as channels. As mentioned before, you will find the entire code on <a href="(https://github.com/t-neumann/salmon-nf)">GitHub</a>, so here is an excerpt of the important sections.</p>

<h5 id="fastqchannel"><code class="language-plaintext highlighter-rouge">fastqChannel</code></h5>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pairedEndRegex</span> <span class="o">=</span> <span class="n">params</span><span class="o">.</span><span class="na">inputDir</span> <span class="o">+</span> <span class="s">"/*_{1,2}.fq.gz"</span>
<span class="nc">SERegex</span> <span class="o">=</span> <span class="n">params</span><span class="o">.</span><span class="na">inputDir</span> <span class="o">+</span> <span class="s">"/*[!12].fq.gz"</span>

<span class="n">pairFiles</span> <span class="o">=</span> <span class="nc">Channel</span><span class="o">.</span><span class="na">fromFilePairs</span><span class="o">(</span><span class="n">pairedEndRegex</span><span class="o">)</span>
<span class="n">singleFiles</span> <span class="o">=</span> <span class="nc">Channel</span><span class="o">.</span><span class="na">fromFilePairs</span><span class="o">(</span><span class="nc">SERegex</span><span class="o">,</span> <span class="nl">size:</span> <span class="mi">1</span><span class="o">){</span> <span class="n">file</span> <span class="o">-&gt;</span> <span class="n">file</span><span class="o">.</span><span class="na">baseName</span><span class="o">.</span><span class="na">replaceAll</span><span class="o">(/.</span><span class="na">fq</span><span class="o">/,</span><span class="s">""</span><span class="o">)</span> <span class="o">}</span>

<span class="n">singleFiles</span><span class="o">.</span><span class="na">mix</span><span class="o">(</span><span class="n">pairFiles</span><span class="o">)</span>
<span class="o">.</span><span class="na">set</span> <span class="o">{</span> <span class="n">fastqChannel</span> <span class="o">}</span>
</code></pre></div></div>

<p>This elaborate chunk of code is needed to enable the <code class="language-plaintext highlighter-rouge">fastqChannel</code> input channel to our <code class="language-plaintext highlighter-rouge">salmon</code> process to handle both single- and paired-end <code class="language-plaintext highlighter-rouge">fastq</code> files. As you can see, we created a <code class="language-plaintext highlighter-rouge">pairFiles</code> channel with a paired-end regex basically assuming that our read-pairs are named <code class="language-plaintext highlighter-rouge">*_1.fq.gz</code> and <code class="language-plaintext highlighter-rouge">*_2.fq.gz</code>. In addition, we have a <code class="language-plaintext highlighter-rouge">singleFiles</code> channel that takes all <code class="language-plaintext highlighter-rouge">fastq</code> files not following the <code class="language-plaintext highlighter-rouge">_1</code> and <code class="language-plaintext highlighter-rouge">_2</code> naming convention and assuming it is single-end read files.</p>

<p>The <code class="language-plaintext highlighter-rouge">fromFilePairs</code> method creates a channel emitting the file pairs matching the regex we provided. The matching files are emitted as tuples in which the first element is the grouping key of the matching pair and the second element is the list of files (sorted in lexicographical order). For example:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span><span class="mo">03</span><span class="mi">99</span><span class="n">ad16</span><span class="o">-</span><span class="mi">816</span><span class="n">f</span><span class="o">-</span><span class="mi">4824</span><span class="o">-</span><span class="n">ae28</span><span class="o">-</span><span class="mi">7</span><span class="n">b82e006e7b7_gdc_realn_rehead</span><span class="o">,</span> <span class="o">[</span><span class="mo">03</span><span class="mi">99</span><span class="n">ad16</span><span class="o">-</span><span class="mi">816</span><span class="n">f</span><span class="o">-</span><span class="mi">4824</span><span class="o">-</span><span class="n">ae28</span><span class="o">-</span><span class="mi">7</span><span class="n">b82e006e7b7_gdc_realn_rehead_1</span><span class="o">.</span><span class="na">fq</span><span class="o">.</span><span class="na">gz</span><span class="o">,</span> <span class="mo">03</span><span class="mi">99</span><span class="n">ad16</span><span class="o">-</span><span class="mi">816</span><span class="n">f</span><span class="o">-</span><span class="mi">4824</span><span class="o">-</span><span class="n">ae28</span><span class="o">-</span><span class="mi">7</span><span class="n">b82e006e7b7_gdc_realn_rehead_2</span><span class="o">.</span><span class="na">fq</span><span class="o">.</span><span class="na">gz</span><span class="o">]]</span>
<span class="o">[</span><span class="mi">0</span><span class="n">ac6634e</span><span class="o">-</span><span class="mo">00</span><span class="n">b0</span><span class="o">-</span><span class="mi">4107</span><span class="o">-</span><span class="n">a5d6</span><span class="o">-</span><span class="n">db8ffc602645_gdc_realn_rehead</span><span class="o">,</span> <span class="o">[</span><span class="mi">0</span><span class="n">ac6634e</span><span class="o">-</span><span class="mo">00</span><span class="n">b0</span><span class="o">-</span><span class="mi">4107</span><span class="o">-</span><span class="n">a5d6</span><span class="o">-</span><span class="n">db8ffc602645_gdc_realn_rehead_1</span><span class="o">.</span><span class="na">fq</span><span class="o">.</span><span class="na">gz</span><span class="o">,</span> <span class="mi">0</span><span class="n">ac6634e</span><span class="o">-</span><span class="mo">00</span><span class="n">b0</span><span class="o">-</span><span class="mi">4107</span><span class="o">-</span><span class="n">a5d6</span><span class="o">-</span><span class="n">db8ffc602645_gdc_realn_rehead_2</span><span class="o">.</span><span class="na">fq</span><span class="o">.</span><span class="na">gz</span><span class="o">]]</span>
</code></pre></div></div>

<p>As you can see, for the single-end reads channel <code class="language-plaintext highlighter-rouge">singleFiles</code>, the method is slightly extended:</p>

<p>First, we set an additional parameter <code class="language-plaintext highlighter-rouge">size: 1</code> to set the number of files each emitted item is expected to hold to 1. In additional, we manually provide the a custom grouping strategy in the closure, which based on the current file as parameter, returns the grouping key. In our case, we simply strip anything from the file name after <code class="language-plaintext highlighter-rouge">.fq</code> and use this as our grouping key. For example:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span><span class="mi">0</span><span class="n">fdb3d0e</span><span class="o">-</span><span class="n">e405</span><span class="o">-</span><span class="mi">4</span><span class="n">e8d</span><span class="o">-</span><span class="mi">8897</span><span class="o">-</span><span class="mi">4</span><span class="n">a90ea4fe00c_gdc_realn_rehead</span><span class="o">,</span> <span class="o">[</span><span class="mi">0</span><span class="n">fdb3d0e</span><span class="o">-</span><span class="n">e405</span><span class="o">-</span><span class="mi">4</span><span class="n">e8d</span><span class="o">-</span><span class="mi">8897</span><span class="o">-</span><span class="mi">4</span><span class="n">a90ea4fe00c_gdc_realn_rehead</span><span class="o">.</span><span class="na">fq</span><span class="o">.</span><span class="na">gz</span><span class="o">]]</span>
<span class="o">[</span><span class="mi">1916</span><span class="n">abcd</span><span class="o">-</span><span class="mi">61</span><span class="n">c0</span><span class="o">-</span><span class="mi">4</span><span class="n">f23</span><span class="o">-</span><span class="mi">96</span><span class="n">ac</span><span class="o">-</span><span class="n">be70aacb8dc1_gdc_realn_rehead</span><span class="o">,</span> <span class="o">[</span><span class="mi">1916</span><span class="n">abcd</span><span class="o">-</span><span class="mi">61</span><span class="n">c0</span><span class="o">-</span><span class="mi">4</span><span class="n">f23</span><span class="o">-</span><span class="mi">96</span><span class="n">ac</span><span class="o">-</span><span class="n">be70aacb8dc1_gdc_realn_rehead</span><span class="o">.</span><span class="na">fq</span><span class="o">.</span><span class="na">gz</span><span class="o">]]</span>
</code></pre></div></div>

<p>Finally, we combined both channels via a <code class="language-plaintext highlighter-rouge">mix</code> operator into our final <code class="language-plaintext highlighter-rouge">fastqChannel</code> input channel to our <code class="language-plaintext highlighter-rouge">salmon</code> process.</p>

<h5 id="indexchannel"><code class="language-plaintext highlighter-rouge">indexChannel</code></h5>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">indexChannel</span> <span class="o">=</span> <span class="nc">Channel</span>
	<span class="o">.</span><span class="na">fromPath</span><span class="o">(</span><span class="n">params</span><span class="o">.</span><span class="na">salmonIndex</span><span class="o">)</span>
	<span class="o">.</span><span class="na">ifEmpty</span> <span class="o">{</span> <span class="n">exit</span> <span class="mi">1</span><span class="o">,</span> <span class="s">"Salmon index not found: ${params.salmonIndex}"</span> <span class="o">}</span>
</code></pre></div></div>

<p>This input channel is pretty straightforward set up. Only thing we need to do is to precreate our Salmon index (read how to do this <a href="https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode">here</a>) and supply it via the <code class="language-plaintext highlighter-rouge">salmonIndex</code> parameter - how this is done will follow later.</p>

<h5 id="process-salmon">Process <code class="language-plaintext highlighter-rouge">salmon</code></h5>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">process</span> <span class="n">salmon</span> <span class="o">{</span>

	<span class="n">tag</span> <span class="o">{</span> <span class="n">lane</span> <span class="o">}</span>

    <span class="nl">input:</span>
    <span class="n">set</span> <span class="nf">val</span><span class="o">(</span><span class="n">lane</span><span class="o">),</span> <span class="n">file</span><span class="o">(</span><span class="n">reads</span><span class="o">)</span> <span class="n">from</span> <span class="n">fastqChannel</span>
    <span class="n">file</span> <span class="n">index</span> <span class="n">from</span> <span class="n">indexChannel</span><span class="o">.</span><span class="na">first</span><span class="o">()</span>

    <span class="nl">output:</span>
    <span class="n">file</span> <span class="o">(</span><span class="s">"${lane}_salmon/quant.sf"</span><span class="o">)</span> <span class="n">into</span> <span class="n">salmonChannel</span>
    <span class="nf">file</span> <span class="o">(</span><span class="s">"${lane}_pseudo.bam"</span><span class="o">)</span> <span class="n">into</span> <span class="n">pseudoBamChannel</span>

    <span class="nl">shell:</span>

    <span class="n">def</span> <span class="n">single</span> <span class="o">=</span> <span class="n">reads</span> <span class="k">instanceof</span> <span class="nc">Path</span>

    <span class="nf">if</span> <span class="o">(!</span><span class="n">single</span><span class="o">)</span>

      <span class="sc">'''</span>
      <span class="n">salmon</span> <span class="n">quant</span> <span class="o">-</span><span class="n">i</span> <span class="o">!{</span><span class="n">index</span><span class="o">}</span> <span class="o">-</span><span class="n">l</span> <span class="no">A</span> <span class="o">-</span><span class="mi">1</span> <span class="o">!{</span><span class="n">reads</span><span class="o">[</span><span class="mi">0</span><span class="o">]}</span> <span class="o">-</span><span class="mi">2</span> <span class="o">!{</span><span class="n">reads</span><span class="o">[</span><span class="mi">1</span><span class="o">]}</span> <span class="o">-</span><span class="n">o</span> <span class="o">!{</span><span class="n">lane</span><span class="o">}</span><span class="n">_salmon</span> <span class="o">-</span><span class="n">p</span> <span class="o">!{</span><span class="n">task</span><span class="o">.</span><span class="na">cpus</span><span class="o">}</span> <span class="o">--</span><span class="n">validateMappings</span> <span class="o">--</span><span class="n">no</span><span class="o">-</span><span class="n">version</span><span class="o">-</span><span class="n">check</span> <span class="o">-</span><span class="n">z</span> <span class="o">|</span> <span class="n">samtools</span> <span class="n">view</span> <span class="o">-</span><span class="nc">Sb</span> <span class="o">-</span><span class="no">F</span> <span class="mi">256</span> <span class="o">-</span> <span class="o">&gt;</span> <span class="o">!{</span><span class="n">lane</span><span class="o">}</span><span class="n">_pseudo</span><span class="o">.</span><span class="na">bam</span>
	    <span class="sc">'''</span>
    <span class="k">else</span>
      <span class="sc">'''</span>
      <span class="n">salmon</span> <span class="n">quant</span> <span class="o">-</span><span class="n">i</span> <span class="o">!{</span><span class="n">index</span><span class="o">}</span> <span class="o">-</span><span class="n">l</span> <span class="no">A</span> <span class="o">-</span><span class="n">r</span> <span class="o">!{</span><span class="n">reads</span><span class="o">}</span> <span class="o">-</span><span class="n">o</span> <span class="o">!{</span><span class="n">lane</span><span class="o">}</span><span class="n">_salmon</span> <span class="o">-</span><span class="n">p</span> <span class="o">!{</span><span class="n">task</span><span class="o">.</span><span class="na">cpus</span><span class="o">}</span> <span class="o">--</span><span class="n">validateMappings</span> <span class="o">--</span><span class="n">no</span><span class="o">-</span><span class="n">version</span><span class="o">-</span><span class="n">check</span> <span class="o">-</span><span class="n">z</span> <span class="o">|</span> <span class="n">samtools</span> <span class="n">view</span> <span class="o">-</span><span class="nc">Sb</span> <span class="o">-</span><span class="no">F</span> <span class="mi">256</span> <span class="o">-</span> <span class="o">&gt;</span> <span class="o">!{</span><span class="n">lane</span><span class="o">}</span><span class="n">_pseudo</span><span class="o">.</span><span class="na">bam</span>
	    <span class="sc">'''</span>

<span class="o">}</span>
</code></pre></div></div>

<p>Our only process for the <code class="language-plaintext highlighter-rouge">salmon-nf</code> workflow is the <code class="language-plaintext highlighter-rouge">salmon</code> process.</p>

<p>You will notice that it has the 2 input channels we previously defined - <code class="language-plaintext highlighter-rouge">fastqChannel</code> and <code class="language-plaintext highlighter-rouge">indexChannel</code>. Note, how we have to use the <code class="language-plaintext highlighter-rouge">.first()</code> method on the <code class="language-plaintext highlighter-rouge">indexChannel</code> since it is a folder.</p>

<p>In addition, we have defined 2 output channels - <code class="language-plaintext highlighter-rouge">salmonChannel</code> outputting all <code class="language-plaintext highlighter-rouge">quant.sf</code> files and <code class="language-plaintext highlighter-rouge">pseudoBamChannel</code> outputting the corresponding <code class="language-plaintext highlighter-rouge">pseudo.bam</code> files.</p>

<p>The actual script that is run, is a plain conditional bash script. We have an initial condition that asks whether we have single read files coming in from the <code class="language-plaintext highlighter-rouge">fastqChannel</code> or paired-end reads - and based on this evaluation will run one or the other script branch.</p>

<p>The bash script itself is then basically only a <code class="language-plaintext highlighter-rouge">salmon</code> call on the respective input files.</p>

<p>This input channel is pretty straightforward set up. Only thing we need to do is to precreate our Salmon index (read how to do this <a href="https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode">here</a>) and supply it via the <code class="language-plaintext highlighter-rouge">salmonIndex</code> parameter - how this is done will follow later.</p>

<h3 id="nextflowconfig">nextflow.config</h3>

<p>The Nextflow configuration files contain directives for for parameter definitions, profile definitions and many others.</p>

<p>In our particular example of <code class="language-plaintext highlighter-rouge">salmon-nf</code>, we will have a master <code class="language-plaintext highlighter-rouge">nextflow.config</code> that is tidied up and include additional configs for each section.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">includeConfig</span> <span class="err">'</span><span class="n">config</span><span class="o">/</span><span class="n">general</span><span class="o">.</span><span class="na">config</span><span class="err">'</span>
<span class="n">includeConfig</span> <span class="err">'</span><span class="n">config</span><span class="o">/</span><span class="n">docker</span><span class="o">.</span><span class="na">config</span><span class="err">'</span>

<span class="n">profiles</span> <span class="o">{</span>
    <span class="n">standard</span> <span class="o">{</span>
        <span class="n">process</span><span class="o">.</span><span class="na">executor</span> <span class="o">=</span> <span class="err">'</span><span class="n">local</span><span class="err">'</span>
        <span class="n">process</span><span class="o">.</span><span class="na">maxForks</span> <span class="o">=</span> <span class="mi">3</span>
    <span class="o">}</span>

    <span class="n">slurm</span> <span class="o">{</span>
    	<span class="n">includeConfig</span> <span class="err">'</span><span class="n">config</span><span class="o">/</span><span class="n">slurm</span><span class="o">.</span><span class="na">config</span><span class="err">'</span>
    <span class="o">}</span>

    <span class="n">awsbatch</span> <span class="o">{</span>
        <span class="n">includeConfig</span> <span class="err">'</span><span class="n">config</span><span class="o">/</span><span class="n">awsbatch</span><span class="o">.</span><span class="na">config</span><span class="err">'</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>As you can see, we have simply included some more config files and some barebone definition of profiles. Let’s look at the sub-config files.</p>

<h5 id="generalconfig">general.config</h5>

<p>This holds general configurations, parameters and definitions that are applicable to any of our run profiles.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">params</span> <span class="o">{</span>

   <span class="n">outputDir</span> <span class="o">=</span> <span class="err">'</span><span class="o">./</span><span class="n">results</span><span class="err">'</span>
<span class="o">}</span>

<span class="n">process</span> <span class="o">{</span>

	<span class="n">publishDir</span> <span class="o">=</span> <span class="o">[</span>
      <span class="o">[</span><span class="nl">path:</span> <span class="n">params</span><span class="o">.</span><span class="na">outputDir</span><span class="o">,</span> <span class="nl">mode:</span> <span class="err">'</span><span class="n">copy</span><span class="err">'</span><span class="o">,</span> <span class="nl">overwrite:</span> <span class="err">'</span><span class="kc">true</span><span class="err">'</span><span class="o">,</span> <span class="nl">pattern:</span> <span class="s">"*/quant.sf"</span><span class="o">],</span>
      <span class="o">[</span><span class="nl">path:</span> <span class="n">params</span><span class="o">.</span><span class="na">outputDir</span><span class="o">,</span> <span class="nl">mode:</span> <span class="err">'</span><span class="n">copy</span><span class="err">'</span><span class="o">,</span> <span class="nl">overwrite:</span> <span class="err">'</span><span class="kc">true</span><span class="err">'</span><span class="o">,</span> <span class="nl">pattern:</span> <span class="s">"*pseudo.bam"</span><span class="o">]</span>
  	<span class="o">]</span>

	<span class="n">errorStrategy</span> <span class="o">=</span> <span class="err">'</span><span class="n">retry</span><span class="err">'</span>
	<span class="n">maxRetries</span> <span class="o">=</span> <span class="mi">3</span>
	<span class="n">maxForks</span> <span class="o">=</span> <span class="mi">100</span>

<span class="o">}</span>


<span class="n">cloud</span> <span class="o">{</span>
    <span class="n">imageId</span> <span class="o">=</span> <span class="err">'</span><span class="n">ami</span><span class="o">-</span><span class="mi">0</span><span class="n">f99d00928be3a282</span><span class="err">'</span>
    <span class="n">instanceType</span> <span class="o">=</span> <span class="err">'</span><span class="n">t2</span><span class="o">.</span><span class="na">micro</span><span class="err">'</span>
    <span class="n">userName</span> <span class="o">=</span> <span class="err">'</span><span class="n">ec2</span><span class="o">-</span><span class="n">user</span><span class="err">'</span>
    <span class="n">keyName</span> <span class="o">=</span> <span class="err">'</span><span class="n">awsbatch</span><span class="err">'</span>
    <span class="c1">// Type: SSH, Protocol: TCP, Port: 22, Source IP: 0.0.0.0/0</span>
    <span class="n">securityGroup</span> <span class="o">=</span> <span class="err">'</span><span class="n">sg</span><span class="o">-</span><span class="mo">0307</span><span class="n">dbec406526c14</span><span class="err">'</span>
<span class="o">}</span>


<span class="n">timeline</span> <span class="o">{</span>
	<span class="n">enabled</span> <span class="o">=</span> <span class="kc">true</span>
<span class="o">}</span>

<span class="n">report</span> <span class="o">{</span>
	<span class="n">enabled</span> <span class="o">=</span> <span class="kc">true</span>
<span class="o">}</span>
</code></pre></div></div>

<p>We set a default output directory in the <code class="language-plaintext highlighter-rouge">params</code> section, copy the <code class="language-plaintext highlighter-rouge">quant.sf</code> and <code class="language-plaintext highlighter-rouge">pseudo.bam</code> files to a dedicated publish directory, set our error strategy, a basic cloud profile for starting up instances on <a href="https://aws.amazon.com">AWS</a> and enable <a href="https://www.nextflow.io/docs/latest/tracing.html#timeline-report">timeline</a> and <a href="https://www.nextflow.io/docs/latest/tracing.html#execution-report">execution</a> reports per default.</p>

<h5 id="dockerconfig">docker.config</h5>

<p>With this configuration file, we enable Docker support per default and supply the Docker image to use with our <code class="language-plaintext highlighter-rouge">salmon</code> process.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">docker</span> <span class="o">{</span>
    <span class="n">enabled</span> <span class="o">=</span> <span class="kc">true</span>
<span class="o">}</span>

<span class="n">process</span> <span class="o">{</span>
    <span class="c1">// Process-specific docker containers</span>
    <span class="nl">withName:</span><span class="n">salmon</span> <span class="o">{</span>
        <span class="n">container</span> <span class="o">=</span> <span class="err">'</span><span class="n">obenauflab</span><span class="o">/</span><span class="nl">salmon:</span><span class="n">latest</span><span class="err">'</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h5 id="slurmconfig">slurm.config</h5>

<p>This configuration file defines a profile for the <a href="https://slurm.schedmd.com/documentation.html">SLURM</a> scheduler which is run on our HPC system. Our cluster only supports Singularity, so we disable Docker and enable Singuarity in return, as well as define basic resource constraints and queues on our HPC system where to run our tasks - and finally also supply the location of the <code class="language-plaintext highlighter-rouge">salmonIndex</code> on our file system.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">singularity</span> <span class="o">{</span>
	<span class="n">enabled</span> <span class="o">=</span> <span class="kc">true</span>
<span class="o">}</span>

<span class="n">docker</span> <span class="o">{</span>
	<span class="n">enabled</span> <span class="o">=</span> <span class="kc">false</span>
<span class="o">}</span>

<span class="n">process</span> <span class="o">{</span>

    <span class="n">executor</span> <span class="o">=</span> <span class="err">'</span><span class="n">slurm</span><span class="err">'</span>
    <span class="n">clusterOptions</span> <span class="o">=</span> <span class="err">'</span><span class="o">--</span><span class="n">qos</span><span class="o">=</span><span class="kt">short</span><span class="err">'</span>
    <span class="n">cpus</span> <span class="o">=</span> <span class="err">'</span><span class="mi">12</span><span class="err">'</span>
    <span class="n">memory</span> <span class="o">=</span> <span class="o">{</span> <span class="mi">8</span><span class="o">.</span><span class="na">GB</span> <span class="o">*</span> <span class="n">task</span><span class="o">.</span><span class="na">attempt</span> <span class="o">}</span>
<span class="o">}</span>

<span class="n">params</span> <span class="o">{</span>

   <span class="n">salmonIndex</span> <span class="o">=</span> <span class="err">'</span><span class="o">/</span><span class="n">groups</span><span class="o">/</span><span class="nc">Software</span><span class="o">/</span><span class="n">indices</span><span class="o">/</span><span class="n">hg38</span><span class="o">/</span><span class="n">salmon</span><span class="o">/</span><span class="n">gencode</span><span class="o">.</span><span class="na">v28</span><span class="o">.</span><span class="na">IMPACT</span><span class="err">'</span>

<span class="o">}</span>
</code></pre></div></div>

<h5 id="awsbatchconfig">awsbatch.config</h5>

<p>This configuration file will be explained in detail in a later post - but in brief it enables execution of tasks in the cloud using <a href="https://aws.amazon.com/batch/">AWS Batch</a>, yet it still requires extensive configuration before it is usable.</p>

<h2 id="running-the-salmon-nf-nextflow-workflow">Running the <code class="language-plaintext highlighter-rouge">salmon-nf</code> Nextflow workflow</h2>

<p>Now that we have written our code and committed everything to GitHub, we can finally testdrive our workflow on some actual data.</p>

<p>First, let’s pull in our workflow:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tobias.neumann@login-01 <span class="o">[</span>BIO] <span class="nv">$ </span>nextflow pull t-neumann/salmon-nf
Picked up _JAVA_OPTIONS: <span class="nt">-Djava</span>.io.tmpdir<span class="o">=</span>/tmp
Checking t-neumann/salmon-nf ...
 downloaded from https://github.com/t-neumann/salmon-nf.git - revision: 4fbaea7165 <span class="o">[</span>master]
tobias.neumann@login-01 <span class="o">[</span>BIO] <span class="err">$</span>
</code></pre></div></div>

<p>Now we are ready to run our workflow. Make sure to select the profile you desire - for this example I will run it on our in-house cluster with SLURM:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tobias.neumann@login-01 <span class="o">[</span>BIO] <span class="nv">$ </span>nextflow run t-neumann/salmon-nf <span class="nt">--inputDir</span> /tmp/data <span class="nt">--outputDir</span> results <span class="nt">-profile</span> slurm <span class="nt">-resume</span>
Picked up _JAVA_OPTIONS: <span class="nt">-Djava</span>.io.tmpdir<span class="o">=</span>/tmp
N E X T F L O W  ~  version 19.01.0
Launching <span class="sb">`</span>t-neumann/salmon-nf<span class="sb">`</span> <span class="o">[</span>maniac_poisson] - revision: 4fbaea7165 <span class="o">[</span>master]

 parameters
 <span class="o">======================</span>
 input directory          : /tmp/data
 output directory         : results
 <span class="o">======================</span>

<span class="o">[</span>warm up] executor <span class="o">&gt;</span> slurm
<span class="o">[</span>fb/20d1dc] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>8cec7235-3572-460c-b1d7-efe7961988e1_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>e9/6f6404] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>5e18b02d-7e56-4f0d-b892-e7798eee5205_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>f9/509312] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>d1ada222-b67f-47c0-b380-091eaab093b4_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>6d/30354f] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>3783843f-c4fa-4aab-8f5b-e0749763164e_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>9b/2a81e9] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>0fdb3d0e-e405-4e8d-8897-4a90ea4fe00c_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>de/418130] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>383e3574-d22c-4dd6-842f-656ee2ab3b32_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>c1/e00c04] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>1916abcd-61c0-4f23-96ac-be70aacb8dc1_gdc_realn_rehead<span class="o">)</span>
<span class="o">[</span>63/6a2e93] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>30fe4005-f4f2-41ce-bb1a-4830f3959ab7_gdc_realn_rehead<span class="o">)</span>
</code></pre></div></div>

<p>Now we just have to wait till our workflow has successfully finished processing all our samples.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>76/67754e] Submitted process <span class="o">&gt;</span> salmon <span class="o">(</span>0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead<span class="o">)</span>

t-neumann/salmon-nf has finished.
Status:   SUCCESS
Time:     Sun Aug 25 23:35:49 CEST 2019
Duration: 2m

tobias.neumann@login-01 <span class="o">[</span>BIO] <span class="err">$</span>
</code></pre></div></div>

<p>If we now check our results and execution folder, we will find all the files we asked for in there - Nextflow is awesome!</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tobias.neumann@login-01 <span class="o">[</span>BIO] <span class="nv">$ </span><span class="nb">ls
</span>report.html  results  timeline.html
tobias.neumann@login-01 <span class="o">[</span>BIO] <span class="nv">$ </span><span class="nb">ls </span>results
0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead_pseudo.bam  0399ad16-816f-4824-ae28-7b82e006e7b7_gdc_realn_rehead_salmon
</code></pre></div></div>

<p>Have fun building workflows on your own - it pays off, especially for larger samples and heterogeneous computing environments!</p>]]></content><author><name>Tobias Neumann</name></author><category term="Pipelines" /><category term="Containers" /><category term="Docker" /><category term="Nextflow" /><summary type="html"><![CDATA[Setting up and running a pipeline with Nextflow]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://t-neumann.github.io/assets/images/categories/nextflow.png" /><media:content medium="image" url="https://t-neumann.github.io/assets/images/categories/nextflow.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">AWS architecture outline</title><link href="https://t-neumann.github.io/pipelines/AWS-architecture/" rel="alternate" type="text/html" title="AWS architecture outline" /><published>2019-02-10T09:45:00+01:00</published><updated>2019-02-10T09:45:00+01:00</updated><id>https://t-neumann.github.io/pipelines/AWS-architecture</id><content type="html" xml:base="https://t-neumann.github.io/pipelines/AWS-architecture/"><![CDATA[<p>If you talk about the omni-present buzzword <strong>cloud computing</strong>, you will inevitably stumble over <a href="https://aws.amazon.com">Amazon Web Services <i class="fab fa-aws" aria-hidden="true"></i></a>. Sounds super cool and everybody gets excited about it, but I for my part was simply overwhelmed by the amount of services and products available from the platform.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-architecture/AWSServices.png" alt="AWS Services" /></p>

<p>The good news for us bioinformaticians is - and probably all cloud computing professionals handling on enterprise solutions are going to beat me for this statement - for setting up a proper and failsafe analysis pipeline with AWS, you only need a tiny fraction of those and can ignore the rest. In this post, I will walk you through the essential AWS building blocks I deem required for a basic bioinformatics processing pipeline, their characteristics, caveats and how they play together.</p>

<h1 id="aws-building-blocks">AWS building blocks</h1>

<p>If you are familiar with cluster computing environments, you should not have a hard time to find the same architecture principal when building your own custom cluster computing environment in the cloud with AWS. I will elaborate on those pieces I encountered when building up a basic processing pipeline:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">S3</code> for storage of input and auxiliary (e.g. index) files</li>
  <li><code class="language-plaintext highlighter-rouge">EBS</code> as local compute storage</li>
  <li><code class="language-plaintext highlighter-rouge">AMI</code> Machine image (the operating system) to be run on your instances</li>
  <li><code class="language-plaintext highlighter-rouge">EC2</code> instances that do the actual computation</li>
  <li><code class="language-plaintext highlighter-rouge">ECS</code> to create your “software” from Docker containers to run on your instances</li>
  <li><code class="language-plaintext highlighter-rouge">AWS Batch</code> that handles everything from submission to scaling and proper finalization of your individual jobs</li>
</ul>

<p>In the limited number of pipelines I have set up to run in AWS (they can also run on any other compute environment, but that’s a different later story) I have never used any services beyond that. For anything that involves reading e.g. raw read files, processing them and retreiving the output one should be able to make do with a combination of those. This can probably be optimized or done more elegantly with different services, but I had some discussions on this with various people and we have not come across a solution that could do it at a lower cost.</p>

<h2 id="s3---simple-storage-service">S3 - Simple Storage Service</h2>

<p>This is the long-term storage solution from AWS. If you are familiar with a compute environment, this would be your globally accessible file-system were you store all your important files, reference genomes, alignment-indices - you name it. Contrary to the storage you are used to (unless you copy files locally to your node temporary storage for fast I/O), none of the files on <code class="language-plaintext highlighter-rouge">S3</code> are directly read or written when utilizing <code class="language-plaintext highlighter-rouge">EC2</code> instances for computational tasks. Before any pipeline start, all of the necessary files have to present in <code class="language-plaintext highlighter-rouge">S3</code> such as:</p>

<ul>
  <li>Input files:
    <ul>
      <li>Raw read files (<code class="language-plaintext highlighter-rouge">fastq</code>, <code class="language-plaintext highlighter-rouge">bam</code>,…)</li>
      <li>Quantification tables (<code class="language-plaintext highlighter-rouge">txt</code>, <code class="language-plaintext highlighter-rouge">tsv</code>, <code class="language-plaintext highlighter-rouge">csv</code>,…)</li>
    </ul>
  </li>
  <li>Reference files:
    <ul>
      <li>Genome sequence (<code class="language-plaintext highlighter-rouge">fasta</code>)</li>
      <li>Feature annotations (<code class="language-plaintext highlighter-rouge">gtf</code>, <code class="language-plaintext highlighter-rouge">bed</code>, …)</li>
    </ul>
  </li>
  <li>Index files:
    <ul>
      <li>Alignment indices (<code class="language-plaintext highlighter-rouge">bwa</code>, <code class="language-plaintext highlighter-rouge">bowtie</code>, <code class="language-plaintext highlighter-rouge">STAR</code>,…)</li>
      <li>Exon junction annotations (<code class="language-plaintext highlighter-rouge">gtf</code>, …)</li>
      <li>Transcriptome indices (<code class="language-plaintext highlighter-rouge">callisto</code>, <code class="language-plaintext highlighter-rouge">salmon</code>, …)</li>
    </ul>
  </li>
</ul>

<p><code class="language-plaintext highlighter-rouge">S3</code> also will be the final storage location where any of your final output files produced by your pipeline will end up. Since only <code class="language-plaintext highlighter-rouge">S3</code> is long-term storage, usually you don’t have to worry about deleted intermediate or temporary files produced by your pipeline since they will be discarded after your instance has finished processing a given task.</p>

<p>Upload to <code class="language-plaintext highlighter-rouge">S3</code> does not come with any cost, however downloading data from <code class="language-plaintext highlighter-rouge">S3</code> is charged at around 10 cent / GB. Storage on <code class="language-plaintext highlighter-rouge">S3</code> is charged at a per GB / per month basis. So I guess the fact that they charge data downloads is just merely based on the fact that you could up/download data in-between for free and thus circumvent the storage cost which they want to prevent.</p>

<h2 id="ebs---elastic-block-store">EBS - Elastic Block Store</h2>

<p>Every launched instance comes with a root volume of a limited size (8 GB) where all the OS and Service files are located required to start up an instance. To each instance, you can (and often <strong>must</strong>) attach additional volumes - <code class="language-plaintext highlighter-rouge">EBS</code> volumes - of configurable size where your data goes.</p>

<p>There are 3 things to consider when choosing your EBS size</p>

<ul>
  <li>It needs to be large enough to store all input files for a given job
    <ul>
      <li>This includes <strong>all</strong> auxiliary files such as index files!</li>
    </ul>
  </li>
  <li>It needs to be large enough to store <strong>all</strong> intermediate files for a given job</li>
  <li>It needs to be large enough to store <strong>all</strong> output files from a given job</li>
</ul>

<p>Remember - <code class="language-plaintext highlighter-rouge">S3</code> data is never directly accessed from your instance, but always copied to your local <code class="language-plaintext highlighter-rouge">EBS</code> volume!</p>

<p>Estimating <code class="language-plaintext highlighter-rouge">EBS</code> volume sizes gave me a hard time initially and I did a lot of benchmarking runs - if it is too small, your jobs will crash. In practice, I found that <code class="language-plaintext highlighter-rouge">EBS</code> cost is a negligible fraction of your overall cost - so in the end, I ended up being very generous on <code class="language-plaintext highlighter-rouge">EBS</code> volume sizes.</p>

<h2 id="ami---amazon-machine-image">AMI - Amazon Machine Image</h2>

<p>The <code class="language-plaintext highlighter-rouge">AMI</code> is basically Amazon’s version of an image similar to Virtual Machine images. They offer quite a variety of OS base versions in their store (Linux, Windows etc.), but what you would usually want to go for is extending any of those base images yourself with all the software you need during your pipeline run. These days with <a href="https://www.docker.com">Docker <i class="fab fa-docker" aria-hidden="true"></i></a>, usually there is very little effort to setup your software environment, but even then you will in most cases have to install at least the <a href="https://aws.amazon.com/cli">AWS Command Line Interface</a> to copy files from and to <code class="language-plaintext highlighter-rouge">S3</code>.</p>

<h2 id="ec2---elastic-compute-cloud">EC2 - Elastic Compute Cloud</h2>

<p><code class="language-plaintext highlighter-rouge">EC2</code> is the part where you bring the computing heat: These are the instances upon which you launch your <code class="language-plaintext highlighter-rouge">AMI</code>s, attach your <code class="language-plaintext highlighter-rouge">EBS</code> volumes and then do some heavy computation. <code class="language-plaintext highlighter-rouge">EC2</code> instances come in all form and shapes - depending on your demands. Below is an excerpt of compute optimized instance types, but depending on the application you might go for memory optimized, storage optimized GPUs, you name it.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-architecture/EC2Instances.png" alt="EC2 instances" /></p>

<p>The cool thing about them - probably you noticed already if you did the Math - is in terms of cost, it does not matter whether you pick a smaller or a larger instance. The price will scale exactly linearly, meaning you don’t need to squeeze in two jobs in a 2-timers bigger instance necessarily - which will be important at a later point.</p>

<h2 id="ecs---elastic-container-service">ECS - Elastic Container Service</h2>

<p>This definition and especially it’s distinction from <code class="language-plaintext highlighter-rouge">AWS Batch</code> was the hardest for me - I found the most helpful explanation <a href="https://medium.freecodecamp.org/amazon-ecs-terms-and-architecture-807d8c4960fd">here</a> and summarized it below.</p>

<p>According to Amazon,</p>
<blockquote>
  <p>Amazon Elastic Container Service (Amazon ECS) is a highly scalable, high-performance container orchestration service that supports Docker containers and allows you to easily run and scale containerized applications on AWS.</p>
</blockquote>

<p>With <code class="language-plaintext highlighter-rouge">ECS</code> you can run Docker containers on <code class="language-plaintext highlighter-rouge">EC2</code> instances with <code class="language-plaintext highlighter-rouge">AMIs</code> pre-installed with Docker. <code class="language-plaintext highlighter-rouge">ECS</code> handles the installation of containers and the scaling, monitoring and management of the <code class="language-plaintext highlighter-rouge">EC2</code> instances through an API or the AWS Management console. An <code class="language-plaintext highlighter-rouge">ECS</code> instance has Docker and an <code class="language-plaintext highlighter-rouge">ECS</code> Container Agent running on it. A Container Instance can run many Tasks. The Agent takes care of the communication between ECS and the instance, providing the status of running containers and managing running new ones.</p>

<p><img src="https://t-neumann.github.io/assets/images/posts/AWS-architecture/ECS.png" alt="ECS" /></p>

<p>Several <code class="language-plaintext highlighter-rouge">ECS</code> container instances can be combined into an <code class="language-plaintext highlighter-rouge">ECS</code> cluster: Amazon ECS handles the logic of scheduling, maintaining, and handling scaling requests to these instances. It also takes away the work of finding the optimal placement of each Task based on your CPU and memory needs.</p>

<h2 id="aws-batch">AWS Batch</h2>

<p>The separation of <code class="language-plaintext highlighter-rouge">AWS Batch</code> from <code class="language-plaintext highlighter-rouge">ECS</code> was most blurry to me. Essentially, <code class="language-plaintext highlighter-rouge">AWS Batch</code> is build on top of regular <code class="language-plaintext highlighter-rouge">ECS</code> and comes with additional features such as:</p>

<ul>
  <li>Managed compute environment: AWS handles cluster scaling in response to workload.</li>
  <li>Heterogenous instance types: useful when having outlier jobs taking up large amounts of resources</li>
  <li>Spot instances: Save money compared to on-demand instances</li>
  <li>Easy integration with <code class="language-plaintext highlighter-rouge">Cloudwatch</code> logs (<code class="language-plaintext highlighter-rouge">stdout</code> and <code class="language-plaintext highlighter-rouge">stderr</code> captured automatically). This can also lead to insane cost, so <strong>watch out</strong>. More on that later.</li>
</ul>

<p><code class="language-plaintext highlighter-rouge">AWS Batch</code> will effectively take care of firing up instances to handle your workload and then let <code class="language-plaintext highlighter-rouge">ECS</code> handle the Docker orchestration and job execution.</p>

<h1 id="putting-it-all-together">Putting it all together</h1>

<figure style="width: 500px" class="align-right">
  <img src="https://t-neumann.github.io/assets/images/posts/AWS-architecture/AWSArchitecture.png" alt="AWS Architecture" />
</figure>

<p>So how do all the AWS building blocks we just discussed fit together to process jobs? Let’s walk through it and conclude this post:</p>

<ul>
  <li>All jobs we want to be processed are sent to <code class="language-plaintext highlighter-rouge">AWS Batch</code>, which will assess the resources needed and fire up <code class="language-plaintext highlighter-rouge">ECS</code> instances accordingly.</li>
  <li><code class="language-plaintext highlighter-rouge">ECS</code> will take care of pulling the Docker images needed from a container registry (usually Docker hub) and fire up containers on the <code class="language-plaintext highlighter-rouge">EC2</code> instances using the pre-installed Docker daemon.</li>
  <li>These <code class="language-plaintext highlighter-rouge">EC2</code> instances have been initialized with custom <code class="language-plaintext highlighter-rouge">AMIs</code> on startup, having all <code class="language-plaintext highlighter-rouge">ECS</code> prerequisites and additional customized resources such as e.g. the <code class="language-plaintext highlighter-rouge">AWS CLI</code> and additional <code class="language-plaintext highlighter-rouge">EBS</code> volume space.</li>
  <li>All data required for this job is fetched from their long-term storage in <code class="language-plaintext highlighter-rouge">S3</code> to the local <code class="language-plaintext highlighter-rouge">EBS</code> storage of the respective <code class="language-plaintext highlighter-rouge">EC2</code> instance.</li>
</ul>

<p>Now the job has everything it needs to run and will be processed.
After reading this post, you should have a basic understanding what AWS building blocks an AWS batch scheduling system comprises. The next step is to then actually build the architecture for such a pipeline for which I will dedicate another comprehensive post.</p>]]></content><author><name>Tobias Neumann</name></author><category term="Pipelines" /><category term="AMI" /><category term="AWS" /><category term="Containers" /><category term="Docker" /><category term="Nextflow" /><summary type="html"><![CDATA[Resources to consider for engineering pipelines on AWS]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://t-neumann.github.io/assets/images/categories/aws.svg" /><media:content medium="image" url="https://t-neumann.github.io/assets/images/categories/aws.svg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Welcome to my website!</title><link href="https://t-neumann.github.io/general/intro/" rel="alternate" type="text/html" title="Welcome to my website!" /><published>2019-01-17T21:10:00+01:00</published><updated>2019-01-17T21:10:00+01:00</updated><id>https://t-neumann.github.io/general/intro</id><content type="html" xml:base="https://t-neumann.github.io/general/intro/"><![CDATA[<h1 id="hello-world">Hello world!</h1>

<p>I was repeatedly gently pushed towards writing a couple of blogs posts of all the obstacles I bothered people on various <a href="https://gitter.im">Gitter channels <i class="fab fa-gitter" aria-hidden="true"></i></a> with, so I finally made it happen.</p>

<p>Since I hate anything related to web development, HTML, CSS, JS - you name it - hosting Jekyll on GitHub is the most I can reasonably do. I’m actually quite happy that it requires little CSS and HTML and can be mostly put together via Markdown.</p>

<p>To glue this minimal website together, I shamelessly forked the <a href="https://github.com/mmistakes/minimal-mistakes">Minimal mistakes <i class="fab fa-github" aria-hidden="true"></i></a> template and checked out at code from <a href="https://github.com/maxulysse/maxulysse.github.io">Maxime Garcia <i class="fab fa-github" aria-hidden="true"></i></a> for some stuff I liked from the blogs I looked at.</p>

<p>The plan is to put up posts here with anything regarding to Bioinformatics, reproducible pipeline engineering and occasionally rocket science and orbital mechanics.</p>

<p>Cheers</p>]]></content><author><name>Tobias Neumann</name></author><category term="general" /><category term="update" /><summary type="html"><![CDATA[Bioinformatician's life hacks and more]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://t-neumann.github.io/assets/images/categories/OOSS.jpg" /><media:content medium="image" url="https://t-neumann.github.io/assets/images/categories/OOSS.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>