Skip to content

Commit c23d61e

Browse files
updated README in the main part
1 parent a9fe503 commit c23d61e

1 file changed

Lines changed: 8 additions & 57 deletions

File tree

README.md

Lines changed: 8 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -3,42 +3,24 @@
33
# MDI Biological Laboratory RNA-seq Transcriptome Assembly Module
44
---------------------------------
55

6-
7-
## Three primary and interlinked learning goals:
8-
1. From a *biological perspective*, demonstration of the **process of transcriptome assembly** from raw RNA-seq data.
9-
2. From a *computational perspective*, demonstration of **computing using workflow management and container systems**.
10-
3. Also from an *infrastructure perspective*, demonstration of **carrying out these analyses efficiently in a cloud environment.**
11-
12-
13-
14-
# Quick Overview
15-
This module teaches you how to perform a short-read RNA-seq Transcriptome Assembly with Google Cloud Platform using a Nextflow pipeline, and eventually using the Google Batch API. In addition to the overview given in this README, you will find three Jupyter notebooks that teach you different components of RNA-seq in the cloud.
16-
17-
This module will cost you about $7.00 to run end to end, assuming you shutdown and delete all resources upon completion.
18-
19-
206
## Contents
217

22-
+ [Getting Started](#getting-started)
8+
+ [Overview](#overview)
9+
+ [Learning goals](#learning-goals)
2310
+ [Biological Problem](#biological-problem)
24-
+ [Set Up](#set-up)
25-
+ [Software Requirements](#software-requirements)
2611
+ [Workflow Diagrams](#workflow-diagrams)
2712
+ [Data](#data)
2813
+ [Troubleshooting](#troubleshooting)
2914
+ [Funding](#funding)
3015
+ [License for Data](#license-for-data)
3116

32-
## **Getting Started**
33-
This learning module includes tutorials and execution scripts in the form of Jupyter notebooks. The purpose of these tutorials is to help users familiarize themselves with cloud computing in the specific context of running bioinformatics workflows to prep for and to carry out a transcriptome assembly, refinement, and annotation. These tutorials do this by utilizing a recently published Nextflow workflow (TransPi [manuscript](https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13593), [repository](https://github.com/palmuc/TransPi), and [user guide](https://palmuc.github.io/TransPi/)), which manages and passes data between several state-of-the-art programs, carrying out the processes from initial quality control and normalization, through assembly with several tools, refinement and assessment, and finally annotation of the final putative transcriptome.
34-
35-
Since the work is managed by this pipeline, the notebooks will focus on setting up and running the pipeline, followed by an examination of some of the wide range of outputs produced. We will also demonstrate how to retrieve the complete results directory so that users can examine more extensively on their own computing systems going step-by-step through specific workflows. These workflows cover the start to finish of basic bioinformatics analysis; starting from raw sequence data and carrying out the steps needed to generate a final assembled and annotated transcriptome.
36-
37-
We also put an emphasis on understanding how workflows execute, using the specific example of the Nextflow (https://www.nextflow.io) workflow engine, and on using workflow engines as supported by cloud infrastructure, using the specific example of the Google Batch API (https://cloud.google.com/batch).
17+
## Overview
18+
This module teaches you how to perform a short-read RNA-seq Transcriptome Assembly with a Cloud Computing Platform using a Nextflow pipeline. In addition to the overview given in this README, you will find README related to each platform (AWS, Google Cloud) and Jupyter notebooks that teach you different components of RNA-seq in the cloud.
3819

39-
![technical infrastructure](/images/architecture_diagram.png)
40-
41-
**Figure 1:** The technical infrastructure diagram for this project.
20+
## Learning goals:
21+
1. From a *biological perspective*, demonstration of the **process of transcriptome assembly** from raw RNA-seq data.
22+
2. From a *computational perspective*, demonstration of **computing using workflow management and container systems**.
23+
3. Also from an *infrastructure perspective*, demonstration of **carrying out these analyses efficiently in a cloud environment.**
4224

4325
## **Biological Problem**
4426
The combination of increased availability and reduced expense in obtaining high-throughput sequencing has made transcriptome profiling analysis (primarily with RNA-seq) a standard tool for the molecular characterization of widely disparate biological systems. Researchers working in common model organisms, such as mouse or zebrafish, have relatively easy access to the necessary resources (e.g., well-assembled genomes and large collections of predicted/verified transcripts), for the analysis and interpretation of their data. In contrast, researchers working on less commonly studied organisms and systems often must develop these resources for themselves.
@@ -51,37 +33,6 @@ Transcriptome assembly is the broad term used to describe the process of estimat
5133

5234
Once a new transcriptome is generated, assessed, and refined, it must be annotated with putative functional assignments to be of use in subsequent functional studies. Functional annotation is accomplished through a combination of assignment of homology-based and ab initio methods. The most well-established homology-based processes are the combination of protein-coding sequence prediction followed by protein sequence alignment to databases of known proteins, especially those from human or common model organisms. Ab initio methods use computational models of various features (e.g., known protein domains, signal peptides, or peptide modification sites) to characterize either the transcript or its predicted protein product. This training module will cover multiple approaches to the annotation of assembled transcriptomes.
5335

54-
## **Set Up**
55-
56-
#### Part 1: Setting up Environment
57-
58-
**Enable APIs and create a Nextflow Sercice Account**
59-
60-
If you are using Nextflow outside of NIH CloudLab you must enable the required APIs, set up a service account, and add your service account to your notebook permissions before creating the notebook. Follow sections 1 and 2 of the accompanying [how to document](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateNextflowServiceAccount.md) for instructions. If you are executing this tutorial with an NIH CloudLab account your default Compute Engine service account will have all required IAM roles to run the nextflow portion.
61-
62-
**Create the Vertex AI Instance**
63-
64-
Follow the steps highlighted [here](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/docs/vertexai.md) to create a new user-managed notebook in Vertex AI. Follow steps 1-8 and be especially careful to enable idle shutdown as highlighted in step 7. For this module you should select **Debian 11** and **Python3** in the Environment tab in step 5. In step 6 in the Machine type tab, select **n1-highmem-16** from the dropdown box. This will provide you with 16 vCPUs and 104 GB of RAM which may feel like a lot but is necessary for TransPi to run.
65-
66-
67-
#### Part 2: Adding the Modules to the Notebook
68-
69-
1. From the Launcher in your new VM, Click the Terminal option.
70-
![setup 22](images/Setup22.png)
71-
2. Next, paste the following git command to get a copy of everything within this repository, including all of the submodules.
72-
73-
> ```git clone https://github.com/NIGMS/Transcriptome-Assembly-Refinement-and-Applications.git```
74-
3. You are now all set!
75-
76-
**WARNING:** When you are not using the notebook, stop it. This will prevent you from incurring costs while you are not using the notebook. You can do this in the same window as where you opened the notebook. Make sure that you have the notebook selected ![setup 23](images/Setup23.png). Then click the ![setup 24](images/Setup24.png). When you want to start up the notebook again, do the same process except click the ![setup 25](images/Setup25.png) instead.
77-
78-
## **Software Requirements**
79-
80-
All of the software requirements are taken care of and installed within [Submodule_01_prog_setup.ipynb](./Submodule_01_prog_setup.ipynb). The key pieces of software needed are:
81-
1. [Nextflow workflow system](https://www.nextflow.io/): Nextflow is a workflow management software that TransPi is built for.
82-
2. [Google Batch API](https://cloud.google.com/batch/docs): Google Batch was enabled as part of the setup process and will be readily available when it is needed.
83-
3. [Nextflow TransPi Package](https://github.com/palmuc/TransPi): The rest of the software is all downloaded as part of the TransPi package. TransPi is a Nextflow pipeline that carries out many of the standard steps required for transcriptome assembly and annotation. The original TransPi is available from this GitHub [link](https://github.com/palmuc/TransPi). We have made various alterations to the TransPi package and so the TransPi files you will be using throughout this module will be our own altered version.
84-
8536
## **Workflow Diagrams**
8637

8738
![transpi workflow](images/transpi_workflow.png)

0 commit comments

Comments
 (0)