Add files via upload

rosscampbellNIH · web-flow · commit ef55c25d0afa · 2023-11-30T15:51:32.000-05:00
Add bonus notebook to run new datasets.
diff --git a/Bonus_Notebook_new_data.ipynb b/Bonus_Notebook_new_data.ipynb
@@ -0,0 +1,240 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6eb2c6fb-13d9-4461-ad74-44262079211c",
+   "metadata": {},
+   "source": [
+    "# MDIBL Transcriptome Assembly Learning Module\n",
+    "# Bonus notebook: Using TransPi on a new dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c38bba56-40d9-4ca4-b58b-b9733b424b1f",
+   "metadata": {},
+   "source": [
+    "In this notebook, we are going to explore how to run this module with a new dataset. These submodules provide a great framework for running a rigorous and scalable transcriptome assembly, but there are some considerations that must be made in order to run this with your own data. We will walk through that process here so that hopefully, you are able to take these notebooks to your research group and use them for your own analysis."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "80044322-a021-4fdc-ad83-504961bd1919",
+   "metadata": {},
+   "source": [
+    "The data we are using here comes from SRA. In this example, we are using data from an experiment that compared RNA sequences in honeybees with and without viral infections. The BioProject ID is [PRJNA274674](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA274674). This experiment includes 6 RNA-seq samples and 2 methylation-seq samples. We are only considering the RNA-seq data here. Additionally, we have subsampled them to about 2 millions reads collectively accross all of the samples. In a real analysis this would not be a good idea, but to keep costs and runtimes low we will use the down-sampled files in this demonstration. If you want to explore the full dataset, we recommend pulling the fastq files using the [STRIDES tutorial on SRA downloads](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/tutorials/notebooks/SRADownload/SRA-Download.ipynb). As with the original example in this module, we have concatenated all 6 files into one set of combined fastq files called apis_joined_R{1,2}.fastq.gz We have stored the subsampled fastq files in this module's cloud storage bucket."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dcf2a2d0-bc91-4a2a-9db0-62f1eee91f92",
+   "metadata": {},
+   "source": [
+    "Before we start any analysis, let's set up the environment just like we did in Submodule_01 and Submodule_02 where we move to the correct directory and install software."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "795e3329-21d8-4de0-91c4-58c81371a712",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "%cd /home/jupyter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c9701541-9c05-4a02-abe7-e1126826efe3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pwd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2d0c3012-27ca-4eb7-a106-6532f62cbc82",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#update java\n",
+    "!sudo apt update\n",
+    "!sudo apt-get install default-jdk -y\n",
+    "!java -version"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3d6ffb05-19bb-42c4-a628-6d874c5d3517",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# install mamba and dependencies\n",
+    "!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n",
+    "!bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge\n",
+    "!~/mambaforge/bin/mamba install -c bioconda sra-tools perl-dbd-sqlite perl-dbi -y"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "21ae7ca3-c9e1-4160-a29f-37a53b8265ab",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#install Nextflow\n",
+    "!curl https://get.nextflow.io | bash\n",
+    "!chmod +x nextflow\n",
+    "!./nextflow self-update"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f7d6f314-9af1-428f-85ea-6d02e527a6a1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copy the software from the storage bucket\n",
+    "!gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/TransPi ./"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e5bdc3fa-7b78-4329-ae17-49ffce2085bb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copy the data from the storage bucket\n",
+    "!gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/resources ./"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "65350b31-588b-42d9-90b4-849a523be021",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Make the program executable\n",
+    "!chmod -R +x ./TransPi/bin"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15d79bb0-9f4c-469a-b2e6-3379e68f8f73",
+   "metadata": {},
+   "source": [
+    "Let's have a look at what we've downloaded to make sure it's there."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9c5606e2-5c0c-423c-9e32-40bd550c19cb",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!ls ./resources/seq2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "08ae572a-fe6d-4852-a11c-5a2449c1d6b2",
+   "metadata": {},
+   "source": [
+    "You should see the apis_joined fastq files alongside the others that we use in the previous submodules. Now let's adjust the workflow to run on them."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "331a3857-7734-41e4-819a-de3603b9c95b",
+   "metadata": {},
+   "source": [
+    "One of the great benefits of using a workflow manager like Nextflow is that it allows easy swapping of input samples without drastic changes to the code. In the true spirit of reproducible workflows, the only change necessary in order to run the apis samples is to adjust the `reads` line in the `nextflow.config` file `params` section to point to the new reads location. We show this below."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c22622e3-56a7-42bb-9884-abd39c72d6e3",
+   "metadata": {},
+   "source": [
+    "```\n",
+    "// Directory for reads\n",
+    "reads=\"/home/jupyter/resources/seq2/apis_joined*R[1,2].fastq.gz\"\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fd94aacc-2fb8-4a99-8378-2883b723253a",
+   "metadata": {},
+   "source": [
+    "After this change, you should be able to run the same Nextflow command as you did in Submodule_02 and everything will progress automatically."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "24663040-553b-4434-83a8-93fb9d2fd58f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!NXF_VER=22.10.1 ./nextflow run ./TransPi/TransPi.nf \\\n",
+    "-profile docker --k 17,25,43 --maxReadLen 50 --all "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "57dc8622-fb61-4ce6-ad40-48fc710a4713",
+   "metadata": {},
+   "source": [
+    "With the subsampled reads, the assembly should complete in about 2 hours using a n1-highmem-16 machine."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a822011a-4d77-4d99-8124-67ecb8f3dacf",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "environment": {
+   "kernel": "python3",
+   "name": "common-cpu.m113",
+   "type": "gcloud",
+   "uri": "gcr.io/deeplearning-platform-release/base-cpu:m113"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}