diff --git a/docs/_toc.yml b/docs/_toc.yml index b3ba4c8..1019ace 100644 --- a/docs/_toc.yml +++ b/docs/_toc.yml @@ -58,6 +58,12 @@ parts: - file: adir1/cloud - file: adir1/download - file: adir1/api + - caption: Anopheles darlingi + chapters: + - file: adar1/adar1.0 + - file: adar1/cloud + - file: adar1/download + - file: adar1/api - caption: Partners and community chapters: - file: studies-ag1000g diff --git a/docs/adar1/adar1.0.ipynb b/docs/adar1/adar1.0.ipynb new file mode 100644 index 0000000..1d5103e --- /dev/null +++ b/docs/adar1/adar1.0.ipynb @@ -0,0 +1,841 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "LBNBl2exUYWu" + }, + "source": [ + "# Adar1.0 (Vector Observatory - _Anopheles darlingi_ Phase 1 Data Release)\n", + "\n", + "The **[Adar1.0](adar1.0): _Anopheles darlingi_ data resource** contains single nucleotide polymorphism (SNP) calls from whole-genome sequencing of 1094 mosquitoes, part of the [Population genomics of Anopheles darlingi, the principal South American malaria vector mosquito](https://www.science.org/doi/10.1126/science.adw9761) study. These data were integrated as part of the [MalariaGEN Vector Observatory Anopheles darlingi Genomic Surveillance Project](https://www.malariagen.net/project/anopheles-darlingi-genomic-surveillance-project).\n", + "\n", + "_Anopheles darlingi_ is the primary human malaria vector species in South America and plays a key role in transmitting Plasmodium parasites in the Amazon Basin. This project was established to investigate if key characteristics observed in _An. gambiae_ and _An. funestus_ malaria vectors, such as the complex taxa, high genetic diversity and distinct evolutionary histories are also observed in _An. darlingi_. This resource features whole-genome sequence data which can be used to survey genetic diversity, population structure and evolution of _An. darlingi_, and to establish a foundation for ongoing genomic surveillance of _An. darlingi_ populations.\n", + "\n", + "Researchers from the Broad Institute of MIT and Harvard have generated whole genome sequence data for _An. darlingi_ individuals from six countries, forming the basis for the first large open data resource on the main malaria vector in the Amazon basin or any neotropical mosquito. \n", + "\n", + "This page provides an introduction to open data resources released as part of the first phase of the Vector Observatory-Anopheles darlingi Surveillance Project. This page covers the `Adar1.0` _Anopheles darlingi_ data release. \n", + "\n", + "If you have any questions about this guide or how to use the data, please [start a new discussion](https://github.com/malariagen/vector-public-data/discussions/new) on the malariagen/vector-open-data repo on GitHub. If you find any bugs, please [raise an issue](https://github.com/malariagen/vector-public-data/issues/new/choose)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kJqs4cXppk8j" + }, + "source": [ + "## Terms of use\n", + "\n", + "Data from this project will be made publicly available before journal publication, subject to the following publication embargo: unless otherwise stated, analyses of project data are ongoing and publications are in preparation by project partners, and it is not permitted to use project data for publication (including any type of communication with the general public) without prior permission from the originating partner studies. The publication embargo will expire 24 months after the data is integrated into the Malaria Genome Vector Observatory data repository, or earlier, if the project partner agrees to remove the embargo before the expiry date.\n", + "\n", + "Although malaria is generally an endemic rather than an epidemic disease, and the focus of this project is on surveillance of disease vectors rather than pathogens, our data terms of use build on MalariaGEN's approach to data sharing, and adopt norms which have been established for rapid sharing of pathogen genomic data during disease outbreaks. The primary rationale for this approach is that malaria remains a public health emergency, where ethically appropriate and rapid sharing of genomic surveillance data can help to detect and respond to biological threats such as new forms of insecticide resistance, and to adapt malaria vector control strategies to different settings and changing circumstances.\n", + "\n", + "The publication embargo for all data on this release will expire on the **26th of March 2026**. \n", + "\n", + "If you have any questions about the terms of use, please email [support@malariagen.net](mailto:support@malariagen.net)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iNSicUCtpk8j" + }, + "source": [ + "## Partner studies\n", + "\n", + "- [1357-VO-BR-SALLUM-VMF00326](https://www.malariagen.net/network/where-we-work/1357-vo-br-sallum/) - _Anopheles darlingi_ vector surveillance in Brazil.\n", + "\n", + "- [1358-VO-GF-GENDRIN-VMF00327](https://www.malariagen.net/network/where-we-work/1358-vo-gf-gendrin/) - _Anopheles darlingi_ vector surveillance in French Guiana.\n", + "\n", + "- [1359-VO-GY-NILES-ROBIN-VMF00328](https://www.malariagen.net/network/where-we-work/1359-vo-gy-niles-robin/) - _Anopheles darlingi_ vector surveillance in Guyana.\n", + "\n", + "- [1360-VO-PE-GAMBOA-VMF00329](https://www.malariagen.net/network/where-we-work/1360-vo-pe-gamboa/) - _Anopheles darlingi_ vector surveillance in Peru.\n", + "\n", + "- [1361-VO-VE-GRILLET-VMF00330](https://www.malariagen.net/network/where-we-work/1361-vo-ve-grillet/) - _Anopheles darlingi_ vector surveillance in Venezuela.\n", + "\n", + "- [1362-VO-CO-QUINONES-VMF00331](https://www.malariagen.net/partner_study/1362-vo-co-quinones/) - _Anopheles darlingi_ vector surveillance in Colombia." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5RHbe7N6pk8k" + }, + "source": [ + "## Whole-genome sequencing and variant calling\n", + "\n", + "All samples in `Adar1.0` have been sequenced individually to high coverage using Illumina technology at the Broad Institute. These sequence data have then been analysed to identify genetic variants such as single nucleotide polymorphisms (SNPs) and indels. After variant calling, both the samples and the variants have been through a range of quality control analyses, to ensure the data are of high quality. Both the raw sequence data and the curated variant calls are openly available for download and analysis at NCBI SRA, (BioProject PRJNA1169887). The analysis ready data is also available following this guide. More details on methods can be found in the Supplementary Materials - Materials and Methods of [Population genomics of Anopheles darlingi, the principal South American malaria vector mosquito](https://www.science.org/doi/10.1126/science.adw9761)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Quality control\n", + "\n", + "### SNP filtering\n", + "\n", + "SNPs were hard-filtered with QD < 2, FS > 60,\n", + "ReadPosRankSum < -8, QUAL < 30, SOR > 3, MQ < 40, and/or MQRankSum < -12.5. Indels\n", + "were hard-filtered with QD < 2, QUAL < 30, FS > 200, and/or ReadPosRankSum > -20.\n", + "\n", + "### Inaccessible sites\n", + "\n", + "The VCFs did not retain invariant sites, so\n", + "the inaccessible portion of the genome was defined by any 1 kb windows with <5 segregating\n", + "variants, a category which would be rare if coverage were normally distributed.\n", + "\n", + "### Linkage desiquilibrium\n", + "\n", + "To\n", + "assess linkage disequilibrium decay, the variants outside of inversions with\n", + "minor allele frequency ≥ 10% in each examined population were considered. The researchers at the Broad Institute calculated linkage\n", + "disequilibrium with the snpgdsLDMat function in the SNPRelate v1.40.0 package with\n", + "method=\"r\", squaring the results to generate r2, which we corrected for sample size by\n", + "2\n", + "subtracting the reciprocal of the sample size. They performed statistical tests and data visualization\n", + "in R v. 4.4.2 (66) using scripts available at https://doi.org/10.5281/zenodo.17652650.\n", + "\n", + "### Samples filtering\n", + "\n", + "The researchers at the Broad Institute filtered samples that they considered to be contaminated or too closely related for their analyses. The data for all the samples that were whole-genome sequenced is, however, made available as part of `Adar1.0`.\n", + "\n", + "Further details can be found in the Supplementary Materials - Materials and Methods of [Population genomics of Anopheles darlingi, the principal South American malaria vector mosquito](https://www.science.org/doi/10.1126/science.adw9761)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9Hfchko2pk8l" + }, + "source": [ + "## Data hosting\n", + "\n", + "SNP data from `Adar1.0` are hosted by several different services. \n", + "\n", + "The SNP data have been uploaded to Google Cloud, and can be analysed directly within the cloud without having to download or copy any data, including via free interactive computing services such as [Google Colab](https://colab.research.google.com/). Further information about analysing these data in the cloud is provided in the [cloud data access guide](cloud)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lTJ_EnvOpk8l" + }, + "source": [ + "## Sample sets\n", + "\n", + "The samples included in `Adar1.0` have been organised into 6 sample sets. \n", + "\n", + "Each sample set corresponds to a set of mosquito specimens from a contributing study. Study details can be found in the partner studies webpages listed above." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:25:04.765659Z", + "iopub.status.busy": "2026-04-18T15:25:04.765491Z", + "iopub.status.idle": "2026-04-18T15:25:04.768673Z", + "shell.execute_reply": "2026-04-18T15:25:04.768095Z", + "shell.execute_reply.started": "2026-04-18T15:25:04.765640Z" + }, + "id": "hGA4d7Yrpk8m", + "outputId": "c29827c1-0361-4926-c227-8f6e76c2a497", + "tags": [ + "remove-input" + ] + }, + "outputs": [], + "source": [ + "%pip install -qq malariagen_data" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-18T15:25:05.312906Z", + "iopub.status.busy": "2026-04-18T15:25:05.312632Z", + "iopub.status.idle": "2026-04-18T15:25:08.573798Z", + "shell.execute_reply": "2026-04-18T15:25:08.573228Z", + "shell.execute_reply.started": "2026-04-18T15:25:05.312882Z" + }, + "id": "AnmzLmEgpk8n", + "tags": [ + "remove-input" + ] + }, + "outputs": [ + { + "data": { + "application/javascript": [ + "'use strict';\n", + "(function(root) {\n", + " function now() {\n", + " return new Date();\n", + " }\n", + "\n", + " const force = true;\n", + "\n", + " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", + " root._bokeh_onload_callbacks = [];\n", + " root._bokeh_is_loading = undefined;\n", + " }\n", + "\n", + "const JS_MIME_TYPE = 'application/javascript';\n", + " const HTML_MIME_TYPE = 'text/html';\n", + " const EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", + " const CLASS_NAME = 'output_bokeh rendered_html';\n", + "\n", + " /**\n", + " * Render data to the DOM node\n", + " */\n", + " function render(props, node) {\n", + " const script = document.createElement(\"script\");\n", + " node.appendChild(script);\n", + " }\n", + "\n", + " /**\n", + " * Handle when an output is cleared or removed\n", + " */\n", + " function handleClearOutput(event, handle) {\n", + " function drop(id) {\n", + " const view = Bokeh.index.get_by_id(id)\n", + " if (view != null) {\n", + " view.model.document.clear()\n", + " Bokeh.index.delete(view)\n", + " }\n", + " }\n", + "\n", + " const cell = handle.cell;\n", + "\n", + " const id = cell.output_area._bokeh_element_id;\n", + " const server_id = cell.output_area._bokeh_server_id;\n", + "\n", + " // Clean up Bokeh references\n", + " if (id != null) {\n", + " drop(id)\n", + " }\n", + "\n", + " if (server_id !== undefined) {\n", + " // Clean up Bokeh references\n", + " const cmd_clean = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", + " cell.notebook.kernel.execute(cmd_clean, {\n", + " iopub: {\n", + " output: function(msg) {\n", + " const id = msg.content.text.trim()\n", + " drop(id)\n", + " }\n", + " }\n", + " });\n", + " // Destroy server and session\n", + " const cmd_destroy = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", + " cell.notebook.kernel.execute(cmd_destroy);\n", + " }\n", + " }\n", + "\n", + " /**\n", + " * Handle when a new output is added\n", + " */\n", + " function handleAddOutput(event, handle) {\n", + " const output_area = handle.output_area;\n", + " const output = handle.output;\n", + "\n", + " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", + " if ((output.output_type != \"display_data\") || (!Object.prototype.hasOwnProperty.call(output.data, EXEC_MIME_TYPE))) {\n", + " return\n", + " }\n", + "\n", + " const toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", + "\n", + " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", + " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", + " // store reference to embed id on output_area\n", + " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", + " }\n", + " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", + " const bk_div = document.createElement(\"div\");\n", + " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", + " const script_attrs = bk_div.children[0].attributes;\n", + " for (let i = 0; i < script_attrs.length; i++) {\n", + " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", + " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n", + " }\n", + " // store reference to server id on output_area\n", + " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", + " }\n", + " }\n", + "\n", + " function register_renderer(events, OutputArea) {\n", + "\n", + " function append_mime(data, metadata, element) {\n", + " // create a DOM node to render to\n", + " const toinsert = this.create_output_subarea(\n", + " metadata,\n", + " CLASS_NAME,\n", + " EXEC_MIME_TYPE\n", + " );\n", + " this.keyboard_manager.register_events(toinsert);\n", + " // Render to node\n", + " const props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", + " render(props, toinsert[toinsert.length - 1]);\n", + " element.append(toinsert);\n", + " return toinsert\n", + " }\n", + "\n", + " /* Handle when an output is cleared or removed */\n", + " events.on('clear_output.CodeCell', handleClearOutput);\n", + " events.on('delete.Cell', handleClearOutput);\n", + "\n", + " /* Handle when a new output is added */\n", + " events.on('output_added.OutputArea', handleAddOutput);\n", + "\n", + " /**\n", + " * Register the mime type and append_mime function with output_area\n", + " */\n", + " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", + " /* Is output safe? */\n", + " safe: true,\n", + " /* Index of renderer in `output_area.display_order` */\n", + " index: 0\n", + " });\n", + " }\n", + "\n", + " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", + " if (root.Jupyter !== undefined) {\n", + " const events = require('base/js/events');\n", + " const OutputArea = require('notebook/js/outputarea').OutputArea;\n", + "\n", + " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", + " register_renderer(events, OutputArea);\n", + " }\n", + " }\n", + " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", + " root._bokeh_timeout = Date.now() + 5000;\n", + " root._bokeh_failed_load = false;\n", + " }\n", + "\n", + " const NB_LOAD_WARNING = {'data': {'text/html':\n", + " \"
\\n\"+\n", + " \"

\\n\"+\n", + " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", + " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", + " \"

\\n\"+\n", + " \"\\n\"+\n", + " \"\\n\"+\n", + " \"from bokeh.resources import INLINE\\n\"+\n", + " \"output_notebook(resources=INLINE)\\n\"+\n", + " \"\\n\"+\n", + " \"
\"}};\n", + "\n", + " function display_loaded(error = null) {\n", + " const el = document.getElementById(null);\n", + " if (el != null) {\n", + " const html = (() => {\n", + " if (typeof root.Bokeh === \"undefined\") {\n", + " if (error == null) {\n", + " return \"BokehJS is loading ...\";\n", + " } else {\n", + " return \"BokehJS failed to load.\";\n", + " }\n", + " } else {\n", + " const prefix = `BokehJS ${root.Bokeh.version}`;\n", + " if (error == null) {\n", + " return `${prefix} successfully loaded.`;\n", + " } else {\n", + " return `${prefix} encountered errors while loading and may not function as expected.`;\n", + " }\n", + " }\n", + " })();\n", + " el.innerHTML = html;\n", + "\n", + " if (error != null) {\n", + " const wrapper = document.createElement(\"div\");\n", + " wrapper.style.overflow = \"auto\";\n", + " wrapper.style.height = \"5em\";\n", + " wrapper.style.resize = \"vertical\";\n", + " const content = document.createElement(\"div\");\n", + " content.style.fontFamily = \"monospace\";\n", + " content.style.whiteSpace = \"pre-wrap\";\n", + " content.style.backgroundColor = \"rgb(255, 221, 221)\";\n", + " content.textContent = error.stack ?? error.toString();\n", + " wrapper.append(content);\n", + " el.append(wrapper);\n", + " }\n", + " } else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(() => display_loaded(error), 100);\n", + " }\n", + " }\n", + "\n", + " function run_callbacks() {\n", + " try {\n", + " root._bokeh_onload_callbacks.forEach(function(callback) {\n", + " if (callback != null)\n", + " callback();\n", + " });\n", + " } finally {\n", + " delete root._bokeh_onload_callbacks\n", + " }\n", + " console.debug(\"Bokeh: all callbacks have finished\");\n", + " }\n", + "\n", + " function load_libs(css_urls, js_urls, callback) {\n", + " if (css_urls == null) css_urls = [];\n", + " if (js_urls == null) js_urls = [];\n", + "\n", + " root._bokeh_onload_callbacks.push(callback);\n", + " if (root._bokeh_is_loading > 0) {\n", + " console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", + " return null;\n", + " }\n", + " if (js_urls == null || js_urls.length === 0) {\n", + " run_callbacks();\n", + " return null;\n", + " }\n", + " console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", + " root._bokeh_is_loading = css_urls.length + js_urls.length;\n", + "\n", + " function on_load() {\n", + " root._bokeh_is_loading--;\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n", + " run_callbacks()\n", + " }\n", + " }\n", + "\n", + " function on_error(url) {\n", + " console.error(\"failed to load \" + url);\n", + " }\n", + "\n", + " for (let i = 0; i < css_urls.length; i++) {\n", + " const url = css_urls[i];\n", + " const element = document.createElement(\"link\");\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.rel = \"stylesheet\";\n", + " element.type = \"text/css\";\n", + " element.href = url;\n", + " console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " for (let i = 0; i < js_urls.length; i++) {\n", + " const url = js_urls[i];\n", + " const element = document.createElement('script');\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.async = false;\n", + " element.src = url;\n", + " console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", + " document.head.appendChild(element);\n", + " }\n", + " };\n", + "\n", + " function inject_raw_css(css) {\n", + " const element = document.createElement(\"style\");\n", + " element.appendChild(document.createTextNode(css));\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.6.3.min.js\"];\n", + " const css_urls = [];\n", + "\n", + " const inline_js = [ function(Bokeh) {\n", + " Bokeh.set_log_level(\"info\");\n", + " },\n", + "function(Bokeh) {\n", + " }\n", + " ];\n", + "\n", + " function run_inline_js() {\n", + " if (root.Bokeh !== undefined || force === true) {\n", + " try {\n", + " for (let i = 0; i < inline_js.length; i++) {\n", + " inline_js[i].call(root, root.Bokeh);\n", + " }\n", + "\n", + " } catch (error) {throw error;\n", + " }} else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(run_inline_js, 100);\n", + " } else if (!root._bokeh_failed_load) {\n", + " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", + " root._bokeh_failed_load = true;\n", + " } else if (force !== true) {\n", + " const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n", + " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", + " }\n", + " }\n", + "\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", + " run_inline_js();\n", + " } else {\n", + " load_libs(css_urls, js_urls, function() {\n", + " console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n", + " run_inline_js();\n", + " });\n", + " }\n", + "}(window));" + ], + "application/vnd.bokehjs_load.v0+json": "'use strict';\n(function(root) {\n function now() {\n return new Date();\n }\n\n const force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n\n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n const NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded(error = null) {\n const el = document.getElementById(null);\n if (el != null) {\n const html = (() => {\n if (typeof root.Bokeh === \"undefined\") {\n if (error == null) {\n return \"BokehJS is loading ...\";\n } else {\n return \"BokehJS failed to load.\";\n }\n } else {\n const prefix = `BokehJS ${root.Bokeh.version}`;\n if (error == null) {\n return `${prefix} successfully loaded.`;\n } else {\n return `${prefix} encountered errors while loading and may not function as expected.`;\n }\n }\n })();\n el.innerHTML = html;\n\n if (error != null) {\n const wrapper = document.createElement(\"div\");\n wrapper.style.overflow = \"auto\";\n wrapper.style.height = \"5em\";\n wrapper.style.resize = \"vertical\";\n const content = document.createElement(\"div\");\n content.style.fontFamily = \"monospace\";\n content.style.whiteSpace = \"pre-wrap\";\n content.style.backgroundColor = \"rgb(255, 221, 221)\";\n content.textContent = error.stack ?? error.toString();\n wrapper.append(content);\n el.append(wrapper);\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(() => display_loaded(error), 100);\n }\n }\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error(url) {\n console.error(\"failed to load \" + url);\n }\n\n for (let i = 0; i < css_urls.length; i++) {\n const url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n for (let i = 0; i < js_urls.length; i++) {\n const url = js_urls[i];\n const element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.6.3.min.js\"];\n const css_urls = [];\n\n const inline_js = [ function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\nfunction(Bokeh) {\n }\n ];\n\n function run_inline_js() {\n if (root.Bokeh !== undefined || force === true) {\n try {\n for (let i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n\n } catch (error) {throw error;\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import malariagen_data\n", + "adar1 = malariagen_data.Adar1()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 927 + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:25:10.302163Z", + "iopub.status.busy": "2026-04-18T15:25:10.301603Z", + "iopub.status.idle": "2026-04-18T15:25:10.436405Z", + "shell.execute_reply": "2026-04-18T15:25:10.435839Z", + "shell.execute_reply.started": "2026-04-18T15:25:10.302143Z" + }, + "id": "qsElasBepk8n", + "outputId": "4bf80a06-c2e8-4d2d-b4a6-99c8c66da7db", + "tags": [ + "remove-input" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sample_setsample_count
study_id
1357-VO-BR-SALLUM1357-VO-BR-SALLUM-VMF00326272
1358-VO-GF-GENDRIN1358-VO-GF-GENDRIN-VMF00327139
1359-VO-GY-NILES-ROBIN1359-VO-GY-NILES-ROBIN-VMF0032818
1360-VO-PE-GAMBOA1360-VO-PE-GAMBOA-VMF0032989
1361-VO-VE-GRILLET1361-VO-VE-GRILLET-VMF00330126
1362-VO-CO-QUINONES1362-VO-CO-QUINONES-VMF00331449
\n", + "
" + ], + "text/plain": [ + " sample_set sample_count\n", + "study_id \n", + "1357-VO-BR-SALLUM 1357-VO-BR-SALLUM-VMF00326 272\n", + "1358-VO-GF-GENDRIN 1358-VO-GF-GENDRIN-VMF00327 139\n", + "1359-VO-GY-NILES-ROBIN 1359-VO-GY-NILES-ROBIN-VMF00328 18\n", + "1360-VO-PE-GAMBOA 1360-VO-PE-GAMBOA-VMF00329 89\n", + "1361-VO-VE-GRILLET 1361-VO-VE-GRILLET-VMF00330 126\n", + "1362-VO-CO-QUINONES 1362-VO-CO-QUINONES-VMF00331 449" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_sample_sets = adar1.sample_sets(release=\"1.0\")\n", + "df_sample_sets[['study_id','sample_set', 'sample_count']].set_index('study_id')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yJ16OQ0Hpk8o" + }, + "source": [ + "Here is a more detailed breakdown of the samples contained within this sample set, summarised by country, year of collection, and species:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:25:13.606522Z", + "iopub.status.busy": "2026-04-18T15:25:13.606267Z", + "iopub.status.idle": "2026-04-18T15:25:14.275371Z", + "shell.execute_reply": "2026-04-18T15:25:14.274536Z", + "shell.execute_reply.started": "2026-04-18T15:25:13.606503Z" + }, + "id": "a1OMvuTxUWpJ", + "outputId": "9f872334-fd50-4649-990a-df60ea71c12c", + "tags": [ + "remove-input" + ] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
taxondarlingi
study_idsample_setcountryyear
1357-VO-BR-SALLUM1357-VO-BR-SALLUM-VMF00326Brazil202145
2022222
20236
1358-VO-GF-GENDRIN1358-VO-GF-GENDRIN-VMF00327French Guiana2020139
1359-VO-GY-NILES-ROBIN1359-VO-GY-NILES-ROBIN-VMF00328Guyana202118
1360-VO-PE-GAMBOA1360-VO-PE-GAMBOA-VMF00329Peru201243
202246
1361-VO-VE-GRILLET1361-VO-VE-GRILLET-VMF00330Venezuela201621
201756
202349
1362-VO-CO-QUINONES1362-VO-CO-QUINONES-VMF00331Colombia2022449
\n", + "
" + ], + "text/plain": [ + "taxon darlingi\n", + "study_id sample_set country year \n", + "1357-VO-BR-SALLUM 1357-VO-BR-SALLUM-VMF00326 Brazil 2021 45\n", + " 2022 222\n", + " 2023 6\n", + "1358-VO-GF-GENDRIN 1358-VO-GF-GENDRIN-VMF00327 French Guiana 2020 139\n", + "1359-VO-GY-NILES-ROBIN 1359-VO-GY-NILES-ROBIN-VMF00328 Guyana 2021 18\n", + "1360-VO-PE-GAMBOA 1360-VO-PE-GAMBOA-VMF00329 Peru 2012 43\n", + " 2022 46\n", + "1361-VO-VE-GRILLET 1361-VO-VE-GRILLET-VMF00330 Venezuela 2016 21\n", + " 2017 56\n", + " 2023 49\n", + "1362-VO-CO-QUINONES 1362-VO-CO-QUINONES-VMF00331 Colombia 2022 449" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_samples = adar1.sample_metadata(sample_sets=\"1.0\")\n", + "df_summary = df_samples.pivot_table(\n", + " index=[\"study_id\",\"sample_set\", \"country\", \"year\"], \n", + " columns=[\"taxon\"],\n", + " values=\"sample_id\", \n", + " aggfunc=len,\n", + " fill_value=0)\n", + "df_summary" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dLiU0ulIpk8p" + }, + "source": [ + "Note that there can be multiple sampling sites represented within the same sample set." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OToX5vhfpk8p" + }, + "source": [ + "## Further reading\n", + "\n", + "We hope this page has provided a useful introduction to the `Adar1.0` data resource. If you would like to start working with these data, please visit the [cloud data access guide](cloud) or the [data download guide](download) or continue browsing the other documentation on this site.\n", + "\n", + "If you have any questions about the data and how to use them, please do get in touch by [starting a new discussion](https://github.com/malariagen/vector-data/discussions/new) on the malariagen/vector-data repository on GitHub." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "celltoolbar": "Tags", + "colab": { + "name": "Ag3.0-intro.ipynb", + "provenance": [] + }, + "environment": { + "kernel": "mgenv_7.2.0", + "name": "workbench-notebooks.m139", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m139" + }, + "kernelspec": { + "display_name": "Python (mgenv_7.2.0)", + "language": "python", + "name": "mgenv_7.2.0" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/adar1/api.md b/docs/adar1/api.md new file mode 100644 index 0000000..8c075f7 --- /dev/null +++ b/docs/adar1/api.md @@ -0,0 +1,3 @@ +# Adir1 API + +For documentation on functions in the [malariagen_data](https://github.com/malariagen/malariagen-data-python) Python package for accessing *Anopheles darlingi* data, please visit the [Adar1 API docs page](https://malariagen.github.io/malariagen-data-python/latest/Adar1.html). diff --git a/docs/adar1/cloud.ipynb b/docs/adar1/cloud.ipynb new file mode 100644 index 0000000..3c7905b --- /dev/null +++ b/docs/adar1/cloud.ipynb @@ -0,0 +1,5149 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "DZw8vyUJ0y5k" + }, + "source": [ + "# Adar1 cloud data access\n", + "\n", + "This notebook provides information about how to download data from the [MalariaGEN Vector Observatory Anopheles darlingi Genomic Surveillance Project](https://www.malariagen.net/project/anopheles-darlingi-genomic-surveillance-project), for *Anopheles darlingi* via Google Cloud. This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls. \n", + "\n", + "This notebook illustrates how to read data directly from the cloud, without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via [Google Colab](https://colab.research.google.com/) which are free interactive computing service running in the cloud.\n", + "\n", + "To launch this notebook in the cloud and run it for yourself, click the launch icon () at the top of the page and select one of the cloud computing services available.\n", + "\n", + "## Data hosting\n", + "\n", + "All data required for this notebook is hosted on Google Cloud Storage (GCS). Data are hosted in the `vo_adar_release_master_us_central1` bucket, which is a single-region bucket located in the United States. All data hosted in GCS are publicly accessible and do not require any authentication to access. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zn_-HkLIQH_0" + }, + "source": [ + "## Setup\n", + "\n", + "Running this notebook requires some Python packages to be installed:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:04.927908Z", + "iopub.status.busy": "2026-04-18T15:31:04.927664Z", + "iopub.status.idle": "2026-04-18T15:31:04.931099Z", + "shell.execute_reply": "2026-04-18T15:31:04.930577Z", + "shell.execute_reply.started": "2026-04-18T15:31:04.927886Z" + }, + "id": "wqHBq442QH_1", + "outputId": "1c1306a2-d6f1-46a2-ee4d-30b13dad9148", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "%pip install -q malariagen_data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To make accessing these data more convenient, we've created the [malariagen_data](https://github.com/malariagen/malariagen-data-python) Python package. This is experimental so please let us know if you find any bugs or have any suggestions. See the [Adar1.0 API docs](https://malariagen.github.io/malariagen-data-python/latest/Adar1.0.html) for documentation of all functions available from this package. \n", + "\n", + "Import other packages we'll need to use here." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-18T15:31:08.582719Z", + "iopub.status.busy": "2026-04-18T15:31:08.582477Z", + "iopub.status.idle": "2026-04-18T15:31:11.009090Z", + "shell.execute_reply": "2026-04-18T15:31:11.008195Z", + "shell.execute_reply.started": "2026-04-18T15:31:08.582699Z" + }, + "id": "970klnG1eu8N", + "tags": [] + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import dask\n", + "import dask.array as da\n", + "from dask.diagnostics.progress import ProgressBar\n", + "# silence some warnings\n", + "dask.config.set(**{'array.slicing.split_large_chunks': False})\n", + "import allel\n", + "import malariagen_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jPqZ-LFPQH_2" + }, + "source": [ + "`Adar1` data access from Google Cloud is set up with the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 190 + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:11.011078Z", + "iopub.status.busy": "2026-04-18T15:31:11.010267Z", + "iopub.status.idle": "2026-04-18T15:31:11.734369Z", + "shell.execute_reply": "2026-04-18T15:31:11.733720Z", + "shell.execute_reply.started": "2026-04-18T15:31:11.011055Z" + }, + "id": "mIsSaTuOQH_2", + "outputId": "4facd5a9-6e43-460a-811c-30293568918e", + "tags": [] + }, + "outputs": [ + { + "data": { + "application/javascript": [ + "'use strict';\n", + "(function(root) {\n", + " function now() {\n", + " return new Date();\n", + " }\n", + "\n", + " const force = true;\n", + "\n", + " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", + " root._bokeh_onload_callbacks = [];\n", + " root._bokeh_is_loading = undefined;\n", + " }\n", + "\n", + "const JS_MIME_TYPE = 'application/javascript';\n", + " const HTML_MIME_TYPE = 'text/html';\n", + " const EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", + " const CLASS_NAME = 'output_bokeh rendered_html';\n", + "\n", + " /**\n", + " * Render data to the DOM node\n", + " */\n", + " function render(props, node) {\n", + " const script = document.createElement(\"script\");\n", + " node.appendChild(script);\n", + " }\n", + "\n", + " /**\n", + " * Handle when an output is cleared or removed\n", + " */\n", + " function handleClearOutput(event, handle) {\n", + " function drop(id) {\n", + " const view = Bokeh.index.get_by_id(id)\n", + " if (view != null) {\n", + " view.model.document.clear()\n", + " Bokeh.index.delete(view)\n", + " }\n", + " }\n", + "\n", + " const cell = handle.cell;\n", + "\n", + " const id = cell.output_area._bokeh_element_id;\n", + " const server_id = cell.output_area._bokeh_server_id;\n", + "\n", + " // Clean up Bokeh references\n", + " if (id != null) {\n", + " drop(id)\n", + " }\n", + "\n", + " if (server_id !== undefined) {\n", + " // Clean up Bokeh references\n", + " const cmd_clean = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", + " cell.notebook.kernel.execute(cmd_clean, {\n", + " iopub: {\n", + " output: function(msg) {\n", + " const id = msg.content.text.trim()\n", + " drop(id)\n", + " }\n", + " }\n", + " });\n", + " // Destroy server and session\n", + " const cmd_destroy = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", + " cell.notebook.kernel.execute(cmd_destroy);\n", + " }\n", + " }\n", + "\n", + " /**\n", + " * Handle when a new output is added\n", + " */\n", + " function handleAddOutput(event, handle) {\n", + " const output_area = handle.output_area;\n", + " const output = handle.output;\n", + "\n", + " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", + " if ((output.output_type != \"display_data\") || (!Object.prototype.hasOwnProperty.call(output.data, EXEC_MIME_TYPE))) {\n", + " return\n", + " }\n", + "\n", + " const toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", + "\n", + " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", + " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", + " // store reference to embed id on output_area\n", + " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", + " }\n", + " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", + " const bk_div = document.createElement(\"div\");\n", + " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", + " const script_attrs = bk_div.children[0].attributes;\n", + " for (let i = 0; i < script_attrs.length; i++) {\n", + " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", + " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n", + " }\n", + " // store reference to server id on output_area\n", + " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", + " }\n", + " }\n", + "\n", + " function register_renderer(events, OutputArea) {\n", + "\n", + " function append_mime(data, metadata, element) {\n", + " // create a DOM node to render to\n", + " const toinsert = this.create_output_subarea(\n", + " metadata,\n", + " CLASS_NAME,\n", + " EXEC_MIME_TYPE\n", + " );\n", + " this.keyboard_manager.register_events(toinsert);\n", + " // Render to node\n", + " const props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", + " render(props, toinsert[toinsert.length - 1]);\n", + " element.append(toinsert);\n", + " return toinsert\n", + " }\n", + "\n", + " /* Handle when an output is cleared or removed */\n", + " events.on('clear_output.CodeCell', handleClearOutput);\n", + " events.on('delete.Cell', handleClearOutput);\n", + "\n", + " /* Handle when a new output is added */\n", + " events.on('output_added.OutputArea', handleAddOutput);\n", + "\n", + " /**\n", + " * Register the mime type and append_mime function with output_area\n", + " */\n", + " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", + " /* Is output safe? */\n", + " safe: true,\n", + " /* Index of renderer in `output_area.display_order` */\n", + " index: 0\n", + " });\n", + " }\n", + "\n", + " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", + " if (root.Jupyter !== undefined) {\n", + " const events = require('base/js/events');\n", + " const OutputArea = require('notebook/js/outputarea').OutputArea;\n", + "\n", + " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", + " register_renderer(events, OutputArea);\n", + " }\n", + " }\n", + " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", + " root._bokeh_timeout = Date.now() + 5000;\n", + " root._bokeh_failed_load = false;\n", + " }\n", + "\n", + " const NB_LOAD_WARNING = {'data': {'text/html':\n", + " \"
\\n\"+\n", + " \"

\\n\"+\n", + " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", + " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", + " \"

\\n\"+\n", + " \"\\n\"+\n", + " \"\\n\"+\n", + " \"from bokeh.resources import INLINE\\n\"+\n", + " \"output_notebook(resources=INLINE)\\n\"+\n", + " \"\\n\"+\n", + " \"
\"}};\n", + "\n", + " function display_loaded(error = null) {\n", + " const el = document.getElementById(null);\n", + " if (el != null) {\n", + " const html = (() => {\n", + " if (typeof root.Bokeh === \"undefined\") {\n", + " if (error == null) {\n", + " return \"BokehJS is loading ...\";\n", + " } else {\n", + " return \"BokehJS failed to load.\";\n", + " }\n", + " } else {\n", + " const prefix = `BokehJS ${root.Bokeh.version}`;\n", + " if (error == null) {\n", + " return `${prefix} successfully loaded.`;\n", + " } else {\n", + " return `${prefix} encountered errors while loading and may not function as expected.`;\n", + " }\n", + " }\n", + " })();\n", + " el.innerHTML = html;\n", + "\n", + " if (error != null) {\n", + " const wrapper = document.createElement(\"div\");\n", + " wrapper.style.overflow = \"auto\";\n", + " wrapper.style.height = \"5em\";\n", + " wrapper.style.resize = \"vertical\";\n", + " const content = document.createElement(\"div\");\n", + " content.style.fontFamily = \"monospace\";\n", + " content.style.whiteSpace = \"pre-wrap\";\n", + " content.style.backgroundColor = \"rgb(255, 221, 221)\";\n", + " content.textContent = error.stack ?? error.toString();\n", + " wrapper.append(content);\n", + " el.append(wrapper);\n", + " }\n", + " } else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(() => display_loaded(error), 100);\n", + " }\n", + " }\n", + "\n", + " function run_callbacks() {\n", + " try {\n", + " root._bokeh_onload_callbacks.forEach(function(callback) {\n", + " if (callback != null)\n", + " callback();\n", + " });\n", + " } finally {\n", + " delete root._bokeh_onload_callbacks\n", + " }\n", + " console.debug(\"Bokeh: all callbacks have finished\");\n", + " }\n", + "\n", + " function load_libs(css_urls, js_urls, callback) {\n", + " if (css_urls == null) css_urls = [];\n", + " if (js_urls == null) js_urls = [];\n", + "\n", + " root._bokeh_onload_callbacks.push(callback);\n", + " if (root._bokeh_is_loading > 0) {\n", + " console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", + " return null;\n", + " }\n", + " if (js_urls == null || js_urls.length === 0) {\n", + " run_callbacks();\n", + " return null;\n", + " }\n", + " console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", + " root._bokeh_is_loading = css_urls.length + js_urls.length;\n", + "\n", + " function on_load() {\n", + " root._bokeh_is_loading--;\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n", + " run_callbacks()\n", + " }\n", + " }\n", + "\n", + " function on_error(url) {\n", + " console.error(\"failed to load \" + url);\n", + " }\n", + "\n", + " for (let i = 0; i < css_urls.length; i++) {\n", + " const url = css_urls[i];\n", + " const element = document.createElement(\"link\");\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.rel = \"stylesheet\";\n", + " element.type = \"text/css\";\n", + " element.href = url;\n", + " console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " for (let i = 0; i < js_urls.length; i++) {\n", + " const url = js_urls[i];\n", + " const element = document.createElement('script');\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.async = false;\n", + " element.src = url;\n", + " console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", + " document.head.appendChild(element);\n", + " }\n", + " };\n", + "\n", + " function inject_raw_css(css) {\n", + " const element = document.createElement(\"style\");\n", + " element.appendChild(document.createTextNode(css));\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.6.3.min.js\"];\n", + " const css_urls = [];\n", + "\n", + " const inline_js = [ function(Bokeh) {\n", + " Bokeh.set_log_level(\"info\");\n", + " },\n", + "function(Bokeh) {\n", + " }\n", + " ];\n", + "\n", + " function run_inline_js() {\n", + " if (root.Bokeh !== undefined || force === true) {\n", + " try {\n", + " for (let i = 0; i < inline_js.length; i++) {\n", + " inline_js[i].call(root, root.Bokeh);\n", + " }\n", + "\n", + " } catch (error) {throw error;\n", + " }} else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(run_inline_js, 100);\n", + " } else if (!root._bokeh_failed_load) {\n", + " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", + " root._bokeh_failed_load = true;\n", + " } else if (force !== true) {\n", + " const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n", + " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", + " }\n", + " }\n", + "\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", + " run_inline_js();\n", + " } else {\n", + " load_libs(css_urls, js_urls, function() {\n", + " console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n", + " run_inline_js();\n", + " });\n", + " }\n", + "}(window));" + ], + "application/vnd.bokehjs_load.v0+json": "'use strict';\n(function(root) {\n function now() {\n return new Date();\n }\n\n const force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n\n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n const NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded(error = null) {\n const el = document.getElementById(null);\n if (el != null) {\n const html = (() => {\n if (typeof root.Bokeh === \"undefined\") {\n if (error == null) {\n return \"BokehJS is loading ...\";\n } else {\n return \"BokehJS failed to load.\";\n }\n } else {\n const prefix = `BokehJS ${root.Bokeh.version}`;\n if (error == null) {\n return `${prefix} successfully loaded.`;\n } else {\n return `${prefix} encountered errors while loading and may not function as expected.`;\n }\n }\n })();\n el.innerHTML = html;\n\n if (error != null) {\n const wrapper = document.createElement(\"div\");\n wrapper.style.overflow = \"auto\";\n wrapper.style.height = \"5em\";\n wrapper.style.resize = \"vertical\";\n const content = document.createElement(\"div\");\n content.style.fontFamily = \"monospace\";\n content.style.whiteSpace = \"pre-wrap\";\n content.style.backgroundColor = \"rgb(255, 221, 221)\";\n content.textContent = error.stack ?? error.toString();\n wrapper.append(content);\n el.append(wrapper);\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(() => display_loaded(error), 100);\n }\n }\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error(url) {\n console.error(\"failed to load \" + url);\n }\n\n for (let i = 0; i < css_urls.length; i++) {\n const url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n for (let i = 0; i < js_urls.length; i++) {\n const url = js_urls[i];\n const element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.6.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.6.3.min.js\"];\n const css_urls = [];\n\n const inline_js = [ function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\nfunction(Bokeh) {\n }\n ];\n\n function run_inline_js() {\n if (root.Bokeh !== undefined || force === true) {\n try {\n for (let i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n\n } catch (error) {throw error;\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
MalariaGEN Adar1 API client
\n", + " Please note that data are subject to terms of use,\n", + " for more information see \n", + " the MalariaGEN website or contact support@malariagen.net.\n", + " See also the Adir1 API docs.\n", + "
\n", + " Storage URL\n", + " gs://vo_adar_release_master_us_central1
\n", + " Data releases available\n", + " 1.0
\n", + " Results cache\n", + " None
\n", + " Cohorts analysis\n", + " 20260303
\n", + " Site filters analysis\n", + " dt_20260301
\n", + " Software version\n", + " malariagen_data 15.6.0.post407+1430baf
\n", + " Client location\n", + " Iowa, United States (Google Cloud us-central1)
\n", + " Data filtered for unrestricted use only\n", + " False
\n", + " Data filtered for surveillance use only\n", + " False
\n", + " Relevant data releases\n", + " 1.0
\n", + " " + ], + "text/plain": [ + "\n", + "Storage URL : gs://vo_adar_release_master_us_central1\n", + "Data releases available : 1.0\n", + "Results cache : None\n", + "Cohorts analysis : 20260303\n", + "Site filters analysis : dt_20260301\n", + "Software version : malariagen_data 15.6.0.post407+1430baf\n", + "Client location : Iowa, United States (Google Cloud us-central1)\n", + "Data filtered to unrestricted use only: False\n", + "Data filtered to surveillance use only: False\n", + "Relevant data releases : 1.0\n", + "---\n", + "Please note that data are subject to terms of use,\n", + "for more information see https://www.malariagen.net/data\n", + "or contact support@malariagen.net. For API documentation see \n", + "https://malariagen.github.io/malariagen-data-python/v15.6.0.post407+1430baf/Adir1.html" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "adar1 = malariagen_data.Adar1()\n", + "adar1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ITy4zIVoQH_2" + }, + "source": [ + "## Sample sets\n", + "\n", + "Data are organised into different releases. As an example, data in Adar1.0 are organised into 6 sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to access data from only specific sample sets, or all sample sets.\n", + "\n", + "To see which sample sets are available, load the sample set manifest into a pandas dataframe:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 927 + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:11.736128Z", + "iopub.status.busy": "2026-04-18T15:31:11.735097Z", + "iopub.status.idle": "2026-04-18T15:31:11.802547Z", + "shell.execute_reply": "2026-04-18T15:31:11.801994Z", + "shell.execute_reply.started": "2026-04-18T15:31:11.736105Z" + }, + "id": "b4ADQTOfQH_2", + "outputId": "f7c6d68b-053f-4698-8b6f-29720287c423", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sample_setsample_countstudy_idstudy_urlterms_of_use_expiry_dateterms_of_use_urlreleaseunrestricted_use
01357-VO-BR-SALLUM-VMF003262721357-VO-BR-SALLUMhttps://www.malariagen.net/partner_study/1357-...2027-11-30https://www.malariagen.net/data/our-approach-s...1.0False
11358-VO-GF-GENDRIN-VMF003271391358-VO-GF-GENDRINhttps://www.malariagen.net/partner_study/1358-...2027-11-30https://www.malariagen.net/data/our-approach-s...1.0False
21359-VO-GY-NILES-ROBIN-VMF00328181359-VO-GY-NILES-ROBINhttps://www.malariagen.net/partner_study/1359-...2027-11-30https://www.malariagen.net/data/our-approach-s...1.0False
31360-VO-PE-GAMBOA-VMF00329891360-VO-PE-GAMBOAhttps://www.malariagen.net/partner_study/1360-...2027-11-30https://www.malariagen.net/data/our-approach-s...1.0False
41361-VO-VE-GRILLET-VMF003301261361-VO-VE-GRILLEThttps://www.malariagen.net/partner_study/1361-...2027-11-30https://www.malariagen.net/data/our-approach-s...1.0False
51362-VO-CO-QUINONES-VMF003314491362-VO-CO-QUINONEShttps://www.malariagen.net/partner_study/1362-...2027-11-30https://www.malariagen.net/data/our-approach-s...1.0False
\n", + "
" + ], + "text/plain": [ + " sample_set sample_count study_id \\\n", + "0 1357-VO-BR-SALLUM-VMF00326 272 1357-VO-BR-SALLUM \n", + "1 1358-VO-GF-GENDRIN-VMF00327 139 1358-VO-GF-GENDRIN \n", + "2 1359-VO-GY-NILES-ROBIN-VMF00328 18 1359-VO-GY-NILES-ROBIN \n", + "3 1360-VO-PE-GAMBOA-VMF00329 89 1360-VO-PE-GAMBOA \n", + "4 1361-VO-VE-GRILLET-VMF00330 126 1361-VO-VE-GRILLET \n", + "5 1362-VO-CO-QUINONES-VMF00331 449 1362-VO-CO-QUINONES \n", + "\n", + " study_url terms_of_use_expiry_date \\\n", + "0 https://www.malariagen.net/partner_study/1357-... 2027-11-30 \n", + "1 https://www.malariagen.net/partner_study/1358-... 2027-11-30 \n", + "2 https://www.malariagen.net/partner_study/1359-... 2027-11-30 \n", + "3 https://www.malariagen.net/partner_study/1360-... 2027-11-30 \n", + "4 https://www.malariagen.net/partner_study/1361-... 2027-11-30 \n", + "5 https://www.malariagen.net/partner_study/1362-... 2027-11-30 \n", + "\n", + " terms_of_use_url release unrestricted_use \n", + "0 https://www.malariagen.net/data/our-approach-s... 1.0 False \n", + "1 https://www.malariagen.net/data/our-approach-s... 1.0 False \n", + "2 https://www.malariagen.net/data/our-approach-s... 1.0 False \n", + "3 https://www.malariagen.net/data/our-approach-s... 1.0 False \n", + "4 https://www.malariagen.net/data/our-approach-s... 1.0 False \n", + "5 https://www.malariagen.net/data/our-approach-s... 1.0 False " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_sample_sets = adar1.sample_sets(release=\"1.0\")\n", + "df_sample_sets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J0SHf6vaQH_3" + }, + "source": [ + "For more information about these sample sets, you can read about each sample set from the URLs under the field `study_url`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "78L85pli9HdO" + }, + "source": [ + "## Sample metadata\n", + "\n", + "Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen. These are organised by sample set.\n", + "\n", + "E.g., load sample metadata for all samples in the Adar1.0 release into a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe):" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 661 + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:11.803875Z", + "iopub.status.busy": "2026-04-18T15:31:11.803690Z", + "iopub.status.idle": "2026-04-18T15:31:12.153697Z", + "shell.execute_reply": "2026-04-18T15:31:12.153223Z", + "shell.execute_reply.started": "2026-04-18T15:31:11.803857Z" + }, + "id": "-V8nLGSaQH_4", + "outputId": "98a12919-fd6a-4fd5-8155-d90f05d877d7", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sample_idpartner_sample_idcontributorcountrylocationyearmonthlatitudelongitudesex_call...admin1_nameadmin1_isoadmin2_nametaxoncohort_admin1_yearcohort_admin1_monthcohort_admin1_quartercohort_admin2_yearcohort_admin2_monthcohort_admin2_quarter
0VMF00326-0001Coari_AM252_1Maria Anice Mureb SallumBrazilCoari202211-4.137-63.150UKN...AmazonasBR-AMCoaridarlingiBR-AM_darl_2022BR-AM_darl_2022_11BR-AM_darl_2022_Q4BR-AM_Coari_darl_2022BR-AM_Coari_darl_2022_11BR-AM_Coari_darl_2022_Q4
1VMF00326-0002Coari_AM252_2Maria Anice Mureb SallumBrazilCoari202211-4.137-63.150UKN...AmazonasBR-AMCoaridarlingiBR-AM_darl_2022BR-AM_darl_2022_11BR-AM_darl_2022_Q4BR-AM_Coari_darl_2022BR-AM_Coari_darl_2022_11BR-AM_Coari_darl_2022_Q4
2VMF00326-0003Coari_AM252_3Maria Anice Mureb SallumBrazilCoari202211-4.137-63.150UKN...AmazonasBR-AMCoaridarlingiBR-AM_darl_2022BR-AM_darl_2022_11BR-AM_darl_2022_Q4BR-AM_Coari_darl_2022BR-AM_Coari_darl_2022_11BR-AM_Coari_darl_2022_Q4
3VMF00326-0004Coari_AM252_4Maria Anice Mureb SallumBrazilCoari202211-4.137-63.150UKN...AmazonasBR-AMCoaridarlingiBR-AM_darl_2022BR-AM_darl_2022_11BR-AM_darl_2022_Q4BR-AM_Coari_darl_2022BR-AM_Coari_darl_2022_11BR-AM_Coari_darl_2022_Q4
4VMF00326-0005Coari_AM252_5Maria Anice Mureb SallumBrazilCoari202211-4.137-63.150UKN...AmazonasBR-AMCoaridarlingiBR-AM_darl_2022BR-AM_darl_2022_11BR-AM_darl_2022_Q4BR-AM_Coari_darl_2022BR-AM_Coari_darl_2022_11BR-AM_Coari_darl_2022_Q4
..................................................................
1089VMF00331-0445Tagachi_HER0112286021Martha QuiñonesColombiaTagachí202246.221-76.726UKN...ChocóCO-CHOQuibdódarlingiCO-CHO_darl_2022CO-CHO_darl_2022_04CO-CHO_darl_2022_Q2CO-CHO_Quibdó_darl_2022CO-CHO_Quibdó_darl_2022_04CO-CHO_Quibdó_darl_2022_Q2
1090VMF00331-0446Tagachi_HER0112286022Martha QuiñonesColombiaTagachí202246.221-76.726UKN...ChocóCO-CHOQuibdódarlingiCO-CHO_darl_2022CO-CHO_darl_2022_04CO-CHO_darl_2022_Q2CO-CHO_Quibdó_darl_2022CO-CHO_Quibdó_darl_2022_04CO-CHO_Quibdó_darl_2022_Q2
1091VMF00331-0447Tagachi_HER0112286023Martha QuiñonesColombiaTagachí202246.221-76.726UKN...ChocóCO-CHOQuibdódarlingiCO-CHO_darl_2022CO-CHO_darl_2022_04CO-CHO_darl_2022_Q2CO-CHO_Quibdó_darl_2022CO-CHO_Quibdó_darl_2022_04CO-CHO_Quibdó_darl_2022_Q2
1092VMF00331-0448Tagachi_HER0112286024Martha QuiñonesColombiaTagachí202246.221-76.726UKN...ChocóCO-CHOQuibdódarlingiCO-CHO_darl_2022CO-CHO_darl_2022_04CO-CHO_darl_2022_Q2CO-CHO_Quibdó_darl_2022CO-CHO_Quibdó_darl_2022_04CO-CHO_Quibdó_darl_2022_Q2
1093VMF00331-0449Tagachi_HER0112286037Martha QuiñonesColombiaTagachí202246.221-76.726UKN...ChocóCO-CHOQuibdódarlingiCO-CHO_darl_2022CO-CHO_darl_2022_04CO-CHO_darl_2022_Q2CO-CHO_Quibdó_darl_2022CO-CHO_Quibdó_darl_2022_04CO-CHO_Quibdó_darl_2022_Q2
\n", + "

1094 rows × 46 columns

\n", + "
" + ], + "text/plain": [ + " sample_id partner_sample_id contributor \\\n", + "0 VMF00326-0001 Coari_AM252_1 Maria Anice Mureb Sallum \n", + "1 VMF00326-0002 Coari_AM252_2 Maria Anice Mureb Sallum \n", + "2 VMF00326-0003 Coari_AM252_3 Maria Anice Mureb Sallum \n", + "3 VMF00326-0004 Coari_AM252_4 Maria Anice Mureb Sallum \n", + "4 VMF00326-0005 Coari_AM252_5 Maria Anice Mureb Sallum \n", + "... ... ... ... \n", + "1089 VMF00331-0445 Tagachi_HER0112286021 Martha Quiñones \n", + "1090 VMF00331-0446 Tagachi_HER0112286022 Martha Quiñones \n", + "1091 VMF00331-0447 Tagachi_HER0112286023 Martha Quiñones \n", + "1092 VMF00331-0448 Tagachi_HER0112286024 Martha Quiñones \n", + "1093 VMF00331-0449 Tagachi_HER0112286037 Martha Quiñones \n", + "\n", + " country location year month latitude longitude sex_call ... \\\n", + "0 Brazil Coari 2022 11 -4.137 -63.150 UKN ... \n", + "1 Brazil Coari 2022 11 -4.137 -63.150 UKN ... \n", + "2 Brazil Coari 2022 11 -4.137 -63.150 UKN ... \n", + "3 Brazil Coari 2022 11 -4.137 -63.150 UKN ... \n", + "4 Brazil Coari 2022 11 -4.137 -63.150 UKN ... \n", + "... ... ... ... ... ... ... ... ... \n", + "1089 Colombia Tagachí 2022 4 6.221 -76.726 UKN ... \n", + "1090 Colombia Tagachí 2022 4 6.221 -76.726 UKN ... \n", + "1091 Colombia Tagachí 2022 4 6.221 -76.726 UKN ... \n", + "1092 Colombia Tagachí 2022 4 6.221 -76.726 UKN ... \n", + "1093 Colombia Tagachí 2022 4 6.221 -76.726 UKN ... \n", + "\n", + " admin1_name admin1_iso admin2_name taxon cohort_admin1_year \\\n", + "0 Amazonas BR-AM Coari darlingi BR-AM_darl_2022 \n", + "1 Amazonas BR-AM Coari darlingi BR-AM_darl_2022 \n", + "2 Amazonas BR-AM Coari darlingi BR-AM_darl_2022 \n", + "3 Amazonas BR-AM Coari darlingi BR-AM_darl_2022 \n", + "4 Amazonas BR-AM Coari darlingi BR-AM_darl_2022 \n", + "... ... ... ... ... ... \n", + "1089 Chocó CO-CHO Quibdó darlingi CO-CHO_darl_2022 \n", + "1090 Chocó CO-CHO Quibdó darlingi CO-CHO_darl_2022 \n", + "1091 Chocó CO-CHO Quibdó darlingi CO-CHO_darl_2022 \n", + "1092 Chocó CO-CHO Quibdó darlingi CO-CHO_darl_2022 \n", + "1093 Chocó CO-CHO Quibdó darlingi CO-CHO_darl_2022 \n", + "\n", + " cohort_admin1_month cohort_admin1_quarter cohort_admin2_year \\\n", + "0 BR-AM_darl_2022_11 BR-AM_darl_2022_Q4 BR-AM_Coari_darl_2022 \n", + "1 BR-AM_darl_2022_11 BR-AM_darl_2022_Q4 BR-AM_Coari_darl_2022 \n", + "2 BR-AM_darl_2022_11 BR-AM_darl_2022_Q4 BR-AM_Coari_darl_2022 \n", + "3 BR-AM_darl_2022_11 BR-AM_darl_2022_Q4 BR-AM_Coari_darl_2022 \n", + "4 BR-AM_darl_2022_11 BR-AM_darl_2022_Q4 BR-AM_Coari_darl_2022 \n", + "... ... ... ... \n", + "1089 CO-CHO_darl_2022_04 CO-CHO_darl_2022_Q2 CO-CHO_Quibdó_darl_2022 \n", + "1090 CO-CHO_darl_2022_04 CO-CHO_darl_2022_Q2 CO-CHO_Quibdó_darl_2022 \n", + "1091 CO-CHO_darl_2022_04 CO-CHO_darl_2022_Q2 CO-CHO_Quibdó_darl_2022 \n", + "1092 CO-CHO_darl_2022_04 CO-CHO_darl_2022_Q2 CO-CHO_Quibdó_darl_2022 \n", + "1093 CO-CHO_darl_2022_04 CO-CHO_darl_2022_Q2 CO-CHO_Quibdó_darl_2022 \n", + "\n", + " cohort_admin2_month cohort_admin2_quarter \n", + "0 BR-AM_Coari_darl_2022_11 BR-AM_Coari_darl_2022_Q4 \n", + "1 BR-AM_Coari_darl_2022_11 BR-AM_Coari_darl_2022_Q4 \n", + "2 BR-AM_Coari_darl_2022_11 BR-AM_Coari_darl_2022_Q4 \n", + "3 BR-AM_Coari_darl_2022_11 BR-AM_Coari_darl_2022_Q4 \n", + "4 BR-AM_Coari_darl_2022_11 BR-AM_Coari_darl_2022_Q4 \n", + "... ... ... \n", + "1089 CO-CHO_Quibdó_darl_2022_04 CO-CHO_Quibdó_darl_2022_Q2 \n", + "1090 CO-CHO_Quibdó_darl_2022_04 CO-CHO_Quibdó_darl_2022_Q2 \n", + "1091 CO-CHO_Quibdó_darl_2022_04 CO-CHO_Quibdó_darl_2022_Q2 \n", + "1092 CO-CHO_Quibdó_darl_2022_04 CO-CHO_Quibdó_darl_2022_Q2 \n", + "1093 CO-CHO_Quibdó_darl_2022_04 CO-CHO_Quibdó_darl_2022_Q2 \n", + "\n", + "[1094 rows x 46 columns]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_samples = adar1.sample_metadata(sample_sets=\"1.0\")\n", + "df_samples" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ssCdOykfQH_4" + }, + "source": [ + "The `sample_id` column gives the sample identifier used throughout all Adir1.0 analyses.\n", + "\n", + "The `country`, `location`, `latitude` and `longitude` columns give the location where the specimen was collected.\n", + "\n", + "The `year` and `month` columns give the approximate date when the specimen was collected.\n", + "\n", + "The `sex_call` column gives the gender as determined from the sequence data." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9APw05D5gAQ9" + }, + "source": [ + "[Pandas](https://pandas.pydata.org/) can be used to explore and query the sample metadata in various ways. E.g., here is a summary of the numbers of samples by species:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:12.154633Z", + "iopub.status.busy": "2026-04-18T15:31:12.154296Z", + "iopub.status.idle": "2026-04-18T15:31:12.159508Z", + "shell.execute_reply": "2026-04-18T15:31:12.158944Z", + "shell.execute_reply.started": "2026-04-18T15:31:12.154613Z" + }, + "id": "PpsTgviZQH_4", + "outputId": "ddbc9515-25dc-454f-9f02-9427f1261b06", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "taxon\n", + "darlingi 1094\n", + "dtype: int64" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_samples.groupby(\"taxon\").size()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C4EPodCJjg0a" + }, + "source": [ + "## SNP calls\n", + "\n", + "Data on SNP calls, including the SNP positions, alleles, site filters, and genotypes, can be accessed as an [xarray Dataset](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dataset).\n", + "\n", + "E.g., access SNP calls for contig 2 for all samples in `Adar1.0`." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 430 + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:12.160304Z", + "iopub.status.busy": "2026-04-18T15:31:12.160116Z", + "iopub.status.idle": "2026-04-18T15:31:12.513114Z", + "shell.execute_reply": "2026-04-18T15:31:12.512494Z", + "shell.execute_reply.started": "2026-04-18T15:31:12.160285Z" + }, + "id": "433PD7k8jlNj", + "outputId": "bc5e1b8d-f1f4-4008-df56-f577a9080561", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset> Size: 569GB\n",
+       "Dimensions:                       (variants: 30593416, alleles: 4,\n",
+       "                                   samples: 1094, ploidy: 2)\n",
+       "Coordinates:\n",
+       "    variant_position              (variants) int32 122MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    variant_contig                (variants) uint8 31MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    sample_id                     (samples) <U36 158kB dask.array<chunksize=(273,), meta=np.ndarray>\n",
+       "Dimensions without coordinates: variants, alleles, samples, ploidy\n",
+       "Data variables:\n",
+       "    variant_allele                (variants, alleles) |S1 122MB dask.array<chunksize=(524288, 4), meta=np.ndarray>\n",
+       "    variant_filter_pass_darlingi  (variants) bool 31MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    call_genotype                 (variants, samples, ploidy) int8 67GB dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>\n",
+       "    call_GQ                       (variants, samples) int8 33GB dask.array<chunksize=(300000, 50), meta=np.ndarray>\n",
+       "    call_MQ                       (variants, samples) float32 134GB dask.array<chunksize=(300000, 50), meta=np.ndarray>\n",
+       "    call_AD                       (variants, samples, alleles) int16 268GB dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>\n",
+       "    call_genotype_mask            (variants, samples, ploidy) bool 67GB dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>\n",
+       "Attributes:\n",
+       "    contigs:  ('2', '3', 'X')
" + ], + "text/plain": [ + " Size: 569GB\n", + "Dimensions: (variants: 30593416, alleles: 4,\n", + " samples: 1094, ploidy: 2)\n", + "Coordinates:\n", + " variant_position (variants) int32 122MB dask.array\n", + " variant_contig (variants) uint8 31MB dask.array\n", + " sample_id (samples) \n", + "Dimensions without coordinates: variants, alleles, samples, ploidy\n", + "Data variables:\n", + " variant_allele (variants, alleles) |S1 122MB dask.array\n", + " variant_filter_pass_darlingi (variants) bool 31MB dask.array\n", + " call_genotype (variants, samples, ploidy) int8 67GB dask.array\n", + " call_GQ (variants, samples) int8 33GB dask.array\n", + " call_MQ (variants, samples) float32 134GB dask.array\n", + " call_AD (variants, samples, alleles) int16 268GB dask.array\n", + " call_genotype_mask (variants, samples, ploidy) bool 67GB dask.array\n", + "Attributes:\n", + " contigs: ('2', '3', 'X')" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds_snps = adar1.snp_calls(region=\"2\", sample_sets=\"1.0\")\n", + "ds_snps" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fx9ufvbCnPGn" + }, + "source": [ + "The arrays within this dataset are backed by [Dask arrays](https://docs.dask.org/en/latest/array.html), and can be accessed as shown below." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Lvv-lFHJ-Um2" + }, + "source": [ + "### SNP sites and alleles\n", + "\n", + "We have called SNP genotypes in all samples at all positions in the genome where the reference allele is not \"N\". Data on this set of genomic positions and alleles for a given chromosome (e.g., 2RL) can be accessed as [Dask arrays](https://docs.dask.org/en/latest/array.html) as follows." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 132 + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:12.514010Z", + "iopub.status.busy": "2026-04-18T15:31:12.513827Z", + "iopub.status.idle": "2026-04-18T15:31:12.520288Z", + "shell.execute_reply": "2026-04-18T15:31:12.519839Z", + "shell.execute_reply.started": "2026-04-18T15:31:12.513992Z" + }, + "id": "GO5Os0epQH_5", + "outputId": "7c970e20-4811-46a1-8944-4bd7f6e8359f", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 116.70 MiB 2.00 MiB
Shape (30593416,) (524288,)
Dask graph 59 chunks in 1 graph layer
Data type int32 numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 30593416\n", + " 1\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pos = ds_snps[\"variant_position\"].data\n", + "pos" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:12.521037Z", + "iopub.status.busy": "2026-04-18T15:31:12.520869Z", + "iopub.status.idle": "2026-04-18T15:31:12.530918Z", + "shell.execute_reply": "2026-04-18T15:31:12.530470Z", + "shell.execute_reply.started": "2026-04-18T15:31:12.521019Z" + }, + "id": "eD5Gtb-xQH_5", + "outputId": "60a9f964-0335-4084-b359-7902d138bec3", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 116.70 MiB 2.00 MiB
Shape (30593416, 4) (524288, 4)
Dask graph 59 chunks in 5 graph layers
Data type |S1 numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 4\n", + " 30593416\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "alleles = ds_snps[\"variant_allele\"].data\n", + "alleles" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k6i3W7y1QH_5" + }, + "source": [ + "Data can be loaded into memory as a [NumPy array](https://numpy.org/doc/stable/user/absolute_beginners.html) as shown in the following examples." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:12.531825Z", + "iopub.status.busy": "2026-04-18T15:31:12.531644Z", + "iopub.status.idle": "2026-04-18T15:31:12.774265Z", + "shell.execute_reply": "2026-04-18T15:31:12.773744Z", + "shell.execute_reply.started": "2026-04-18T15:31:12.531808Z" + }, + "id": "3_1qTYtiQH_5", + "outputId": "c260b22a-cc89-4a3c-9371-21fde9ec189e", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([11, 12, 13, 28, 36, 43, 45, 53, 54, 55], dtype=int32)" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# read first 10 SNP positions into a numpy array\n", + "p = pos[:10].compute()\n", + "p" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:12.775756Z", + "iopub.status.busy": "2026-04-18T15:31:12.775552Z", + "iopub.status.idle": "2026-04-18T15:31:12.999983Z", + "shell.execute_reply": "2026-04-18T15:31:12.999510Z", + "shell.execute_reply.started": "2026-04-18T15:31:12.775736Z" + }, + "id": "UjeBeyOXQH_6", + "outputId": "4ef2a2e1-789a-4ec0-fff6-53e83f4951d1", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[b'C', b'A', b'G', b'T'],\n", + " [b'T', b'A', b'C', b'G'],\n", + " [b'G', b'A', b'C', b'T'],\n", + " [b'C', b'A', b'G', b'T'],\n", + " [b'A', b'C', b'G', b'T'],\n", + " [b'G', b'A', b'C', b'T'],\n", + " [b'A', b'C', b'G', b'T'],\n", + " [b'G', b'A', b'C', b'T'],\n", + " [b'A', b'C', b'G', b'T'],\n", + " [b'G', b'A', b'C', b'T']], dtype='|S1')" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# read first 10 SNP alleles into a numpy array\n", + "a = alleles[:10].compute()\n", + "a" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XoHkXz0Cbk_p" + }, + "source": [ + "Here the first column contains the reference alleles, and the remaining columns contain the alternate alleles.\n", + "\n", + "Note that a byte string data type is used here for efficiency. E.g., the Python code `b'T'` represents a byte string containing the letter \"T\", which here stands for the nucleotide thymine.\n", + "\n", + "Note that we have chosen to genotype all samples at all sites in the genome, assuming all possible SNP alleles. Not all of these alternate alleles will actually have been observed in the `Adar1` samples. To determine which sites and alleles are segregating, an allele count can be performed over the samples you are interested in. See the example below. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BGVj0OiyAQuX" + }, + "source": [ + "### Site filters\n", + "\n", + "SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. \n", + "\n", + "Each set of site filters provides a \"filter_pass\" Boolean mask for each chromosome arm, where True indicates that the site passed the filter and is accessible to high quality SNP calling.\n", + "\n", + "The site filters data can be accessed as dask arrays as shown in the examples below. " + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 132 + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:13.001077Z", + "iopub.status.busy": "2026-04-18T15:31:13.000895Z", + "iopub.status.idle": "2026-04-18T15:31:13.007081Z", + "shell.execute_reply": "2026-04-18T15:31:13.006665Z", + "shell.execute_reply.started": "2026-04-18T15:31:13.001058Z" + }, + "id": "wh1AaMJ_QH_6", + "outputId": "e9b544fc-2db0-4f83-e23b-30258598d552", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 29.18 MiB 512.00 kiB
Shape (30593416,) (524288,)
Dask graph 59 chunks in 1 graph layer
Data type bool numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 30593416\n", + " 1\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# access gamb_colu_arab site filters as a dask array\n", + "filter_pass = ds_snps['variant_filter_pass_darlingi'].data\n", + "filter_pass" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:13.007753Z", + "iopub.status.busy": "2026-04-18T15:31:13.007580Z", + "iopub.status.idle": "2026-04-18T15:31:13.071315Z", + "shell.execute_reply": "2026-04-18T15:31:13.070538Z", + "shell.execute_reply.started": "2026-04-18T15:31:13.007735Z" + }, + "id": "klokhPxwQH_6", + "outputId": "28c6cbfd-b6cc-46f0-9554-c027c4c57cae", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ True, True, True, True, True, True, True, True, True,\n", + " True])" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# read filter values for first 10 SNPs (True means the site passes filters)\n", + "f = filter_pass[:10].compute()\n", + "f" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sMnfrmCNBzW8" + }, + "source": [ + "### SNP genotypes\n", + "\n", + "SNP genotypes for individual samples are available. Genotypes are stored as a three-dimensional array, where the first dimension corresponds to genomic positions, the second dimension is samples, and the third dimension is ploidy (2). Values are coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.\n", + "\n", + "SNP genotypes can be accessed as dask arrays as shown below." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 173 + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:13.072274Z", + "iopub.status.busy": "2026-04-18T15:31:13.072077Z", + "iopub.status.idle": "2026-04-18T15:31:13.079596Z", + "shell.execute_reply": "2026-04-18T15:31:13.078971Z", + "shell.execute_reply.started": "2026-04-18T15:31:13.072254Z" + }, + "id": "QPViDmX_QH_7", + "outputId": "125ba0b7-4e6d-4c61-f325-39e9eb9522e7", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 62.34 GiB 28.61 MiB
Shape (30593416, 1094, 2) (300000, 50, 2)
Dask graph 2448 chunks in 7 graph layers
Data type int8 numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 2\n", + " 1094\n", + " 30593416\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "gt = ds_snps[\"call_genotype\"].data\n", + "gt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lcG-QFZRRTwx" + }, + "source": [ + "Note that the columns of this array (second dimension) match the rows in the sample metadata, if the same sample sets were loaded. I.e.:" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:13.080505Z", + "iopub.status.busy": "2026-04-18T15:31:13.080303Z", + "iopub.status.idle": "2026-04-18T15:31:13.090301Z", + "shell.execute_reply": "2026-04-18T15:31:13.089785Z", + "shell.execute_reply.started": "2026-04-18T15:31:13.080486Z" + }, + "id": "H0pR2bOCRcLI", + "outputId": "b3283a90-3202-45e9-9482-a926594945df", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_samples = adar1.sample_metadata(sample_sets=\"1.0\")\n", + "gt = ds_snps[\"call_genotype\"].data\n", + "len(df_samples) == gt.shape[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xr_FJ-xARgyS" + }, + "source": [ + "You can use this correspondance to apply further subsetting operations to the genotypes by querying the sample metadata. E.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:13.091602Z", + "iopub.status.busy": "2026-04-18T15:31:13.091423Z", + "iopub.status.idle": "2026-04-18T15:31:13.102958Z", + "shell.execute_reply": "2026-04-18T15:31:13.102241Z", + "shell.execute_reply.started": "2026-04-18T15:31:13.091583Z" + }, + "id": "WqyNsEwLRo0q", + "outputId": "77a966bd-5ab3-416f-fb16-8cc38f46bac2", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "found 1094 darlingi samples\n" + ] + } + ], + "source": [ + "loc_darlingi = df_samples.eval(\"taxon == 'darlingi'\").values\n", + "print(f\"found {np.count_nonzero(loc_darlingi)} darlingi samples\")" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 430 + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:13.256653Z", + "iopub.status.busy": "2026-04-18T15:31:13.255805Z", + "iopub.status.idle": "2026-04-18T15:31:13.351363Z", + "shell.execute_reply": "2026-04-18T15:31:13.350859Z", + "shell.execute_reply.started": "2026-04-18T15:31:13.256629Z" + }, + "id": "auvV_O0Dx1GT", + "outputId": "e3991a1a-1289-4e3d-f3f3-1539d7d336d0", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset> Size: 569GB\n",
+       "Dimensions:                       (variants: 30593416, alleles: 4,\n",
+       "                                   samples: 1094, ploidy: 2)\n",
+       "Coordinates:\n",
+       "    variant_position              (variants) int32 122MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    variant_contig                (variants) uint8 31MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    sample_id                     (samples) <U36 158kB dask.array<chunksize=(182,), meta=np.ndarray>\n",
+       "Dimensions without coordinates: variants, alleles, samples, ploidy\n",
+       "Data variables:\n",
+       "    variant_allele                (variants, alleles) |S1 122MB dask.array<chunksize=(524288, 4), meta=np.ndarray>\n",
+       "    variant_filter_pass_darlingi  (variants) bool 31MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    call_genotype                 (variants, samples, ploidy) int8 67GB dask.array<chunksize=(300000, 45, 2), meta=np.ndarray>\n",
+       "    call_GQ                       (variants, samples) int8 33GB dask.array<chunksize=(300000, 45), meta=np.ndarray>\n",
+       "    call_MQ                       (variants, samples) float32 134GB dask.array<chunksize=(300000, 45), meta=np.ndarray>\n",
+       "    call_AD                       (variants, samples, alleles) int16 268GB dask.array<chunksize=(300000, 45, 4), meta=np.ndarray>\n",
+       "    call_genotype_mask            (variants, samples, ploidy) bool 67GB dask.array<chunksize=(300000, 45, 2), meta=np.ndarray>\n",
+       "Attributes:\n",
+       "    contigs:  ('2', '3', 'X')
" + ], + "text/plain": [ + " Size: 569GB\n", + "Dimensions: (variants: 30593416, alleles: 4,\n", + " samples: 1094, ploidy: 2)\n", + "Coordinates:\n", + " variant_position (variants) int32 122MB dask.array\n", + " variant_contig (variants) uint8 31MB dask.array\n", + " sample_id (samples) \n", + "Dimensions without coordinates: variants, alleles, samples, ploidy\n", + "Data variables:\n", + " variant_allele (variants, alleles) |S1 122MB dask.array\n", + " variant_filter_pass_darlingi (variants) bool 31MB dask.array\n", + " call_genotype (variants, samples, ploidy) int8 67GB dask.array\n", + " call_GQ (variants, samples) int8 33GB dask.array\n", + " call_MQ (variants, samples) float32 134GB dask.array\n", + " call_AD (variants, samples, alleles) int16 268GB dask.array\n", + " call_genotype_mask (variants, samples, ploidy) bool 67GB dask.array\n", + "Attributes:\n", + " contigs: ('2', '3', 'X')" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds_snps_darlingi = ds_snps.isel(samples=loc_darlingi)\n", + "ds_snps_darlingi" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xAreXD3ySw_e" + }, + "source": [ + "Data can be read into memory as numpy arrays, e.g., read genotypes for the first 5 SNPs and the first 3 samples:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:13.656840Z", + "iopub.status.busy": "2026-04-18T15:31:13.656573Z", + "iopub.status.idle": "2026-04-18T15:31:13.770598Z", + "shell.execute_reply": "2026-04-18T15:31:13.770050Z", + "shell.execute_reply.started": "2026-04-18T15:31:13.656820Z" + }, + "id": "AEH-iHpYQH_7", + "outputId": "04e075b3-5f18-4e6f-882e-898335312d71", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[[0, 0],\n", + " [0, 0],\n", + " [0, 0]],\n", + "\n", + " [[0, 0],\n", + " [0, 0],\n", + " [0, 0]],\n", + "\n", + " [[0, 0],\n", + " [0, 0],\n", + " [0, 0]],\n", + "\n", + " [[0, 0],\n", + " [0, 0],\n", + " [0, 0]],\n", + "\n", + " [[0, 0],\n", + " [0, 0],\n", + " [0, 0]]], dtype=int8)" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "g = gt[:5, :3, :].compute()\n", + "g" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vcMEGuGsCSig" + }, + "source": [ + "If you want to work with the genotype calls, you may find it convenient to use [scikit-allel](http://scikit-allel.readthedocs.org/).\n", + "E.g., the code below sets up a genotype array." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 207 + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:14.002668Z", + "iopub.status.busy": "2026-04-18T15:31:14.002372Z", + "iopub.status.idle": "2026-04-18T15:31:14.492261Z", + "shell.execute_reply": "2026-04-18T15:31:14.491708Z", + "shell.execute_reply.started": "2026-04-18T15:31:14.002647Z" + }, + "id": "TBuf01BdbJ6z", + "outputId": "bec96465-4d21-4647-ced0-c687674dad40", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
<GenotypeDaskArray shape=(30593416, 1094, 2) dtype=int8>
01234...10891090109110921093
00/00/00/00/00/0...0/00/00/00/00/0
10/00/00/00/00/0...0/00/00/00/00/0
20/00/00/00/00/0...0/00/00/00/00/0
......
305934130/00/00/00/00/0...0/00/00/00/00/0
305934140/00/00/00/00/0...0/00/00/00/00/0
305934150/00/00/00/00/0...0/00/00/00/00/0
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# use the scikit-allel wrapper class for genotype calls\n", + "gt = allel.GenotypeDaskArray(ds_snps[\"call_genotype\"].data)\n", + "gt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "arZZ_OcPoSQV" + }, + "source": [ + "## Example computation\n", + "\n", + "Here's an example computation to count the number of segregating SNPs on the longest contig (???) that also pass site filters. This may take a minute or two, because it is scanning genotype calls at millions of SNPs in hundreds of samples." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:31:14.856817Z", + "iopub.status.busy": "2026-04-18T15:31:14.856515Z", + "iopub.status.idle": "2026-04-18T15:31:22.271798Z", + "shell.execute_reply": "2026-04-18T15:31:22.271119Z", + "shell.execute_reply.started": "2026-04-18T15:31:14.856794Z" + }, + "id": "mPUEp61aQH_8", + "outputId": "c8eecf02-09d0-4797-f25d-cf56ae1c8bb5", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[########################################] | 100% Completed | 6.29 sms\n" + ] + }, + { + "data": { + "text/plain": [ + "22073253" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# choose contig (longest contig)\n", + "region = \"2\"\n", + "# choose site filter mask\n", + "\n", + "# choose sample sets\n", + "sample_sets = [\"1357-VO-BR-SALLUM-VMF00326\"]\n", + "\n", + "# access SNP calls\n", + "ds_snps = adar1.snp_calls(region=region, sample_sets=sample_sets)\n", + "\n", + "# locate pass sites\n", + "loc_pass = ds_snps[f\"variant_filter_pass_darlingi\"].values\n", + "\n", + "# perform an allele count over genotypes\n", + "gt = allel.GenotypeDaskArray(ds_snps[\"call_genotype\"].data)\n", + "with ProgressBar():\n", + " ac = gt.count_alleles(max_allele=3).compute()\n", + "\n", + "# locate segregating sites\n", + "loc_seg = ac.is_segregating()\n", + "\n", + "# count segregating and pass sites\n", + "n_pass_seg = np.count_nonzero(loc_pass & loc_seg)\n", + "n_pass_seg" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OS4U1IwZgARB" + }, + "source": [ + "## Running larger computations\n", + "\n", + "Please note that free cloud computing services such as Google Colab provide only limited computing resources. Thus although these services are able to efficiently read `Adar1` data stored on Google Cloud, you may find that you run out of memory, or computations take a long time running on a single core. If you would like any suggestions regarding how to set up more powerful compute resources in the cloud, please feel free to get in touch via the [malariagen/vector-data GitHub discussion board](https://github.com/malariagen/vector-data/discussions)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4n73mSO-heAF" + }, + "source": [ + "## Feedback and suggestions\n", + "\n", + "If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the [malariagen/vector-data GitHub discussion board](https://github.com/malariagen/vector-data/discussions)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "celltoolbar": "Tags", + "colab": { + "collapsed_sections": [], + "name": "Ag3.0 cloud data access 2022-03-14.ipynb", + "provenance": [] + }, + "environment": { + "kernel": "mgenv_7.2.0", + "name": "workbench-notebooks.m139", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m139" + }, + "kernelspec": { + "display_name": "Python (mgenv_7.2.0)", + "language": "python", + "name": "mgenv_7.2.0" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/adar1/download.ipynb b/docs/adar1/download.ipynb new file mode 100644 index 0000000..6188784 --- /dev/null +++ b/docs/adar1/download.ipynb @@ -0,0 +1,347 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "p0VbAgTdnvpP" + }, + "source": [ + "# Adar1.0 data downloads\n", + "\n", + "This notebook provides information about how to download data from the [MalariaGEN Vector Observatory Anopheles darlingi Genomic Surveillance Project](https://www.malariagen.net/project/anopheles-darlingi-genomic-surveillance-project), for *Anopheles darlingi*. These data are the first release (v1.0), and include sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls. \n", + "\n", + "Code examples that are intended to be run via a Linux command line are prefixed with an exclamation mark (!). If you are running these commands directly from a terminal, remove the exclamation mark.\n", + "\n", + "Examples in this notebook assume you are downloading data to a local folder within your home directory at the path `~/vo_adar_release_master_us_central1/`. Change this if you want to download to a different folder on the local file system.\n", + "\n", + "## Data hosting\n", + "\n", + "`Adar1` data are hosted by several different services.\n", + "\n", + "Raw sequence reads in FASTQ format and sequence read alignments in BAM format are hosted by the European Nucleotide Archive (ENA). This guide provides examples of downloading data from ENA via FTP using the `wget` command line tool, but please note that there are several other options for downloading data, see the [ENA documentation on how to download data files](https://ena-docs.readthedocs.io/en/latest/retrieval/file-download.html) for more information. \n", + "\n", + "SNP calls in VCF and Zarr formats are hosted on S3-compatible object storage at the Sanger Institute. This guide provides examples of downloading these data using `wget`.\n", + "\n", + "Sample metadata in CSV format are hosted on Google Cloud Storage (GCS) in the `vo_adar_release_master_us_central1` bucket, which is a multi-region bucket located in the United States. All data hosted on GCS are publicly accessible but do require an authentication step, please see details on the [Vector Observatory Data Access page](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).\n", + "\n", + "The guide below provides examples of downloading data from GCS to a local computer using the `wget` and `gsutil` command line tools. For more information about `gsutil`, see the [gsutil tool documentation](https://cloud.google.com/storage/docs/gsutil)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t1wyCDH5nvpS" + }, + "source": [ + "## Sample sets\n", + "\n", + "Data in these releases are organised into sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to download data from only specific sample sets, or all sample sets. For convenience there is a tab-delimited manifest file listing all sample sets in the release, this can be downloaded via `gsutil` to a directory on the local file system, e.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rsX4TP6UnvpS", + "outputId": "a9afc995-80b7-4f62-ad0b-b4d95822cf38", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "!mkdir -pv ~/vo_adir_release/v1.0/\n", + "!gsutil cp gs://vo_adar_release_master_us_central1/v1.0/manifest.tsv ~/vo_adar_release/v1.0/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hWOAFxIDnvpT" + }, + "source": [ + "Here are the file contents:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "vC4ACrTEnvpT", + "outputId": "c7cfe64a-9a78-42ea-dbd9-9cc82410372d" + }, + "outputs": [], + "source": [ + "!cat ~/vo_adar_release/v1.0/manifest.tsv" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5hXT_c0pnvpU" + }, + "source": [ + "For more information about these sample sets, you can explore the [Adar1.0 data user guide](https://malariagen.github.io/vector-data/adir1/adar1.0.html)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D0m-HL43nvpU" + }, + "source": [ + "## Sample metadata\n", + "\n", + "Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen.\n", + "\n", + "### Specimen collection metadata\n", + "\n", + "Specimen collection metadata can be downloaded from GCS. E.g., sample metadata for all sample sets can be downloaded using `gsutil`. If you only want the sample metadata for a single sample set, these can be accessed by including the sample set name on the link below, e.g. to access the metadata for `1357-VO-BR-SALLUM-VMF00326`, you would use: `gs://vo_adar_release_master_us_central1/v1.0/metadata/general/1357-VO-BR-SALLUM-VMF00326`:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-18T15:33:36.315343Z", + "iopub.status.busy": "2026-04-18T15:33:36.314446Z", + "iopub.status.idle": "2026-04-18T15:33:39.325459Z", + "shell.execute_reply": "2026-04-18T15:33:39.324682Z", + "shell.execute_reply.started": "2026-04-18T15:33:36.315300Z" + }, + "id": "CsQVgCl7nvpV", + "outputId": "e0409bcb-5eca-4b1b-e703-e968508f3aec", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "mkdir: created directory '/home/jupyter/vo_adar_release/v1.0/metadata/1357-VO-BR-SALLUM-VMF00326/'\n", + "Building synchronization state...\n", + "Starting synchronization...\n", + "Copying gs://vo_adar_release_master_us_central1/v1.0/metadata/general/1357-VO-BR-SALLUM-VMF00326/samples.meta.csv...\n", + "Copying gs://vo_adar_release_master_us_central1/v1.0/metadata/general/1357-VO-BR-SALLUM-VMF00326/wgs_snp_data.csv...\n", + "Copying gs://vo_adar_release_master_us_central1/v1.0/metadata/general/1357-VO-BR-SALLUM-VMF00326/surveillance.flags.csv...\n", + "/ [3/3 files][118.0 KiB/118.0 KiB] 100% Done \n", + "Operation completed over 3 objects/118.0 KiB. \n" + ] + } + ], + "source": [ + "!mkdir -pv ~/vo_adar_release/v1.0/metadata/1357-VO-BR-SALLUM-VMF00326/\n", + "!gsutil -m rsync -r gs://vo_adar_release_master_us_central1/v1.0/metadata/general/1357-VO-BR-SALLUM-VMF00326/ ~/vo_adar_release/v1.0/metadata/1357-VO-BR-SALLUM-VMF00326/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R7GeyShRnvpV" + }, + "source": [ + "Here are the first few rows of the sample metadata for sample set `1357-VO-BR-SALLUM-VMF00326`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "dhKjnl6knvpW", + "outputId": "6345e845-5288-41a1-e877-5417559b8c6c" + }, + "outputs": [], + "source": [ + "!head ~/vo_adar_release/v1.0/metadata/1357-VO-BR-SALLUM-VMF00326/samples.meta.csv" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VKki7qHunvpW" + }, + "source": [ + "The `sample_id` column gives the sample identifier used throughout all analyses.\n", + "\n", + "The `country`, `location`, `latitude` and `longitude` columns give the location where the specimen was collected.\n", + "\n", + "The `year` and `month` columns give the approximate date when the specimen was collected.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZgpIO8Oknvpa" + }, + "source": [ + "### Site filters\n", + "\n", + "SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. For *An. funestus* and *An. gambiae*, these are available as a VCF file. For *An. darlingi*, they are only available as a Zarr array (see below). If you would like to filter your VCF based on sites passing the filter, you will need to extract the data from the zarr array, and subset your VCF based on these locations (e.g. using bcftools --regions)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OBXGXzj9nvpb" + }, + "source": [ + "## SNP calls (Zarr format)\n", + "\n", + "SNP data are also available in Zarr format, which can be convenient and efficient to use for certain types of analysis. These data can be analysed directly in the cloud without downloading to the local system, see the [Adar1 cloud data access guide](https://malariagen.github.io/vector-data/adar1/cloud.html) for more information. The data can also be downloaded to your own system for local analysis if that is more convenient. Below are examples of how to download the Zarr data to your local system.\n", + "\n", + "The data are organised into several Zarr hierarchies. \n", + "\n", + "### SNP sites and alleles\n", + "\n", + "Data on the genomic positions (sites) and reference and alternate alleles that were genotyped can be downloaded as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "hM4noAz3nvpb", + "tags": [ + "hide-output" + ] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "mkdir: created directory '/home/jupyter/vo_adar_release/v1.0/snp_genotypes'\n", + "mkdir: created directory '/home/jupyter/vo_adar_release/v1.0/snp_genotypes/all'\n", + "mkdir: created directory '/home/jupyter/vo_adar_release/v1.0/snp_genotypes/all/sites/'\n", + "Building synchronization state...\n", + "Reauthentication required.\n", + "Caught non-retryable exception while listing gs://vo_adar_release_master_us_central1/v1.0/snp_genotypes/all/sites/: Reauthentication challenge could not be answered because you are not in an interactive session.\n", + "CommandException: Caught non-retryable exception - aborting rsync\n" + ] + } + ], + "source": [ + "!mkdir -pv ~/vo_adar_release/v1.0/snp_genotypes/all/sites/\n", + "!gsutil -m rsync -r \\\n", + " gs://vo_adar_release_master_us_central1/v1.0/snp_genotypes/all/sites/ \\\n", + " ~/vo_adar_release/v1.0/snp_genotypes/all/sites/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GRqTjrIhnvpb" + }, + "source": [ + "### Site filters\n", + "\n", + "SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. To download site filters data in Zarr format:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tWu4ajAbnvpb", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "!mkdir -pv ~/vo_adar_release/v1.0/site_filters/sc_20250610/darlingi/\n", + "!gsutil -m rsync -r \\\n", + " gs://vo_adar_release_master_us_central1/v1.0/site_filters/sc_20250610/darlingi/ \\\n", + " ~/vo_adar_release/v1.0/site_filters/sc_20250610/darlingi/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vKfArxCFnvpb" + }, + "source": [ + "### SNP genotypes\n", + "\n", + "SNP genotypes are available for each sample set separately. E.g., to download SNP genotypes in Zarr format for sample set `1357-VO-BR-SALLUM-VMF00326`, excluding some data you probably won't need:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "umeGFe1jnvpb", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "!mkdir -pv ~/vo_adar_release/v1.0/snp_genotypes/all/1357-VO-BR-SALLUM-VMF00326/\n", + "!gsutil -m rsync -r \\\n", + " -x '.*/calldata/(AD|GQ|MQ)/.*' \\\n", + " gs://vo_adar_release_master_us_central1/v1.0/snp_genotypes/all/1357-VO-BR-SALLUM-VMF00326/ \\\n", + " ~/vo_adar_release/v1.0/snp_genotypes/all/1357-VO-BR-SALLUM-VMF00326/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8ABQPPgAnvph" + }, + "source": [ + "## Feedback and suggestions\n", + "\n", + "If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the [malariagen/vector-data GitHub discussion board](https://github.com/malariagen/vector-data/discussions)." + ] + } + ], + "metadata": { + "celltoolbar": "Tags", + "colab": { + "collapsed_sections": [ + "8ABQPPgAnvph" + ], + "name": "Ag3.0-data-downloads.ipynb", + "provenance": [] + }, + "environment": { + "kernel": "mgenv_7.2.0", + "name": "workbench-notebooks.m139", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m139" + }, + "kernelspec": { + "display_name": "Python (mgenv_7.2.0)", + "language": "python", + "name": "mgenv_7.2.0" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}