Data access from SRA and other resources

Data access from SRA and other resources#

Warning

Metagenomics analysis with QIIME 2 is in alpha release. This means that results you generate should be considered preliminary, and NOT PUBLICATION QUALITY. Additionally, interfaces are subject to change, and those changes may be backward incompatible (meaning that a command or file that works in one version of the QIIME 2 Shotgun Metagenomics distribution may not work in the next version of that distribution).

Setup#

Before we dive into the tutorial, let’s set up the required directory structre and make sure we have all the required software installed.

QIIME 2 metagenome distribution#

You can install the latest distribution of the QIIME 2 metagenome distribution by following the instructions here. Once installed, you can activate the environment by running the following command:

conda activate qiime2-shotgun-2024.2

Directory structure#

Below you can see the directory structure that we will use throughout this tutorial:

<your working directory>
├── moshpit_tutorial
│   ├── cache
│   ├── results

Once you decided on the location of your working directory, let’s create the results subdirectory by running the following command:

mkdir -p moshpit_tutorial/results

Next, we create the cache subdirectory (this is where majority of the data will be written to by QIIME 2) by running the following command:

qiime tools cache-create \
  --cache ./moshpit_tutorial/cache

We will be saving all the artifacts into that QIIME cache and all the final visualizations and tables into the results directory. If you want to read more about the QIIME cache, you can do so here.

Required databases#

In order to perform the taxonomic and functional annotation, we will need a couple of different reference databases. Below you will find instructions on how to download these databases using respective QIIME 2 actions.

Kraken 2/Bracken database

qiime moshpit build-kraken-db \
    --p-collection standard \
    --o-kraken2-database ./moshpit_tutorial/cache:kraken_standard \
    --o-bracken-database ./moshpit_tutorial/cache:bracken_standard \
    --verbose

EggNOG databases

qiime moshpit fetch-diamond-db \
    --o-diamond-db ./moshpit_tutorial/cache:eggnog_diamond_full \
    --verbose

qiime moshpit fetch-eggnog-db \
    --o-eggnog-db ./moshpit_tutorial/cache:eggnog_annot_full \
    --verbose

Data retrieval from SRA#

The data we are using in this tutorial can be fetched from the SRA repository using the q2-fondue plugin. You can fetch the list of accession IDs using the following command:

wget TODO

Next, we need to import the accession IDs into a QIIME 2 artifact:

qiime tools import \
  --type NCBIAccessionIDs \
  --input-path ids.tsv \
  --output-path ./moshpit_tutorial/ids.qza

Finally, we can use the get-sequences action to download the data (please insert your e-mail address in the --p-email parameter):

qiime fondue get-sequences \
    --i-accession-ids ./moshpit_tutorial/ids.qza \
    --p-n-jobs 16 \
    --p-email you@tutorial.com \
    --o-single-reads ./moshpit_tutorial/cache:reads_single \
    --o-single-paired ./moshpit_tutorial/cache:reads_paired \
    --o-failed-runs ./moshpit_tutorial/cache:failed_runse \
    --verbose

This will download all the sequences into the QIIME 2 cache. It is a lot of data, so keep in mind that depending on your network speed, this might take a while. Once the data is downloaded, you can proceed to one (or more) of the following steps:

Annotation of reads (TODO)
Generation and annotation of contigs (TODO)
Generation and annotation of MAGs (TODO)

Before we jump into the next sections, lets discuss our parsl_config! We are using Parsl to parallelize our computationally expensive jobs. This will be used to run Kraken and Eggnog-Mapper. Here we can decide how many workers, cores, nodes and blocks. You can see an example[here)(Slurm-ConFig)