Integrate metadata in Actions#

We’ve done a lot so far in q2-dwq2, but we’ve left out one thing that is common in almost all real plugins: the integration of metadata in Actions. In this relatively brief section, we’ll enable users to optionally provide metadata associated with reference sequences to the tables generated by the tabulate-las-results Visualizer and the search-and-summarize Pipeline. This metadata could be just about anything. For example, reference metadata could be taxonomic information about the reference sequences, enabling viewers of the resulting visualization to infer the taxonomic origin of their query sequences from the taxonomy associated with the closest matching reference sequences. (That example is also used in the Sequence Homology Searching chapter of An Introduction to Applied Bioinformatics [3], so you can refer there if the idea isn’t familar.)

Let’s get started.

tl;dr

The code that I wrote for this section can be found here: caporaso-lab/q2-dwq2.

The input metadata#

In this example, the reference metadata will be a QIIME 2 metadata file that has reference sequence ids as its identifiers, and some number of additional columns containing the actual metadata. For example, the tab-separated text (.tsv) metadata file could look like:

id	phylum	class genus	species
ref-seq1	Bacteria	Bacillota	Bacillus_A	paranthracis
ref-seq2	Bacteria	Pseudomonadota	Photobacterium	kishitanii
ref-seq3	Bacteria	Actinomycetota	Cryptosporangium	arvum

This metadata file will be passed into tabulate_las_results along with the LocalAlignmentSearchResults artifact that it already takes.

Note

By convention in QIIME 2, all relevant identifiers (in our case, those that could be present in the LocalAlignmentSearchResults artifact) should be represented in the metadata, or an error should be raised. However, “extra” identifiers in the metadata that do not show up in the corresponding artifacts (again, the LocalAlignmentSearchResults in this example) are allowed.

The goal of this convention is that we want to facilitate users maintaining only a single metadata file. If an upstream artifact - in this case the sequences in the FeatureData[Sequence] artifact that is passed into local-alignment-search as reference_seqs - is ever filtered to remove some sequences, the user shouldn’t have to create a filtered version of their metadata file. The reason for this is two-fold. First of all, that would generally be an extra, unnecessary step, so it complicates users’ workflows. But, more importantly, that could result in a proliferation of metadata files that need to be kept in sync. This is another application of the DRY principle: don’t encourage your users to duplicate information represented in their metadata file.

Add an optional input to tabulate_las_results#

Now that we know what the metadata will look like, let’s add a new optional input to our tabulate_las_results function. The new input we add will be called reference_metadata, and it will be received as a qiime2.Metadata object. Our updated function signature in _visualizers.py will look like this:

def tabulate_las_results(output_dir: str,
                         hits: pd.DataFrame,
                         title: str = _tabulate_las_defaults['title'],
                         reference_metadata: qiime2.Metadata = None) \
                         -> None:

If reference_metadata is provided, we’ll take a few steps to integrate it into hits before we write out our HTML file. Otherwise, tabulate_las_results will behave exactly as it did before. I added the following if block:

if reference_metadata is not None:
    reference_metadata = reference_metadata.to_dataframe()

    hits.reset_index(inplace=True)

    metadata_index = reference_metadata.index.name
    metadata_columns = reference_metadata.columns.to_list()
    reference_metadata.reset_index(inplace=True)

    missing_ids = \
        set(hits['reference id']) - set(reference_metadata[metadata_index])
    if len(missing_ids) > 0:
        raise KeyError(
            f"The following {len(missing_ids)} IDs are missing from "
            f"reference metadata: {missing_ids}.")

    hits = pd.merge(hits, reference_metadata,
                    left_on='reference id',
                    right_on=metadata_index,
                    how='inner')

    hits.set_index(['query id', 'reference id'], inplace=True)
    column_order = \
        ['percent similarity', 'alignment length', 'score'] + \
        metadata_columns + ['aligned query', 'aligned reference']
    hits = hits.reindex(columns=column_order)

This looks like a lot, but it’s really a few simple actions.

  1. First, convert the qiime2.Metadata object to a pd.DataFrame using the built-in method on qiime2.Metadata.

  2. Then, remove the index from hits to prepare it for a pd.merge operation.

  3. Next, cache the metadata’s index name because - importantly - we don’t know exactly what this index name will be. The QIIME 2 metadata format allows for a few available options, including id, sample-id, feature-id, and several others, and we don’t want to restrict a user to providing any specific one of these. We also cache the list of metadata columns, and remove the index to prepare it for the pd.merge operation with hits. Remember that we also don’t know what metadata columns the user will provide, and in this example we’re not putting any restrictions on this.

  4. Then, we confirm that all ids that are represented in hits are present in reference_metadata, and we throw an informative error message if any are missing. Our error message should help a user identify what’s wrong, so I chose to indicate how many ids were missing, and provide a list of them.

  5. Next, we merge our hits and the reference metadata on the index of hits and the index of our metadata. Unlike for the metadata, we do know what the index name of hits will be because we were explicit about this when we defined our LocalAlignmentSearchResultsFormat, so we can refer to it directly. This is an important distinction that differentiates metadata from File Formats we define: if we need the flexibility to allow for arbitrary column names, we’re generally working with metadata. On the other hand, if our column names are predefined, we should generally be working with a File Format.

  6. Then, we prepare the hits DataFrame for use downstream. To do this, we first set the MultiIndex. I also sorted the columns such that all of the metadata columns come between the percent similarity, alignment length, and score columns and the aligned query and reference query columns of the original hits DataFrame. When developing, I found this column order to be useful for reviewing the results.

After this, our action proceeds the same as if no metadata was provided.

with open(os.path.join(output_dir, "index.html"), "w") as fh:
    fh.write(_html_template % (title, hits.to_html()))

Update search-and-summarize#

We’ll also want this option to be available to users of search-and-summarize. Since tabulate-las-results is already part of the search-and-summarize Pipeline, all we need to do in our _pipelines.py file is add the new optional parameter (reference_metadata), and pass it through in our call to tabulate_las_results_action. Try to do that yourself, and refer to my code (caporaso-lab/q2-dwq2) as needed.

Update plugin_setup.py#

Next, we’ll need to make the plugin aware of this new parameter when we register the tabulate-las-results and search-and-summarize actions. Metadata files are provided as Parameters of type qiime2.plugin.Metadata on action registration.

In plugin_setup.py, add Metadata to the list of imports from qiime2.plugin. Then, add the new parameter and description to the dictionaries we created to house these for tabulate-las-results.

_tabulate_las_parameters = {'title': Str,
                            'reference_metadata': Metadata}
_tabulate_las_parameter_descriptions = {
    'title': 'Title to use inside visualization.',
    'reference_metadata': 'Reference metadata to be integrated in output.'
}

Recall that we reuse these dictionaries when we register search-and-summarize, so we only need to add this parameter and description in this one place.

At this point, you should be ready to use this new functionality with your plugin. Open a terminal in a environment where your implementation of q2-dwq2 is installed, run qiime dev refresh-cache, and you should see the new parameter in calls to qiime dwq2 tabulate-las-results --help and qiime dwq2 search-and-summarize --help.

Add unit tests and update the search-and-summarize usage example#

Finally, wrap this up by adding new unit tests for the functionality. Do this on your own, and then refer my code (caporaso-lab/q2-dwq2) to see how I did it.

In my code for this section, you’ll also find a metadata file that I provided which corresponds to the example reference sequences that were provided. Use that file to update the search-and-summarize usage example, and then test your code on the command line with the new usage example.