Integrate metadata in Actions#
We’ve done a lot so far in q2-dwq2, but we’ve left out one thing that is common in almost all real plugins: the integration of metadata in Actions.
In this relatively brief section, we’ll enable users to optionally provide metadata associated with reference sequences to the tables generated by the tabulate-las-results
Visualizer and the search-and-summarize
Pipeline.
This metadata could be just about anything.
For example, reference metadata could be taxonomic information about the reference sequences, enabling viewers of the resulting visualization to infer the taxonomic origin of their query sequences from the taxonomy associated with the closest matching reference sequences.
(That example is also used in the Sequence Homology Searching chapter of An Introduction to Applied Bioinformatics [3], so you can refer there if the idea isn’t familar.)
Let’s get started.
tl;dr
The code that I wrote for this section can be found here: caporaso-lab/q2-dwq2.
The input metadata#
In this example, the reference metadata will be a QIIME 2 metadata file that has reference sequence ids as its identifiers, and some number of additional columns containing the actual metadata.
For example, the tab-separated text (.tsv
) metadata file could look like:
id phylum class genus species
ref-seq1 Bacteria Bacillota Bacillus_A paranthracis
ref-seq2 Bacteria Pseudomonadota Photobacterium kishitanii
ref-seq3 Bacteria Actinomycetota Cryptosporangium arvum
This metadata file will be passed into tabulate_las_results
along with the LocalAlignmentSearchResults
artifact that it already takes.
Note
By convention in QIIME 2, all relevant identifiers (in our case, those that could be present in the LocalAlignmentSearchResults
artifact) should be represented in the metadata, or an error should be raised.
However, “extra” identifiers in the metadata that do not show up in the corresponding artifacts (again, the LocalAlignmentSearchResults
in this example) are allowed.
The goal of this convention is that we want to facilitate users maintaining only a single metadata file.
If an upstream artifact - in this case the sequences in the FeatureData[Sequence]
artifact that is passed into local-alignment-search
as reference_seqs
- is ever filtered to remove some sequences, the user shouldn’t have to create a filtered version of their metadata file.
The reason for this is two-fold.
First of all, that would generally be an extra, unnecessary step, so it complicates users’ workflows.
But, more importantly, that could result in a proliferation of metadata files that need to be kept in sync.
This is another application of the DRY principle: don’t encourage your users to duplicate information represented in their metadata file.
Add an optional input to tabulate_las_results
#
Now that we know what the metadata will look like, let’s add a new optional input to our tabulate_las_results
function.
The new input we add will be called reference_metadata
, and it will be received as a qiime2.Metadata
object.
Our updated function signature in _visualizers.py
will look like this:
def tabulate_las_results(output_dir: str,
hits: pd.DataFrame,
title: str = _tabulate_las_defaults['title'],
reference_metadata: qiime2.Metadata = None) \
-> None:
If reference_metadata
is provided, we’ll take a few steps to integrate it into hits
before we write out our HTML file.
Otherwise, tabulate_las_results
will behave exactly as it did before.
I added the following if
block:
if reference_metadata is not None:
reference_metadata = reference_metadata.to_dataframe()
hits.reset_index(inplace=True)
metadata_index = reference_metadata.index.name
metadata_columns = reference_metadata.columns.to_list()
reference_metadata.reset_index(inplace=True)
missing_ids = \
set(hits['reference id']) - set(reference_metadata[metadata_index])
if len(missing_ids) > 0:
raise KeyError(
f"The following {len(missing_ids)} IDs are missing from "
f"reference metadata: {missing_ids}.")
hits = pd.merge(hits, reference_metadata,
left_on='reference id',
right_on=metadata_index,
how='inner')
hits.set_index(['query id', 'reference id'], inplace=True)
column_order = \
['percent similarity', 'alignment length', 'score'] + \
metadata_columns + ['aligned query', 'aligned reference']
hits = hits.reindex(columns=column_order)
This looks like a lot, but it’s really a few simple actions.
First, convert the
qiime2.Metadata
object to apd.DataFrame
using the built-in method onqiime2.Metadata
.Then, remove the index from
hits
to prepare it for apd.merge
operation.Next, cache the metadata’s index name because - importantly - we don’t know exactly what this index name will be. The QIIME 2 metadata format allows for a few available options, including
id
,sample-id
,feature-id
, and several others, and we don’t want to restrict a user to providing any specific one of these. We also cache the list of metadata columns, and remove the index to prepare it for thepd.merge
operation withhits
. Remember that we also don’t know what metadata columns the user will provide, and in this example we’re not putting any restrictions on this.Then, we confirm that all ids that are represented in
hits
are present inreference_metadata
, and we throw an informative error message if any are missing. Our error message should help a user identify what’s wrong, so I chose to indicate how many ids were missing, and provide a list of them.Next, we merge our hits and the reference metadata on the index of
hits
and the index of our metadata. Unlike for the metadata, we do know what the index name ofhits
will be because we were explicit about this when we defined ourLocalAlignmentSearchResultsFormat
, so we can refer to it directly. This is an important distinction that differentiates metadata from File Formats we define: if we need the flexibility to allow for arbitrary column names, we’re generally working with metadata. On the other hand, if our column names are predefined, we should generally be working with a File Format.Then, we prepare the
hits
DataFrame for use downstream. To do this, we first set the MultiIndex. I also sorted the columns such that all of the metadata columns come between the percent similarity, alignment length, and score columns and the aligned query and reference query columns of the originalhits
DataFrame. When developing, I found this column order to be useful for reviewing the results.
After this, our action proceeds the same as if no metadata was provided.
with open(os.path.join(output_dir, "index.html"), "w") as fh:
fh.write(_html_template % (title, hits.to_html()))
Update search-and-summarize
#
We’ll also want this option to be available to users of search-and-summarize
.
Since tabulate-las-results
is already part of the search-and-summarize
Pipeline, all we need to do in our _pipelines.py
file is add the new optional parameter (reference_metadata
), and pass it through in our call to tabulate_las_results_action
.
Try to do that yourself, and refer to my code (caporaso-lab/q2-dwq2) as needed.
Update plugin_setup.py
#
Next, we’ll need to make the plugin aware of this new parameter when we register the tabulate-las-results
and search-and-summarize
actions.
Metadata files are provided as Parameters of type qiime2.plugin.Metadata
on action registration.
In plugin_setup.py
, add Metadata
to the list of imports from qiime2.plugin
.
Then, add the new parameter and description to the dictionaries we created to house these for tabulate-las-results
.
_tabulate_las_parameters = {'title': Str,
'reference_metadata': Metadata}
_tabulate_las_parameter_descriptions = {
'title': 'Title to use inside visualization.',
'reference_metadata': 'Reference metadata to be integrated in output.'
}
Recall that we reuse these dictionaries when we register search-and-summarize
, so we only need to add this parameter and description in this one place.
At this point, you should be ready to use this new functionality with your plugin.
Open a terminal in a environment where your implementation of q2-dwq2
is installed, run qiime dev refresh-cache
, and you should see the new parameter in calls to qiime dwq2 tabulate-las-results --help
and qiime dwq2 search-and-summarize --help
.
Add unit tests and update the search-and-summarize
usage example#
Finally, wrap this up by adding new unit tests for the functionality. Do this on your own, and then refer my code (caporaso-lab/q2-dwq2) to see how I did it.
In my code for this section, you’ll also find a metadata file that I provided which corresponds to the example reference sequences that were provided.
Use that file to update the search-and-summarize
usage example, and then test your code on the command line with the new usage example.