Part 3: the paper with Wuhan mosquito virus 6 and Josh

17 minute read


If you’re going in chronological order you’ve already read about how I got involved with Chan Zuckerberg’s Biohub individual mosquito transcriptome project, the exciting findings from it and the issues it faced, a brief foray into collaboration on one of the interesting viruses we stumbled upon, and now I’ll tell you about the most recent paper that came out of the Californian mosquito data. It’s both a kind of love story and a mission statement for my lab as we go forward.


Picture this - as we’re wrapping up and polishing the main individual Californian mosquito transcriptome paper a number of analyses that I had done remain overboard. Many of them are from a time when we thought of the paper as being a compilation of vignettes - short self-contained stories highlighting some interesting aspect of virus biology that was enabled by our study design. Some I’m keeping for potential metagenomically-inclined PhD students, others were not that exciting in retrospect, and the rest I’ll talk about here.

I should also say that this particular study lingered on my desk an unbelievably long time. I had an early draft of this paper in late 2020 with a slightly different framing that was marinating until 2022 when Josh and I mustered enough motivation to push through the last round of revisions for it. After that we put it up on bioRxiv and initially sent it to Review Commons for peer review (try it if you haven’t!). I was very pleasantly surprised when an editor of Evolution Letters reached out to see if we wanted to submit it there but unfortunately we had other plans in mind. Since Journal of Virology doesn’t recognise reviews from Review Commons we had to do more rounds of peer review which was fine by me, we even got some solid confirmation that we weren’t getting the interesting results because of some fluke.

Before I go into any of the details I feel like I should give you one crucial piece of information to understand why I did this study the way I did it. I love orthomyxos (members of Orthomyxoviridae). Have since my early PhD days. It’s what my first first-author paper is on. I love that its diversity is amenable to family-wide analyses, love the genome organisation, love its tractable reassortant way of recombining, and I certainly appreciate that its members can be a problem for vertebrate health.

The vignette

One of the first things that was left overboard with the Batson et al. (2021) paper was a reassortment network of Wuhan mosquito virus 6 (WuMV-6). The paper felt busy as it is without an arbitrary deeper dive into an obscure virus. We already had all the orthomyxo finds we could want - an eight-segmented clade, reconstruction of putative segments from other people’s data, some phylogenetics showing expected patterns of diversity and some reassortment. A deeper dive into WuMV-6 would’ve detracted from the paper whilst not doing our finding any justice. It was agreed that I was free to pursue the story on my own.

As I was working through some early analyses with WuMV-6 genomes from China, California, and Australia more WuMV-6 sequences started showing up. First in Sweden, then a collaborator working on mosquitoes in Cambodia found it in their sequence data. This pattern would continue over the years and today we know WuMV-6 is present as far as Trinidad, Tunisia, Madagascar, etc. Such volumes of data collected over time, particularly when segmented (now that we have the right method to analyse them), are prime targets for analyses in BEAST. Surprisingly enough (since not every RNA virus works in this regard) WuMV-6 genomes showed sufficient molecular clock signal to be calibrated with minimally informative priors and we had our very own 27-genome reassortment network.

Something that immediately caught my eye is how recent the common ancestry of WuMV-6 genomes was. If you look at WuMV-6 diversity outside of Sweden (which shares a common ancestor with the rest in ~1950s) all of it derives from a single genome that existed in the last ~20 years despite its descendants now being found around the perimeter of the Pacific Ocean - California, China, Cambodia, Australia. Furthermore, we can see reassortment events taking place in the last ~8 years involving lineages that are later found in Australia and California, i.e. opposite sides of the Pacific Ocean. Remarkable rates of migration! But are they really?

Did I mention reassortment networks are cool?

I think it’s entirely fair to say that we know very little about the lives of insects. We can make many unusual observations - Sigmaviruses in Drosophila being exclusively vertically transmitting yet somehow jumping from one species to another in the last couple of hundred years, Sigmaviruses sweeping in UK on the order of decades, the classic global P element (a transposable element) sweep in Drosophila melanogaster that took less than 50 years, (as I’ve recently learned from Darren Obbard on a recent visit) a P element invasion of Drosophila simulans that took a few years to establish it globally but not yet fixed in all populations. We have the observations of a whole spectrum of migration rates (often involving vertical transmission), they’re unambiguous, and yet very little understanding of how those migrations occur. With this little caveat in mind we can proceed to some hypotheses of WuMV-6 migration.

Ships, winds, or meaty spaceships?

It’s common knowledge in arbovirology that mosquitoes are not travelers. They’ll at most travel a few kilometres in their short lives. We’ll set aside the speed with which the vertically-inherited P element invaded Drosophila simulans (also not a strong flier) and look for alternative ways for mosquitoes to go far and fast. Humans have very efficient technological means to go far and fast. We also know that mosquitoes can get swept up into the atmosphere where they can ride high altitude winds and get deposited hundreds of kilometres away. And then there’s the classic arbovirus method of travel - inside meaty spaceships called vertebrates. We’ve seen it with Zika over the last 50 years and we’ve seen it with West Nile invading North America in 1999.

We need an extra puzzle piece from WuMV-6 to answer this so I looked into segment-wide dN/dS. I was initially dismayed that segments hypothetical 3 and 2 (unknown function) showed higher segment-wide dN/dS than gp64 (the surface protein). My initial hypothesis was that if WuMV-6 is infecting vertebrates we might see some evidence of antigenic drift that would manifest in a higher dN/dS or higher rates of non-synonymous evolution (seeing as time is a more sensitive way of normalising than more mutations). This was a bit too single-minded of me and it turns out that an analogous situation occurs in influenza - the NS segment - similarly short - evolves much faster at the non-synonymous level than haemagglutinin (HA) which we know experiences antigenic drift. In light of this I changed my interpretation of the results - WuMV-6 surface protein gp64 distinctly experiences very high rates of non-synonymous evolution outside the normal range seen in its other longer proteins. Neither anthropogenic transportation methods nor abiotic factors like high altitude winds could explain this pattern but the involvement of vertebrate hosts exherting antibody pressure certainly could. So my current suspicion is that WuMV-6 does infect a somewhat longer-lived vertebrate host with some regularity. It’s almost certainly not humans given that we probably into contact with WuMV-6 all the time but water birds, already susceptible to other quaranjavirus infections, could be good candidates.

Surface proteins and non-synonymous evolution

A tale of two (classes) of proteins

One of the goals that I set out for a paper was to have a kind of roadmap figure for orthomyxoviruses. What is the current state of their diversity? How many genomes are complete? What can we say about their surface proteins? What patterns emerge when we synthesise data across studies?

A few things we found:

  1. Most orthomyxovirus genomes are incomplete. Public sequence databases are rife with PB1, PB2, PA, and NP because they’re quite conserved and thus easiest to identify by a simple homology search. Without individual-animal metagenomic study designs it’ll take us a lot longer to reconstruct complete genomes and thus to identify and characterise potentially problematic viruses.
  2. Most orthomyxoviruses use one of two membrane fusion protein classes if not actual proteins. Currently known orthomyxoviruses seem to use just one of two membrane fusion protein classes - class I (the haemagglutinins, haemagglutinin-esterase-fusion proteins, and the like) found in vertebrate-infecting members and class III (gp64-like proteins) in predominantly invertebrate-infecting members. That’s an interesting limitation if it’s real. It also comes with curious exceptions - there’s a clade of recently discovered fishy orthomyxos (Steelhead trout orthomyxovirus-1 and Rainbow trout orthomyxovirus-2) that have a recognisable neuraminidase that’s not accompanied by a haemagglutinin which brings us to the next observation.
  3. Too many orthomyxoviruses have mislabeled genes. Despite distinctly not having haemagglutinins both the fishy clade and a number of quaranja- and thogotoviruses are labeled as such. All currently known thogoto- and quaranjaviruses use gp64 proteins (class III) and the fishy clade seems to be using another unrelated (or unrecognisably related) class I protein that seems closer to SARS-CoV-2 Spike or retroviral env. I blame this on the familiarity with influenza A as the archetypal orthomyxovirus (so anything that’s a surface protein gets called haemagglutinin) and the lack of curation on public databases.

All of these point to our ability to do better as the research community.

Are we there yet?

RNA virus diversity is finite. Yes, it is vast but also has to be finite. Because of the way phylogenetics works whenever we discover a new sequence we also get to know a bit about their evolutionary history too. With some exceptions it does look like we have a decent idea of what the diversity of common RNA viruses looks like. As an example, if we discovered a new extant hominid species today we’d be extremely surprised, sure, but we’d also probably have a really good idea about its biology. In the same way, I reckon by now we know the broad brushstrokes of RNA virus evolution (as far as the common eukaryotic ones are concerned) and what forms they might take.

As metagenomic studies fill in the RNA-dependent RNA polymerase (RdRp) tree with the finer strokes eventually we should start seeing that the strokes aren’t adding that much detail and we can already tell what the painting is going to be. Because diversity is finite. Josh found one extreme example whose analysis we’d reimplement for our purposes. Think of what a newly discovered species of bird would look like these days. Probably very similar to something we’ve found before. In fact one study would suggest any new bird species discovered today is likely to be so close to its nearest relative that we’d have to squint to call it a new species at all.

Turns out what we were going to quantify already has a name, it’s phylogenetic diversity in ye olden traditional ecology/evolution literature. The idea is simple - take a phylogenetic tree of your sequences and go through each tip in chronological order of discovery. Traverse the tree from each tip back to the root marking every branch encountered with the year of that tip’s discovery unless a branch has been marked with another year by a previous traversal. Darren Obbard had done this before for viruses actually, but I had totally forgotten until he graciously reminded me about it (sorry, Darren!). This allows us to look at the temporal trends in orthomyxovirus PB1 phylogenetic diversity discovery and it shows that yes, overall we have evidence to say that orthomyxoviruses discovered each year are less novel. (Un)Fortunately, we cannot make the same statement about the most diverged members discovered each year. So even though on average orthomyxoviruses discovered now are less novel (as measured in substitutions per site, i.e. branch length), the most novel members discovered each year don’t show any trend (yet).

Phylogenetic diversity is great

Finally, we can forecast this phylogenetic diversity discovery into the future. Granted, with a container ship full of caveats about what sort of hosts we have focused on (or not), how much of the rare orthomyxoviruses we’re missing, etc. but we are putting something on the table to advance discussions. I’d be happy if we underestimated the discovery trajectory. Even happier if we got the trajectory right. I’d definitely be surprised/disappointed if we overestimated it.

A note on databases moving forward

There’s one final thing I feel I should point out about this study. At one point when I was assembling more WuMV-6 genomes I could only find FASTQ files on the China National GeneBank DataBase (CNGBdb). It is now not uncommon to see this happening more and more since researchers in China are currently the leaders of large scale metagenomic studies. There are two major issues with this.

Firstly, I’ve already encountered a situation where the accession number for a file I used changed between reviews and proofs. This is a very serious issue for reproducibility, especially since I couldn’t find any history of the changes which doesn’t happen on NCBI. I hope this happened because of some oversight on my part but if that wasn’t it then CNGBdb will find it very difficult to build trust with the research community.

Secondly, the CNGBdb is not (as far as I’m aware) integrated with any tools like BLAST. It’s fine if CNGBdb is intended as a data-only repository but the data must be mirrored somewhere where it can be accessed by the myriad of tools developed over the years (e.g. BLAST!). Given the COVID-19 pandemic and some questionable decisions at GISAID I understand concerns about control over sequence data but ultimately this will prove a huge detriment to researchers everywhere - scientists abroad won’t be able to find relevant sequences without tool integration and scientists in China won’t get credit for the work they’ve done as a result. As usual, there’s much better returns on transparency.


I think the fields of metagenomic virus discovery and virus evolution are entering a very interesting time largely thanks to how cheap sequencing is becoming but also because of advances in artificial intelligence that through a combination of protein structure prediction) and easily accessible, i.e. democratised frameworks are leading to truly novel diversity discovery. As we’re generating more and more sequence data, however, it feels like many groups performing metagenomic sequencing in search of new viruses are stepping back or maybe just getting laser-focused on more ecological questions. Here’s what I mean - the first complete (eight-segmented) WuMV-6 genomes were deposited by our team on GenBank in 2021 May. Since then there’s only been four WuMV-6 sequences deposited on GenBank - three out of four are PB1 sequences, one is PA. If you check the literature, however, there’s tens of SRA entries from which you can reconstruct perfectly decent and complete WuMV-6 genomes. Basically, people aren’t processing their sequence data to its full potential. I’m certainly not complaining - it looks like a new bottom-feeder-like (no negative implications intended!) niche with a very low infrastructure footprint is opening up in science and folks like me stand to benefit a lot. It does suggest that we’ll need more tools like pebblescout to work the raw SRA data in addition to renewed pressure to make people share their data.

From talking to people and seeing some of the latest research coming out now it does seem like many are starting to recognise that RNA-dependent RNA polymerase (RdRp) diversity is a very important but ultimately limiting side of the story of RNA virus evolution. Whether you’re wondering what’s likely to pose a threat to human health or trying to understand why your insect virus seems to be killing males, I’d hazard a bet that the vast majority of the time those questions won’t be solved by understanding RdRp diversity but some other protein the virus codes for. As such, it becomes crucial to get complete virus genomes, an easy enough task for non-segmented viruses but increasingly difficult with the number of segments in segmented groups. The individual transcriptome across a geographic transect does seem like the winning study design here.

Lastly and most importantly, I think our paper is a good argument for continued metagenomic virus discovery efforts. Someone could look at the fact that we couldn’t name a third of the viruses we found in individual Californian mosquito transcriptomes (because they were named previously) and declare that mosquitoes have had enough attention and we should go look elsewhere. Our ability to use arthropod viruses to understand their hosts better is being noticed by other groups too and I think that is one argument for why metagenomic studies should continue, even in well-researched hosts. You’re still very likely to catch new viruses but crucially you’ll also contribute sequences that can answer questions about host populations in detail we couldn’t dream of thanks to phylodynamic methods we have at our disposal. So, as we said in our paper’s conclusion - keep going!