Tracking the Start of SARS-CoV-2

Ranger222

Joined: Oct 20, 2007

Posts: 24,780

User Profile

Private Message

Ignore User Stop Ignoring

How long do you want to ignore this user?

24 hours One week Permanently Cancel

11:43a, 3/28/20

AG

So I've used this in response to a few threads now, and a locked thread last night wanted me to start a new thread that explains how we can track the virus and determine the start date of this "outbreak" that originated in China.

Thanks to the low cost and availability of sequencing today, hundreds if not thousands of viral genomes from SARS-CoV-2 have now been deposited online to share and distribute to the scientific community for analysis. The group that is leading the way in this is nextstrain.org which offers real-time tracking of pathogen evolution. They have made several different interactive charts to show how the virus is mutating that allows them to track the induction of an infection within a community and the resulting thread.

Here is an example of the power of their approach:

Trevor Bedford, a scientist at Fred Hutchinson in Seattle, shows that ~80% of the infections that occurred in Washington state most likely originated from one founder event, or infected person.

So how do they do this?

Its pretty simple. All viruses have a known mutation rate, meaning at a constant rate, a single nucleotide (A, G, T (here U) or C) becomes altered within the ~30,000 base pair viral genome. All living things experience this, as the machinery that replicates our genetic material is prone to errors. Sometimes those errors are beneficial to the organism, and that is how evolution occurs. Other times they are deleterious, and that mutation is quickly removed from a loss of fitness to the organism. Sometimes they just don't matter at all and persist as they do neither harm or good. They just accumulated over time. For this virus and others, they show a mutation rate of 2 base pairs in a given month.

So if we start with the original viral genome that first caused infection ---

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

After one month we will have at least one population of infections that result in a viral genome that looks like this :

XXXXXXXXAXXXXXXXXXXXXXXBXXXXXXXXXXXXXXXXXX

We can then track this specific pattern to see where around the world this viral genome originated and how it is spreading.

You may have a different one that looks like this --

XXXAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXBXXXXXXX

and we can track this new population as well.

So over time, this version can gain a new mutation

XXXAXXXXXXXXXXXXXXCXXXXXXXXXXXXXXXBXXXXXXX

that we can track as well, but backtrack through the A and B mutations to its original founder, even through we now have a third mutation in C.

So how does this relate to finding out when the original outbreak began?

Well, if we know the mutation rate (two changes per month) and we have enough sequences around the initial reports around the outbreak (thankfully we do from China), we can go back to the earliest days and see what the viral genome looks like back then and build a "molecular clock" to find the beginning of the outbreak.

This YouTube video is a seminar given by Trevor Bedford to Georgia Tech at the beginning of February when the outbreak really hadn't even reached the US yet. He says that based off of 5 viral genomes from China, three of those genomes were identical (from the same virus that had not mutated) while two others only differed by 3 mutations each. This told him that at this point in January, the sequence diversity was low, and we could backtrack the start of the outbreak only 1-2 months earlier. I've started the YouTube video where he discusses this.

Trevor is not the only one that thinks this. Here is another analysis from Andrew Rambaut from the University of Edinburgh that uses modeling and phylodyanmic analysis from 176 genomes to give a mid-November date. TMRCA = The Most Recent Common Ancestor

Data Coalescent model Estimated TMRCA 95% interval
12-Feb, 75 genomesExponential growth 29-Nov-2019 28-Oct-2019 20-Dec-2019
24-Feb, 86 genomesExponential growth 17-Nov-2019 27-Aug-2019 19-Dec-2019

Rambaut Analysis

So what does this mean?

There are a lot of conspiracy theories out there that SARS-CoV-2 has been out and in the population for quite some time, perhaps last fall. I love a good conspiracy theory, but unfortunately this one cannot be true BASED ON ALL THE ACCUMULATED SCIENTIFIC DATA. If you had something in December or January that knocked you and your whole family out for a good week, it was NOT SARS-CoV-2. You are not immune right now to this virus.

21

Keegan99

Joined: Oct 10, 1999

Posts: 99,999

User Profile

Private Message

Ignore User Stop Ignoring

How long do you want to ignore this user?

24 hours One week Permanently Cancel

In reply to Ranger222 • 11:51a, 3/28/20

AG

There was a poster on TexAgs that had a coworker return from Wuhan - having been to the wetmarket - in December. This coworker was reportedly out with "the flu" for two weeks.

Unless that person infected others, a la the Seattle arrival, how can you be sure they didn't have COVID?

9

cisgenderedAggie

Joined: Jul 19, 2014

Posts: 6,762

User Profile Ignore User Stop Ignoring

How long do you want to ignore this user?

24 hours One week Permanently Cancel

In reply to Keegan99 • 12:10p, 3/28/20

If they continue to do sequence analysis, it will come out in the end. If there was a separate, earlier founder, more close to the source, that caused everyone's claims that they probably had it in November, December, and early Jan, it should eventually manifest as a separate lineage.

Not a 100% certainty from this that these claims are completely false, but very unlikely that this was happening. It's certainly possible that the sequenced samples are completely unrepresentative of the entire American population, but the more samples taken, the less likely that is. If there were earlier founders in different locations, this should be evident from the sequence analyses and more so as they continue to observe.

This is from the 8th post in Trevor's thread

That's UK, and suggestive of multiple founders. That's what the US sequences should look like if the stories of "this person was traveling and out with the flu for weeks..." are true. Note the rate of mutations and the timescale though. For those stories to be true, the divergence should happen sooner.

Very unlikely that this came to America in November.

1

2 edits

Post removed:

by user

12:29p, 3/28/20

Keegan99

Joined: Oct 10, 1999

Posts: 99,999

User Profile

Private Message

Ignore User Stop Ignoring

How long do you want to ignore this user?

24 hours One week Permanently Cancel

In reply to cisgenderedAggie • 12:34p, 3/28/20

AG

But it all relies on the alleged coworker that returned from Wuhan infecting someone else and starting a cluster.

cisgenderedAggie

Joined: Jul 19, 2014

Posts: 6,762

User Profile Ignore User Stop Ignoring

How long do you want to ignore this user?

24 hours One week Permanently Cancel

In reply to [removed post] • 12:47p, 3/28/20

Yes, that's true. Which is why they need to continue to sequence samples from all over if anyone cares about the story of how it got here and when.

The data are freely available from nextstrain. I downloaded just the US data, there are 500 genomes from Jan through Mar. the following stares are represented:
California
Arizona
Washington
Wisconsin
Illinois
Texas
New York
Connecticut
Massachussets
Minnesota
Utah
And Grand Princess

3

cisgenderedAggie

Joined: Jul 19, 2014

Posts: 6,762

User Profile Ignore User Stop Ignoring

How long do you want to ignore this user?

24 hours One week Permanently Cancel

In reply to Keegan99 • 12:51p, 3/28/20

I suppose that's true. It's possible that everyone That traveled internationally was crazy careful about spreading illness right up until social distancing became a thing. Also possible that it just spread everywhere and people didn't start getting very sick until after the media told us to, but the 90% negative tests might argue against that.

I think it's way more likely that the sequencing data are telling a reasonably accurate story.

2

Ranger222

Joined: Oct 20, 2007

Posts: 24,780

User Profile

Private Message

Ignore User Stop Ignoring

How long do you want to ignore this user?

24 hours One week Permanently Cancel

In reply to Keegan99 • 7:06p, 3/28/20

AG

Keegan99 said:
There was a poster on TexAgs that had a coworker return from Wuhan - having been to the wetmarket - in December. This coworker was reportedly out with "the flu" for two weeks.

Unless that person infected others, a la the Seattle arrival, how can you be sure they didn't have COVID?

I'm not sure I understand the question? TMCRA analysis dates to mid-November, so a December infection could certainly have occurred? Especially if they were at the site of the original founder? Seems to fit what was discussed, no?

BadMoonRisin

Joined: Aug 16, 2010

Posts: 19,342

User Profile

Private Message

Ignore User Stop Ignoring

How long do you want to ignore this user?

24 hours One week Permanently Cancel

In reply to Ranger222 • 7:15p, 3/28/20

AG

Ranger222 said:
So I've used this in response to a few threads now, and a locked thread last night wanted me to start a new thread that explains how we can track the virus and determine the start date of this "outbreak" that originated in China.

Thanks to the low cost and availability of sequencing today, hundreds if not thousands of viral genomes from SARS-CoV-2 have now been deposited online to share and distribute to the scientific community for analysis. The group that is leading the way in this is nextstrain.org which offers real-time tracking of pathogen evolution. They have made several different interactive charts to show how the virus is mutating that allows them to track the induction of an infection within a community and the resulting thread.

Here is an example of the power of their approach:

Trevor Bedford, a scientist at Fred Hutchinson in Seattle, shows that ~80% of the infections that occurred in Washington state most likely originated from one founder event, or infected person.

So how do they do this?

Its pretty simple. All viruses have a known mutation rate, meaning at a constant rate, a single nucleotide (A, G, T (here U) or C) becomes altered within the ~30,000 base pair viral genome. All living things experience this, as the machinery that replicates our genetic material is prone to errors. Sometimes those errors are beneficial to the organism, and that is how evolution occurs. Other times they are deleterious, and that mutation is quickly removed from a loss of fitness to the organism. Sometimes they just don't matter at all and persist as they do neither harm or good. They just accumulated over time. For this virus and others, they show a mutation rate of 2 base pairs in a given month.

So if we start with the original viral genome that first caused infection ---

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

After one month we will have at least one population of infections that result in a viral genome that looks like this :

XXXXXXXXAXXXXXXXXXXXXXXBXXXXXXXXXXXXXXXXXX

We can then track this specific pattern to see where around the world this viral genome originated and how it is spreading.

You may have a different one that looks like this --

XXXAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXBXXXXXXX

and we can track this new population as well.

So over time, this version can gain a new mutation

XXXAXXXXXXXXXXXXXXCXXXXXXXXXXXXXXXBXXXXXXX

that we can track as well, but backtrack through the A and B mutations to its original founder, even through we now have a third mutation in C.

So how does this relate to finding out when the original outbreak began?

Well, if we know the mutation rate (two changes per month) and we have enough sequences around the initial reports around the outbreak (thankfully we do from China), we can go back to the earliest days and see what the viral genome looks like back then and build a "molecular clock" to find the beginning of the outbreak.

This YouTube video is a seminar given by Trevor Bedford to Georgia Tech at the beginning of February when the outbreak really hadn't even reached the US yet. He says that based off of 5 viral genomes from China, three of those genomes were identical (from the same virus that had not mutated) while two others only differed by 3 mutations each. This told him that at this point in January, the sequence diversity was low, and we could backtrack the start of the outbreak only 1-2 months earlier. I've started the YouTube video where he discusses this.

Trevor is not the only one that thinks this. Here is another analysis from Andrew Rambaut from the University of Edinburgh that uses modeling and phylodyanmic analysis from 176 genomes to give a mid-November date. TMRCA = The Most Recent Common Ancestor

Data Coalescent model Estimated TMRCA 95% interval
12-Feb, 75 genomesExponential growth 29-Nov-2019 28-Oct-2019 20-Dec-2019
24-Feb, 86 genomesExponential growth 17-Nov-2019 27-Aug-2019 19-Dec-2019

Rambaut Analysis

So what does this mean?

There are a lot of conspiracy theories out there that SARS-CoV-2 has been out and in the population for quite some time, perhaps last fall. I love a good conspiracy theory, but unfortunately this one cannot be true BASED ON ALL THE ACCUMULATED SCIENTIFIC DATA. If you had something in December or January that knocked you and your whole family out for a good week, it was NOT SARS-CoV-2. You are not immune right now to this virus.

Thank you for posting this. I postulated a question on a thread a few days ago inquiring about how practical it was that this thing just got here in Early Jan-now and was just given the simple service-level answer of "read the papers", "google it", "its not my job to educate you", etc.

This is exactly what I was asking for in the thread but was never provided, and it makes complete sense.

Again, thanks.

5

1 edit

insulator_king

Joined: Oct 22, 2004

Posts: 1,113

User Profile Ignore User Stop Ignoring

How long do you want to ignore this user?

24 hours One week Permanently Cancel

9:32p, 3/28/20

AG

What if the mutation rate is 3x per month instead of 2x.?

SOunds like running the sequence using a range of bounds wold also be informational.

I just GREATLY dislike the use of a single assumed coefficient in these types of modeling problems.

I think I got a C in Reservoir Engineering, which was chock full of various models, that I never have used.

2

BlackGoldAg2011

Joined: Sep 18, 2014

Posts: 3,474

User Profile Ignore User Stop Ignoring

How long do you want to ignore this user?

24 hours One week Permanently Cancel

In reply to insulator_king • 8:10a, 3/29/20

AG

insulator_king said:
What if the mutation rate is 3x per month instead of 2x.?

SOunds like running the sequence using a range of bounds wold also be informational.

I just GREATLY dislike the use of a single assumed coefficient in these types of modeling problems.

I think I got a C in Reservoir Engineering, which was chock full of various models, that I never have used.

It wouldn't if there actually isn't much uncertainty in the mutation rate. If they have that variable pretty much pinned down, running sensitivities on A range of values would just add artificial uncertainty where there isn't any. As a fellow PETE I won't begin to even guess at their level of certainty, but the way he stated it, it sure sounds like they have reason to be confident in that 2x

1