(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . Sequential mutations in exponentially growing populations [1] ['Michael D. Nicholson', 'Edinburgh Cancer Research', 'Institute Of Genetics', 'Cancer', 'University Of Edinburgh', 'Edinburgh', 'United Kingdom', 'David Cheek', 'Center For Systems Biology', 'Department Of Radiology'] Date: 2023-08 Stochastic models of sequential mutation acquisition are widely used to quantify cancer and bacterial evolution. Across manifold scenarios, recurrent research questions are: how many cells are there with n alterations, and how long will it take for these cells to appear. For exponentially growing populations, these questions have been tackled only in special cases so far. Here, within a multitype branching process framework, we consider a general mutational path where mutations may be advantageous, neutral or deleterious. In the biologically relevant limiting regimes of large times and small mutation rates, we derive probability distributions for the number, and arrival time, of cells with n mutations. Surprisingly, the two quantities respectively follow Mittag-Leffler and logistic distributions regardless of n or the mutations’ selective effects. Our results provide a rapid method to assess how altering the fundamental division, death, and mutation rates impacts the arrival time, and number, of mutant cells. We highlight consequences for mutation rate inference in fluctuation assays. In settings such as bacterial infections and cancer, cellular populations grow exponentially. DNA mutations acquired during this growth can have profound effects, e.g. conferring drug resistance or faster tumour growth. In mathematical models of this fundamental process, considerable effort—spanning many decades—has been invested to understand the factors that control two key aspects of this process: how many cells exist with a set of mutations, and how long does it take for these cells to appear. In this paper, we consider these two aspects in a general mathematical framework. Surprisingly, for both quantities, we find universal probability distributions which are valid regardless of how many mutations we focus on, and what effect these mutations might have on the cells. The distributions are elegant and easy to work with, providing a computationally efficient alternative to intensive simulation-based approaches. We demonstrate the usefulness of our mathematical results by illustrating their consequences for bacterial experiments and cancer evolution. Funding: M.D.N is a cross-disciplinary post-doctoral fellow supported by funding from CRUK Brain Tumour Centre of Excellence Award (C157/A27589). The funders played no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. Copyright: © 2023 Nicholson et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. In this study we build upon the mathematical machinery developed in Refs. [ 10 , 11 ] to investigate this question. We focus on the biologically relevant settings of large times and small mutation rates. Broad-ranging features of the cell number, and arrival time, of type n cells are highlighted—including universal simple distributions—and explicit expressions make the impact of mutation and selection clear. The time until a cell type emerges, and expands to a detectable population size, depends on a variety of factors. Most obvious are the relevant mutation rates, however selection also plays an important role. For instance, if we start an experiment with an unmutated cell and wait for a cell with 2 mutations, a low division rate of cells with one mutation slows down this process. In the scenario of the sequential acquisition of driver alterations in cancer, with each mutation providing a selective advantage, Durrett and Moseley characterised the time to acquire n driver mutations [ 10 ]. We recently examined the setting of drug resistance conferring mutations, which often have a deleterious effect, so that the original cell type grew the fastest [ 11 ]. However, in general, the effects of mutation and selection on evolutionary timescales within exponentially growing populations remain unclear. To quantitatively characterise diseases, in settings such as cancer, and bacterial and viral infections, a concerted effort has been made to study evolutionary dynamics in exponentially expanding populations. Understanding the timescale of evolution is a key aspect of this research program which has proven useful in a diverse range of areas such as: measuring mutation rates [ 1 ], assessing the likelihood of therapy resistance developing [ 2 – 4 ], inferring the selective advantage of cancer driver events [ 5 – 7 ], and exploring the necessary steps in the metastatic process [ 8 , 9 ]. The common theme within these works is that they use information about when a particular cell type arises within the population of interest. For a concrete example, whose roots lie in the celebrated work of Luria and Delbrück [ 1 ], if we imagine a growing colony of bacteria, we might wish to know how quickly a mutant bacterium will develop with a specific mutation that confers resistance to an antibiotic therapy. The cancer evolution examples discussed above all assume that the type 1 cell has a driver mutation. In other settings, it may be more natural to consider the type 1 cells as wild type, for example when considering the emergence of drug resistance. We emphasise that in this paper the type one cells are always supercritical, that is they grow exponentially on average. For a more general model that describes a population with the potential to traverse multiple evolutionary paths, genotype space can be represented as a directed graph. When the original cell type has the largest net growth rate, we recently derived simple formulas for the arrival time and cell number through the directed graph of genotypes [ 11 ]. The results presented below, where the cell type with the largest net growth rate is unconstrained, hold only for a linear path through a genotype space. While in this work we cannot compare arbitrary sets of paths to a target evolutionary genotype, one may focus on each evolutionary path to the target type separately as a single linear path and then compare the median time to traverse each evolutionary path using the results presented below. For example, two sets of driver mutations might be considered: mini-drivers which have a high mutation rate, but low selective advantage, and major-drivers which have a low mutation rate but large selective advantage [ 15 ]. We would then compare the median times of the evolutionary paths ‘Driver 1 → Mini-driver → Driver 3’ and ‘Driver 1 → Major-driver → Driver 3’ to determine which path is most likely to produce the first cell with three driver mutations. When the cancer evolutionary trajectory is not specified, but it is assumed that driver mutations arise at a constant rate such that each new mutation confers a constant 1 + s d fold increase in the proliferation rate, then this model also falls within our framework. Bozic et al. [ 5 ] applied this model to cancer genetic data, thereby inferring the selective effect s d of driver mutations. Conversely to oncogenic drivers, neoantigen-creating mutations that stimulate the immune system to attack cancer cells have been modelled as increasing the death rate of the mutated cells by a factor of 1 + s n [ 14 ] ( Fig 2B ). Lakatos et al. [ 14 ] used this model to examine conditions such that a population of neoantigen-presenting cancer cells would be sufficiently large to be observed in sequencing data in order to explore the limits of detecting immune-mediated negative selection. Exploring how the distribution of the cell number with k neoantigens varies as function of s n and the neoantigen-mutation rate can be rapidly assessed with the results below. Cancer cells accumulate mutations with a variety of phenotypic effects during the cancer’s expansion. Oncogenic driver mutations are thought to increase the population’s net growth rate, either by increasing the proliferation rate or decreasing the death rate. A linear path is relevant when considering cancers that follow a specified evolutionary trajectory. For example, the canonical mutational path [ 12 , 13 ] in colorectal cancer is loss of APC (type 1 cells), followed by a KRAS mutation (type 2 cells have mutations in both genes), then loss of TP53 (type 3 cells with mutations in all 3 genes); see Fig 2B . A. Previous work has considered special cases of growth rate sequences, here we consider general sequences as long as λ 1 > 0. B. Two biological scenarios in which the growth rate sequences covered in this paper are relevant: the acquisition of driver mutations in the canonical carcinogenesis pathway of colorectal cancer, and the accumulation of neoantigens by cancer cells which results in increased cell death due to immune system surveillance. Our model considers a linear evolutionary path of cells sequentially mutating from type 1 to 2 to 3, and so on (see Figs 1A and 2 ). We briefly highlight scenarios for which our model is relevant, drawing on examples from cancer evolution (although similar statements can be made for other exponentially growing populations). We focus on two quantities; the number of cells of type n at time t—denoted Z n (t), and the arrival time of the first type n cell—termed τ n (see Fig 1B and 1C ). To describe the growth of the cellular populations, let the net growth rate of the type n cells be λ n = α n − β n . We denote the ‘running-max’ fitness, which is the largest growth rate of the cell types among 1, …, n, as δ n , that is δ n = max i=1,…,n λ i . Further, we introduce r n as the number of times the running-max has been attained over the cell types up to n, that is r n = #{i = 1, …, n : λ i = δ n }. A summary of the key notation used in this article is provided in Table 1 . A: We consider a multitype branching process in which cells can divide, die, or mutate to a new type. B: We study the waiting time until a cell of the nth type exists, τ n , starting with a single cell of type 1. C: Stochastic simulation of the number of cells over time, with dashed lines indicating the large-time trajectories given by Eq (1) . Grey horizontal line occurs at the inverse of the mutation rate, while the grey vertical lines indicate the time at which the type n population size reaches the inverse of the mutation rate, which gives the arrival time of the type n + 1 cells to leading order. Parameters: α 1 = α 3 = 1.1, α 2 = 1, β 1 = 0.8, β 2 = 0.9, β 3 = 0.5, ν 1 = ν 2 = 0.01. Thus, the net growth rates are λ 1 = 0.3, λ 2 = 0.1, λ 3 = 0.6 and the running-max fitness follows δ 1 = δ 2 = λ 1 , δ 3 = λ 3 . We consider a population of cells, where each cell can be associated with a given ‘type’ (for example ‘type 3’ might be cells with 3 particular mutations). Cells of type n divide, die, and mutate to a cell of type n + 1, at rates α n , β n and ν n , with all cells behaving independently of each other. With (n) representing a type n cell and ⌀ symbolising a dead cell, our cell level dynamics can be represented as (see also Fig 1A ): (1) In other words after a random, exponentially distributed waiting time with parameter α n + β n + ν n , a type n cell is replaced by one of the listed three options with probability proportional to its corresponding rate. The process starts with a single cell of type 1 at time t = 0, and we assume that the type 1 population is supercritical (α 1 > β 1 ) and that it survives forever (does not undergo stochastic extinction). Fig 5B shows likelihood inference for the mutation rate using both approaches assuming 100 simulated replicates and that 2 mutations (e.g. amplifications) confer resistance. The two inference approaches have strengths and weaknesses depending on the underlying mutation rate and the time t for which the cells are grown before being exposed to the drug. If t is too large the majority, or all, replicates will have resistant cells, and hence the number without resistance carries limited information on the mutation rate (e.g. the wide error bars for log 10 (ν) = −1.5 in the left plot of Fig 5B ). Instead, the long-time limit approximation of the mutant count distribution, Eq (1) , is appropriate, and here our simulated inference for the mutation rate closely matches the true parameter value ( Fig 4B ). However, if t isn’t large enough ( then Eq (1) poorly characterises the distribution of resistant cells (e.g. the incorrect inference for log 10 (ν) = −3 in the right plot of Fig 5B ); instead, the p 0 method enables accurate inference of the mutation rate. Hence, similar to the advice for the classic fluctuation assay [ 24 ], if only some replicates show resistance the p 0 method is preferred, whereas if all replicates have sizeable mutant numbers, inference using the mutant counts is advisable. Note that our inference here has assumed known birth rates and no death. These rates could be measured by standard experimental protocols, for example using growth curve assays. Kimmel and Axelrod [ 28 ] also gave statistical consideration to a fluctuation assay where two mutations are needed. However, in principle (neglecting experimental complexities), our results hold for any n, include death, and allow for varied growth rates between the cell types, extending the work of Ref. [ 28 ]. A. Schematic of a fluctuation assay for the measurement of mutation rates when n mutations are required for resistance. Drug sensitive cells are initially cultured, and after growth for a given time t, the cells are exposed to a selective medium. Non-resistant cells are killed, revealing the number of mutants. This experiment is conducted over replicates, and the number of replicates without resistance and the mutant numbers are recorded. B. Likelihood inference on a simulated fluctuation assay assuming: 2 mutations are required for resistance, 100 replicates, no death, α i = 1 for each i, t = 10, and the mutation rate ν stated on the x-axis. Wide error bars are expected when using the p 0 method for as only a small number of replicates have no resistant cells; in such a setting using the mutant counts (right panel) provides superior inference. Likewise, if the approximation of Eq (1) is not appropriate, which explains the inaccurate inference for log 10 (ν) = −3 when using the mutant counts; the p 0 method provides improved inference in this scenario. For simplicity, and as is typical for mutation rate inference, assume mutations are modelled as neutral (λ 1 = λ 2 = …) and that mutations occur at rate ν (ν = ν 1 = ν 2 = …). Suppose k replicates of a fluctuation assay are performed and the number of replicates without resistance, and/or the distribution of mutant numbers over replicates is recorded ( Fig 5A ). If the mutation rate ν is known, the distribution of replicates without resistance is binomial with k trials and success probability given by the logistic distribution of Eq (4) (further details on inference methodology is given in the supplementary material S1 Text ). In this setting the median arrival time of the (n + 1)th type is Hence, given the number of replicates without resistance, the unknown mutation rate ν may be inferred by maximum likelihood (p 0 method). Similarly, the mutant count distribution over replicates would be characterised by Eq (1) , which in this setting take the simple form of with V n an exponential random variable with mean Maximum likelihood for the mutant counts under this distribution provides a secondary approach to infer ν. Pairing mathematical models for the emergence of drug resistance during exponential population growth with experimental fluctuation assays enables the inference of mutation rates [ 1 , 23 ]. In the classic fluctuation assay, replicates are initiated by a small number of drug sensitive cells, which are then grown for either a fixed time period or until the total population reaches a given size. The cells are then exposed to the drug, killing non-resistant cells, which allows the number of replicates without resistance, and the mutant number in those replicates with resistance, to be measured. These experimental quantities are then combined with an appropriate statistical model to infer the mutation rate of acquiring resistance [ 24 ]. Originally, only wild type and mutated cells were considered in fluctuation assays. However, including multiple types is required when assessing multidrug resistance, investigating resistant-intermediates such as persistor cells [ 25 ], or if multiple gene amplifications are needed for therapy resistance. Gene amplifications are a prevalent resistance mechanism in cancer [ 26 ] and amplification rates have been previously reported using fluctuation assays [ 27 ], under the standard assumption of a single mutational transition to resistance. However, the modelling assumption of a single mutation imbuing therapy tolerance may be invalid if multiple amplifications are required for resistance. For example, the drug resistant WB 20 rat epithelial cell line in Tlsty et al [ 27 ] contained 4 gene copies, compared to the wild type having only 1 copy of the resistance gene. In such settings, to meaningfully infer amplification rates, an inference framework that describes sequential mutation acquisition is needed. With our results such a modified inference scheme can be constructed. Special cases of our results have been obtained previously. Durrett and Moseley [ 10 ] obtained the formulas for the arrival time in the special case λ 1 < λ 2 < ⋯ < λ n in the context of accumulation of driver mutations in cancer, and the leading order was also derived in [ 5 ]. A key conclusion of [ 5 , 10 ] follows directly from the representation of the difference in median arrival times given in Eq (7) : Assuming a constant driver mutation rate (ν 1 = … = ν n ), the median waiting time between the nth and (n + 1)th driver mutation is approximately which decreases as a function of n. Hence, under this model, tumor evolution accelerates during its growth [ 5 , 10 ]. For a comparison with the formulas of [ 10 ], note that in this case the running-max fitness for type j is always λ j , that is δ j = λ j , and so r j = 1 for all j. Further, the cell types in [ 10 ] are numbered from zero. Then the quantity as defined in this paper corresponds and agrees with c θ,n μ n of [ 10 ] (the formulas in [ 10 ] contain some misprints, but they are corrected in [ 22 ]). Durrett and Moseley [ 10 ] also pointed out that the shapes of the distributions of both the arrival time and the population size were independent of n. These distributions were also observed for the special case λ 1 > λ i for 1 < i ≤ n in [ 11 ]; this case was studied under the motivation of mutations that confer drug resistance but at a fitness cost. In the present paper we have found that even for a general sequence of net growth rates the distribution shapes remain independent of n and their dependence on the rate parameters can be written in relatively simple terms. The formulas for the arrival times ( 7 ) are valid for small mutation rates, and to leading order the increase in the median arrival time for each new type (i.e. ) is . An intuitive understanding can be gained by assuming that: (i) the arrival time for the type n + 1 cells approximately occurs when the type n population size reaches 1/ν n and (ii) we can ignore fluctuations in population size such that the type n population grows exponentially as in the deterministic factor of Eq (1) . Then, for the case n = 1, we simply find as the time it takes an exponentially growing population to grow from one cell to 1/ν 1 , that is we solve , which reproduces the leading order of Eq (6) as ν 1 → 0. Similarly, for the arrival times for type n + 1, suppose we start an exponential function at with net growth rate δ n ; this growth will take time to reach the threshold of from one cell. To leading order in small mutation rates, this reproduces the recursion of Eq (7) . The arrival time density has a general shape centred at ( Fig 4 ). As expected, the median arrival time increases with n or as the mutation rates decreases, and the recursion of Eq 7 explicitly details how these parameters interact. In contrast, the variance of the arrival time is always . Moreover, the entire shape of the distribution, which is centered around , is determined only by λ 1 . Thus due to the constant variance, for , modellers may safely ignore the stochastic nature of waiting times and treat the arrival time of the type n cells as deterministic. However, our result raises questions for statistical identifiability; aiming to distinguish between models, e.g. does a phenotype of interest require 2 or 3 mutations, based on fluctuations may be difficult due to the common logistic distribution. Our approximation ( 1 ) for the cell number of the type n cells is valid for large times. Additionally, small mutation rates are required when the running-max fitness increases, so λ 1 < δ n . Heuristically, we expect the approximation to be valid at large enough times such that the type n cells have been seeded with high probability, that is for . Around the arrival time for the type n cells, , fluctuations in the cell number can be greater, which can be seen even in the two-type setting. In the two-type neutral case (λ 1 = λ 2 ), from Eq (1) we expect that, for , where V 2 is exponentially distributed, and therefore has an exponentially decaying tail. However, for (or ), it is known that Z 2 (t) has a heavy-tailed distribution, commonly known as the Luria-Delbrück distribution [ 19 – 21 ]. On the other hand, for λ 1 < λ 2 , we found that V 2 does have a power-law heavy-tail as for the Luria-Delbrück distribution. Therefore, at times around the arrival time for type n cells, the fluctuations in cell number may exceed the characterisation given in Eq (1) , but at larger times they are described by the Mittag-Leffler random variable V n . We also note that, in the scale parameter recursion of Eq 2 , when mutations are mildly deleterious (0 < δ n − λ n+1 ≪ 1), the scale parameter can take large values. Therefore, caution should be adopted when using our approximation in this case. The random amplitude of the deterministic growth, V n , has a Mittag-Leffler distribution, with infinite mean if λ 1 < δ n , which is driven by a power-law decay in its distribution. Intuition for the tails can be gleaned from the case of n = 2 [ 19 ]. In the λ 1 < λ 2 case, the power-law tail arises due to rare, early mutations from the type 1 cells. The descendants of these early mutations make a considerable contribution to the total number of type 2 cells even at large times (see discussion of Theorem 3.2 in [ 19 ]). However, for λ 1 ≥ λ 2 , the type 2 descendants from any given mutation eventually make up zero proportion of the type 2 population. Instead, the sheer number of new mutations from the type 1 cells drives the growth of the type 2 population, and in this case the tail decays exponentially. To move to type n, from Eq (3) we see that if δ n ≥ λ n+1 then the randomness in the cell number is inherited from type n to type n + 1. Thus if the running-max fitness does not exceed the growth rate of the type 1 population, that is if δ 1 = δ n , then an exponential distribution will be propagated, i.e. all follow an exponential distribution. However, if the running-max fitness does increase, then for the first i such that δ i < λ i+1 , a power-law tail will emerge for V i+1 . For types that occur after the emergence of the power-law, that is for j > i + 1, if the running-max fitness does not increase then the power-law with tail-exponent λ 1 /λ i+1 will be propagated, again due to the inheritance property of Eq (3) . If instead the running-max fitness increases again, i.e. there is j > i + 1 such that λ i+1 < λ j , then the power-law tail remains but with the exponent decreased to λ 1 /λ j . Thus, if the running-max fitness ever rises above δ 1 , the tail of the random amplitude has a power-law decay with a monotone decreasing exponent λ 1 /δ n . From Eq (1) , we see that on a logarithmic scale (as in Fig 1C ), at large times the number of cells approximately follows a straight line with gradient that increases only when the running-max fitness increases. When the running-max fitness does increase (δ n−1 < λ n ), then the type n cell number grows exponentially with rate λ n . Conversely, if the type n cells have net growth rate smaller than the running-max fitness (δ n−1 > λ n ), then as the large time behaviour of the type n cell number is exponential growth with rate δ n−1 = δ n , the flux from the type n − 1 population eventually drives the cell growth. One can observe this behaviour in Fig 1C : although the type 2 cells have lower fitness than type 1, the population sizes both eventually grow at the same rate of λ 1 . However, the type 3 cells have the largest fitness so far, hence the cell number grows at its own rate λ 3 . When the type n cells have net growth rate equal to the running-max fitness (δ n−1 = λ n ), relevant for a neutral mutations scenario, then exponential growth at rate δ n occurs but with an additional geometric factor of . The origin of this geometric factor is best understood by considering the mean growth for n = 2, λ 1 = λ 2 [ 19 ]. In this case mutations occur at rate proportional to and the average number of descendants from a mutation which occurs at time s is by time t. Hence, at time t, the mean number of mutants is , which is the same geometric factor that appeared as for the limit result Eq (1) . Extending this argument to type n explains the geometric factor. For the case where each running-max fitness is attained only by one type (r i = 1 for each i) then the medians satisfy the following recursion: with (6) then for n ≥ 2 7) where c n is defined immediately after Eq (2) . If the running-max fitness may be obtained multiple times, then a more detailed recursion also exists, given as Lemma 6 in Methods. Note that since the distribution in Eq (4) is symmetric, the median and the mean coincide. Normalized histogram for the arrival times of types 1–3 obtained from 1000 simulations of the exact model versus the probability density corresponding to the logistic distribution of Eq (4) . Note the shape of the distribution remains unchanged. Parameters: α 1 = α 3 = 1, α 2 = 1.4, ν 1 = ν 2 = ν 3 = 0.01, β 1 = β 2 = 0.3, β 3 = 1.5. Similarly to the population sizes, the exact distribution of the arrival time is analytically intractable outside of the simplest settings. For example, the exact probability that type 3 cells arrive by time t is given in Ref. [ 18 ] and requires the evaluation of 4 hypergeometric functions. However, when the mutation rates are small simplicity again emerges; the time until the appearance of the first type n + 1 cell, τ n+1 , has approximately a logistic distribution (4) with scale given by and median given by (5) where ω n is the scale parameter defined in ( 2 ). Comparisons of the limiting logistic distribution with simulations are shown in Fig 4 , with further simulations provided in the supplementary figure S1 Fig . The population initiated by the first cell of type n + 1 could go extinct, and so we might wish to instead consider the waiting time until the first type n + 1 cell whose lineage survives. All lineages of type n + 1 will eventually go extinct unless λ n+1 > 0. If λ n+1 > 0 then the results given above hold also for the arrival time of the first surviving lineage if we replace ν n by ν n λ n+1 /α n+1 . In general, the equation for asymptotic growth ( 1 ) together with the formulas for ω n in ( 2 ) enables us to easily answer questions about the population of different cell types. One might ask, for example, whether the number of cells of type n is greater than a given size k and how the growth rates and mutation rates in the system influence this; this problem can be approached using Numerically evaluating the resulting distribution function is standard in scientific software (e.g. using the Mittag-Leffler package in R [ 17 ]). The variable V n /ω n is a single parameter Mittag-Leffler random variable with scale parameter one, and tail parameter γ = λ 1 /δ n . For γ = 1 its density is simply e −x , and hence V n /ω n has mean 1, while for γ < 1 the density has a x γ−1 singularity at the origin and a x −γ−1 tail, thus V n /ω n has infinite mean. A further property is that, when the running-max fitness does not increase between n and n + 1, the random variables V n and V n+1 are equal up to a constant factor (perfectly correlated), i.e. with probability 1 (3) However, in the case δ n < λ n+1 , such simple rules do not apply. Eq (1) , states that for large times and small mutation rates, the scaled number of type n cells, , is approximately Mittag-Leffler distributed with scale ω n and tail λ 1 /δ n . Here, we compare simulations of the scaled number of type n divided by ω n , to the density of V n /ω n which is Mittag-Leffler with scale parameter 1, and tail parameter λ 1 /δ n ∈ (0, 1]. We chose three tail parameter values λ 1 /δ n = 0.25, 0.5, 1.0, and these curves are depicted with solid lines. The simulation parameter were always α 1 = 1.2, β 1 = 0.2, ν 1 = 0.01, β 2 = 0.3 and for n = 2 types sim 1: α 2 = 4.3, t = 5; sim 2: α 2 = 2.3, t = 7; sim 3: α 2 = 1.0, t = 12. Then for n = 3 types sim 4: as in sim 3 plus α 3 = 2.4, β 3 = 0.4, ν 3 = 0.001, t = 12. Density lines were created in Mathematica using x γ−1 MittagLefflerE[γ, γ, −x γ ]. Understanding the distribution of the number of cells of type n at a fixed time t (e.g. the probability that 5 cells exist of type 2 at time 2) can be complex [ 16 ], however a surprising level of simplicity emerges at large times with small mutation rates. The number of cells of type n can be decomposed into the product of a time-independent random variable and a simple time-dependent deterministic function controlled by the running-max fitness δ n , and the number of times it has been attained r n up to type n: (1) The random variable V n has a Mittag-Leffler distribution with tail parameter λ 1 /δ n , and scale parameter ω n . Its density has a particularly simple Laplace transform The parameter ω n may be computed by the following recurrence relations: setting ω 1 = α 1 /λ 1 , then for n ≥ 1, (2) where Notably, when type 1 has the maximal growth rate of all types up to type n, that is δ n = λ 1 , the Mittag-Leffler distribution collapses to an exponential distribution with mean ω n . Stochastic simulations of the scaled number of type n cells for large times, , which according to Eq (1) is Mittag-Leffler distributed, are compared with theory in Fig 3 . Our results are broken into three sections. We first give an overview of our main mathematical results, stratified by whether they relate to the number of type n cells or to their arrival time. We then highlight the main properties of the results as well as providing intuitive arguments for why these properties emerge. Finally, we compare our results to previously known special cases. As the biological processes studied become increasingly complex, so too will the mathematical models constructed to describe such processes. We hope that the results of the present paper will enable researchers to find simplicity in an arbitrarily complex parameter landscape for a fundamental class of mathematical models. We have focused on the number, and arrival time, of cells with n mutations. While this problem dates back at least to the work of Luria and Delbrück—where a mutation resulted in phage resistant bacteria—specific instances of the problem are commonly used to study a variety of biological phenomena [ 3 – 5 , 8 , 9 , 14 , 24 , 31 – 34 ]. The time of first mutation is well known, however the arrival time of cells with n alterations is unclear outside of specific fitness landscapes [ 10 , 11 ]. Here, we developed approximations for the cell number and arrival time regardless of whether mutations increase, decrease, or have no effect on the growth rate of the cells carrying the alterations. We showed that, within relevant limiting regimes, the number of type n cells can be decoupled into the product of a deterministic time-dependent function and a time-independent Mittag-Leffler random variable; meanwhile the arrival time of type n cells follows a logistic distribution with a shape that depends only on the net growth of the type 1 cells. The features of these distributions, such as median arrival time, can be exactly mapped to the underlying model parameters, that is the division, death, and mutation rates. These results illuminate the effects of mutation and selection, and can be readily numerically evaluated to explore particular biological hypotheses. We highlighted the utility of our results on mutation rate inference in fluctuation assays. Due to their simplicity and ability to model fundamental biology such as cell division, death, and mutation, multitype branching processes have become a standard tool for quantitative researchers investigating evolutionary dynamics in exponentially growing populations. Further, these models are able to link detailed microscopic molecular processes to explain macroscopic experimental, clinical, and epidemiological data [ 29 , 30 ]. Despite the importance of this framework, even simple questions are often challenging to examine. Whilst numerical and simulation based methods have proven powerful for both model exploration and statistical inference, the computational expense of simulating to plausible scales can lead to challenges; e.g. simulating to tumour sizes orders of magnitude smaller than reality, which provides obstacles for biological interpretation of inferred parameters. Moreover, it is often unclear how to precisely summarise the manner in which a large number of parameters interact to influence quantities of interest, such as the time until a triply resistant cell emerges. In this study, we analysed the regimes of large times, and small mutation rates, in order to develop limiting formulas that can be used to quickly gain intuition or for approximate statistical inference Methods In this section we provide detailed results and proofs in their general form. Branching process: Population growth We first look to understand the number of cells of type n at time t, that is Z n (t), at large times. Proposition 1. Assume non-extinction of the type 1 population, that is that Z 1 (t) > 0 for all t ≥ 0. Then, for each , there exists a (0, ∞)-valued random variable V n such that almost surely. As our branching process is reducible this result is not considered classical [35]. Heuristically, the result says that for large t, and so at large times all the stochasticity of Z n (t) is bundled into the variable V n . Towards proving Proposition 1, we first consider a model of a deterministically growing population which seeds mutants as a Poisson process, the mutants growing as a branching process. The next result defines the model and describes the large-time number of mutants, generalising a result of [36]. Lemma 1. Let (f(t)) t≥0 be a non-negative cadlag function, x, δ > 0, and r ≥ 0, with Suppose that come from a Poisson process on [0, ∞) with intensity f(⋅). Suppose that (Y i (t)) t≥0 , , are i.i.d. birth-death branching processes initiating from a single cell, that is Y i (0) = 1, with birth and death rates α and β. Let λ = α − β. Define Then almost surely. Here V is some positive random variable with mean . Proof. We first give the argument assuming λ ≠ 0, and provide a comment at the end of the proof indicating modifications needed for the λ = 0 case. First we claim that is a martingale with respect to the natural filtration. Indeed, for s ≤ t, as required. Next we look to bound the second moment of M(t). To this end, observe that is a compound Poisson distribution which is a Poisson sum of i.i.d. random variables distributed as Y 1 (t − ξ), where ξ is a [0, t]-valued random variable with density proportional to f (see, e.g., Section 2 of [36]). Using the already-known second moment for a birth-death branching process [37] (see Theorem 6.1 on page 103), we have that It follows that and since , we find that Therefore (8) where C, D and E are positive constants. To conclude the proof, we will separately consider the three cases listed in the Lemma’s statement: δ < λ, δ = λ, and δ > λ. We begin with the case δ < λ. Here the martingale M(t) has a bounded second moment. By the martingale convergence theorem, M(t) converges to some random variable V′ with mean zero. Rearranging the limit of M(t), almost surely, where the integral converges because the integrand has an exponentially decaying tail. The positivity of V can be seen by Fatou’s lemma: where the are i.i.d. random variables on [0,∞) that are each non-zero with positive probability [22, 35] (recall this case assumes that λ > δ > 0 so that each Y i (⋅) is supercritical). Hence, with probability one at least one of the W i is positive. This gives the result for δ < λ. The second case is δ = λ. Here the second moment of M(t) is still bounded and so we can again apply the martingale convergence theorem to see that M(t) converges almost surely. It follows that converges to zero almost surely. Thus, using dominated convergence, is the almost sure limit of t−r−1e−δtZ(t). The third and final case is δ > λ. This case requires a new perspective because the second moment of M(t) may not be bounded, disallowing the martingale convergence theorem. Instead we appeal to Borel-Cantelli. For ϵ > 0 and , consider the events Then by Doob’s martingale inequality and then Eq (8); here G and γ are positive numbers which do not depend on n. By Borel-Cantelli, the probability that only finitely many of occur is one. Equivalently, converges to zero almost surely. Thus, using dominated convergence, is the almost sure limit of t−re−δtZ(t). For the case of λ = 0, minor modifications are required. Firstly, the second-moment has the form and hence with C′ a positive constant. When λ = 0, then δ > λ. Thus, the above bound should be used in the Borel-Cantelli centred argument, which leads to the same result. We can now give the proof of Proposition 1 on the convergence of cell numbers. Proof of Proposition 1. We prove the result by induction. Clearly it is true for n = 1. Now suppose that almost surely. Condition on the trajectory of Z n (⋅), and apply Lemma 1 to see that almost surely. Having proven that the cell numbers grow asymptotically as a deterministic function of time multiplied by a time-independent random amplitude V n , our next aim is to determine the distribution of this random amplitude. We shall proceed via induction. To establish the base case we restate a classic result [22, 35]: Lemma 2. The random variable V 1 from Proposition 1 has exponential distribution with parameter λ 1 /α 1 = 1 − β 1 /α 1 . Since the type n population seeds the type n + 1 population, one might expect that the random amplitudes V n and V n+1 of the two populations are related. The next result says that this is indeed the case for a part of parameter space—when the type n + 1 fitness is no greater than the fitnesses of previous types. Corollary 1. Let n ≥ 1. If δ n > λ n+1 while for δ n = λ n+1 Proof. Immediate from Lemma 1. Corollary 1 focuses on the case that the fitness of type n + 1 does not dominate the fitnesses of types 1 to n; here it says that the random amplitude V n+1 is simply a constant multiple of V n , meaning that the large-time stochasticity of the type n + 1 population size is perfectly inherited from the type n population. A special example is that type 1 has a larger fitness than all subsequent types, in which case V n is a constant multiple of V 1 and thus all random amplitudes are exponentially distributed, recovering a result of [11]. Corollary 1 is also a generalisation of Theorem 3.2 parts 1 and 2 of [19] which provided the distribution of V 2 in terms of V 1 . The remaining region of parameter space—where a new type may have a fitness greater than the fitness of all previous types is our next focus. Here, contrasting with the region considered in Corollary 1, the random amplitudes seem to be rather complex. The distribution of V 2 takes an intricate form, which is calculated in [16] (Eq. 56) and we do not restate it here for brevity. The distribution of V n for n > 2 apparently are unknown. We aim to find simple approximations for the V n by introducing an approximate model. Approximate model introduction The exact distribution of the random amplitude V n for a generic sequence of birth and death rates appears to be analytically intractable. Thus we look to approximate V n in the limit of small mutation rates. Towards such an approximation, we choose to follow a method inspired by Durrett and Moseley [10] which simplifies calculations by introducing an approximate model. The approximate model is motivated by the following heuristic argument: mutations to create cells of type (n + 1) occur at rate ν n Z n (t); when the mutation rates are small it will take some time for the first cell of type (n + 1) to appear; at large times (Proposition 1); therefore for small mutation rates, mutations to create cells of type (n + 1) should occur at rate . We carefully define the approximate model momentarily, but briefly it arises by assuming the type (n + 1) arrive at rate and then letting the type (n + 1) cells follow the dynamics we’ve already been assuming. Formally, we define the approximate model iteratively. We let be the size of the type n population at time t, set for t ≥ 0, and fix . Then, given , let be the times from a Poisson process with rate Then, we set (9) where the Y n,i (⋅) are independent birth-death processes initiated from a single cell with birth and death rates α n and β n , and (10) We hypothesise but do not prove that the distribution of the random amplitudes and V n for the approximate and original models respectively coincide in the limit of small mutation rates; this is known to be true in the two-type setting (Section 4.4 of [16]). [END] --- [1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011289 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/