(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . Artificial neural networks for model identification and parameter estimation in computational cognitive models [1] ['Milena Rmus', 'Department Of Psychology', 'University Of California', 'Berkeley', 'California', 'United States Of America', 'Ti-Fen Pan', 'Liyu Xia', 'Department Of Mathematics', 'Anne G. E. Collins'] Date: 2024-05 Computational cognitive models have been used extensively to formalize cognitive processes. Model parameters offer a simple way to quantify individual differences in how humans process information. Similarly, model comparison allows researchers to identify which theories, embedded in different models, provide the best accounts of the data. Cognitive modeling uses statistical tools to quantitatively relate models to data that often rely on computing/estimating the likelihood of the data under the model. However, this likelihood is computationally intractable for a substantial number of models. These relevant models may embody reasonable theories of cognition, but are often under-explored due to the limited range of tools available to relate them to data. We contribute to filling this gap in a simple way using artificial neural networks (ANNs) to map data directly onto model identity and parameters, bypassing the likelihood estimation. We test our instantiation of an ANN as a cognitive model fitting tool on classes of cognitive models with strong inter-trial dependencies (such as reinforcement learning models), which offer unique challenges to most methods. We show that we can adequately perform both parameter estimation and model identification using our ANN approach, including for models that cannot be fit using traditional likelihood-based methods. We further discuss our work in the context of the ongoing research leveraging simulation-based approaches to parameter estimation and model identification, and how these approaches broaden the class of cognitive models researchers can quantitatively investigate. Computational cognitive models occupy an important position in cognitive science research, as they offer a simple way of quantifying cognitive processes (such as how fast someone learns, or how noisy they are in choice selection), and testing which cognitive theories offer a better explanation of the behavior. To relate cognitive models to the behavioral data, researchers rely on statistical tools that require estimating the likelihood of observed data under the assumptions of the cognitive model. This is, however, not possible to do for all models as some models present significant challenges to likelihood computation. In this work, we use artificial neural networks (ANNs) to bypass likelihood computation and approximation altogether, and demonstrate the success of this approach applied to model parameter estimation and model comparison. The proposed method is a contribution to ongoing development of modeling tools which will enable cognitive researchers to test a broader range of theories of cognition. Our results showed that our method is highly successful and robust at parameter and model identification while remaining technically lightweight and accessible. We highlight the fact that our method can be applied to standard cognitive data sets (i.e. with arbitrarily small number of participants, and normal number of trials per participant), as the ANN training is fully done on a large simulated data set. Our work contributes to the ongoing research focusing on leveraging artificial neural networks to advance the field of computational modeling, and provides multiple new avenues for maximizing the utility of computational cognitive models. To validate and benchmark our approach, we first compared it against standard model parameter fitting methods most commonly used by cognitive researchers (MLE, MAP, rejection ABC) in cognitive models from different families (reinforcement learning, Bayesian Inference) with tractable likelihood. Next, we demonstrated that neural networks can be used for parameter estimation of models with intractable likelihood, and compared it to standard approximation method (ABC). Finally, we showed that our approach can also be used for model identification. A) Traditional methods rely on computing log-likelihood (LLH) of the data under the given model, and optimizing the likelihood to derive model parameter estimates. B) The ANN is trained to map parameter values onto data sequences using a large simulated data set; the trained network can then be used to estimate cognitive model parameters based on new data without the need to compute or approximate likelihood. C) The ANN structure inspired by [ 52 ] is suitable for data with strong inter-trial dependencies: it consists of an RNN and fully connected feed-forward network, with an output containing ANN estimates of parameter values the data was simulated from for each agent. D) As in parameter estimation, traditional tools for model identification rely on likelihood to derive model comparison metrics (e.g. AIC, BIC) that are used to determine which model likely generated the data. E) ANN is instead trained to learn the mapping between data sequences and respective cognitive models the data was simulated from. F) Structure of the ANN follows the structure introduced for parameter estimation, with the key difference of final layer containing the probability distribution over classes representing model candidates, with highest probability class corresponding to the model the network identified as the one that likely generated the agent’s data. Our approach relies on the property of ANNs as universal function approximators. The ANN structure we implemented was a recurrent neural network (RNN) with feed-forward layers inspired by [ 52 ] ( Fig 1 ) that is trained to estimate model parameters, or identify which model most likely generated the data based on input data sequences simulated by the cognitive model. Our approach is similar to previous work in the domain of simulation-based inference [ 40 , 41 ], with a difference that such architectures are specifically designed to optimize explicit summary statistics that describe the data patterns (e.g. invertible networks). Here, rather than emphasizing steps involving the reduction of data dimensionality through the creation (and selection) of summary statistic vectors and subsequent inference based on parameter value samples, our focus is on the direct translation of raw data sequences into precise parameter estimates or the identification of the source model (via implicit summary statistics in network layers). Here, we test a related, general approach that leverage advances in artificial neural networks (ANNs) to estimate parameters and perform model identification for models with and without tractable likelihood, entirely bypassing the likelihood estimation (or approximation) step. ANNs have been successfully used to fit intractable models in different fields, including weather models [ 51 ] and econometric models [ 6 ], and more recently cognitive models of decision making [ 40 , 41 ]. We develop similar approaches to specifically target the intractability estimation problem in the field of computational cognitive science, including both parameter estimation and model identification, and thoroughly test it in a challenging class of models where there are strong dependencies between trials (e.g. learning experiments). While ABC rejection algorithms provide a useful workaround solution, it’s important to acknowledge their inherent limitations. Specifically, ABC results are sensitive to the choice of summary statistics (and rejection criteria) and sample efficiency of ABC demonstrates scales poorly in cases of high-dimensional data [ 33 , 38 , 39 ]. Recent strides in the field of simulation-based inference/likelihood-free inference have addressed these limitations by using artificial neural network(ANN) structures designed to optimize summary statistics, and consequently infer parameters. These methods enable automated (or semi-automated) construction of summary statistics, minimizing the effect the choice of summary statistics may have on the accuracy of parameter estimation [ 38 , 40 – 44 ]. This innovative approach serves to amortize the computational cost of simulation-based inference, opening new frontiers in terms of scalability and performance [ 40 , 41 , 45 – 50 ]. Some existing techniques attempt to bridge this gap. For example, Inverse Binomial Sampling [ 30 ], particle filtering [ 31 ], and assumed density estimation [ 32 ] provide approximate solutions to the Bayesian inference process in specific cases. Many of these methods, however, require advanced mathematical expertise for effective use and adaptation beyond specific cases they were developed for, making them less accessible many researchers. Approximate Bayesian Computation (ABC, [ 33 – 37 ]) offers a more accessible avenue for estimating parameters in models limited by intractable likelihoods. More widely employed in cognitive modeling, the approach of basic ABC rejection algorithms involves translating trial-level data into summary statistics. Parameter values of the candidate model are then selected based on their ability to produce simulated data that is closely aligned with summarized data, guided by some predefined rejection criterion. Researchers’ ability to benefit from computational modeling crucially depends on the availability of methods for model fitting and comparison. Such tools are available for a large group of cognitive models (such as, for example, reinforcement learning and drift diffusion models). Examples of commonly used model parameter fitting tools include maximum likelihood estimation (MLE, [ 18 ]), maximum a-posteriori (MAP, [ 19 ]), and sampling approaches ([ 20 , 21 ]). Examples of model comparison tools include information criteria such as AIC and BIC [ 5 , 22 ], and Bayesian group level approaches, including protected exceedance probability [ 23 , 24 ]. These methods all have one important thing in common—they necessitate computing the likelihood of the data conditioned on models and parameters, thus limiting their use to models with tractable likelihood. However, many models do not have a tractable likelihood. This severely limits the types of inferences researchers can make about cognitive processes, as many models with intractable likelihood might offer better theoretical accounts of the observed data. Examples of such models include cases where observed data (e.g. choices) might depend on latent variables—such as the unobserved rules that govern the choices [ 25 – 27 ], or a latent state of engagement (e.g. attentive/distracted, [ 28 , 29 ]) a participant/agent might be in during the task. In these cases, computing the likelihood of the data often demands integrating over the latent variables (rules/states) across all trials, which grows exponentially and thus is computationally intractable. This highlights an important challenge—computing likelihoods is essential for estimating model parameters, and performing fitness comparison/model identification, and alternative models are less likely to be considered or taken advantage of to a greater extent. Computational modeling is an important tool for studying behavior, cognition, and neural processes. Computational cognitive models translate scientific theories into algorithms using simple equations with a small number of interpretable parameters to make predictions about the cognitive or neural processes that underlie observable behavioral or neural measures. These models have been widely used to test different theories about cognitive processes that shape behavior and relate to neural mechanisms [ 1 – 4 ]. By specifying model equations, researchers can inject different theoretical assumptions into most models, and simulate synthetic data to make predictions and compare against observed behavior. Researchers can quantitatively arbitrate between different theories by comparing goodness of fit [ 5 , 6 ], across different models. Furthermore, by estimating model parameters that quantify underlying cognitive processes, researchers have been able to characterize important individual differences (e.g. developmental: [ 7 – 10 ]; clinical: [ 11 – 15 ]) as well as condition effects [ 16 , 17 ]. To make the model misspecification more extreme, we additionally simulated data from a Bayesian inference model, and estimated RL model parameters from the simulated data. We did this using standard methods (MAP) and ANN, and repeated the same process in reverse (simulating data from an RL model, and fitting Bayesian inference model parameters). We found that both MAP and ANN exhibited similar patterns. That is, in the case of simulating Bayesian inference model and fitting RL model parameters, the estimated β captured the variance from the true β and p switch , while the estimated α parameter captured the variance driven by the Bayesian updating parameters p reward and p switch ( S14 Fig ). In the case of simulating RL model and fitting Bayesian inference model parameters, p switch parameter captured the noise in the simulated data coming from the β parameter, and the variance from the α parameter was attributed to the p reward parameter ( S15 Fig ). We also correlated parameter estimates generated by the two methods. High correlation implies that MAP and GRU generate similar parameter estimates, suggesting that they are impacted by model misspecification in a similar way ( S11 Fig ). In addition to testing the effects of incorrect priors, we also tested the effect of model misspecification on standard method and ANN performance (focusing on MAP and GRU network, as they performed the best in parameter recovery tests on benchmark models). We fit the Bayesian inference model (without stickiness) to the data simulated from the Bayesian inference model with stickiness using MAP. For the ANN, we trained the neural network to estimate parameters of the Bayesian inference model, and tested it on the separate test set data simulated from the Bayesian inference model with stickiness. For each method, we looked at the correlation between the ground truth Bayesian inference with stickiness model parameters, and the method’s parameter estimates ( S13 Fig ). Our results suggest that the parameters shared between the 2 models are reasonably recoverable using both MAP and ANN (e.g. the recovery is noisier but comparable to that of parameters in Bayesian models without model misspecification ( S4 and S5 Figs); furthermore, the correlation between ground truth and estimated values is similar for the two methods. We also tested the effects of incorrect prior assumptions about the parameter range on method performance. Specifically we 1) trained the network using data simulated from a narrow range of parameters (theoretically informed) and 2) trained the network based on broader range of parameter values. Next, we tested both networks in making out-of-sample predictions for test data sets that were simulated from narrow and broad parameter ranges respectively. The network trained using a narrow parameter range made large errors at estimating parameters for data simulated outside of the range it was trained on; training the network on a broader range overall resulted in smaller error, with some loss of precision for the parameter values in range of most interest (e.g. the narrow range of parameters the alternative network is trained on). We observed similar results with MAP, where we specified narrow/broad prior (where narrow prior would place high density on a specific parameter range). Notably, training the network using a broader range of parameters while oversampling from a range of interest yielded more accurate parameter estimation compared to MAP with broad priors (Approach described in S9 Fig ). A) Parameter estimation in both RL-LAS and HRL show that training with a mixture of trial sequence lengths (purple line) yields more robust out-of-sample parameter value prediction compared to fixed trial sequence lengths. B) Best model identification results, performed on different combinations of model candidates, were also yielded by mixed trial sequence length training. The number of agents/simulations used for training was kept constant across all the tests (N agents = 30k). Data-augmentation practices in machine learning increase robustness of models during training [ 63 ] by introducing different types of variability in the training data set (e.g. adding noise, different data sizes). Specifically, slicing time-series data into sub-series is a data-augmentation practice that increases accuracy [ 64 ]. Thus, we trained our ANN with the fixed number of simulations of different trial numbers. As predicted, we found that the ANNs trained with a mixture of trial sequence lengths across simulations (purple line) consistently yielded better performance across different numbers of test trials for both parameter recovery and model identification ( Fig 6A and 6B ). To evaluate the quality of parameter recovery, we used the coefficient of determination score (R 2 ) which normalizes different parameter ranges. We found that the ANNs trained with a higher trial number reach high R 2 scores in long test trials. However, their performance suffers significantly with smaller number of test trials. The results also show a similar trend in model identification tasks except that training with higher trial number doesn’t guarantee a better performance. For instance, the classification accuracy between HRL task models of the ANN trained with 300 trials reaches 87% while the ANN trained with 500 trials is 84%. ANNs are sometimes known to fail catastrophically when data is different from the training distribution in minor ways [ 59 – 62 ]. Thus, we investigated the robustness of our method to differences in data format we might expect in empirical data, such as different numbers of trials across participants. Specifically, we conducted robustness experiments by varying the number of trials in each individual simulation contributing to training or test sets, fixing the number of agents in the training set. Because we cannot compute the likelihood for our likelihood-intractable models based on closed-form solutions via MAP, we only report the confusion matrices obtained from our ANN approach In the first confusion matrix we performed model identification for 2P − RL and RL − LAS, as we reasoned these two models differ by only one mechanism (occasional inattentive state), and thus could potentially pose the biggest challenge to model identification. In the second confusion matrix, we included all models used to simulate data on the HRL task (HRL model, Bayesian inference model, Bayesian inference with stickiness model). In both cases, the network successfully identified the correct models as true models, with a very small degree of misidentification, mostly in the nested models. Based on our benchmark comparison to AIC, and the proof of concept identification for likelihood intractable models, our results indicate that the ANN can be leveraged as a valuable tool in model identification. As shown in Fig 5A , model identification performed using our ANN approach was better compared to the AIC confusion matrix, with less “confusion”—lower off-diagonal proportions compared to diagonal proportions (correct identification). Model identification using AIC is likely in part less successful due to some models being nested in others (e.g. 2P − RL in 4P − RL, BI in S − BI). Specifically, since AIC score represents a combination of likelihood and penalty incurred by the number of parameters it is possible that the data from more complex models is incorrectly identified as better fit by a simpler version of that model (e.g. the model with fewer parameters; an issue which would be more pronounced if we used a BIC confusion matrix). The same phenomenon is observed with the network, but to a much lesser extent, showing better identification out of sample—even for nested models. Furthermore, the higher degree of ANN misclassification observed for BI/S − BI was driven by S − BI simulations with stickiness parameter close to 0, which would render the BI and S − BI non-distinguishable ( S7 Fig ). For models with tractable likelihood, we performed the same model identification process using AIC [ 5 ] that relies on likelihood computation, penalized by number of parameters, to quantify model fitness as a benchmark. We note that another common criterion, BIC [ 6 ], performed more poorly than AIC in our case. The best fitting model is identified based on the lowest AIC score—a successful model recovery would indicate that the true model has the lowest AIC score compared to other models fit to that data. To construct the confusion matrix, we computed best AIC score proportions for all models, across all agents, for data sets simulated from each cognitive model ( Fig 5 ; see Methods ). We also tested the use of our ANN approach for model identification. Specifically, we simulated data from different cognitive models, and trained the network to make a prediction regarding which model most likely generated the data out of all model candidates. The network architecture was identical to the network used for parameter estimation, except that the last layer became a classification layer (with one output unit per model category) instead of a regression layer (with one output unit per target parameter). We applied our method with integrated evidential learning to tractable and intractable versions of the RL models (2P-RL and RL-LAS, Fig 4 ). We found that incorporating this modification did not compromise the point estimate parameter recovery (e.g. compared to our baseline method focused only on maximizing the accuracy of the predictions). Additionally, it enabled the estimation of the uncertainty around the point estimate, as demonstrated by [ 58 ]. This extension appears to be more computationally expensive (with longer training periods) than our original method, but not to a prohibitive extent. Thus far, we have outlined a method that provides point estimates of parameters based on input data sequences, as is typically the use for much lightweight cognitive modeling (e.g. maximum likelihood estimation or MAP). However, it is sometimes also valuable to compute the uncertainty associated with these estimates [ 21 ]. It is possible to extend our approach to add this capability. While there are various alternative ways to do so (e.g. Bayesian neural networks), the approach we have opted for is incorporating evidential learning into our method [ 58 ]. Evidential learning differs from Bayesian networks in that it places priors over likelihood function, rather than network weights. The network leverages this property to learn both statistical (aleatoric) and systematic (epistemic) uncertainty during the process of estimating a continuous target based on the input data sequences. This marks a shift from optimizing a network to minimize errors based on average prediction, without considering uncertainty. A) Parameter recovery loss from the held out test set for the intractable-likelihood models (RL-LAS, HRL) using ABC and GRU network. Loss is quantified as the mean squared error (MSE) based on the discrepancy between true and estimated parameters. Bars represent MSE average for each parameter across all agents, with errorbars representing standard error across agents ( S17 Fig ) shows variability across seeds). B) Parameter recovery from the RL-LAS and HRL models using ABC (green) and GRU network (yellow). ρ values represent Spearman ρ correlation between true and estimated parameters. Red line represents a unity line (x = y) and black line represents a least squares regression line.All correlations were significant at p <.001. Next, we tested our method in two examples of computational models with intractable likelihood. As a comparison method, we implemented Approximate Bayesian Computation (ABC), alongside our ANN approach to estimate parameters. The two example likelihood-intractable models we used had in common the presence of a latent state which conditioned sequential updates: RL with latent attentive state (RL − LAS), and a form of non-temporal hierarchical reinforcement learning (HRL, [ 27 ]). Since we cannot fit these models using MAP or MLE we used only ABC as a benchmark. Because we found LSTM RNN to be more challenging to train and achieve similar results when compared to GRU, we focused on GRU for the remainder of comparisons. We found that average MSE was much lower for the neural network compared to ABC for both RL-LAS ( Fig 3A , average MSEs: ABC =.62, GRU =.21) and HRL ( Fig 3A , average MSEs: ABC =.28, GRU =.19). Spearman correlations were noisier for ABC compared to GRU in both models ( Fig 3B , RL-LAS : β ρ ABC , ρ GRU = [.72, .91], α ρ ABC , ρ GRU = [.83, .95], T ρ ABC , ρ GRU = [.5, .81]; HRL : β ρ ABC , ρ GRU = [.86, .89], α ρ ABC , ρ GRU = [.85, .9]; all correlations were significant at p <.001). Furthermore, some parameters were less recoverable than others (e.g. the T parameter in RL-LAS model, which indexed how long participants remained in an inattentive state); this might be in part due to less straightforward effect of T on behavior ( S6 Fig ). Note that in order to obtain our ABC results we had to perform an extensive exploration procedure to select summary statistics—ensuring reasonable ABC results. Indeed, the choice of summary statistics is not trivial and represents an important difficulty of applying basic rejection ABC [ 33 , 38 ], that we can entirely bypass using our new neural network approach. We acknowledge that recent methods that rely on ANNs replaced standard ABC methods by automating (or semi-automating) construction of summary statistics [ 38 , 40 – 44 , 51 ]. However, we aimed to explore an alternative approach, independent of explicit optimization of summary statistics, and focused on the ABC instantiation that has been most frequently implemented in the field of cognitive science as a benchmark [ 33 – 35 ]. Next, we visualized parameter recovery. We found that for each of the cognitive models the parameter recovery was largely successful (Spearman ρ correlations between true parameter values and estimated values: β ρ MAP , ρ GRU = [.90, .91], α + ρ MAP , ρ GRU = [.53, .52], α − ρ MAP , ρ GRU = [.88, .89], κ: ρ MAP , ρ GRU = [.78, .79], Fig 2B ; all correlations were significant at p <.001). For conciseness, we only show recovery of the more complex model parameters from the RL model family (and only MAP method as it performed better compared to ABC and MLE, as well as only GRU since it performed better than LSTM), as we would expect a more complex model to emphasize superiority of a fitting method more clearly compared to simpler models. Recovery plots of the remaining models (and respective fitting methods) can be found in S2 – S5 Figs. Our results suggest that 1) ANN performed as well as traditional methods in parameter estimation based on the MSE loss; 2) more complex models may limit accuracy of parameter estimation in traditional methods that neural networks appear to be more robust against. We note that for the 4P − RL model, parameter recovery was noisy for all methods, with some parameters being less recoverable than others (e.g. α + , Fig 2B ). This is an expected property of cognitive models applied to realistic-sized experimental data as found in most human experiments (i.e. a few hundred trials per participant). To check whether the limited recovery can be attributed to parameter identifiability rather than pitfalls of any specific method, we looked at the correlation between parameter estimates obtained using the standard model fitting method (MAP) and the ANN (GRU) ( S10 Fig )—with parameters that are not well recovered (e.g. α + in 4P-RL model) being of particular interest. High correlation between estimated parameters obtained via 2 methods imply systematic errors in parameter identification that apply to both methods—thus suggesting that the weaker correlation between true and fit parameters for some parameters is more likely due to limitations in the model applied to the data set than method specifications such as poor optimization performance. We further discuss the implications in discussion section—highlighting that computational models should be carefully crafted and specified regardless of the tools used for model fitting. A) Parameter recovery loss from the held out test set for the tractable-likelihood models (2P-RL, 4P-RL, BI, S-BI) using each of the tested methods. Loss is quantified as the mean squared error (MSE) based on the discrepancy between true and estimated parameters. Bars represent loss average for each parameter across all agents, with errorbars representing standard error across agents. B) Parameter recovery from the 4P-RL model using MAP and GRU. ρ values represent Spearman ρ correlation between true and estimated parameters. Red line represents a unity line (x = y) and black line represents a least squares regression line. All correlations were significant at p <.001. First, we examined the performance of standard model-fitting tools (MLE, MAP and ABC). The standard tools yielded a pattern of results that are expected based on noisy, realistic-size data sets (with several-hundred trials per agent). Specifically, we found that MAP outperformed MLE ( Fig 2A , average MSEs: MLE =.67, MAP =.35), since the parameter prior applied in MAP regularizes the fitting process. ABC was also worse compared to MAP ( Fig 2A , average MSE: ABC =.53). While fitting process is also regularized in ABC, worse performance in some models can be attributed to signal loss that arises from approximation to the likelihood. Next, we focused on the ANN performance; our results showed that for each of the models, ANN performed better than or just as well as the traditional methods ( Fig 2A , average MSEs for different RNN variants: GRU =.32, LSTM =.35). Better network performance was more evident for parameter estimation in more complex models (e.g. models with higher number of parameters such as 4P-RL and S-BI; average MSE across these 2 models: MLE =.95, MAP =.43, ABC =.71, GRU =.38, LSTM =.44). We used the same held out data set to evaluate all methods (the test set the ANN has not observed yet, see simulation details). For each of the methods we extracted the best fit parameters, and then quantitatively estimated the method’s performance as the mean squared error (MSE) between estimated and true parameters across all agents. Methods with lower MSE indicated better relative performance. All of the parameters were scaled for the purpose of loss computation, to ensure comparable contribution to loss across different parameters. To quantify overall loss for a cognitive model we averaged across all individual parameter MSE scores; to calculate fitting method’s MSE score for a class of cognitive models (e.g. likelihood tractable models) we averaged across respective method’s MSE scores for those models (See Methods for details about method evaluation). First, we sought to validate our ANN method and compare its performance to existing methods by testing it on standard likelihood-tractable cognitive models of different levels of complexity in the same task: 2-parameter (2P − RL) and 4-parameter (4P − RL) reinforcement learning models commonly used to model behavior on reversal tasks [ 7 , 14 , 53 , 54 ], as well as Bayesian Inference model (BI) and Bayesian Inference with Stickiness (S − BI) as an alternative model family that has been found to outperform RL in some cases [ 55 – 57 ]. We estimated model parameters using multiple traditional methods for computing (maximum likelihood and maximum a-posteriori estimation; MLE and MAP) and approximating (Approximate Bayesian Computation; ABC) likelihood. We used the results of these tools as a benchmark for evaluating the neural network approach. Next, we estimated parameters of these models using two variants of RNNs: with gated recurrent units (GRUs) or Long-Short-Term-Memory units (LSTM). We focused on two distinct artificial neural network (ANNs) applications in cognitive modeling: parameter estimation and model identification. Specifically, we built a network with a structure suitable for sequential data/data with time dependencies (e.g. recurrent neural network (RNN); [ 52 ]). Training deep ANNs requires large training data sets. We generated such a data set at minimal cost by simulating a cognitive computational model on a cognitive task a large number of times. Model behavior in the cognitive task (e.g. a few hundred trials of stimulus-action pairs or stimulus-action-outcome triplets (depending on the task) for each simulated agent) constituted ANN’s training input; true known parameter values (or identity of the model) from which the data was simulated constituted ANNs’ training targets. We evaluated the network’s training performance in predicting parameter values/identity of the model in a separate validation set, and tested the trained network on a held out test set. We tested RNN variants and compared their accuracy against traditional likelihood-based model fitting/identification methods using both likelihood-tractable and likelihood-intractable cognitive models. See Methods section for details on the ANN training and testing process. Discussion Our results demonstrate that artificial neural networks (ANNs) can be successfully and efficiently used to estimate best fitting free parameters of likelihood-intractable cognitive models, in a way that is independent of likelihood approximation. ANNs also show remarkable promise in successfully arbitrating between competing cognitive models. While our method leverages “big data” techniques, it does not require large experimental data sets: indeed, the large training set used to train the ANNs is obtained purely through efficient and fast model simulation. Thus, our method is applicable to any standard cognitive data set with a normal number of participants and trials per participants. Furthermore, while our method requires some ability to work with ANNs, it does not require any advanced mathematical skills, making it largely accessible to the broad computational cognitive modeling community. Our method adds to a family of approaches from other attempts at using neural networks for fitting computational cognitive models. Specifically, previous work leveraging amortized inference has focused on taking advantage of large-scale simulations and invertible networks. This approach involves training the summary segment of the network to adeptly learn relevant summary statistic vectors, while concurrently training the inference segment of the network to approximate the posterior distribution of model parameters based on the outputs generated by the summary network [40, 41, 46]. This method has successfully been applied to both parameter estimation and model identification (and performs in a similar range as our method when applied to intractable models we implemented in this paper), bypassing many issues of ABC. In parallel, work by [47] showcased Likelihood Approximation Networks (LANs) as a method that approximates likelihood of sequential sampling models (but requires ABC-like approaches for training), and recovers posterior parameter distributions with high accuracy for a specific class of models (e.g. drift diffusion models); more recently, [48] used a similar approach with higher training data efficiency. Work by [65] used Approximate Bayesian Computation (ABC) in conjunction with mixture density networks to map data to parameter posterior distributions. Unlike most of these approaches our architecture is not dependent on [47, 48] or explicitly designed to optimize [40, 41, 46] summary statistics. By necessity, hidden layers of our network do implicitly compute a form of summary statistic that are translated into estimated parameters/model class in the output layer; however, we do not optimize for such statistics explicitly, beyond their ability to support parameter/model recovery. Other approaches have used ANNs for different purposes than fitting cognitive models [66]. For example, [52] leveraged flexibility of RNNs (which inspired our network design) to map data sequences onto separable latent dimensions that have different effects on decision-making behavior of agents, as an alternative to cognitive models that make more restrictive assumptions. Similarly, work by [67] also used RNNs to estimate RL parameters and make predictions about behavior of RL agents. Our work goes further than this approach in that it focuses on both parameter recovery and model identification of models with intractable likelihood, without relying on likelihood approximation. Furthermore, multiple recent papers [68, 69] use ANNs as a replacement for cognitive models, rather than as a tool for supporting cognitive modeling as we do, demonstrating the number of different ways ANNs are taking a place in computational cognitive science. It is important to note that while ANNs may prove to be a useful tool for cognitive modeling, one should not expect that their use immediately fixes or overrides all issues that may arise in parameter estimation and model identification. For instance, we have observed that while ANNs outperformed many of the traditional likelihood-based methods, recovery for some model parameters was still noisy (e.g. learning rate α in the 4P-RL model, Fig 2). This is a property of cognitive models when applied to experimental applied to data sets that range in hundreds of trials. Standard methods (e.g. MAP) fail in a similar way—as shown by the high correlation between MAP and ANN parameter estimates (S10 Fig), which suggests that parameter recovery issues have more to do with identifiability limitations of the data and model, rather than other issues such as optimization method. Similarly, often times model parameters are not meaningful in certain numerical ranges, and sometimes model parameters trade off in how they impact behavior through mathematical equations that define the models—making the parameter recovery more challenging. Furthermore, when it comes to model identification, particularly with nested models, the specific parameter ranges can influence the outcome of model identification, favoring simpler models over more complex ones (or vice versa). This was evident in our observations regarding the confusion between Bayesian inference models with and without stickiness, wherein the ground truth values of stickiness played a decisive role in the model identification. This is to say ANNs should be treated as a useful tool that is only useful if the researchers apply significant forethought to developing appropriate, identifiable cognitive models. In a similar vein, it is important to recognize that the potential negative implications of model misspecification extend to neural networks, much like they impact traditional model-fitting approaches. For instance, our estimation of parameters may be conducted under the assumption of model X, whereas, in reality, model Y might be the most suitable for explaining the data—leading to poor parameter estimation and model predictions. Our test of the systematic effects of model misspecification involved utilizing a network trained to estimate parameters from one model (e.g. Bayesian Inference) to predict parameters for the held-out test set data simulated from a different model (e.g. Bayesian Inference with stickiness, or RL). We compared this to model misspecification with a standard MAP approach. Notably, neither method exhibited significant adverse effects. When models were nested, the parameters shared between the two models were reasonably well recovered. When the model misspecificpation was more extreme (with models from different families), we again observed similar effects on the two methods, where variance driven by one parameter tended to be recovered similarly. Thus, our approach appears equally (but not worse) subject to the risk of model misspecification as other fitting methods. In light of these findings, our key takeaway is to exercise caution against assuming that the use of a neural network remedies all issues typically associated with modeling. Instead, we advocate for the application of conventional diagnostics (e.g., model comparison, predictive checks) that are commonly employed in standard methods to ensure robust and accurate results. Relatedly, we have shown that the parameter estimation accuracy varies greatly as a function of the parameter range the network was trained on, along with whether the underlying parameter distribution of the held out test-set is included in that range or not. This is an expected property of ANNs that are known to underperform when the test data systematically differs from training examples [59–61]. As such, the range of parameters/models used for inputs constitutes a form of prior that constrains the fit, and it is important to carefully specify it with informed priors (as is done with other methods, such as MAP). We found that training the network using a broader parameter range, while heavily sampling from a range of interest (e.g. plausible parameter values based on previous research) affords both accurate prediction for data generated outside of the main expected range, with limited loss of precision within the range of interest (S9 Fig). This kind of practice is also consistent with practices in computational cognitive modeling, where a researcher might specify (e.g. using a prior) that parameter might range between two values, with most falling within a certain, more narrow range. One factor that is specific to ANN-based methods (as opposed to standard methods) is the effect different hyperparameters (e.g. size of the neural network, choice of the learning rate, dropout values, etc.) may have on network performance—commonly resulting in overfitting or underfitting. We observed that the network performance, particularly in parameter recovery, is most significantly influenced by the number of units in the GRU layer and the chosen dropout rate. A suitable range for the number of GRU units is typically between 90 and 256, covering the needs of most cognitive models. A dropout rate within the range of 0.1 to 0.2 is generally sufficient. We have outlined the details of parameter ranges we tested in the table (S1 Table). To address this challenge, we employed an automated hyperparameter tuning approach, as outlined by Bergstra, Yamins, and Cox (2013). This Bayesian optimization for tuning hyper-parameters helps reduce the time required to obtain an optimal parameter set by learning from previous iterations. Additionally, in the process of training a neural network, the initialized random weights play a significant role in determining the network’s convergence and the final performance. Different random seeds can result in different initializations of the network weights, which may affect the optimization process downstream, and potentially yield different final solutions. It is important to be mindful of this; we have inspected effects of setting different seeds on our network performance (S17 Fig), and found that overall network performance was stable across different seeds, with slight variations (1 seed) for both parameter estimation and model identification—showcasing the need for cautious practice of inspecting network’s performance under multiple seeds. We compared our artificial neural network approach against existing methods that are commonly used to estimate parameters of likelihood-intractable models (e.g. ABC, [33, 70]). While traditional rejection ABC provides a workaround solution, it also imposes certain constraints. Specifically, it is more suitable for data with no sequential-dependencies, and the accuracy of parameter recovery is largely contingent on selection of appropriate summary statistics, which is not always a straightforward problem. More recent advances in the domain of simulation-based inference [38, 40, 42, 44] solve many ABC-issues by automating the process of construction of summary statistics. For the purpose of this project we have focused on the methods that are most commonly used in cognitive modeling (e.g. maximum likelihood/maximum a posteriori), but future work should extend to conducting the same benchmarking procedure involving these inference methods. Alternative approximation methods (e.g. particle filtering [31]; density estimation [32]); inverse binomial sampling [30] may prove to be more robust, but frequently require more advanced mathematical knowledge and model case-based adaptations, or are more computationally expensive; indeed, some of them may not be usable or tractable in our type of data and models where there are sequential dependencies between trials [30, 71]. ANN-based methods such as ours or others’ [40, 41, 49], on the other hand, offers a more straightforward and time-efficient path to both parameter estimation and model identification. Developing more accessible and robust methods is critical for advances in computational modeling and cognitive science, and the rising popularity of deep learning puts neural networks forward as useful tools for this purpose. Our method also offers an advantage of requiring very little computational power. The aim of the project at its current state was not to optimize our ANN training in terms of time and computing resources; nevertheless, we used Nvidia V100 GPUs with 25 GB memory and required at most 1 hour for model training and predictions. This makes the ANN tool useful, as it requires a low amount of computing resources and can be done fast and inexpensively. All of our code is shared on GitHub. We primarily focused on extensive tests using synthetic data, in particular in the context of learning experiments that present important challenges for some methods (such as BADS [71] or ABC [33–35]) due to the dependency between trials, and have not been thoroughly investigated with other ANN-based approaches. A critical next step will be to further validate our approach using empirical data (e.g. participant data from the tasks). Similarly, we relied on RNNs due to their flexibility and capacity to handle sequential data. However, it will be important to explore different structures, such as transformers [72], for potentially improved accuracy in parameter recovery/model identification, as well as alternative uses in cognitive modeling. In addition, our baseline approach lacks the capability to quantify the complete uncertainty in parameter estimation, offering only point estimates. This is similar to many lightweight cognitive modeling approaches (such as MAP and LLH), but stands in contrast to other methods that integrate simulation-based inference with neural network structures [40, 41, 45, 47, 48], where the ability to capture full uncertainty represents a notable strength. Nevertheless, we have showcased that our method can easily be extended to provide uncertainty estimates by incorporating evidential learning techniques [58], at a slight computational cost, but minimal impact on point estimates’ accuracy. Furthermore, we included both RL and Bayesian inference models to demonstrate our approach can work with different classes of computational models. Future work will include additional models (e.g. sequential decision making models) to further test robustness of our approach. In conclusion, we propose an accessible ANN-based method to perform parameter and model identification across a broad class of computational cognitive models for which application of existing methods is challenging. Our work should contribute to a growing literature focused on developing new methods that will allow researchers to quantitatively test a broader family of theories than previously possible. [END] --- [1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012119 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/