(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:https://journals.plos.org/plosone/s/licenses-and-copyright

------------


Noise-trained deep neural networks effectively predict human vision and its neural responses to challenging images

['Hojin Jang', 'Psychology Department', 'Vanderbilt Vision Research Center', 'Vanderbilt University', 'Nashville', 'Tennessee', 'United States Of America', 'Devin Mccormack', 'Frank Tong']

Date: 2022-01 

Deep neural networks (DNNs) for object classification have been argued to provide the most promising model of the visual system, accompanied by claims that they have attained or even surpassed human-level performance. Here, we evaluated whether DNNs provide a viable model of human vision when tested with challenging noisy images of objects, sometimes presented at the very limits of visibility. We show that popular state-of-the-art DNNs perform in a qualitatively different manner than humans—they are unusually susceptible to spatially uncorrelated white noise and less impaired by spatially correlated noise. We implemented a noise training procedure to determine whether noise-trained DNNs exhibit more robust responses that better match human behavioral and neural performance. We found that noise-trained DNNs provide a better qualitative match to human performance; moreover, they reliably predict human recognition thresholds on an image-by-image basis. Functional neuroimaging revealed that noise-trained DNNs provide a better correspondence to the pattern-specific neural representations found in both early visual areas and high-level object areas. A layer-specific analysis of the DNNs indicated that noise training led to broad-ranging modifications throughout the network, with greater benefits of noise robustness accruing in progressively higher layers. Our findings demonstrate that noise-trained DNNs provide a viable model to account for human behavioral and neural responses to objects in challenging noisy viewing conditions. Further, they suggest that robustness to noise may be acquired through a process of visual learning.

Competing interests: It should be acknowledged that a patent, describing the noise-training methods used in this study to train deep neural networks, has been granted by the U.S. Patent and Trademark Office. Inventors, Frank Tong and Hojin Jang; Applicant, Vanderbilt University; Patent Number 11,030,487.

Funding: This research was supported by a grant (R01EY029278) from the National Eye Institute ( https://www.nei.nih.gov ) to F.T. and a core grant (P30EY008126) from the National Eye Institute to the Vanderbilt Vision Research Center (Director David Calkins). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability: The data from this study are available on open science framework ( https://osf.io/bxr2v/ ), including behavioral data from experiments 1 and 2, performance accuracy of the deep neural networks, multivariate fMRI patterns of activity from visual areas of interest, and noise-trained versions of VGG-19.

Copyright: © 2021 Jang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

We performed a functional neuroimaging experiment to assess the degree of correspondence between DNN models and human neural activity. Multivariate decoding of activity patterns in the human visual cortex revealed better discrimination of objects in pixelated Gaussian noise as compared to Fourier phase-scrambled noise, consistent with the behavioral advantage shown by human observers and also by noise-trained DNNs. Moreover, noise-trained DNNs provided a better correspondence to the patterns of object-specific responses found in both early visual areas and high-level object areas. We go on to show that DNNs trained to recognize objects in artificial noise can generalize their knowledge to some extent to other image distortions, including real-world conditions of visual noise. Taken together, our findings demonstrate that noise-trained DNNs provide a viable model of the noise robust properties of the human visual system.

Next, we sought to investigate whether DNNs can be trained to recognize objects in extreme levels of visual noise, and, in particular, whether such noise-trained DNNs might provide a better match to human behavioral and neural performance. Although it has long been known that neural networks can be regularized by adding a small amount of noise to their input data [ 38 ] and such procedures have proven useful in the training of DNNs [ 39 ], the impact of training DNNs with extreme levels of noise has only recently begun to receive attention [ 26 – 29 ]. Here, we found that noise-trained DNNs could accurately classify objects at lower noise thresholds than human observers, but more importantly, their qualitative pattern of performance to different noise types provided a better match to human performance. Moreover, noise-trained DNNs performed far better than standard DNNs in their ability to predict human recognition thresholds on an image-by-image basis. A layer-specific analysis of DNN activity patterns indicated that noise training led to widespread changes in the robustness of the network, with more pronounced differences between standard and noise-trained networks found in the middle and higher layers.

Both human and DNN systems can be stressed by lower SNRs, as the object images approach the limits of perceptual visibility. This experimental design allowed us to test for quantitative differences in performance, by identifying the critical noise level at which performance sharply declines. Moreover, it allowed us to test for qualitative differences in visual processing across noise type. We found that popular state-of-the-art DNNs perform in a qualitatively different manner than humans. Specifically, DNNs are unusually susceptible to pixelated Gaussian noise (i.e., white noise) and less susceptible to spatially correlated Fourier phase-scrambled noise (similar to “pink” noise), whereas human observers show the opposite pattern of performance.

The goal of our study was to determine whether DNNs can provide a viable model of human behavioral and neural performance under stress test visual conditions. Our recognition task required humans and DNNs to classify objects embedded in either spatially independent noise or spatially correlated noise, across a wide range of signal-to-noise ratios (SNRs; Fig 1 ). Spatially independent Gaussian noise has been used to characterize attentional modulation of visual sensitivity and the perceptual learning of complex stimuli [ 34 , 35 ]. Pixelated noise has also been used to characterize the robustness of visual cortical responses to objects presented in high levels of noise [ 36 ]. We were also interested in assessing the impact of Fourier phase-scrambled noise on object recognition performance, as such noise preserves the 1/F amplitude spectrum of natural images [ 37 ] and contains spatially correlated structure that might be more confusing to object recognition systems.

Although DNNs can perform remarkably well on tasks of object recognition, with claims that they have achieved or even surpassed human-level performance [ 24 , 25 ], a conundrum lies in the fact that these networks tend to lack robustness to more challenging viewing conditions. In particular, there is some evidence to suggest that DNNs are unusually susceptible to visual noise and clutter and that human recognition performance is more robust to noisy viewing conditions [ 26 – 30 ]. Identifying potential disparities between human and DNN performance is necessary to understand the limitations of current DNN models of human vision [ 31 – 33 ] and a precursor to developing better models.

There is growing evidence to indicate that deep neural networks (DNNs) trained on object classification provide the best current model of the human and nonhuman primate visual systems [ 11 , 12 ]. The visual representations learned by these DNNs demonstrate a reliable correspondence with the neural representations found at multiple levels of the human visual pathway [ 13 – 18 ]. Moreover, DNNs trained on large data sets of object images, such as ImageNet [ 19 ], can reliably predict how individual neurons in the monkey inferotemporal cortex will respond to objects, faces, and even synthetic stimuli [ 20 – 23 ].

A central question in cognitive and computational neuroscience concerns how we detect, discriminate, and identify stimuli by sight [ 1 – 3 ]. The task of object recognition is exceedingly complex, yet human observers can typically recognize most any object within just fractions of a second [ 4 , 5 ]. The human visual system processes information in a hierarchically organized manner, progressing from the encoding of basic visual features in early visual areas to the representation of more complex object properties in higher visual areas [ 6 – 10 ]. What are the neural computations performed by the visual system that allow for successful recognition across a diversity of contexts and viewing conditions?

Results

In experiment 1, we evaluated the performance of 8 pretrained DNNs (AlexNet, VGG-F, VGG-M, VGG-S, VGG-16, VGG-19, GoogLeNet, and ResNet-152 [40–43]) and 20 human observers at recognizing object images presented in either pixelated Gaussian noise or Fourier phase-scrambled noise (Fig 1; see Materials and methods). Object images were presented with varying levels of visual noise by manipulating the signal-to-signal-plus-noise ratio (SSNR), which is bounded between 0 (noise only) and 1 (signal only). This allowed us to quantify changes in performance accuracy as a function of SSNR level. Performance was assessed using a total of 800 images from 16 object categories obtained from the validation data set of the ImageNet database [19]; the categories consisted of different types of animals, vehicles, and indoor objects. These images were novel to the participants and were never used for DNN training.

Fig 2a shows the mean performance accuracy of DNNs and humans plotted as a function of SSNR level, with the performance of individual DNNs shown in Fig 2b. Although DNNs could match the performance of human observers under noise-free conditions, consistent with previous reports [24], DNN performance became severely impaired in the presence of moderate levels of noise. Most DNNs exhibited a precipitous drop in recognition accuracy as SSNR declined from 0.6 to 0.4, whereas human performance was much more robust across this range.

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Humans outperform DNNs at recognizing objects in noise. (a) Mean performance accuracy in a 16-alternative object classification task plotted as a function of SSNR level for human observers (black curves) and 8 standard pretrained DNNs (red curves) with ± 1 standard deviation in performance indicated by the shaded area around each curve. Separate curves are plotted for pixelated Gaussian noise (solid lines with closed circles) and Fourier phase-scrambled noise (dashed lines with open circles). (b) Classification accuracy plotted as a function of SSNR level for individual pretrained DNN models. Data are available at https://osf.io/bxr2v/. DNN, deep neural network; SSNR, signal-to-signal-plus-noise ratio. https://doi.org/10.1371/journal.pbio.3001418.g002

Of particular interest, the DNNs appeared to be impaired by noise in a manner that qualitatively differed from human performance. Spatially correlated noise proved more challenging to human observers, whereas the DNNs were more severely impaired by pixelated Gaussian noise (in 7 out of 8 cases). We fitted a logistic function to the performance accuracy data of each participant and each DNN to determine the threshold SSNR level at which performance reached 50% accuracy. This analysis confirmed that human observers exhibited much lower SSNR thresholds than DNNs, outperforming the DNNs by a highly significant margin at recognizing objects in pixelated noise (t(26) = 15.94, p < 10−14); they also outperformed DNNs at recognizing objects in Fourier phase-scrambled noise (t(26) = 12.29, p < 10−11). Moreover, humans showed significantly lower SSNR thresholds for objects in pixelated noise as compared to spatially correlated noise (0.255 versus 0.315; t(19) = 13.41, p < 10−10), whereas DNNs showed higher SSNR thresholds for objects in pixelated noise as compared to spatially correlated noise (0.535 versus 0.446; t(7) = 3.81, p = 0.0066).

The fact that spatially independent noise proved more disruptive for DNNs was unexpected, given that a simple spatial filtering mechanism, such as averaging over a local spatial window, should allow a recognition system to reduce the impact of spatially independent noise while preserving relevant information about the object. Instead, these DNNs are unable to effectively pool information over larger spatial regions in the presence of pixelated Gaussian noise.

We performed additional analyses to compare the patterns of errors made by DNNs and human observers, plotting confusion matrices for each of 4 SSNR levels (S1 Fig). Human performance remained quite robust even at SSNR levels as low as 0.2, as the majority of responses remained correct, falling along the main diagonal. Also, error responses were generally well distributed across the various categories, although there was some degree of clustering and greater confusability occurred among animate categories. By contrast, DNNs were severely impaired by pixelated noise when SSNR declined to 0.5 or lower and showed a strong bias toward particular categories such as “hare,” “cat,” and “couch.” For objects in spatially correlated noise, the DNNs exhibited a preponderance of errors at SSNR levels of 0.3 and below, with bias toward “hare,” “owl,” and “cat.”

Development of a noise training protocol to improve DNN robustness We devised a noise training protocol to determine whether it would be possible to improve the robustness of DNNs to noisy viewing conditions to better match human performance. For these computational investigations, we primarily worked with the VGG-19 network, as this pretrained network performed quite favorably in comparison to much deeper networks (e.g., GoogLeNet and ResNet-152) and could be trained in an efficient manner to evaluate a variety of manipulations. First, we investigated the effect of training VGG-19 on images from the 16 object categories presented at a single SSNR level with either type of noise. After such training, the network was tested on a novel set of object images presented with the corresponding noise type across a full range of SSNR levels. We observed that training the DNN at a progressively lower SSNR level led to a consistent leftward shift of the recognition accuracy by SSNR curve (Fig 3a). However, this improvement in performance for noisy images was accompanied by a loss of performance accuracy for noise-free images. The latter was evident from the prominent downward shift in the recognition accuracy by SSNR curve. Such loss of accuracy for noise-free images would be unacceptable for any practical applications of this noise training procedure and clearly deviated from human performance. Next, we investigated whether robust performance across a wide range of SSNR levels might be attained by providing intermixed training with both noise-free and noisy images. Fig 3b indicates that such combined training was highly successful, with the strongest improvement observed for noisy images presented at a challenging SSNR level of 0.2. When the training SSNR was reduced to levels as low as 0.1, the task became too difficult and the learning process suffered. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. Effects of training DNNs with objects in noise. (a) Impact of training VGG-19 with object images presented at a single SSNR level (1.0, 0.7, 0.5, 0.3, 0.2, or 0.1) when evaluated with novel test images presented at multiple SSNR levels. Accuracy of pretrained VGG-19 (red curve) serves as a reference in each plot. (b) Impact of training VGG-19 with a combination of noise-free images (SSNR 1.0) and noisy images at a specified SSNR level. Data are available at https://osf.io/bxr2v/. DNN, deep neural network; SSNR, signal-to-signal-plus-noise ratio. https://doi.org/10.1371/journal.pbio.3001418.g003 Given the excellent performance of VGG-19 after training with noisy object images, we sought to compare noise-trained DNNs with human performance. Using VGG-19 pretrained on standard images from ImageNet, we implemented noise training using images from the 16 object categories shown with pixelated Gaussian noise (SSNR range 0.2 to 0.99), images with Fourier phase-scrambled noise (SSNR range 0.2 to 0.99), as well as clean images. Fig 4a shows that this noise-trained version of VGG-19 was able to attain robustness to both types of noise concurrently and performed far better than standard pretrained DNNs at recognizing novel images of noisy objects that were not used for training. Moreover, noise training led to better performance for objects in pixelated Gaussian noise as compared to Fourier phase-scrambled noise, in a manner that better matched the qualitative performance of human observers. The noise-trained network also seemed to outperform human observers on average. To analyze these performance differences in detail, we fitted a logistic function to identify the SSNR thresholds of each DNN and human observer, separately for each noise condition. Thresholds were defined by the SSNR level at which the fitted function reached 50% accuracy. A histogram of SSNR thresholds revealed that noise-trained VGG-19 outperformed all 20 human observers and all 8 original DNNs at recognizing objects in both Gaussian noise and Fourier phase-scrambled noise (Fig 4b). These results indicate that the noise training protocol can greatly enhance the robustness of DNNs, such that they can match or surpass human performance when tasked to recognize objects in extreme levels of visual noise. Our results are consistent with other recent reports [29]. However, such findings are insufficient to determine whether or not noise-trained DNNs have acquired visual representations that can account for the noise robust nature of human vision. Below, we evaluated this noise-trained version of VGG-19 on its ability to predict human behavioral and neural performance. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 4. Noise-trained VGG-19 outperforms human observers and other DNNs. (a) Mean classification accuracy of noise-trained VGG-19 (blue), human observers (gray), and pretrained DNNs (red) for objects in pixelated Gaussian noise (solid lines, closed circles) and Fourier phase-scrambled noise (dashed lines, open circles). Noise-trained VGG-19 was trained with objects in Gaussian noise, Fourier phase-scrambled noise, and clean images from the 16 categories. (b) Frequency histograms comparing the SSNR thresholds of noise-trained VGG-19 (blue), individual human observers (gray), and 8 standard pretrained DNNs (red). Data are available at https://osf.io/bxr2v/. DNN, deep neural network; SSNR, signal-to-signal-plus-noise ratio. https://doi.org/10.1371/journal.pbio.3001418.g004

Image-level predictions of human behavioral performance The ability to predict human recognition performance at the level of specific images constitutes one of the most stringent tests for evaluating DNN models; however, current DNN models have yet to adequately account for image-specific human performance [31]. Here, we devised a second behavioral experiment to evaluate whether noise-trained DNNs might be capable of predicting the noise threshold at which people can successfully recognize objects on an image-by-image basis. A total of 20 observers were presented with each of 800 object images (50 per category), which slowly emerged from pixelated Gaussian noise. The SSNR level gradually increased from an initial value of 0 in small steps of 0.025 every 400 ms, until the observer pressed a key to pause the dynamic display in order to make a categorization decision. A reward-based payment scheme provided greater reward for correct responses made at lower SSNR levels. After making a categorization response, participants used a mouse pointer to demarcate the portions of the image that they relied on for their recognition judgment. The resulting data allowed us to compare the similarity of humans and DNNs in their SSNR thresholds, as well as the portions of each image that were diagnostic for recognition judgments. Mean performance accuracy was high (90.3%), and human SSNR thresholds for each image were calculated based on responses for correct trials only. Accordingly, we calculated SSNR thresholds for noise-trained VGG-19 by requiring accuracy to reach 90%. For comparison, we evaluated a standard DNN, which consisted of pretrained VGG-19 that received an equal number of training epochs with the 16 object categories using noise-free images only. (Object images were included in this analysis only if a reliable logistic function could be fitted to the data, which required that the DNN in question could accurately classify that object at an SSNR of 1 and the DNN was not strongly biased to make that categorical response to Gaussian noise images with an SSNR of 0.) Although the standard DNN could predict human SSNR thresholds for individual object images to some degree (Fig 5a, slope = 0.29, r = 0.24, t(721) = 6.57, p < 10−10), the noise-trained DNN provided a significantly better fit of human behavioral performance (slope = 0.73, r = 0.53, t(714) = 16.65, p < 10−16, comparison of noise-trained versus standard DNN, z = 6.55, p < 10−10). These findings indicate that noise-trained DNNs provide a better model for predicting the critical noise level at which humans can recognize individual objects. That said, it should be noted that human-to-human similarity was greater still (mean r = 0.94, standard deviation of r = 0.0033, based on a split-half correlation analysis with 10,000 random splits), indicating that further improvements can be made by future DNN models to account for human recognition performance. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 5. Noise-trained VGG-19 predicts human recognition thresholds. (a) Scatter plot comparing SSNR thresholds of human observers with the thresholds of standard VGG-19 (red) and noise-trained VGG-19 (blue). Each data point depicts the SSNR threshold for an individual object image. (b) Examples of diagnostic object features from human observers, standard VGG-19, and noise-trained VGG-19. The mean SSNR level at which human observers correctly recognized the objects is indicated. Teapot image by Frank Tong. “Red Tabby Cat With Brown Eyes” by Plutonix is licensed under CC BY SA 3.0; the cat image was converted to grayscale with Gaussian noise added. (c) Correlational similarity and overlap ratio of the spatial profile of diagnostic features reported by human observers and those measured in DNNs across a range of SSNR levels. Gray dashed lines indicate ceiling-level performance based on human-to-human correspondence. Confidence intervals (95%) for human-to-human correspondence were extremely small (less than ±0.001) and therefore are not plotted. Data are available at https://osf.io/bxr2v/. DNN, deep neural network; SSNR, signal-to-signal-plus-noise ratio. https://doi.org/10.1371/journal.pbio.3001418.g005 To complement the diagnostic regions reported by human observers, we used layer-wise relevance propagation [44] to determine what portions of each image were important for the decisions of the DNNs (Fig 5b). We calculated the spatial correlation and amount of overlap between the diagnostic regions of humans and DNNs across a range of SSNR levels. Both standard and noise-trained VGG-19 performed quite well at predicting the diagnostic regions used by human observers at high SSNR levels of 0.8 or greater (Fig 5c). However, only the noise-trained DNN could reliably predict the diagnostic regions used by human observers in noisy viewing conditions. The above findings demonstrate that noise-trained DNNs can capture the fine-grained behavioral performance patterns of human observers when tasked to recognize objects in challenging noisy conditions.

Characterizing network changes caused by noise training Given that noise-trained VGG-19 provided an effective model for predicting human recognition of objects in noise, we sought to identify the stages of processing that were most affected by noise training. We devised a layer-specific noise susceptibility analysis that required calculating the correlation strength between the layer-specific pattern of activity evoked by a noise-free image and the pattern of activity evoked by that same image when presented with noise at varying SSNR levels (Fig 6a). Here, correlation strength should monotonically increase with increasing SSNR level (from an expected R value of 0 to 1.0), and the threshold SSNR level needed to reach a correlation value of 0.5 can then be identified. A lower threshold SSNR indicates greater robustness, whereas a higher threshold SSNR indicates greater noise susceptibility. We confirmed that the correlational similarity between layer-specific responses to a noise-free image and noisy image did indeed increase as a monotonic function of latter’s SSNR level (S2 Fig) and observed greater robustness in the noise-trained DNN than in the standard pretrained DNN, especially in the higher layers. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 6. Layer-specific network changes caused by noise training. (a) Depiction of method used for layer-specific noise susceptibility analysis. (b) Correlation-based SSNR thresholds for pretrained (red) and noise-trained (blue) versions of VGG-19 plotted by layer for objects shown in pixelated Gaussian noise or Fourier phase-scrambled noise. Layers 1 to 16, convolutional layers after rectification; layers 17 and 18, fully connected layers after rectification; layer 19, softmax output layer. Higher SSNR thresholds indicate greater susceptibility to noise. (c) Classification-based SSNR thresholds plotted by layer for pretrained and noise-trained networks. Multiclass SVMs were used to predict object category from layer-specific activity patterns. (d) Similarity of feature representations for pretrained and noise-trained versions of VGG-19, calculated using CCA. Data are available at https://osf.io/bxr2v/. CCA, canonical correlation analysis; DNN, deep neural network; SSNR, signal-to-signal-plus-noise ratio; SVM, support vector machine. https://doi.org/10.1371/journal.pbio.3001418.g006 As can be seen in Fig 6b, pretrained and noise-trained VGG-19 exhibited quite similar SSNR thresholds in the first few layers but thereafter performance began to diverge. We could quantify the change in noise susceptibility across layers in terms of measures of slope. For the standard DNN, noise susceptibility gradually increased in progressively higher layers for both types of noise (slope = 0.008 for Gaussian noise, p < 10−4; slope = 0.012 for Fourier phase-scrambled noise, p < 10−12), implying that the contaminating effects of image noise tend to become amplified across successive stages of feedforward processing. After noise training, however, the network showed considerable improvement, especially in the middle and higher layers where the difference between standard and noise-trained networks most clearly diverged. For pixelated Gaussian noise, SSNR thresholds decreased considerably across successive layers (slope = −0.017, p < 10−8). In effect, the convolutional processing that occurs across successive stages of the noise-trained network leads to a type of de-noising process. This finding is consistent with the notion that the disruptive impact of spatially independent noise can be curtailed if signals over progressively larger spatial regions are pooled together in an appropriate manner to dampen the impact of random, spatially independent noise. This can be contrasted with the results for the DNN trained on objects in Fourier phase-scrambled noise. Here, the SSNR thresholds of the noise-trained network remained quite stable across successive layers and decreased only slightly (slope = −0.002, p = 0.017), whereas the standard DNN became more susceptible to noise across successive layers. As a complementary analysis, we measured classification-based SSNR thresholds by applying a multiclass support vector machine (SVM) classifier to the activity patterns of each layer of a given network. Each SVM was trained on activity patterns evoked by noise-free training images and then tested on its ability to predict the object category of test stimuli presented at varying SSNR levels. The SSNR level at which classification accuracy reached 50% was identified as the classification-based SSNR threshold. For standard DNNs, we found that classification accuracy for noise-free test images gradually improved across successive layers of the network due to increased sensitivity to category information (S3 Fig), and this trend largely accounted for the gradual improvement in classification-based SSNR threshold from one layer to the next (Fig 6c). Of greater interest, the divergence between standard and noise-trained networks became more pronounced in the middle and higher layers due to the benefits of noise training. These findings favor the notion that acquisition of noise robustness involves considerable modification of representations in the middle and higher layers of the noise-trained network. Consistent with this notion, a study of DNNs trained on auditory stimuli, specifically spoken words and music in the presence of real-world background noise, found that robustness to auditory noise was more pronounced in the higher layers of the DNN [45]. We further evaluated the extent to which the feature representations or weights in each layer changed as a consequence of noise training. This was done by performing canonical correlation between the weight matrices of the pretrained DNN and the noise-trained DNN to assess their multivariate similarity. As can be seen in Fig 6d, training with a combination of Gaussian noise and Fourier phase-scrambled noise led to negligible change to the representations in layer 1, whereas modifications were evident in layers 2 through 17, with somewhat greater change observed in the higher layers. However, layers 18 and 19, which are fully connected and tend to represent more semantic rather than visual information, exhibited negligible change in their structured representations after noise training. These analyses indicate that noise training leads to modifications to all convolutional layers of the DNN, with the exception of layer 1. Presumably, these changes account for the greater robustness to noise that was observed in our layer-wise measures of noise susceptibility (Fig 6b and 6c).

Comparison of DNNs and human visual cortical responses to objects in noise We conducted a functional magnetic resonance imaging (fMRI) study at 7 Tesla to measure human cortical responses to objects in noise and to assess their degree of correspondence with DNN object representations. Observers were shown 16 object images (2 images from 8 selected categories) in each of 3 viewing conditions: without noise (SSNR 1.0), in pixelated Gaussian noise (SSNR 0.4) and in Fourier phase-scrambled noise (SSNR 0.4). During each image presentation, observers were instructed to perform an animate/inanimate discrimination task. Behavioral accuracy was high overall (97.1% for clean objects, 98.5% for Gaussian noise, and 95.5% for Fourier phase-scrambled noise) and did not significantly differ between conditions (F(2, 21) = 1.27, p = 0.30). First, we sought to determine whether the human visual cortex is more readily disrupted by spatially correlated noise than by spatially independent noise, as one might expect from our behavioral results from experiment 1. We evaluated object discrimination performance of individual visual areas by training a multiclass SVM classifier on fMRI responses to clean object images and testing the classifier’s ability to predict the object category of both clean and noisy images using cross validation. In early visual areas V1 through V4, object classification of cortical responses was most accurate for clean images, intermediate for objects in pixelated Gaussian noise, and poorest for objects in Fourier phase-scrambled noise (Fig 7a). Planned comparisons indicated that classification accuracy was significantly higher for clean objects than for objects in pixelated noise and also higher for objects in pixelated noise as compared to Fourier phase-scrambled noise (t(7) > 4.7 in all cases, p < 0.0025). These fMRI results concur with the better behavioral performance that human observers exhibited for objects in pixelated noise as compared to Fourier phase-scrambled noise. Classification accuracy for high-level object-sensitive areas was lower overall than was observed for early visual areas; this pattern of results is often found in studies of fMRI decoding. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 7. Noise-trained VGG-19 provides a better model of human cortical responses to objects in noise. (a) Classification accuracy for fMRI responses in individual visual areas for clean objects (black filled circles), objects in pixelated Gaussian noise (gray filled circles) and Fourier phase-scrambled noise (gray open circles). Error bars indicate ± 1 standard error of the mean (n = 8). Chance-level performance is 12.5%. (b) Correlational similarity of object representations obtained from human visual areas and individual layers of DNNs when comparing standard versus noise-trained networks (red versus blue, respectively). Color-coded horizontal lines at the top of each plot indicate a statistically significant advantage (p < 0.01 uncorrected) for a given DNN at predicting human neural representations of the object images. Data are available at https://osf.io/bxr2v/. DNN, deep neural network; FFA, fusiform face area; fMRI, functional magnetic resonance imaging; LOC, lateral occipital cortex; PPA, parahippocampal place area; SVM, support vector machine. https://doi.org/10.1371/journal.pbio.3001418.g007 Interestingly, discrimination accuracy for objects in pixelated noise was not significantly different from that of clean objects in any of the high-level areas. These findings are consistent with the notion that greater spatial pooling of information by higher visual areas may attenuate the detrimental effects of spatially independent noise [36]. By contrast, classification accuracy was significantly better for objects in pixelated Gaussian noise as compared to Fourier phase-scrambled noise in the lateral occipital cortex (LOC, t(7) = 3.38, p < 0.025) and the parahippocampal place area (PPA, t(7) = 2.54, p < 0.05), although not in the fusiform face area (FFA, t(7) = 1.09, p = 0.31). Our findings indicate that object processing in both low- and high-level visual areas is more readily disrupted by spatially correlated noise than by spatially uncorrelated noise. We evaluated the correspondence between human cortical responses and DNN representations by performing representational similarity analysis [13]. This involved calculating a correlation matrix of the responses to each of the 48 images (16 object images × 3 viewing conditions), separately for each visual area and for each layer of a DNN. After excluding the main diagonal, the resulting matrices reflected the similarity (or confusability) of responses to all possible pairs of object images. The similarity of the object representational spaces across humans and DNNs could then be determined by calculating the Pearson correlation between matrices obtained from human visual areas and DNNs. The noise-trained DNN consisted of VGG-19 trained on the 16 categories of objects presented with the both types of noise as well as noise-free images. Here, the standard DNN consisted of pretrained VGG-19 that received an equal number of training epochs with noise-free images only. Fig 7b shows the results of standard and noise-trained DNNs in terms of their layer-specific ability to predict the patterns of responses in human visual areas. In the lowest layers 1 to 3, both standard and noise-trained DNNs exhibited comparable performance in their ability to account for human fMRI data. However, from convolutional layer 4 and above, the noise-trained DNN exhibited a clear advantage over the standard DNN at predicting the similarity structure of human cortical responses, while performing better overall. For early visual areas, the correspondence with noise-trained VGG-19 remained high throughout convolutional layers 4 through 16 and then exhibited a sharp decline in the fully connected layers 17 to 19. These fMRI results concur with the fact that the later fully connected layers of DNNs tend to represent more abstracted object information rather than visual information [14]. A different pattern of results was observed in high-level object-sensitive areas (LOC, FFA, and PPA). Here, the correspondence with the noise-trained DNN remained high or tended to rise in the fully connected layers. Taken together, these results demonstrate that noise-trained DNNs provide an effective model to account for the pattern of visual cortical responses to objects in noise, whereas standard DNNs do not.

[END]

[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001418

(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/


via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/