Advances in generative AI technology have made it easier than ever for anyone to manufacture increasingly realistic synthetic media (colloquially known as deepfakes) at faster speeds, larger scales, and with more customization than ever.13 This in turn has led to synthetic media increasingly being used for harmful purposes, including disinformation campaigns, nonconsensual pornography, financial fraud, child sexual abuse and exploitation, and espionage.23 As of today, the principal defense to combat deceptive synthetic media depends in large part on the human observer’s perceptual detection capabilities—their ability to visually or auditorily identify AI-generated content when they encounter it.13 Yet the growing realism of synthetic media impedes this ability, heightening people’s vulnerability to weaponized synthetic content. Moreover, people overestimate how capable they are at identifying synthetic media, further exacerbating the problem.11
As a result, accurately measuring people’s perceptual ability to differentiate between real and fake is critical to effectively combat the harms from synthetic media misuse. While there are growing efforts to develop and implement alternative technical solutions, such as machine detection, watermarking, and content provenance, these methods either currently lack robustness or are not yet sufficiently widespread to be effective.10,13,31 Similarly, despite extensive calls to deploy educational interventions such as digital media literacy campaigns or to ratify national laws requiring the labeling or prohibition of certain AI-generated content, formalized efforts have thus far been relatively limited.
We conducted a study to measure people’s ability to detect AI-generated content, with participants asked to classify synthetic and authentic images, as well as audio, video, and audiovisual stimuli in a series of online surveys. In addition to reviewing general deception-performance trends, our study examines how specific stimuli characteristics, including media type, authenticity, multimodality, and content subject matter, affect human detection performance. By doing so, it is possible to identify certain types of synthetic content that may be generally more difficult for people to detect, and consequently, more successful vehicles for deception. Our study also examines the impact of demographic factors, including multilingualism, age, and prior knowledge of synthetic media, on detection performance, revealing whether certain demographics are more vulnerable to misleading synthetic content than others.
Our survey series was designed to emulate the experience of browsing a typical online news feed by replicating several ecological conditions, including the interface format, browsing behavior, types and subject matter of the digital media presented, and whether the synthetic stimuli was sufficiently representative of actual synthetic media being made with publicly accessible generative AI tools at the time. The study design is discussed in more detail in the Methods section (see Appendix).
As social media platforms have become a hot spot for the proliferation of AI-generated content, this approach was chosen so the study’s findings would more closely reflect the average individual’s detection performance if they were to encounter misleading synthetic media in their social media news feeds.7 Thus far, much of human detection research has sought to assess detection performance under circumstances not necessarily reflective of how people currently encounter synthetic media in their daily lives. Examples include constraining the synthetic content being tested, such as its subject matter or media type, as well as employing testing conditions not present in the real world, such as two-alternative forced-choice methods, providing instant feedback or unlimited testing time, and informing participants what percentage of media will be synthetic in the test.5,6,12,14,17 As a result, it is challenging to generalize many of these studies’ findings on human detection performance. Comparatively, the body of literature examining human detection capabilities under conditions more typically encountered in real life is relatively new—our study seeks to build on such work. For example, Joseph et al. examined the impact of common real-world video browsing conditions, such as video quality, time-limited exposure, and distraction,8 on synthetic video detection.
Our study makes several additional contributions to the current literature, expanding the scope of synthetic stimuli for detection and examining several novel demographic factors. Thus far, the human detection field has predominately assessed people’s ability to identify synthetic images featuring human faces.6,13,14,22,27 This study is part of a nascent trend to examine human detection of non-face images, such as urban scenes and landscapes.1 As far as we are aware, it is the first study to also include synthetic images featuring animals. Furthermore, it is the first human detection study to examine the role of language in detecting synthetic, visual-based stimuli; to compare detection-accuracy rates between heterogeneous and fully synthetic audiovisual stimuli; to assess the impact of age on synthetic audio recognition; and to examine the effect of an individual’s prior knowledge of synthetic media on detection performance.
Our results show that participants’ overall accuracy rates for identifying synthetic content are close to a chance-level 50%, with minimal variation between media types, suggesting that people’s visual and auditory perceptual capabilities are inadequate for reliably identifying synthetic media encountered online. Our results also find that detection-accuracy rates worsen when people are presented with the following:
Synthetic content compared to authentic content
Images of human faces compared to non-human-face objects
Single-modality stimuli compared to multimodal stimuli
Audiovisual stimuli with heterogeneous authenticity compared to fully synthetic audiovisual stimuli.
This indicates that individual content characteristics present within the stimuli do affect certain visual and auditory perceptual processes such as object, speech, and language recognition, providing potentially novel observations to not only synthetic media detection research but also the human perception field.
We also find that demographic factors such as multilingualism and age play a role in perceptual detection capabilities. People were less accurate in correctly identifying stimuli featuring foreign languages than those featuring languages in which participants are fluent, and older participants performed worse than younger participants—particularly in identifying audio and audiovisual stimuli. This indicates that monolingual and older demographics may be more susceptible to being deceived by synthetic media than their multilingual and younger counterparts. Finally, we find that people’s prior knowledge of synthetic media does not affect detection performance, with people who reported being unfamiliar, semi-familiar, or highly familiar with synthetic media prior to taking the survey all performing similarly. This suggests either that current public knowledge of synthetic media is insufficient for meaningfully improving detection performance, or that synthetic media has become convincingly realistic enough that perceptual-based educational interventions have become inadequate.
These results demonstrate that depending on people’s perceptual detection capabilities to discern the real from the fake is no longer a viable bulwark against the threats posed by synthetic media. Though our findings provide useful insights on how to reduce people’s susceptibility to certain digital content characteristics, such as stimuli featuring a foreign language or in a single modality, it is expected that the benefits will be relatively short term. Rather, we anticipate that continued advances in generative AI technology will eventually lead to any detection performance differences resulting from these characteristics becoming negligible, and that human detection performance overall will plateau. Ultimately, these findings serve to further emphasize the critical need for alternative countermeasures to more effectively combat both the potential and already realized harms arising from synthetic media, whether these measures are technical, educational, or otherwise.
Results
We conducted a preregistered perceptual survey series, requiring 1,276 participants to classify authentic and synthetic media stimuli.
Media type and authenticity. Mean detection performance across all stimuli was 51.2%, close to a random chance performance of 50%. With regard to the detection performance of specific modalities, participants were least accurate at classifying image stimuli, with a mean accuracy rate of 49.4% (Figure 1). Comparatively, detection accuracy was higher for video-only stimuli and audio-only stimuli, at 50.7% and 53.7% respectively. Participants were most accurate when classifying audiovisual stimuli, achieving a mean accuracy of 54.5%.
The stimuli’s authenticity was also a meaningful predictor for detection performance (see Table 1). Across all modalities, participants were significantly better at correctly identifying fully authentic stimuli (with a mean detection performance of M=64.6%) compared to identifying stimuli that contained synthetic media (M=38.8%). When audiovisual stimuli was examined specifically, as the only multimodal media type present in the study, an unpaired t-test found that participants’ detection performance worsened significantly (t(9861)=6.3, p<0.001) when classifying audiovisual stimuli containing both synthetic video and authentic audio (M=43.4%) compared to audiovisual stimuli containing synthetic video as well as synthetic audio (M=49%). The effect size for this difference was small, with Cohen’s d equal to 0.11.
β | SE | Wald χ2 | df | p | Odds Ratio | OR CI (95%) | |
(Intercept) | 0.924 | 0.022 | 1767.698 | 1 | <0.001* | 2.519 | [2.808, 2.260] |
Authenticity | |||||||
Synthetic Media | -1.140 | 0.012 | 8300.303 | 1 | <0.001* | 0.320 | [0.322, 0.317] |
Image Subject Matter | |||||||
Human Face | -0.269 | 0.016 | 265.364 | 1 | <0.001* | 0.764 | [0.783, 0.746] |
Language | |||||||
Foreign Language | -0.158 | 0.020 | 59.707 | 1 | <0.001* | 0.854 | [0.883, 0.826] |
Deepfakes Pre-Knowledgeability | |||||||
Highly Familiar | -0.035 | 0.022 | 2.332 | 1 | 0.115 | 0.966 | [1.007, 0.926] |
Semi-Familiar | -0.005 | 0.015 | 0.095 | 1 | 0.758 | 0.995 | [1.025, 0.966] |
Likelihood ratio test = x2(6) = 9142.5, p<0.001 Nagelkerke R2 = 0.095 |
Human face vs. non-human-face images. Participants were also significantly less accurate (Table 1) when classifying images featuring human faces (M=46.6%) compared to images featuring non-human-face objects, such as animals (M=51.7%), food (M=49.9%), and landscapes (M=54.7%), shown in Figure 2. A post hoc unpaired t-test confirmed that although the effect size was small (Cohen’s d=0.21), there continued to be a significant difference in detection performance even when controlling for model type (t(12201)=-13.83, p<0.001).
Multimodality. An unpaired t-test found that the mean detection-accuracy rate was significantly higher (Cohen’s d=0.4, small-to-medium effect) for participants classifying multimodal audiovisual stimuli (t(49004)=5.18, p<0.001) compared to when they classified audio-only and video-only stimuli (M=52.2%), both single modalities. Thirty of the video-based stimuli were randomly presented in different formats in both of the surveys, enabling a post hoc comparison of participants’ accuracy rates across each condition. In one survey, these stimuli were presented in a fully audiovisual format. In the other survey, the audio was removed and the stimuli were presented in a video-only format. Post hoc analysis found a small but meaningful effect (Cohen’s d=0.11) of multimodality on participants’ detection performance. An unpaired t-test found that participants were significantly more accurate (t(37517)=10.15, p<0.001) at identifying the video stimuli when the audio was included (M=55.9%) than when the same stimuli was presented in a video-only format (M=50.7%).
Multilinguism. We found that detection performance significantly improved (Table 1) when the participant was required to classify visual and auditory stimuli featuring a spoken language in which the participant was self-reportedly fluent (M=54.5%), as opposed to classifying visual and auditory stimuli featuring a foreign language (M=51.3%). As shown in Figure 3, a series of post hoc unpaired t-tests found that detection-accuracy rates continued to be significantly better when known, as opposed to foreign, languages were present in the stimuli, even when examining each modality type individually. The effect sizes across each of the three modality comparisons were small, with the presence of known languages in audiovisual stimuli having comparatively the largest effect (d=0.09), while the effect was lessened for audio-only stimuli (d=0.06) and was smallest for video-only stimuli (d=0.04). Participants were found to be significantly more accurate (t(17696)=3.947, p<0.001) at correctly identifying audio-only stimuli that featured known languages (M=55.3%) compared to those featuring foreign languages (M=52.3%). Detection performance was also found to be significantly better when classifying audiovisual stimuli (t(22640)=6.864, p<0.00) and video-only stimuli (t(19278)=2.83, p<0.001) featuring known languages, with mean detection-accuracy rates for audiovisual (M=56.5%) and video (M=52%) stimuli featuring known languages higher compared to mean detection-accuracy rates for audiovisual (M=52%) and video (M=49.5%) stimuli featuring foreign languages.
As seen in Figure 4, a series of additional post hoc unpaired t-tests found that when examining the 30 video-based stimuli that were present in both surveys—in either a video-only or audiovisual format—the inclusion of audio featuring a known language alongside the video resulted in a small (d=0.12) but significant improvement in participants’ detection performance (t(23035)=9.17, p<0.001). Although the effect was comparatively smaller (d=0.09), detection performance also significantly improved (t(16966)=5.63, p<0.001) when audio featuring a foreign language was included alongside the video.
Age. A post hoc linear regression analysis found age to be a significant predictor of detection performance, (β=-56.64, p<0.001) with older participants being significantly less accurate in classifying stimuli compared to younger participants [R2=0.04, F(1,1274)=55.63, p<0.001]. When examining the results by individual media type, we found that detection performance by age decreased the greatest amount for audiovisual and audio-only stimuli, with a comparatively more constrained decrease in accuracy rates for image and video-only stimuli (Figure 5).
Prior synthetic media knowledge. Finally, there was no significant difference in detection performance found (Table 1) between participants who self-reported being previously unfamiliar with synthetic media (M=51.1%) and those who reported having prior knowledge of synthetic media, either to a semi-familiar (M=51%) or highly familiar degree (M=51.9%). A post hoc one-way ANOVA did not find any significant difference in detection-accuracy rates between the three respective groups [F(2,124185)=1.4, p=0.25; η2<0.01], with Tukey’s HSD test further confirming no significant difference between their means.
Limitations
This study is not without its limitations. First, as advances in generative AI technology have continued to rapidly improve the realism of synthetic content, current human detection performance of synthetic media may be lower than reported in the study. Furthermore, although the design of the study mimics online platform environmental conditions to replicate the manner in which individuals would likely encounter synthetic media in their daily lives, the study does exclude several features that would typically be present in such an environment. This includes captions or text accompanying a media post, metrics like likes or reshares, and other users’ comments, and does not account for participants’ ability to cross-reference, validate the source of, or fact-check the stimuli they encounter. Although these restrictions were done deliberately to ensure participants were solely reliant on their cognitive perceptual capabilities in identifying synthetic content, it does somewhat limit generalization of the study’s findings, as it does not account for people’s ability to rely on other contextual cues or employ digital literacy skills to determine whether a media item is real or fake. Therefore, we suggest that our study results are more ecologically valid for human detection performance on online platforms where either information is more context-limited or where there is a proven tendency of people to not critically engage with digital content. Existing literature suggests that many social media platforms are increasingly becoming such context-limited environments, a result of the growing trend for users to passively and non-critically consume digital content on these platforms, and the growing reliance on them as a primary source for information such as news.18,25,28 A final limitation is that participants self-reported their degree of preexisting synthetic media knowledge and fluency in languages, so participants may have over or underestimated their capabilities in comparison with other participants.
Discussion
Social media platform detection performance. Multiple implications can be drawn from the results of this study. First, people struggled to meaningfully distinguish synthetic images, as well as audio, video, and audiovisual stimuli, from their authentic counterparts, with overall detection-accuracy rates being close to a chance-level performance. This is congruent with existing research to some degree, as prior studies have recorded similar detection-accuracy rates for synthetic images,17,27 yet studies on video, audio, and audiovisual stimuli have typically found higher detection-accuracy rates.2,5,6,12,14,20,22 Although study differences bar a more detailed comparison, we hypothesize that the overall divergence in detection performance is due to differing environmental testing conditions—which suggests that people’s detection capabilities are further constrained under conditions similar to those present in browsing a social media news feed. This worsening performance is consistent with Josephs et al.’s 2023 study, which similarly found that online platform environmental factors, including divided attention and exposure length, among others, also had inhibitory effects on people’s detection capabilities.8 Further research might be conducted on how these and other environmental factors may also affect detection performance, both individually and cumulatively. Having a better understanding of these environmental factors and their effects on detection performance will provide a clearer picture of how vulnerable people may be to deceptively employed synthetic media in their daily lives. For instance, despite smartphones’ growing popularity as people’s primary source of online information, it has been found that people use lower amounts of cognitive resources when consuming digital content via smartphones.3,28 As a result, widespread smartphone use may further hinder people’s abilities to detect deceptive synthetic media when consuming content via their online news feeds.
Stimuli content characteristics. Detection performance has been found to be sensitive to certain stimuli content characteristics, which suggests that these individual factors may affect people’s detection capabilities due to how they impact their cognitive perceptual processes.
Authenticity. People are more accurate at identifying authentic content rather than synthetic content, revealing a bias toward classifying all digital content as real. This is consistent with previous research, which also found people had a similar inclination to classify video-only stimuli as authentic.8,11
Modality. People are also less accurate at detecting synthetic audio-only and video-only stimuli compared to audiovisual stimuli, which suggests the inclusion of additional modalities improves people’s perceptual ability to distinguish between authentic and synthetic media. This is consistent with existing research such as Groh et al.’s 2023 study, which found the addition of audio to videos of real and fake political speeches improved detection performance.6 This may be due to the multimodal nature of speech perception, as speech perception research has found that people rely on both visual and auditory perceptual cues to comprehend spoken words.21 Therefore, synthetic audiovisual stimuli may be less difficult to detect than its monomodal counterparts because people are able to leverage both the auditory and visual information present to better identify observable AI-generated artifacts. In addition, detection accuracy is found to be higher when identifying fully synthetic audiovisual stimuli, compared to heterogeneous synthetic audiovisual stimuli, where the video contains synthetic content while the audio is authentic. This suggests that people may be better at identifying fully synthetic audiovisual stimuli than heterogeneous audiovisual stimuli, due to the higher potential presence of observable artifacts in both the audio and visual modalities. People are able to leverage these artifacts to determine authenticity, as opposed to potential observable artifacts being present in only one modality, as would be the case with heterogeneous synthetic audiovisual stimuli.
Image content. Meanwhile, people are worse at detecting synthetic images featuring human faces compared to synthetic images featuring animals, food, and landscapes, indicating they find synthetic images featuring human faces to be more convincing than similarly realistic looking images featuring non-human-face images. This may be due to the specialized human visual perceptual process that occurs when people observe faces as opposed to non-face objects. Visual perception research has established that, whereas humans recognize faces as a perceptual whole, non-face objects are identified by their distinct individual components.9,26 Therefore, the more gestalt face-recognition process that takes place once an image is recognized to feature a face may lead people to be less sensitive toward identifying observable AI-generated artifacts than they would if employing the more fragmented object-recognition process that occurs when observing non-face objects.
Multilingualism. People are also more accurate in detecting synthetic audio-only, video-only, and audiovisual stimuli featuring languages in which the observer is fluent compared to stimuli featuring foreign languages, suggesting that language familiarity plays a significant role in people’s visual and auditory perceptual detection capabilities. This is consistent with existing synthetic-audio-detection research such as Müller et al.’s 2023 study, which found native English speakers to be better at detecting English synthetic audio clips than non-native speakers.20 Detection performance may be higher when known languages are present in visual and auditory stimuli due to the observer being more familiar with the visual and auditory language information available, making them more sensitive to observable artifacts.16 In addition, when comparing the detection performance of video-only stimuli with their audiovisual counterparts (identical video clips but with audio included), it was found that people’s detection accuracy improved to a greater degree between video-only and audiovisual stimuli featuring known languages versus those featuring foreign languages. This suggests that people are more sensitive to auditory perceptual cues in audiovisual stimuli featuring a familiar language as opposed to foreign ones. This is congruent with established language-perception research, which has found that people weigh auditory information more highly when observing known languages being spoken, while they rely on visual perceptual cues to a greater degree when observing foreign languages.24,29 Therefore, it may be that the inclusion of audio facilitated people’s sensitivity to synthetic visual content to a greater degree for stimuli featuring familiar languages than foreign languages because of the greater weight given to auditory perceptual cues when observing familiar languages as part of the language perceptual process.
Understanding how individual stimuli characteristics such as the ones examined in this study may inhibit or facilitate people’s perceptual detection capabilities is valuable, as it provides more accurate predictions of how susceptible people potentially are to different types of synthetic content. In addition, it identifies demographics that may be more vulnerable to being duped by synthetic media than others, such as monolingual speakers potentially being less sensitive to visual and auditory AI-generated artifacts presented in foreign languages than multilingual speakers. In turn, these insights can inform the development of more effective countermeasures to better mitigate people’s susceptibility to synthetic media, such as specific content-moderation tactics. For instance, although the default for many social media platforms is to have audiovisual stimuli muted when being played, detection performance of synthetic media is likely to be improved if the audio is not muted when the video automatically plays. Therefore, a useful content-moderation policy may be to not automatically mute videos deemed to be at higher risk of containing deceptive synthetic content, such those featuring political content in the run up to elections. Future research would be useful to identify additional actionable countermeasures, as well as further explore the effects of these and other stimuli characteristics on detection performance, such as determining whether the difference in detection performance between faces and non-face objects is also found in identifying synthetic video and audiovisual stimuli. Regardless, as generative AI technology continues to improve, periodic reassessment of these stimuli content characteristics would be beneficial to determine whether they continue to have the same effect.
Age. People’s age significantly affected detection performance, with older people being less accurate in identifying synthetic media than younger people across all media types. The relationship between age and detection performance is consistent with previous research, which similarly found older people to be less accurate in classifying audiovisual and audio stimuli, respectively.2,14 This suggests that as people age, they become less sensitive to perceptual cues such as observable artifacts present within synthetic media. We speculate that this may be due to the widespread visual and auditory perceptual degradation that occurs with aging.4 Detection performance in relation to older age declined the most for stimuli containing audio components, which may be a result of hearing loss being more prevalent than visual impairment in older demographics, along with the lower frequency of people using treatments for improving their hearing in comparison with their sight.4,15 Audiovisual stimuli, which younger people were the most accurate at classifying, was the media type least often correctly classified by older people. This may be due to how age-related auditory and visual degradation affects the multisensory perceptual process. Existing speech-perception research finds that older individuals depend more on multisensory information to compensate for degradation of unisensory perception, such as relying more on adjacent visual cues to improve their comprehension of auditory information.19 This higher reliance on the integration of perceptual information has led to older individuals being more likely to be susceptible to illusionary effects, such as purposefully mismatched visual and auditory information, than their younger counterparts.19 Similarly, this may mean older individuals are less sensitive to artifacts present in audiovisual stimuli because they are more reliant on multisensory than unisensory perception abilities when observing audiovisual content.
Altogether, this indicates that age increases susceptibility to being misled by synthetic media, with a greater vulnerability to audio and audiovisual synthetic content than visual synthetic content. This insight suggests the potential need for age-related content-moderation policies to better improve older digital consumers’ resilience against being tricked by deceptive synthetic media, or the heightened need for educational interventions tailored to this higher-risk demographic. This is especially important given that older individuals are increasingly becoming the targets of synthetic-media-enhanced scams.19 Future research on how age-related limitations may be mitigated would be useful, such as assessing whether corrective devices such as glasses or hearing aids meaningfully improve perceptual detection performance.
Prior synthetic media knowledge. People’s prior knowledge of synthetic media did not affect detection performance; people who reported being highly familiar with synthetic media performed similarly to those who reported being less familiar or unfamiliar. As this study did not test participants on their synthetic media knowledge, rather asking them to self-report, this suggests one of two possible causes. The first is that synthetic media has become convincingly realistic to the degree where increased familiarity with synthetic media does not meaningfully improve people’s perceptual detection capabilities. Alternatively, it may be that current public knowledge of synthetic media, even at comparatively higher levels, does not sufficiently educate people on effective perception-based detection methods. Existing research suggests the truth is currently somewhere in the middle, as some studies have found that increased exposure to synthetic content and providing immediate feedback or prior training has improved detection performance, while others have found that not to be the case.5,6,12,14 However, as synthetic media continues to improve in realism, it is expected that eventually perceptual-based educational interventions will become less effective. Nonetheless, further work to clarify in what contexts perceptual-based interventions are able to improve detection performance will be beneficial for developing more effective educational interventions to reduce people’s current vulnerability to deceptive synthetic content. In addition, research into the impact of non-perception-based educational interventions, such teaching critical analysis skills like fact-checking or cross-referencing, on detection performance would also be valuable.
Conclusion
The results of our study demonstrate that people’s perceptual detection capabilities are no longer a suitable defense against deceptive synthetic media. Tools to create convincingly realistic synthetic content have now become available to anyone, including those with harmful intent. Our findings underscore the critical importance of developing and deploying robust countermeasures that are not reliant on human perceptual detection capabilities. This includes increasing investment and research for technical solutions, such as machine detection and watermarking or cryptographic signatures, as well as wider adoption of other techniques, such as content provenance or hashing databases. It also highlights the need to pursue widespread educational interventions, such as digital media literacy campaigns, to better equip people with the skills and knowledge to identify false content in other ways, such as critical analysis techniques like cross-referencing and fact-checking.
Importantly, our study also identifies several stimuli content characteristics and observer demographic factors that have inhibitory or facilitatory effects on people’s perceptual detection capabilities. These findings provide useful insights for informing immediate actionable countermeasures that could be taken in the short term to reduce people’s vulnerability to more convincingly deceptive content, such as specific content-moderation policies for online platforms. Nevertheless, as synthetic media outputs continue to progress in their realism, we anticipate that perceptual-based countermeasures will eventually plateau, requiring alternative solutions over the long term. Regardless, further work in this space is vital to improve our understanding and monitor the limitations of human perceptual detection capabilities to better identify and increase societal resilience against the dangers posed by synthetic media.
To view the Appendix for this article, please visit https://dl.acm.org/doi/10.1145/3729417 and click on Supplemental Material.
Acknowledgments
We would like to thank Gamin Kim, Ike Barrash, and Daniel Pycock for their contributions in data analysis and formatting, as well as Alexis Day for her contributions to the survey’s design and development.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment