CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 date: Fri, 26 Dec 2025 16:25:36 GMT last-modified: Wed, 12 Feb 2025 18:11:20 GMT etag: "87c7bfb6290689b4f6a3e36686b6a551" access-control-allow-origin: * content-type: application/rss+xml; charset=UTF-8 link: ; rel="https://api.w.org/" age: 59825 accept-ranges: bytes x-cacheable: YES content-length: 245125 Around the word https://corpling.hypotheses.org A corpus linguist's research notebook Sun, 08 Dec 2024 11:53:19 +0000 fr-FR hourly 1 https://wordpress.org?v=6.8.3 Empowering budding corpus linguists with a custom shiny app https://corpling.hypotheses.org/5113 https://corpling.hypotheses.org/5113#respond Sun, 08 Dec 2024 11:52:35 +0000 https://corpling.hypotheses.org/?p=5113 As corpus linguists, we often grapple with the question of how to empower students who may not have extensive coding skills to design tools that fit their research needs. With this in mind, I developed a modest yet very helpful Shiny app that simplifies running

\chi^2

and Fisher’s exact tests for independence while offering access to the underlying code. This blog post explores how the app was built, its assets and limitations, and reflects on pedagogical goals.

Are you new to Shiny?

If you are new to Shiny, you may want to read this post, especially the section entitled “Shiny 101”. Shiny is a framework designed to enable researchers and data scientists to rapidly develop interactive web applications using the R or Python programming languages. Its intuitive structure makes it an ideal tool for creating user-friendly platforms to share and explore data, which is why it is very popular among data scientists. Shiny applications are particularly valued for their ability to visualize complex datasets and provide dynamic interfaces. Users are empowered to interact, which is why I think Shiny apps are a great addition to a professor’s toolbox.

The (humble) philosophy behind the (humble) app

Ready-made tools for corpus linguistics are user-friendly but often lack the flexibility needed to address specific requirements. Excellent tools such as AntConc (a concordance program for analyzing text, available at AntConc) and #LancsBox (a tool for corpus analysis and visualization, found at #LancsBox), which I frequently use in class with linguistics beginners, come to mind. However, a common limitation is their inability to help students select appropriate statistical methods based on the characteristics of the data (e.g., using Fisher’s exact test for small samples). This year, I decided to develop an app for my students at Université Bordeaux Montaigne to address this limitation. The app is not intended to replace existing tools. Rather, it is designed to complement them.

In a nutshell, the app allows students to upload contingency tables in various formats, automatically selects and executes either $\chi^2$ or Fisher’s exact test based on expected frequencies, generates visualizations such as association plots and mosaic plots to aid interpretation, and provides fully commented code to enhance transparency and support learning.

What the app looks like when you load it

Fisher’s exact test and the $\chi^2$ test of independence are both used to assess whether two categorical variables are independent. They differ in their assumptions and ideal use cases. Fisher’s exact test is particularly well-suited for small sample sizes or situations where the expected cell frequencies in a contingency table are very low (less than 5). Unlike tests that rely on approximations, Fisher’s exact test calculates the exact probability of observing the data under the null hypothesis. This makes it a good choice for sparse data. The $\chi^2$ test of independence, on the other hand, is ideal for larger datasets where all expected cell frequencies are sufficiently high (greater than 5). This test works by comparing the observed frequencies in the data to the expected frequencies under the null hypothesis. It is parametric as it uses an approximation based on the $\chi^2$ distribution. While it is faster and computationally simpler for large tables, the $\chi^2$ test is less accurate for small sample sizes (its reliance on approximations can lead to errors).

This app invites both critical thinking and ownership. By making the decision-making process explicit (e.g., why Fisher’s exact test might be used instead of $\chi^2$ ), the app initiates students to matching statistical methods to data characteristics. Students are also strongly encouraged to adapt the app for their specific needs, especially if they want to pursue a curriculum in linguistics.

The source code are available on my GitHub repository. Admittedly, I am not the best at coming up with creative names for my scripts, so the app is simply called chisq-fisher-viz. A good idea would be to launch it before continuing to read. Since I’m on a free plan, shinyapps.io limits app usage on their servers to 25 hours. For this reason, I am not sharing the hosted link. The good news is that you can run it locally or download, publish, and run the script for free at https://www.shinyapps.io/.

How the app works

As any other Shiny app, this one relies on a simple user interface (UI) and robust server-side logic. The students can upload tables in .xlsx, .csv, or .txt format, and the app validates the data to ensure that no rows or columns are empty.

Here is what the tables can look like.

The Shiny app selects the appropriate statistical test by applying $\chi^2$ if any expected cell frequency is greater than 5, as in Fig. 1, or Fisher’s exact test if any expected cell frequency is below 5, as in Fig. 2.

With the first sample input file, the $\chi^2$ test of independence is applied (Fig. 3).

Fig. 3. Cohen-Friendly association plot (as produced by the `vcd` package for R)

The Shiny app displays a Cohen-Friendly association plot, along with the results in clear tables with observed, expected frequencies, and residuals, as well as a p-value.

Fig. 4. Observed frequencies, expected frequencies, and residuals

With the second sample input file, Fisher’s exact test of independence is applied (Fig. 5). This time, a mosaic plot is displayed and the table of residuals is not proposed (Fig. 6).

Fig. 6. Observed and expected frequencies

Importantly, the Shiny app explains the results to help students decide whether to reject or fail to reject the null hypothesis. This presupposes that you have explained hypothesis testing beforehand, but that is quite manageable. FYI, I cover hypothesis testing in Chapter 8 of my book Corpus Linguistics and Statistics with R.

Assets

One key asset of such an app is its dual focus: it is meant to be both easy to use, and it provides a learning opportunity. Indeed, students can run tests without learning R, and the fully commented R code is accessible via GitHub. Students can therefore understand how the app works under the hood. They can also adapt or extend it for their own research, once they are more proficient in R programming.

Limitations and Next Steps

This app represents a first step forward, but it is not without limitations. One concern is the association of residuals with Fisher’s exact test, which can be misleading. While residuals are meaningful for $\chi^2$ , their interpretation alongside Fisher’s test is less straightforward.

Visualization choices also present challenges. Mosaic plots, for instance, are visually appealing but are not strictly tied to Fisher’s exact test. Their inclusion in the app might lead students to assume these are the “natural” extensions of the test. Although mosaic plots are sometimes used with Fisher’s test, other visualizations, such as heatmaps, might be equally appropriate. Future updates could refine these visualizations to better match their intended statistical contexts.

Additionally, the app simplifies complex statistical assumptions and relationships, which risks giving users a false sense of mastery. Adding warnings about the limitations of specific tests and linking to resources on statistical theory will help address this issue in future versions.

Scalability is another consideration. While the app works well for small to medium-sized datasets, handling larger datasets or more complex statistical tests may require optimization or additional features in the future.

Concluding Thoughts

This Shiny app is part of a broader effort to give linguistics students access to practical, user-friendly tools, especially when their curriculum does not include training in computational or quantitative methods, as is often the case in the humanities.

The app aims to strike a balance between being easy to use and supporting solid learning outcomes. It is designed to help students explore statistics without getting overwhelmed by technical details. I believe our role as educators is not just to teach specific tools but to spark curiosity, encourage adaptability, and nurture critical thinking.

If you try the app or use it in your classes, I would love to hear how it works for you! Feedback, suggestions, and contributions are always welcome.

]]> https://corpling.hypotheses.org/5113/feed 0 Corpus linguistics in the LLM era – the changing nature of language data https://corpling.hypotheses.org/4690 https://corpling.hypotheses.org/4690#comments Mon, 25 Nov 2024 08:41:37 +0000 https://corpling.hypotheses.org/?p=4690 The emergence of generative AIs, such as ChatGPT, Google Gemini, Microsoft Copilot, Anthropic Claude, Meta AI, or Mistral, has brought both opportunities and challenges to the field of corpus linguistics. These systems generate vast amounts of language output that often appear natural, but is this output genuinely authentic? This raises important questions for corpus linguists about the nature of linguistic data and the methods used to study it. Key issues include understanding the nature of the language generated by AI and evaluating the implications of AI-augmented tools for analyzing linguistic data. This post is the first in a new series on AI-assisted corpus linguistics. Here, I will focus on the changing nature of language data.

What kind of AI are we talking about?

If you have been keeping up with the latest in AI, you have probably heard about two main types of AI that are making waves: generative AI and predictive AI. While both types rely on big datasets and machine learning, they serve different purposes. Generative AI creates new stuff and is often used in creative fields, while predictive AI predicts what is coming next based on what has happened before and is more about analysis and decision-making. Large Language Models (LLMs), which are the foundations on which ChatGPT-like systems are built, belong to the first kind.

Why are LLMs so powerful now?

Transformers are the driving force behind the advancements in Large Language Models (LLMs) and have been a game-changer in natural language processing (NLP).

Transformers are effective because they bring together several mechanisms, which I will briefly describe and attempt to explain based on what I have read on them. These mechanisms include self-attention, parallel data processing, the encoder-decoder architecture, positional encoding, and multi-head attention. Summaries of how they work and related tutorials can be found in many places online—for example, here, here, or here. If you are not interested in these technical details, feel free to skip this part and jump to the next section.

The first mechanism is self-attention, which allows the model to weigh the importance of different words in a sentence relative to each other. Imagine you are in a busy room trying to listen to someone talk. Your brain automatically focuses on their voice and what they are saying while tuning out the less important noises. Transformers use self-attention for roughly the same purpose. This is like your brain filtering out the noise to understand the conversation better. The way that I understand it is that, unlike previous models that processed data sequentially, transformers can look at the entire sentence at once and understand how each word influences the others.

Self-attention is combined with the ability to process data in parallel, which speeds up training and inference significantly and makes it possible to handle massive datasets efficiently. Traditional neural networks, like RNNs and LSTMs, process data one step at a time, a bit like reading a book page by page. Transformers, however, can process the entire input sequence in parallel, similar to how you can glance at a whole paragraph and understand its context instantly. This parallel processing significantly speeds up training and inference.

The encoder-decoder architecture of transformers is also a great asset. Think of the transformer as a translator. The encoder is like the person who reads and understands the original text, summarizing its essence. The decoder is like the person who takes this summary and translates it into another language, generating the output sequence. Because the encoder-decoder setup is highly flexible, it can be adapted for various tasks like text generation, translation, and summarization.

Because transformers do not process data sequentially, they need a way to keep track of the order of words. This is achieved through positional encoding, which is like adding a timestamp to each word so the model knows where it fits in the sentence. It is similar to how you use chapter numbers and page numbers to navigate a book.

The multi-head attention mechanism is like having multiple pairs of eyes looking at the same scene from different angles. Each head focuses on different aspects of the input, allowing the model to capture a more comprehensive understanding of the context. This is needed for tasks that require contextual “understanding” (I am extremely uncomfortable using terms referring to human abilities when discussing LLMs, hence the scare quotes), such as teasing apart bank as a financial institution and bank as the side of a river.

Finally, transformers are efficient because they are pre-trained on vast amounts of text data. Not only that, they are also fine-tuned for specific tasks, similar to how a general practitioner might specialize in a particular field of medicine. This approach allows each LLM to use its general “knowledge” (scare quotes, again) of language and adapt quickly to new tasks without starting from scratch.

Why should we be concerned about corpora?

By definition, corpora consist of naturally occurring language produced in authentic contexts, without the speakers or writers being aware that their language output would be used one day for linguistic analysis. This spontaneous and unself-conscious production of language is a necessary condition for linguists, who base their studies on materials that are free from the potential biases or alterations that might occur if participants knew their language was being scrutinized. The resulting data thus represents a more accurate snapshot of genuine language use in various communicative situations.¹

As the boundaries between human-produced and machine-generated language become increasingly blurred, the traditional conception of corpora as carefully curated collections of authentic language use is being challenged by the growth of AI-generated text. This development raises questions about the nature of linguistic authenticity and representation.

As said before, LLMs are trained on a massive amount of text data: we are talking billions of words and phrases. This training allows them to mimic how we speak and write, generating text that appears coherent and relevant. Whether it is answering questions, writing articles, or even having a chat, LLMs can produce text that sounds a lot like it was written by a human.

Let us discuss the Turing test for a moment. This concept, introduced by Alan Turing in 1950, serves as a benchmark for evaluating AI. Imagine you are conversing with someone via text and cannot discern whether you are speaking to a human or a highly sophisticated computer program. That is essentially what the Turing test measures: if an AI can convincingly imitate human conversation to the extent that a person cannot reliably tell it apart from a human, it passes the test. ChatGPT did pass the Turing test. However, the Turing test does not imply that the AI truly understands the language it generates. Rather, it demonstrates an impressive ability to mimic human-like responses, like a parrot (see below).

The role of corpus linguists, from my perspective, is to ensure that the texts we study are produced by humans. This is vital because linguistics belongs to the human sciences, and the grammatical phenomena we study are inherently human. Grammar and linguistics are not exact sciences. We are as interested in the regularities that govern language practices as we are in the peculiarities that disrupt them.

LLMs like ChatGPT produce an “average language” – expressions abstracted from millions of ways of expressing oneself – as a direct result of their training process. This process begins with a vast corpus of text data that brings together a wide range of linguistic styles, topics, and contexts. As the model iterates through multiple training epochs, it refines its “understanding” of language but inevitably smooths out the idiosyncrasies that make human language so diverse. The model learns to generate text by predicting the most probable sequences based on statistical patterns, rather than preserving what makes each individual voice unique.

While ChatGPT may be adept at imitating accents and dialects, what value is there in studying an imitation if linguists cannot confidently and precisely associate this linguistic marking with genuine human experiences? Furthermore, what purpose does it serve to make generalizations about an ersatz language that is itself the product of a dehumanized generalization process?

Because the grammatical phenomena observed in LLM outputs are not the product of genuine human cognitive processes or social interactions, but rather the result of complex statistical computations abstracted from massive datasets, this “average language” is of limited interest to linguists. It merely represents a form of linguistic expression fundamentally detached from the human experiences and social contexts that traditionally inform linguistic study.

Stochastic parrots?

Generative AI, which includes LLMs, is primarily focused on producing new content, where “new” does not necessarily mean “original”. This type of AI can generate everything from text and images to videos, music, and even software code (in fact, it is very good at coding, as we shall see in a future post). The goal is to develop not creativity but productivity into various creative tasks such as content creation, art, music, and fashion. However, whether we can truly consider this output as “real” creativity is debatable given the parrot-like nature of generative AI.

Before ChatGPT was released in late 2022, renowned NLP figure Emily Bender and her colleagues raised concerns about the implications of these technologies. They popularized the term “stochastic parrot” in their 2021 paper titled “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” to address the limitations of LLMs.

“Stochastic parrot” is a catchy phrase that conveys the idea that LLMs are essentially advanced mimics. Picture a parrot that has listened to every conversation ever held; this parrot can piece together sentences that sound remarkably human-like but lacks any true understanding of what it is saying. That is similar to how LLMs operate but on a much larger scale. The term “stochastic” refers to the element of controlled randomness in how these models select their words. They do not simply repeat exact phrases from their training data; instead, they mix and match in ways that can appear creative or insightful. However, despite their impressive outputs, these models do not possess genuine understanding or reasoning capabilities. They are biased, cannot fact-check themselves or apply common sense and may confidently present misinformation if it aligns with their training data.

In the face of worldwide admiration for LLMs, Bender and her colleagues take a step back to ask: How big is too big? They caution against over-relying on language models that can produce human-like text without any real comprehension of truth or ethics. They explore the possible risks associated with developing larger models and propose paths for mitigating those risks. Bender et al.’s recommendations include weighing environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything from the web,² conducting pre-development exercises to evaluate how planned approaches fit into research goals, and encouraging research directions beyond merely increasing model size.

Traditional corpora are safe…

Carefully curated corpora like the British National Corpus, BNC2014, the Corpus of Contemporary American English, the Brown Corpus and its family members, the Lancaster-Oslo/Bergen Corpus (LOB), the International Corpus of English, etc.³ are holding their ground. For now, they remain unaffected by the challenges facing other types of linguistic data. Because these corpora are built from carefully chosen, verified samples of human-produced language, often focusing on specific time periods, genres, or varieties, they are the gold standard in linguistic research. Their creation involves meticulous quality control: texts are checked and cleaned manually to ensure they truly represent the linguistic phenomena they aim to capture. Historical corpora, like the Helsinki Corpus or COHA, for instance, have an added layer of security. Their texts come from before the rise of AI-generated content, meaning they are undeniably human-authored. What is more, these corpora typically have fixed timeframes for data collection, so they avoid including material created after a certain date, AI-generated or otherwise.

Another big strength is their thorough documentation. Researchers can see exactly how the corpora were built, what sources were used, and how the data was prepared⁴. This transparency makes it clear what kind of language data is being analyzed. These corpora also aim for balance and representativeness, with each source carefully processed to be as free from distributional skew as possible and ready for linguistic study. Such resources continue to stand strong amidst the shifting landscape of language data. I believe they are safe.

…but what about web-based corpora?

However, the contamination of the web with AI-generated texts could affect corpora based on web scraping, such as Sketch Engine’s enTenTen, frTenTen, deTenTen, or ruTenTen, as well as other web-based resources like the frWaC or ukWaC, and various social-media corpora derived from platforms like X/Twitter and Reddit. Web-based corpora like these, as long as they keep being compiled on present-day scrapings, are increasingly at risk of contamination by AI-generated content.

The primary concern is that, in a matter of months or years, the proportion of natural language data on the web could decrease to the point where it is overshadowed by AI-generated text. Such contamination will skew linguistic analyses, leading to misrepresentations of actual human language patterns and usage. At best, it will add an extra layer of complexity to the composition of language samples that linguists will have to disentangle. While linguists may be prepared and equipped for this challenge, it will undoubtedly lengthen the workflow.

This issue is not specific to corpus linguistics, of course. It also impacts journalism, fiction and non-fiction publishing, educational institutions, and other sectors that rely on authentic human-generated content. For instance, news organizations are increasingly struggling to differentiate between genuine user-generated content and AI-fabricated stories. Literary agents and publishers will soon find it challenging to identify original works amid a flood of AI-generated manuscripts. Similarly, academic publishers are facing difficulties in verifying the authenticity of submissions, while educational institutions contend with issues of academic integrity as AI-generated essays become more prevalent. Market research firms relying on web-scraped data for consumer insights may find their analyses skewed by AI-generated opinions and reviews. Ethical social media platforms (such as Mastodon) may face challenges in maintaining genuine user engagement metrics due to the rising prevalence of AI interactions. This issue is particularly evident on controversial platforms like X (formerly Twitter), where AI-driven bots produce fake interactions that artificially inflate engagement statistics.

As AI-generated content becomes more sophisticated and pervasive, these sectors, just like corpus linguistics, will need to develop new strategies and tools to authenticate and validate human-authored content. Otherwise, there will be no way of ensuring the integrity and reliability of their work. As an avid fiction reader, I cannot help but imagine that, in the near future, publishers might introduce some kind of authenticity seal on book covers to distinguish human-authored fiction from AI-generated fiction. Perhaps web-based corpora will feature such as seal.

The ouroboros menace

The impending contamination of linguistic datasets with AI-generated content poses big methodological challenges. The most insidious danger in this scenario is the emergence of a linguistic ouroboros (Fig. 1): a self-consuming cycle where AI models are trained on data increasingly polluted by AI-generated content, only to produce more AI content that further contaminates the datasets.

A snake eating its own tail — Fig. 1. An ouroboros (a snake eating its own tail)

This self-reinforcing loop could lead to a progressive distortion of what we consider natural language, as each generation of AI models learns from and amplifies the artifacts and biases of its predecessors. The result could be a gradual drift away from authentic human language patterns, creating a sort of linguistic “uncanny valley” where AI-generated text becomes simultaneously more prevalent and less representative of genuine human communication (Radivojevic et al 2024).⁵

Moreover, this contamination is not limited to just skewing language models. It could also impact a wide range of NLP tasks, from sentiment analysis and topic modeling to machine translation and text summarization. As these models inadvertently incorporate AI-generated patterns, their outputs may become less aligned with human linguistic intuitions and communicative norms.

Other issues beyond corpus linguistics

The stakes are high because they extend beyond just preserving the validity of language studies. As AI takes on an increasingly significant role in content creation, we should also consider three additional concerns: the carbon footprint of generating content with AI, the traceability of text sources used in corpus composition, and copyright issues.

The environmental impact of training and running LLMs is huge, with some estimates suggesting that training a single large AI model can emit as much carbon as five cars over their lifetimes. Additionally, because AI systems rely on vast and often uncredited data sources, thus involving frequent copyright infringements, it becomes increasingly difficult to verify the origin, authenticity, and potential biases of the text used to train these models. When no copyright applies, the unauthorized use of data can still be considered theft of intellectual property. As evidenced by ongoing legal battles and discussions around these topics, clearer regulations and ethical guidelines are needed to ensure that AI development respects intellectual property rights while preserving innovation. In other words, innovation is good, but it must be pursued responsibly and fairly.

Breaking the cycle

This blog post does not offer solutions but aligns with the general blueprint that breaking the cycle requires researchers and developers to continue devising robust methods for detecting, flagging, and filtering AI-generated content. We need to make sure that AI-free datasets are created for training and evaluation in NLP and that no AI-generated text contaminates natural language corpora in corpus linguistics. This task is becoming increasingly difficult as AI models become more sophisticated and AI-generated content becomes harder to detect.

Going further

To go further, I invite you to listen to this episode of Lingthusiam, “Helping computers decode sentences – Interview with Emily M. Bender“, which was released just as I finished writing this post, and in which Lauren Gawne interviews Emily Bender. In this episode, EB talks about the complexity of language processing and explains how much computers struggle to understand language in the same way humans do. She also mentions her involvement in the Mystery AI Hype Theater 3000 podcast and her research on the societal impacts of language technologies. As you may have guessed, she advocates a critical approach to computational linguistics and artificial intelligence.

References

Bender, E. M., & Friedman, B. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6:587–604.

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? FATML ’21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610-623.

Mori, M. (1970). The uncanny valley phenomenon. Energy, 7(4), 33-35.

Radivojevic, K., Chou, M., Badillo-Urquiola, K., & Brenner, P. (2024). Human Perception of LLM-generated Text Content in Social Media Environments. arXiv.

The only exception to this principle are corpora of elicited texts, but those are designed for specific purposes, such as studying particular linguistic phenomena that may be rare in natural discourse or analyzing language use in highly specialized domains or professions, to give just two examples. In these cases, the controlled nature of elicitation allows linguists to focus on specific aspects of a given language. By doing so, linguists accept to sacrifice some degree of naturalness in exchange for the investigation targeted language phenomena.
Data statements include details such as curation rationale and data sources (Bender & Friedman 2018). They make it possible to understand how experimental results might generalize and what biases might be reflected in systems built on a given software. Data statements also address harms caused by bias in datasets. While initially developed for language data types, data statements could be adapted for a wide range of data types, including corpora, with adjustments to account for their unique characteristics. Practices involving corpora should likewise support better transparency in the compilation and documentation of natural language data.
Because I am a professor of English linguistics, I have chosen only corpora of English as examples. Of course, corpora are not limited to English.
By way of illustration, this link takes you to a spreadsheet that explains the composition of the COCA.
The term “uncanny valley” was originally coined by roboticist Masahiro Mori in 1970 to describe the unsettling feeling people experience when encountering robots or digital representations that closely resemble humans, but are not quite convincing (Mori 1970).

]]> https://corpling.hypotheses.org/4690/feed 1 A whale of a tool: navigating texts with Voyant Tools https://corpling.hypotheses.org/4465 https://corpling.hypotheses.org/4465#respond Fri, 09 Aug 2024 10:28:55 +0000 https://corpling.hypotheses.org/?p=4465 As a corpus linguist, I happen to teach my students how to use R for text analysis. However, due to the steep learning curve associated with R, I sometimes need to provide more accessible, ready-made alternatives. Two such tools I recommend are Lawrence Anthony’s AntConc and Stéfan Sinclair and Geoffrey Rockwell’s Voyant Tools. This post focuses on Voyant Tools, an open-source, web-based application for text analysis, ideal for both beginners and experienced researchers.

Voyant Tools in a few words

Voyant Tools is a web-based text reading and analysis environment designed to facilitate macro-reading and interpretive practices for digital humanities students and scholars. Its user-friendly interface makes it a perfect backup solution when time constraints prevent in-depth R training.

Getting Started with Voyant Tools

To begin using Voyant Tools, follow these steps:

Visit the Voyant Tools website

By default, Voyant Tools is in French: https://voyant-tools.org/. To access the English version: https://voyant-tools.org/?lang=en.

Input your text

Voyant Tools comes with pre-loaded corpora; William Shakespeare’s plays, Jane Austen’s novels, and Mary Shelley’s Frankenstein. You can access them by clicking ‘Open’, right below the ‘Add texts’ box.

Of course, the most interesting feature is to input your own text(s). You can either paste the text directly, preferably if it is not too long, enter a URL, or upload files from your computer. Supported file formats include plain text, HTML, XML, PDF, RTF, and MS Word documents. Here, I am using Herman Melville’s Moby Dick in .txt format, which can be downloaded from this link (the text is downloaded from Project Gutenberg; I have post-processed it with R).

Once your text is loaded, Voyant presents a multi-panel interface with several interactive tools:

Because Voyant Tools allows sharing its working environment, you should be able to play with the interface by following this link.

Key tools

Cirrus

Cirrus is a word cloud visualization showing the most frequent terms in your corpus. Right now, it does not tell us much because (a) there are not many words on the plot, and (b) most of the words are grammatical and do not tell us much about what the novel is about.

This can be fixed easily by hovering your mouse over the upper-right corner and clicking on the switch in the menu that appears (each tool has this option menu, and each menu is accessed this way):

This allows you to activate a stopword list, which will exclude common words that may clutter your analysis. Using the drop-down menu, make sure you select the language that matches the language of your text.

Here, we select English and click ‘Confirm’ to reload the word-cloud. The cloud becomes more meaningful.

Below, we set the number of words to 150 using the ‘Terms’ ruler.

Yes, Moby Dick is about a whale, and Captain Ahab is not lurking far.

Reader

Reader displays the full text of your documents for close reading and includes a search feature to help you find and examine specific terms in context.

The tool supports regular expressions, allowing you to fine-tune your searches.

Trends

Trends displays the distribution of terms across your corpus or document segments. The figure below compares the distributions of five terms (whale, sea, old, man, and like) across the corpus, which is divided into ten sections.

Options allow the user to select one specific term (like Ahab below) and choose from a variety of graph types (Area, Columns, Line, Stacked Bar, or the default Line + Stacked Bar).

What I like about this novel is that, except for segment 3 where Ahab and the whale co-occur, segments 5 to 8 show whale as quite frequent and Ahab as rare. However, from segment 8 onwards, Ahab peaks while whale does not. This nicely suggests the back-and-forth between Captain Ahab and the whale.

Summary

As its name indicates, summary provides an overview of your corpus, including word count and distinctive terms. Note that the statistics are influenced by the use of a stopword list. The Documents tag is useful if your corpus consists of several distinct parts. The Phrases tag allows the user to spot recurring multiword expressions in the corpus.

Contexts

Context is familiar to you if you are used to KWIC tools. It displays keywords in context (below, the keyword is whaler), allowing you to examine how specific terms are used throughout the text. Again, it comes with a bunch of options. You can therefore expand or reduce the context.

Collocates

The Collocates tag allows you to examine words that frequently appear near the keywords in your corpus.

The context ruler is used to expand or reduce the search context for your collocates. You can also select a specific keyword and see its most distinctive collocates.

Additional tools

Clicking on the windows logo at the top-right corner of the Voyant Tools interface will give you access to more expert tools.

The windows logo is the second to last from the right.

Navigate the drop-down menu to reveal the list of available tools.

I describe two of them briefly, but there are many more. A list of tools is available here. Feel free to explore! My own experience tells me that some tools are more like gadgets than real aids to data exploration. My recommendation is therefore to use a tool not because its graphic output looks nice, but because it reveals some important aspect of your data in an efficient way.

ScatterPlot

My favorite tool, by far because it relies on exploratory techniques that I use in my own research, namely principal component analysis, correspondence analysis, and t-SNE.¹ To access it: windows logo > Visualization tools > ScatterPlot.

ScatterPlot helps visualize statistically significant associations between terms in your corpus/corpora based on their distributions.

The plot below relies on correspondence analysis. The size of each data point (=each word) depends on its frequency. The position of each data point depends on its distribution in the part of the corpus considered. Here, the corpus is split into ten “bins” (=ten parts).

Collocates Graph

Collocates Graph visualizes keywords and terms that frequently appear together as a force-directed network graph. More on network graphs in these two posts: here and here.

The Context slider controls the number of terms included when searching for collocates. The selected value represents the number of words considered on each side of the keyword, effectively doubling the window of words. By default, the context is set to 5 words per side, with a maximum of 30. In the example below, I have set the context to 13 words and chosen ‘whale’ as the keyword.

Assets and Limitations

Voyant Tools offers a powerful, accessible platform for text analysis, making it an invaluable resource for both beginners and experienced researchers in corpus linguistics. By integrating Voyant into your toolkit, you can efficiently explore and analyze textual data, uncovering patterns and insights that might not be immediately apparent through manual analysis.

Voyant makes it easy to share your findings. You can export visualizations, data, or entire tool configurations. This feature is particularly useful for incorporating results into publications, presentations, or further analysis.

While Voyant excels at quantitative analysis and distant reading, it does not replace the need for close reading or qualitative interpretation. Researchers should use Voyant in conjunction with traditional methods for a comprehensive understanding of their texts.

Video recap

If you have one hour to spare, I highly recommend that you take a look at this step-by-step video guide published by the University Libraries of the University at Albany.

References

https://voyant-tools.org/docs/
https://infoguides.gmu.edu/textanalysistools/voyant
https://brockdsl.github.io/Voyant-Tutorial/
https://guides.library.ucsc.edu/DS/Resources/Voyant
https://www.youtube.com/watch?v=4jCGLmbLFT0

Another method, Document Similarity, is proposed, but it is really useful when your corpus consists of several documents.

]]> https://corpling.hypotheses.org/4465/feed 0 Creating frequency lists with R https://corpling.hypotheses.org/4388 https://corpling.hypotheses.org/4388#respond Thu, 21 Dec 2023 11:32:29 +0000 https://corpling.hypotheses.org/?p=4388 In a previous post, I provided two frequency lists without revealing the scripts to make them. This post explains how to create word frequency lists (lemmatized and unlemmatized) from the BNC 2014 (spoken), using R programming. In the field of corpus linguistics, frequency lists are an important tool in analyzing language usage. A frequency list provides a count of how often each word or lemma occurs in a given corpus, shedding light on the lexical coloration of your dataset. For more details, see Section 5.4 of Corpus Linguistics and Statistics with R.

Getting Started

Before diving into the code, make sure that you have the gsubfn library installed and loaded. This is done by running the following command in your R console:

install.packages("gsubfn")

The gsubfn library provides string manipulation functions. In the scripts below, it is used for pattern matching and replacement.

Next, download . The 11.5-million-word spoken component of the BNC2014 consists of transcripts of recorded conversations involving 672 speakers from different parts of the UK between 2012 and 2016. The corpus breaks down into 1,251 files, i.e. one per conversation. You need to download the BNC2014 corpus files from this page before proceeding with the code below.

As mentioned in my introductory post to the BNC2014, once you have downloaded the files and stored them on your hard drive, the folder architecture looks like this:

We are interested in the tagged folder because we want to retrieve the POS tags.

The lemmatized freqlist

We begin with the lemmatized frequency list. We want a three-column table: the first column contains the lemmas, the second column their respective POS tags, and the third column their respective frequency counts.

First, we clear the workspace and load gsubfn.

# Clear workspace
rm(list=ls(all=TRUE))
# Load necessary libraries
library(gsubfn)

Next, we specify the Path to where the BNC2014 Spoken are stored. The list.files function is then used to get a list of file names matching the pattern .xml in the specified directory.

corpus.files <- list.files(path="/bnc2014spoken/spoken/tagged", pattern="\.xml$", full.names=TRUE)

We create an empty character vector all.matches to collect all the matches found during processing.

all.matches <- character()

The code below enters a loop to iterate through each file in the list of corpus files.

for (i in 1:length(corpus.files)) {

The current corpus file is read into a character vector using the scan function.

corpus.file <- scan(corpus.files[i], what="char", sep="\n")

Regular expressions are used to extract information (lemmas and classes) from the corpus file. The strapplyc function is applied to extract matching patterns.

lemmas <- unlist(strapplyc(words, "lemma=\"(\w+)\"", backref=1))
classes <- unlist(strapplyc(words, "class=\"(\w+)\"", backref=1))

Lemmas and classes are combined and stored in the all.matches vector.

lemmas.classes <- paste(lemmas, classes, sep="_")
all.matches <- c(all.matches, lemmas.classes)
}

Note that the loop will take some time to run. The time varies depending on the speed of your processor and how much memory (RAM) your system has.

The table function is used to create a frequency table of the combined lemmas and classes.

all.matches.table <- table(all.matches)

The frequency table is sorted in decreasing order.

all.matches.sorted.table <- sort(all.matches.table, decreasing=TRUE)

The sorted frequency table is formatted into a tab-separated table.

tab.table <- paste(names(all.matches.sorted.table), all.matches.sorted.table, sep="\t")
tab.table.2 <- gsub("_", "\t", tab.table, perl=TRUE)

The final step involves saving the formatted frequency table to a text file on the desktop.

cat("LEMMA\tCLASS\tFREQUENCY", tab.table.2, file="/Users/yourname/Desktop/freqlist.bnc.2014.txt", sep="\n")

Note that you must replace /Users/yourname/Desktop/ with the actual path where you want to save the output file. The file freqlist.bnc.2014.txt can now be opened with a spreadsheet software.

Here is the same code in one single chunk:

# Clear workspace
rm(list=ls(all=TRUE))
# Load necessary libraries
library(gsubfn)
# Specify the path to the BNC 2014 spoken corpus files
corpus.files <- list.files(path="/bnc2014spoken/spoken/tagged", pattern="\\.xml$", full.names=TRUE)
# Prepare an empty vector to collect all matches
all.matches <- character()
# Enter the loop
for (i in 1:length(corpus.files)) {
  # Load current corpus file
  corpus.file <- scan(corpus.files[i], what="char", sep="\n")
  
  # Collect relevant elements (lemmas and classes)
  lemmas <- unlist(strapplyc(words, "lemma=\"(\\w+)\"", backref=1))
  classes <- unlist(strapplyc(words, "class=\"(\\w+)\"", backref=1))  
  
  # Collect all matches
  lemmas.classes <- paste(lemmas, classes, sep="_")
  all.matches <- c(all.matches, lemmas.classes)
}
# Create a frequency table
all.matches.table <- table(all.matches)
# Sort the frequency table
all.matches.sorted.table <- sort(all.matches.table, decreasing=TRUE)
# Prepare the output table
tab.table <- paste(names(all.matches.sorted.table), all.matches.sorted.table, sep="\t")
tab.table.2 <- gsub("_", "\t", tab.table, perl=TRUE)
# Save the frequency list to a text file
cat("LEMMA\tCLASS\tFREQUENCY", tab.table.2, file="/Users/yourname/Desktop/freqlist.bnc.2014.txt", sep="\n")

Upon inspection with a spreadsheet software (I am using Excel), your frequency list should look like this:

The unlemmatized freqlist

Now, let’s modify the above script for creating an unlemmatized frequency list. What changes is the last part of the loop, namely:

# Collect relevant elements (lemmas and classes)
  words <- unlist(strapplyc(corpus.file, "<w pos=\"\\w+\" lemma=\"\\w+\" class=\"\\w+\" usas=\"\\w+\">", backref=1))
  classes <- unlist(strapplyc(words, "class=\"(\\w+)\"", backref=1))  
  
  # Collect all matches
  words.classes <- paste(words, classes, sep="_")
  all.matches <- c(all.matches, words.classes)

Instead of collecting lemmas, we collect words (words <- unlist(strapplyc(corpus.file, "<w pos=\"\\w+\" lemma=\"\\w+\" class=\"\\w+\" usas=\"\\w+\">", backref=1))). Here is the code as a single chunk:

# Clear workspace
rm(list=ls(all=TRUE))
# Load necessary libraries
library(gsubfn)
# Specify the path to the BNC 2014 spoken corpus files
corpus.files <- list.files(path="/bnc2014spoken/spoken/tagged", pattern="\\.xml$", full.names=TRUE)
# Prepare an empty vector to collect all matches
all.matches <- character()
# Enter the loop
for (i in 1:length(corpus.files)) {
  # Load current corpus file
  corpus.file <- scan(corpus.files[i], what="char", sep="\n")
  
  # Collect relevant elements (lemmas and classes)
  words <- unlist(strapplyc(corpus.file, "<w pos=\"\\w+\" lemma=\"\\w+\" class=\"\\w+\" usas=\"\\w+\">", backref=1))
  classes <- unlist(strapplyc(words, "class=\"(\\w+)\"", backref=1))  
  
  # Collect all matches
  words.classes <- paste(words, classes, sep="_")
  all.matches <- c(all.matches, words.classes)
}
# Create a frequency table
all.matches.table <- table(all.matches)
# Sort the frequency table
all.matches.sorted.table <- sort(all.matches.table, decreasing=TRUE)
# Prepare the output table
tab.table <- paste(names(all.matches.sorted.table), all.matches.sorted.table, sep="\t")
tab.table.2 <- gsub("_", "\t", tab.table, perl=TRUE)
# Save the frequency list to a text file
cat("WORD\tCLASS\tFREQUENCY", tab.table.2, file="/Users/yourname/Desktop/freqlist.bnc.2014.txt", sep="\n")

This script saves the unlemmatized frequency list to a separate text file (freqlist.bnc.2014.unlem.txt).

The frequency lists’ files are available from me upon request.

Cover image credits: Glen Carrie.

]]> https://corpling.hypotheses.org/4388/feed 0 Plotting Likert-scale survey data with R: what is soup? https://corpling.hypotheses.org/4284 https://corpling.hypotheses.org/4284#respond Tue, 17 Oct 2023 10:56:33 +0000 https://corpling.hypotheses.org/?p=4284 Each academic year, I lead my Master’s students in English linguistics on an intellectual journey that traces the evolution from traditional structuralist semantics to contemporary cognitive linguistics. A significant moment of this journey is the exploration of Prototype semantics and how it addresses the conundrums of structuralist semantics. In this post, I show how to handle and plot in R the kind typically obtained in Prototype-semantics experiments, namely Likert-scale data.

Structuralist semantics vs. Prototype semantics

Structuralist semantics (SS) represents meanings in terms of checklists of necessary and sufficient features that must be satisfied. Although this might work well with simple concepts, problems arise with culturally marked ones.

The benefits of Prototype semantics (PS) lie in its ability to provide a more rigorous understanding of how language and categorization work. First, PS allows for the recognition of fuzzy boundaries within categories. Category membership is flexible and graded. With its all-or-nothing approach, structuralist semantics does not allow graded category membership. Second, PS also aligns with the principle of cognitive economy, according to which the human mind stores and processes information efficiently because it allows for the storage of general prototypes of concepts rather than exhaustive lists of necessary and sufficient conditions for each category. Third, PS accommodates variations in how different cultures conceptualize and categorize the world, recognizing that not all languages or cultures categorize concepts in the same way. Lastly, PS is more psychologically plausible than structuralist semantics because it aligns with the idea that human cognition relies on mental representations that are based on prototypes and exemplars rather than strict rules and definitions.

To show the benefits of PS over SS, one excellent case in point is BACHELOR. The structuralist approach, often referred to as componential semantics, is associated with Katz and Fodor (1963), who proposed a method of defining word meanings through a hierarchical organization of concepts, based on a list of semantic primitives. In this framework, BACHELOR is represented as follows:¹

From Katz and Fodor (1963), cited in Desagulier and Monneret (2023)

Although elegant, this approach fails to consider the centrality or salience of meanings in various contexts, as well as the variability in typicality within a category. In contrast, Prototype Theory, pioneered by cognitive psychologists Mervis and Rosch (1981), Rosch (1978), and Rosch & Mervis (1975), posits that categorization is subject to typicality effects. Not all members of a category have equal status with respect to the prototype of a given category. For example, a 25-year-old unmarried man will be considered a more prototypical bachelor compared to other unmarried men such as the Pope or Superman.

The soup experiment

Last year, I came upon this brilliant Short Stuff piece on YouTube by Scottish comedians Conor Reilly, Tommy Reilly and Malcolm Cumming:

In this video, Tommy grapples with the concept of ‘soup’. While most people might consider soup a straightforward dish, pinpointing its exact definition proves to be a perplexing puzzle for Tommy. His initial attempt to tackle the question, “what makes soup soup?” falls flat, as he approaches it through the lens of structuralist semantics. Frantically attempting to construct a comprehensive checklist of defining features for SOUP, Tommy’s efforts continuously hit roadblocks as he encounters exceptions at every turn. However, everything takes an intriguing turn when an anonymous letter mysteriously appears under his door, bearing the cryptic message, “you’re stirring the wrong pot.” This clever remark serves as a gentle nudge for Tommy to broaden his perspective along the lines of PS, suggesting that understanding soup requires considering the context in which it is cooked, encompassing not just culinary factors but also cultural contexts. Just as Tommy cracks the enigmatic code of soup, a chilling red laser suddenly appears, ominously targeting his forehead. This startling development leaves viewers to speculate whether some mysterious entity, perhaps the government itself, is determined to prevent Tommy from unraveling the ultimate truth about soup.

Obviously, the screenwriters must have taken a semantics course as part of their curriculum! I find this video inspiring for my students as it provides a solid foundation for understanding the benefits of a PS approach over SS.

In October 2023, using Google Forms, I designed a survey to see if, like Tommy, we could find the truth about soup. Here are the instructions:

Eighteen students participated in the experiment. They were instructed to rate 60 different soups on a Likert scale from 1 to 7, representing varying degrees of prototypicality as a ‘soup,’ as detailed in the instructions above. The results of the survey were collected in a spreadsheet (download the anonymized spreadsheet here).

Here is one example of an item they had to rate.

Why a Likert scale?

Likert-scale survey

A Likert scale is a psychometric scale commonly used in surveys and questionnaires to measure people’s attitudes, opinions, or perceptions. The scale is named after its creator, psychologist Rensis Likert. Typically, it involves a series of statements that express various levels of agreement or disagreement with a certain issue. Respondents are asked to indicate their level of agreement with each statement by selecting a point on the scale that reflects their opinion.

Usually, the scale consists of a range of response options, often five or seven, that represent different degrees of agreement or disagreement, typically ranging from “strongly agree” to “strongly disagree.” These response options are often represented as numerical values, with higher numbers indicating stronger agreement or disagreement. Alternatively, they can be represented with labels, such as “strongly disagree,” “disagree,” “neutral,” “agree,” and “strongly agree.” One thing to bear in mind is to keep the number of response options odd (3, 5, 7, etc.). Indeed, an odd number of responses ensures that a middle/neutral option is preserved.

Rensis Likert (University of Michigan. News and Information Services. Photographs — Bentley Historical Library, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=76306573)

Let me now show you how the Likert data was processed and plotted in R.

Step 1: clear the workspace

rm(list=ls(all=TRUE))

Step 2: install and load the required packages

# Install the packages (run it once)
install.packages("cowplot")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("likert")
install.packages("RColorBrewer")
install.packages("tidyr")
install.packages("xlsx")
# Load the packages (run in each session)
library(cowplot)
library(dplyr)
library(ggplot2)
library(likert)
library(RColorBrewer)
library(tidyr)
library(xlsx)

cowplot offers a flexible and consistent way to arrange multiple plots into complex arrangements.
dplyr should be known to you if you are familiar with my blog; it provides a consistent set of verbs that help in manipulating data (filtering, selecting specific columns, summarizing data, etc.);
ggplot2 is a widely used data visualization package that helps in creating all sorts of graphs; it is based on the grammar of graphics and provides a state-of-the-art framework for creating complex plots rather easily;
likert provides functions for handling Likert-scale data;
RColorBrewer provides a set of color palettes for creating attractive and effective plots;
tidyr should, again, be familiar to you; it is designed to help tidy messy data sets. It provides tools for changing the layout of data sets to make them easier to work with. It is particularly useful for data sets where different variables are stored in both rows and columns.
xlsx allows you to read data from and writing data to Excel files; you can also interact with Excel files directly from R.

Step 3: load the data with `xlsx`

data <- read.xlsx("/Users/filepath/PT.experiment.data.xlsx", sheetIndex = 1)

This line reads the data from the first sheet (sheetIndex=1) of the Excel file specified in the given file path (I have used a fake path here). If you have not done it yet, download the data by clicking this link.

Step 4: inspect the data

str(data)

You should see the following:

'data.frame':	117 obs. of  60 variables:
 $ Pineapple Gazpacho             : num  3 3 5 7 3 2 4 5 7 3 ...
 $ Italian wedding soup           : num  5 7 6 5 4 1 5 3 6 3 ...
 $ Beef and barley soup           : num  7 7 7 6 5 2 5 4 5 5 ...
 $ Potato Leek Soup               : num  3 1 3 2 1 1 1 1 4 1 ...
 $ Zuppa Toscana                  : num  4 5 5 6 5 2 4 5 1 3 ...
 $ Chicken Noodle Soup            : num  2 5 3 1 4 1 3 1 3 3 ...
 $ Asparagus Soup                 : num  2 1 2 3 1 1 1 1 1 1 ...
 $ Porridge                       : num  7 7 1 7 6 7 7 7 1 6 ...
 $ Cream of asparagus soup        : num  5 1 2 2 2 1 1 1 1 1 ...
 $ Cucumber soup                  : num  3 7 3 4 2 1 2 6 2 1 ...
 $ Avocado soup                   : num  2 3 3 1 1 1 2 4 2 1 ...
 $ Chili                          : num  7 7 7 7 7 7 7 7 1 7 ...
 $ Mulligatawny soup              : num  3 1 5 1 1 2 2 3 1 1 ...
 $ Sopa de Lima                   : num  3 7 4 3 3 3 5 4 3 2 ...
 $ Shrimp and corn chowder        : num  5 7 4 3 6 4 6 7 1 6 ...
 $ Hot and sour soup              : num  2 6 7 1 2 2 5 3 5 3 ...
 $ Gumbo                          : num  3 4 5 5 6 2 5 5 6 3 ...
 $ Gazpacho verde  Green Gazpacho : num  1 3 4 3 1 2 2 5 2 1 ...
 $ Matzo ball soup                : num  4 7 5 6 4 1 5 6 1 3 ...
 $ Pumpkin Soup                   : num  1 1 6 1 1 1 1 1 1 1 ...
 $ French onion soup              : num  1 7 7 5 3 3 2 3 7 3 ...
 $ Gazpacho de Aguacate           : num  3 4 5 2 5 2 2 4 1 2 ...
 $ Spinach and artichoke soup     : num  4 7 3 7 3 1 4 4 3 2 ...
 $ Butternut squash soup          : num  1 1 4 1 1 1 1 1 1 1 ...
 $ Sausage and kale soup          : num  6 7 7 7 5 2 5 4 3 4 ...
 $ Lentil soup                    : num  5 6 5 7 2 1 5 2 6 4 ...
 $ French pea soup                : num  4 5 3 1 3 2 2 1 1 1 ...
 $ Gravy                          : num  7 7 7 7 5 7 7 7 2 7 ...
 $ Watercress Soup                : num  4 4 4 7 1 1 2 1 7 1 ...
 $ Cream of mushroom soup         : num  2 7 3 2 3 1 2 1 2 1 ...
 $ Beef stew                      : num  7 7 5 5 6 5 5 7 5 7 ...
 $ Watermelon gazpacho            : num  3 7 7 7 4 1 4 4 1 2 ...
 $ New England clam chowder       : num  4 5 3 7 5 3 5 7 5 2 ...
 $ Irish potato soup              : num  1 3 2 4 1 2 1 1 2 1 ...
 $ Tomato soup                    : num  1 1 3 2 3 1 1 1 1 1 ...
 $ Clam chowder                   : num  4 7 2 5 5 3 4 7 5 2 ...
 $ Gazpacho                       : num  1 5 5 4 2 2 2 4 2 2 ...
 $ Split pea soup                 : num  2 2 2 2 1 2 3 1 2 1 ...
 $ Egg drop soup                  : num  5 7 4 5 3 3 6 4 5 4 ...
 $ Acorn squash soup              : num  3 7 5 1 1 1 2 1 1 6 ...
 $ Carrot ginger soup             : num  1 1 3 3 1 1 1 1 1 1 ...
 $ Avgolemono  Greek Lemon Soup   : num  4 7 4 6 5 2 6 4 1 4 ...
 $ Ramen                          : num  5 4 3 1 4 3 4 5 2 5 ...
 $ Lobster bisque                 : num  6 5 2 4 3 2 2 3 4 4 ...
 $ Thai Tom Yum soup              : num  7 7 4 1 4 2 5 6 6 5 ...
 $ Miso soup                      : num  5 1 4 1 3 1 4 3 4 3 ...
 $ Chilled strawberry soup        : num  7 7 3 7 5 1 2 6 1 4 ...
 $ Minestrone                     : num  6 7 6 5 6 2 6 5 7 7 ...
 $ Okroshka                       : num  5 7 3 7 3 2 5 5 6 3 ...
 $ Vegetable soup                 : num  2 3 6 1 4 1 4 1 6 1 ...
 $ Wonton soup                    : num  6 3 5 7 3 1 5 5 2 6 ...
 $ Tortilla soup                  : num  7 7 4 7 5 2 7 6 2 7 ...
 $ Pumpkin black bean soup        : num  4 7 5 2 3 2 5 5 6 3 ...
 $ Borscht                        : num  7 7 6 1 4 1 5 5 1 7 ...
 $ Ketchup                        : num  7 7 6 7 7 7 7 7 1 7 ...
 $ Corn and potato chowder        : num  3 7 2 4 4 3 4 7 3 3 ...
 $ Vichyssoise                    : num  3 4 2 7 2 1 2 7 1 1 ...
 $ Chilled Cucumber Dill Soup     : num  4 6 2 1 2 1 2 5 1 1 ...
 $ Cabbage soup                   : num  4 7 6 4 4 1 5 5 7 2 ...
 $ Chorba                         : num  2 7 5 4 3 1 3 5 7 2 ...

Step 5: modify column names

The read.xlsx() function replaces each space with a dot in the column names. Remove the dots using gsub():

colnames(data) <- gsub("\.", " ", colnames(data))

Step 6: prepare Likert scale labels

We define seven labels because our Likert scale contains seven levels.

lbs <- c("Very Good Example of Soup", 
	"Good Example of Soup", "Moderately Good Example of Soup", 
	"Neutral (Could be interpreted as a Soup)", 
	"Moderately Bad Example of Soup", 
	"Bad Example of Soup", 
	"Very Bad Example of Soup or Not a Soup at All")

Step 7: data preprocessing

We convert the data to factors, assign custom labels, drop any rows with missing values, and finally convert it back to a data frame.

data <- data %>%
  dplyr::mutate_if(is.character, factor) %>%
  dplyr::mutate_if(is.numeric, factor, levels = 1:7, labels = lbs) %>%
  drop_na() %>% # from tidyr
  as.data.frame() # from base R

Step 8: define factor levels

Because we have seven levels in our Likert scale, we define seven factor levels:

factor_levels <- c("Very Good Example of Soup", 
	"Good Example of Soup", 
	"Moderately Good Example of Soup", 
	"Neutral (Could be interpreted as a Soup)", 
	"Moderately Bad Example of Soup", 
	"Bad Example of Soup", 
	"Very Bad Example of Soup or Not a Soup at All")

Step 9: create and customize the Likert plot

This chunk generates the Likert plot using the likert function, sets the color scheme, and adds a title.

survey_p1 <- plot(likert(data), ordered = T, wrap= 60) +
	scale_fill_manual(name="", 
	values = c("red", "#FF6600", "#FF8200", "#D6DCE4", "#44A5FF","#4472C4", "darkblue"),
	breaks = factor_levels) +
	ylab("") +
	ggtitle("What is soup? A prototype-theory experiment")

The first line generates a Likert plot using the data provided. The likert function creates a plot based on the participants’ responses. The ordered = T argument ensures that the plot is ordered. The wrap = 60 argument determines the maximum number of characters per line for the plot. The line with values sets the color scheme for the Likert plot. It manually assigns colors to each level of the Likert scale. The name = "" argument sets an empty legend title, and the values argument specifies the colors to be used. The breaks = factor_levels argument ensures that the color breaks correspond to the levels defined earlier. ylab("") sets the y-axis label of the plot to an empty string, essentially removing the label from the y-axis. ggtitle("What is soup? A prototype-theory experiment") adds a title to the plot.

Step 10: save the plot

We use the save_plot() function from the cowplot package to save the Likert plot as a PDF file in the specified file path (I am using a fake path here):

cowplot::save_plot("/Users/Users/filepath/likert.plot.pdf",
	survey_p1, 
	base_asp = 2, 
	base_height = 8)

The argument base_asp = 2 sets the aspect ratio of the plot, determining the ratio of space for questions versus space for the plot itself. A value of 2 implies that the plot will have a larger space for the questions compared to the space for the plot. base_height = 8 sets the height of the plot. A value of 8 implies that the plot will have a larger height, which results in a smaller font size.

This is what you should obtain:

A plot summarizing the Likert-scale data

The items that best illustrate the SOUP category appear in red in the bottom-left part of the plot. The worst examples appear in blue in the upper-right part of the plot. Neat!

Step 11: export the data

Here, we extract the Likert results and export the original data along with the Likert results to separate sheets in an Excel file.

data.likert <- likert(data)
data.likert.df <- data.likert$results
write.xlsx(as.data.frame(data), file = "/Users/filepath/data.count.xlsx", sheetName ="data with categories", row.names=F, append=FALSE)
write.xlsx(as.data.frame(data.likert.df), file = "/Users/filepath/data.count.xlsx", sheetName ="mean scores", row.names=F, append=TRUE)

data.likert <- likert(data) applies the likert() function to the data. In other words, it converts the input data into a format suitable for Likert analysis. The resulting data.likert is an object that contains the processed Likert data.

data.likert.df <- data.likert$results extracts the summarized results from the data.likert object. It stores these results as a data frame in the variable data.likert.df, which can be used for further analysis or visualization.

write.xlsx(as.data.frame(data), file = "/Users/filepath/data.count.xlsx", sheetName ="data with categories", row.names=F, append=FALSE) writes the original data to an Excel file named “data.count.xlsx” located at the specified file path (here, again the path is fake). The data is written to a sheet named “data with categories”. The row.names=F argument ensures that row names are not included in the data. The append=FALSE argument ensures that the file is not appended if it already exists.

write.xlsx(as.data.frame(data.likert.df), file = "/Users/filepath/data.count.xlsx", sheetName ="mean scores", row.names=F, append=TRUE) writes the summarized Likert results to the same Excel file as before but in a new sheet named “mean scores”. The row.names=F argument ensures that row names are not included in the data. The append=TRUE argument appends the data to the existing file if it already exists, rather than overwriting it.

References

Desagulier G. & Philippe Monneret. 2023. Cognitive Linguistics and a usage-based approach to the study of semantics and pragmatics. In Manuel Díaz-Campos & Sonia Balasch (Eds). The Handbook of Usage-Based Linguistics. Blackwell Publishing. pdf

Katz, Jerrold J & Jerry A Fodor. 1963. The structure of a semantic theory. Language 39(2). 170–210.

Mervis, Carolyn B & Eleanor Rosch. 1981. Categorization of natural objects. Annual review of psychology 32(1). 89–115.

Rosch, Eleanor. 1978. Principles of categorization. In Eleanor Rosch & Barbara B. Lloyd (eds.), Cognition and categorization, 27–48. Hillsdale, N.J.: Lawrence Erlbaum Associates.

Rosch, Eleanor & Carolyn B Mervis. 1975. Family resemblances: studies in the internal structure of categories. Cognitive psychology 7(4). 573–605.

Two more specific meanings are part of the list described by Katz & Fodor (1963): [who has
the first or lowest academic degree] and [fur seal when without a mate during the breeding time].

]]> https://corpling.hypotheses.org/4284/feed 0 Dependency parsing in R with UDPipe https://corpling.hypotheses.org/4178 https://corpling.hypotheses.org/4178#respond Tue, 20 Dec 2022 11:52:29 +0000 https://corpling.hypotheses.org/?p=4178 As a follow-up to my previous post (POS-tagging in R with UDPipe), I explore the udpipe package further, focusing this time on dependency parsing. Dependency parsing is a process of analyzing the grammatical structure of sentences, establishing relationships between words in the sentences, and labeling these relationships using grammatical dependencies. Before you proceed, it is a good idea to become acquainted with the philosophy behind universal dependencies (UD).

Dependency parsing

Having access to the grammatical structure of a sentence is useful for a variety of NLP tasks, such as:

Information extraction: Dependency parsing can be used to identify the relationships between words in a sentence and extract specific pieces of information. For example, you might use dependency parsing to identify the subject and object of a sentence, or to extract the names of people or organizations mentioned in the text.
Text generation: Dependency parsing can be used to generate natural language text by determining the grammatical structure of a sentence and inserting words in the appropriate positions.
Machine translation: Dependency parsing can be used to analyze the structure of sentences in one language and generate equivalent sentences in another language, which can be useful for machine translation tasks.
Text classification: Dependency parsing can be used to extract features from text that can be used to classify text into different categories, such as sentiment analysis or topic classification.
Text summarization: Dependency parsing can be used to identify the most important words or phrases in a sentence or document, which can be useful for text summarization tasks.

The pipeline

We are going to re-use some of the code of POS-tagging in R with UDPipe, namely the parts designed to:

dowload and load the UDPipe language models with the udpipe_download_model() and udpipe_load_model() functions;
annotate a candidate sentence with udpipe_annotate() function.

The new part of the code involves submitting a candidate sentence and visualizing the tokens, POS tag, and dependencies with the powerful and versatile textplot package.

(Down)load the language model

# load the necessary packages
library(udpipe)
library(textplot)
# download a language model (english-ewt) and save its path
m_eng_ewt   <- udpipe_download_model(language = "english-ewt")
m_eng_ewt_path <- m_eng_ewt$file_model
# load the selected language model
m_eng_ewt_loaded <- udpipe_load_model(file = m_eng_ewt_path)

Annotate a sentence

It is now time to parse a candidate sentence, which is “The dead air shapes the dead darkness, further away than seeing shapes the dead earth” (Faulkner, As I Lay Dying). The sentence is annotated and the output is converted into a data frame.

sentence <- udpipe::udpipe_annotate(m_eng_ewt_loaded, x = "The dead air shapes the dead darkness, further away than seeing shapes the dead earth.") 
     %>%
  as.data.frame()

You can inspect the annotated sentence with head(sentence). I do not do it here because the output is too wide.

Plot the dependencies

To plot the dependencies, we use the textplot_dependencyparser() function of the textplot package.

textplot_dependencyparser(sentence, size = 3)

Dependency parser with the `english-ewt` model

The two arguments are: the annotated sentence (sentence) and the label size (size).

Interpreting the graph

Bear in mind that UD treebanks are annotated with grammatical dependencies between the words in a sentence. In UD, each word is assigned a dependency relation to one of the other words in the sentence. The word that the relation is pointing to is called the head of the relation, and the word that the relation is coming from is called the dependent.

To understand dependencies, you need to refer to an inventory of dependencies. I recommend this one, adapted from de Marneffe et al (2014). The nature of each dependency is spelled out in red. In the above sentence, we have:

det determiner
amod adjectival modifier
nsubj nominal subject
obj object
punct punctuation
advmod adverbial modifier
advcl adverbial clause modifier
csubj clausal subject
mark marker

To assess how well the parser performed, it is also a good idea to know what goes on in the sentence. Here, the first occurrence of the verb shapes is the root of the first clause and the second occurrence of the same verb is the root of the second clause. The first clause says that dead air is shaping dead darkness and the second clause says that the nominalized verb seeing is shaping dead earth. In the first clause, The determines the noun air and dead is an adjective modifying air and darkness. Air is the subject of the verb shapes, and darkness is the object. In the second clause, the adverb further modifies the adverb away. Both adverbs modify the verb shape. Than is here tagged as a ‘marker’ (mark). The arc from than to shapes signals that it is a subordinating conjunction. Seeing is the subject of the verb shapes and earth is the object. Although picky grammarians may propose alternative taggings and parsings, we can say that english-ewt has done a reasonably good job.

Choosing the right language model

You must match the model to the text data that you work with. On top of english-ewt, there are three other models that are worth considering for English: english-gum, english-lines, and english-partut. They are trained on different datasets and may have slightly different performance characteristics. Here is a brief overview of each model:

english-ewt is trained on the English Web Treebank (EWT), which is a collection of sentences from the web;
english-gum is trained on the GUM Corpus, which is a large, manually annotated corpus of English that includes a wide range of text types and genres;
english-lines is trained on the LINES Corpus, which is a collection of sentences from the web;
english-partut is trained on the ParTUT Corpus, which is originally a collection of Italian sentences. It has been adapted for use with English text by applying cross-lingual transfer learning techniques.

If you are working with a specific type of text (e.g., web text, news articles, etc.), you may want to choose a model that was trained on a similar dataset, e.g., english-ewt or english-lines. If you are working with a mix of text types, you may want to choose a more general-purpose model, e.g., english-gum.

Having said that, no parser is perfect. The graph below is based on english-gum, which is a priori the best model for a Faulkner sentence. Surprisingly, the parser has misinterpreted the grammatical status of the second occurrence of shapes, which it considers a noun instead of a verb. This error comes from the fact that, on the surface, shapes can be considered a noun or a verb.

Dependency parser with the `english-gum` model

The same problem appears with english-lines…

Dependency parser with the `english-lines` model

… and english-partut.

Dependency parser with the `english-partut` model

There are therefore three solutions: (a) compare parsers on a series of test sentences and choose the one that performs best, (b) accept that your parser will generate a certain amount of wrong tags and dependencies, or (c) train your own model on your specific data, a feature also offered by the udpipe package.

References

De Marneffe, M. C., Dozat, T., Silveira, N., Haverinen, K., Ginter, F., Nivre, J., & Manning, C. D. (2014). Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp. 4585-4592).

Nivre, J, de Marneffe, M.C., Ginter, F., Hajič, J., Manning, C.D., Pyysalo, S., Schuster, S., Tyers, F., and Zeman, D. 2020. Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4034–4043, Marseille, France. European Language Resources Association.

Cover image generated with DALL-E (https://labs.openai.com/)

]]> https://corpling.hypotheses.org/4178/feed 0 POS-tagging in R with UDPipe https://corpling.hypotheses.org/4081 https://corpling.hypotheses.org/4081#comments Sun, 18 Dec 2022 22:36:08 +0000 https://corpling.hypotheses.org/?p=4081 POS stands for “part-of-speech” (i.e. the grammatical nature of a word). POS tagging is the process of assigning a part-of-speech (such as noun, verb, adverb, adjective, determiner, etc.) to each word in a given text. This is a common task in corpus linguistics, NLP, and the digital humanities to help understand the structure of a text or collection of texts. Indeed, by identifying the part-of-speech for each word, it becomes easier to identify the roles that words play in a sentence and to understand the overall meaning of the sentence.

There are several R packages that can be used for POS tagging. Some of the most popular POS-tagging packages include tidytext and openNLP and udpipe. The tidytext package provides tools for text mining and analysis, including functions for POS tagging. The openNLP package is a machine learning toolkit that includes functions for POS tagging and other NLP tasks. The more recent udpipe package is designed to use the UDPipe (Universal Dependencies Parser) library, which includes functions for POS tagging and other NLP tasks such as tokenizing, lemmatizing, and parsing (Straka & Straková 2017). In this post, I will focus on udpipe.

UDPipe

Universal dependencies (UD) is a framework for annotating grammar (syntax and morphological features). UD is extremely popular in NLP, perhaps slightly less so in corpus linguistics. The goal of UD is to provide a consistent, language-independent representation of the syntactic structure of sentences. This representation is called a dependency tree, and it shows the relationships between words in a sentence, including which words are the subject, object, and other grammatical roles.

Language Models

Load the necessary packages:

library(dplyr)
library(stringr)
library(udpipe)
library(lattice)

The udpipe package includes a number of pre-trained language models for various languages. These models are trained on UD treebanks. 101 Pre-trained models are available for 65+ languages (view the full list here). Four models are available for English: english-ewt, english-gum, english-lines, english-partut. Let us download all four of them with the udpipe_download_model() function.

# english-ewt
m_eng_ewt   <- udpipe_download_model(language = "english-ewt")
#english-gum
m_eng_gum   <- udpipe_download_model(language = "english-gum")
#english-lines
m_eng_lines   <- udpipe_download_model(language = "english-lines")
#english-partut
m_eng_partut   <- udpipe_download_model(language = "english-partut")

Once you have downloaded these models, they will be stored permanently on your computer. To avoid having to download them again, it is a good idea to know the path to each of them and save it into a character vector. Here is how to do it:

m_eng_ewt_path <- m_eng_ewt$file_model
m_eng_gum_path <- m_eng_gum$file_model
m_eng_lines_path <- m_eng_lines$file_model
m_eng_partut_path <- m_eng_partut$file_mode

To load a model, use the udpipe_load_model() function:

m_eng_ewt_loaded <- udpipe_load_model(file = m_eng_ewt_path)
m_eng_gum_loaded <- udpipe_load_model(file = m_eng_gum_path)
m_eng_lines_loaded <- udpipe_load_model(file = m_eng_lines_path)
m_eng_partut_loaded <- udpipe_load_model(file = m_eng_partut_path)

Of course, you only need one of these models. We are using english-ewt.

Load and pre-process the text

For the following demo, I am going to use a short text in English, available here. It is an excerpt from the preamble to the GNU General Public License.

Load the text:

text <- readLines(url("https://tinyurl.com/gnutxt"), skipNul = T)

And clean it with the stringer package:

text <- text %>% str_squish()

FYI, str_squish() removes whitespace at the start and end, and replaces all internal whitespace with a single space. This is what the text should look like:

[1] "The GNU General Public License is a free, copyleft license for software and other kinds of works. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things."

Annotate the text

The text is tokenised, tagged, and dependency-parsed in one go with the udpipe_annotate() function:

text_annotated <- udpipe_annotate(m_eng_ewt_loaded, x = text) %>%
      as.data.frame() %>%
      select(-sentence)

The output is a data frame:

Two kinds of POS tags are available: upos and xpos. upos tags are independent of the specific language being used (they are ‘universal’). The list of upos tags is therefore limited:

ADJ adjective
ADP adposition
ADV adverb
AUX auxiliary
CCONJ coordinating conjunction
DET determiner
INTJ interjection
NOUN noun
NUM numeral
PART particle
PRON pronoun
PROPN proper noun
PUNCT punctuation
SCONJ subordinating conjunction
SYM symbol
VERB verb
X other

xpos tags, on the other hand, are language-specific. For example, in English, the upos tag for a verb might be VERB, while the corresponding xpos tag might be VB (for a base form verb) or VBD (for a past tense verb). In French, the upos tag for a verb might still be VERB, but the xpos tag might be VER:cond (for a conditional verb).

To append a upos tag to each word in the text, use the paste() function:

text_postagged <- paste(text_annotated$token, "_", text_annotated$upos, collapse = " ", sep = "")

This is what you obtain:

[1] "The_DET GNU_PROPN General_PROPN Public_PROPN License_PROPN is_AUX a_DET free_ADJ ,_PUNCT copyleft_ADJ license_NOUN for_ADP software_NOUN and_CCONJ other_ADJ kinds_NOUN of_ADP works_NOUN ._PUNCT The_DET licenses_NOUN for_ADP most_ADJ software_NOUN and_CCONJ other_ADJ practical_ADJ works_NOUN are_AUX designed_VERB to_PART take_VERB away_ADP your_PRON freedom_NOUN to_PART share_VERB and_CCONJ change_VERB the_DET works_NOUN ._PUNCT By_ADP contrast_NOUN ,_PUNCT the_DET GNU_PROPN General_PROPN Public_PROPN License_PROPN is_AUX intended_VERB to_PART guarantee_VERB your_PRON freedom_NOUN to_PART share_VERB and_CCONJ change_VERB all_DET versions_NOUN of_ADP a_DET program_NOUN --_PUNCT to_PART make_VERB sure_ADJ it_PRON remains_VERB free_ADJ software_NOUN for_ADP all_DET its_PRON users_NOUN ._PUNCT We_PRON ,_PUNCT the_DET Free_ADJ Software_NOUN Foundation_NOUN ,_PUNCT use_VERB the_DET GNU_PROPN General_PROPN Public_PROPN License_PROPN for_ADP most_ADJ of_ADP our_PRON software_NOUN ;_PUNCT it_PRON applies_VERB also_ADV to_ADP any_DET other_ADJ work_NOUN released_VERB this_DET way_NOUN by_ADP its_PRON authors_NOUN ._PUNCT You_PRON can_AUX apply_VERB it_PRON to_ADP your_PRON programs_NOUN ,_PUNCT too_ADV ._PUNCT When_ADV we_PRON speak_VERB of_ADP free_ADJ software_NOUN ,_PUNCT we_PRON are_AUX referring_VERB to_ADP freedom_NOUN ,_PUNCT not_ADV price_NOUN ._PUNCT Our_PRON General_ADJ Public_NOUN Licenses_NOUN are_AUX designed_VERB to_PART make_VERB sure_ADJ that_SCONJ you_PRON have_VERB the_DET freedom_NOUN to_PART distribute_VERB copies_NOUN of_ADP free_ADJ software_NOUN (_PUNCT and_CCONJ charge_VERB for_ADP them_PRON if_SCONJ you_PRON wish_VERB )_PUNCT ,_PUNCT that_SCONJ you_PRON receive_VERB source_NOUN code_NOUN or_CCONJ can_AUX get_VERB it_PRON if_SCONJ you_PRON want_VERB it_PRON ,_PUNCT that_SCONJ you_PRON can_AUX change_VERB the_DET software_NOUN or_CCONJ use_VERB pieces_NOUN of_ADP it_PRON in_ADP new_ADJ free_ADJ programs_NOUN ,_PUNCT and_CCONJ that_SCONJ you_PRON know_VERB you_PRON can_AUX do_VERB these_DET things_NOUN ._PUNCT"

We can do the same with xpos tags:

text_postagged <- paste(text_annotated$token, "_", text_annotated$xpos, collapse = " ", sep = "")

This time, when you inspect text_postagged, this is what the text looks like:

[1] "The_DT GNU_NNP General_NNP Public_NNP License_NNP is_VBZ a_DT free_JJ ,_, copyleft_JJ license_NN for_IN software_NN and_CC other_JJ kinds_NNS of_IN works_NNS ._. The_DT licenses_NNS for_IN most_JJS software_NN and_CC other_JJ practical_JJ works_NNS are_VBP designed_VBN to_TO take_VB away_RP your_PRP$ freedom_NN to_TO share_VB and_CC change_VB the_DT works_NNS ._. By_IN contrast_NN ,_, the_DT GNU_NNP General_NNP Public_NNP License_NNP is_VBZ intended_VBN to_TO guarantee_VB your_PRP$ freedom_NN to_TO share_VB and_CC change_VB all_DT versions_NNS of_IN a_DT program_NN --_, to_TO make_VB sure_JJ it_PRP remains_VBZ free_JJ software_NN for_IN all_DT its_PRP$ users_NNS ._. We_PRP ,_, the_DT Free_JJ Software_NN Foundation_NN ,_, use_VB the_DT GNU_NNP General_NNP Public_NNP License_NNP for_IN most_JJS of_IN our_PRP$ software_NN ;_, it_PRP applies_VBZ also_RB to_IN any_DT other_JJ work_NN released_VBN this_DT way_NN by_IN its_PRP$ authors_NNS ._. You_PRP can_MD apply_VB it_PRP to_IN your_PRP$ programs_NNS ,_, too_RB ._. When_WRB we_PRP speak_VBP of_IN free_JJ software_NN ,_, we_PRP are_VBP referring_VBG to_IN freedom_NN ,_, not_RB price_NN ._. Our_PRP$ General_JJ Public_NN Licenses_NNS are_VBP designed_VBN to_TO make_VB sure_JJ that_IN you_PRP have_VBP the_DT freedom_NN to_TO distribute_VB copies_NNS of_IN free_JJ software_NN (_-LRB- and_CC charge_VB for_IN them_PRP if_IN you_PRP wish_VBP )_-RRB- ,_, that_IN you_PRP receive_VBP source_NN code_NN or_CC can_MD get_VB it_PRP if_IN you_PRP want_VBP it_PRP ,_, that_IN you_PRP can_MD change_VB the_DT software_NN or_CC use_VB pieces_NNS of_IN it_PRP in_IN new_JJ free_JJ programs_NNS ,_, and_CC that_IN you_PRP know_VBP you_PRP can_MD do_VB these_DT things_NNS ._."

As expected, the level of granularity is higher with xpos. Therefore, the choice of upos vs xpos tags depends on the kind of study that you are conducting.

Plotting frequency distributions

To obtain the frequency distribution of POS tags, use the txt_freq function of the udpipe package. We do it for upos tags…

> txt_freq(text_annotated$upos)
     key freq  freq_pct
1   NOUN   36 17.391304
2   VERB   29 14.009662
3   PRON   25 12.077295
4  PUNCT   21 10.144928
5    ADP   18  8.695652
6    ADJ   17  8.212560
7    DET   15  7.246377
8  PROPN   12  5.797101
9    AUX    9  4.347826
10 CCONJ    8  3.864734
11  PART    7  3.381643
12 SCONJ    6  2.898551
13   ADV    4  1.932367

…and xpos tags:

> txt_freq(text_annotated$xpos)
     key freq   freq_pct
1     IN   23 11.1111111
2     NN   22 10.6280193
3    PRP   18  8.6956522
4     VB   16  7.7294686
5     DT   15  7.2463768
6     JJ   15  7.2463768
7    NNS   14  6.7632850
8    NNP   12  5.7971014
9      ,   12  5.7971014
10   VBP    9  4.3478261
11    CC    8  3.8647343
12     .    7  3.3816425
13    TO    7  3.3816425
14  PRP$    7  3.3816425
15   VBZ    4  1.9323671
16   VBN    4  1.9323671
17    MD    4  1.9323671
18    RB    3  1.4492754
19   JJS    2  0.9661836
20    RP    1  0.4830918
21   WRB    1  0.4830918
22   VBG    1  0.4830918
23 -LRB-    1  0.4830918
24 -RRB-    1  0.4830918

The barchart() function in the lattice package is now used to create a bar chart to display the distribution of POS tags in the text. We start with upos tags:

freq.distribution.upos <- txt_freq(text_annotated$upos)
freq.distribution.upos$key <- factor(freq.distribution.upos$key, levels = rev(freq.distribution.upos$key))
barchart(key ~ freq, data = freq.distribution.upos, col = "dodgerblue",
         main = "UPOS frequencies",
         xlab = "Freq")

and do the same for xpos tags:

freq.distribution.xpos <- txt_freq(text_annotated$xpos)
freq.distribution.xpos$key <- factor(freq.distribution.xpos$key, levels = rev(freq.distribution.xpos$key))
barchart(key ~ freq, data = freq.distribution.upos, col = "cadetblue",
         main = "XPOS frequencies",
         xlab = "Freq")

References

Straka, M., & Straková, J. (2017). Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 shared task: Multilingual Parsing from raw text to universal dependencies (pp. 88-99).

Cover image generated with DALL-E (https://labs.openai.com/)

]]> https://corpling.hypotheses.org/4081/feed 1 A word-cloud generator with R https://corpling.hypotheses.org/3941 https://corpling.hypotheses.org/3941#respond Sat, 10 Dec 2022 08:42:30 +0000 https://corpling.hypotheses.org/?p=3941 In Corpus Linguistics and Statistics with R, I showed how to make a word cloud from a text file (Section 6.3). A word cloud is a user-friendly way of representing a frequency list graphically. I have decided to share the code here in the form of a Shiny app. Shiny is a web application that is created using the shiny package in R. Shiny allows users to create interactive web apps quickly and easily using R, which makes it a popular choice for data scientists who want to share their work with others. Shiny apps are often used to visualize data and create user-friendly interfaces for data exploration.

Shiny 101

To make a Shiny app, you will need to have the shiny package installed in R. To install the Shiny package, open R and type the following:

install.packages("shiny")

Once the shiny package is installed, you can create a new Shiny app by using the shinyApp() function. This function takes two arguments: the first is the UI (user interface) of the app, and the second is the server function that defines the behavior of the app.

The user interface (UI) defines the layout and appearance of the app. The UI typically consists of a combination of input elements, such as buttons, checkboxes, and text boxes, that allow the user to interact with the app, and output elements, such as graphs and tables, that display the results of the app’s computations.

The server part is of the script contains instructions that the computer should follow to build the app. These instructions typically include instructions for reading in data, performing computations, and generating output. The server script also includes instructions for how the app should respond to user input, such as by updating the output or changing the plot in response to a button click.

Here is an example of a simple Shiny app that displays “Hello, Shiny!” on the screen:

# Load the Shiny package
library(shiny)
# Define the UI
ui <- fluidPage(
  # Add a title to the page
  title = "Hello, Shiny!",
  
  # Add a main panel to the page
  mainPanel(
    # Add a text output to the main panel
    textOutput("hello")
  )
)
# Define the server function
server <- function(input, output) {
  # Add a reactive expression that returns the string "Hello, Shiny!"
  output$hello <- reactive({
    "Hello, Shiny!"
  })
}
# Create the Shiny app by combining UI and server
shinyApp(ui = ui, server = server)

To run the app, you have two options. The first option is to copy and paste the whole script into R. The second option is to save the above script as an R file, e.g., myshinyapp.r and use the runApp() function. This function takes the file path of the script as its argument, and runs the code in the script:

library(shiny)
runApp("/path/to/myshinyapp.R")

This will open the Shiny app in your default web browser, where you can interact with it.

The Shiny word-cloud app

Save the following script into an R file: wordcloudapp.R.

# Load required packages
library(shiny)
library(tm)
library(wordcloud)
# Define UI
ui <- fluidPage(
  
  # Application title
  titlePanel("Word Cloud Generator"),
  
  # Sidebar with options to load a text file and specify number of words
  sidebarLayout(
    sidebarPanel(
      fileInput("file", "Choose a text file", accept = c("text/plain", ".txt")),
      sliderInput("num_words", "Number of words to include in word cloud:",
                  min = 50, max = 500, value = 100)
    ),
    
    # Show the word cloud in the main panel
    mainPanel(
      plotOutput("wordcloud", width = "600px", height = "600px")
    )
  )
)
# Define server logic
server <- function(input, output) {
  
  # Reactive function to process the text file
  process_text <- reactive({
    # Load the text file
    text <- readLines(input$file$datapath)
    
    # Convert the text to lowercase and remove punctuation
    text <- tolower(text)
    text <- gsub("[[:punct:]]", "", text)
    
    # Tokenize the text
    text <- unlist(strsplit(text, "\\s+"))
    
    # Return the processed text
    return(text)
  })
  
  # Generate the word cloud using the processed text
  output$wordcloud <- renderPlot({
    wordcloud(process_text(), max.words = input$num_words)
  })
  
}
# Create the Shiny app by combining UI and server
shinyApp(ui = ui, server = server)

Load the script as follows:

runApp("path/to/file/wordcloudapp.R")

The Shiny app opens in your default web browser. At first, you see an error message (‘Error: ‘con’ is not a connection’). This is because you have not loaded a text yet.

Let us load Herman Melville’s Moby Dick, which I downloaded from Project Gutenberg. It is a UTF-8 text file that I post-processed with R. To do so, click on ‘Browse’. This opens up an interactive window. Look for the text file and click ‘Open’.

The size of each word is indexed on its type frequency. By default, the number of words in the word cloud is 100. Using the ruler, we set the number of words to be included in the word cloud to 300.

The word cloud is pretty basic but does the trick. If you read the documentation of the wordcloud package, you will discover more features, such as coloring the words based on how often they occur in the text. You can also customize the stoplist (the list of words that are excluded from the word cloud, such as function words).

Have fun!

]]> https://corpling.hypotheses.org/3941/feed 0 Should we be yelling at ‘iel’? https://corpling.hypotheses.org/3593 https://corpling.hypotheses.org/3593#respond Mon, 04 Jul 2022 20:10:21 +0000 https://corpling.hypotheses.org/?p=3593 Every year in November, France eagerly awaits the Beaujolais Nouveau and the new edition of the reference dictionary of French: Le Robert. As New York Times wine critic Eric Asimov wrote: “Beaujolais is a far more relaxed wine than Bordeaux. Don’t worry about decanting, or serving it in the perfect glass. It’s versatile with food.” Indeed, the French have a relaxed attitude with respect to this cheap wine. The release of the Beaujolais Nouveau is the opportunity for a light-hearted, yet nationwide, debate on whether it has a banana, strawberry, or raspberry aftertaste. The publication of the dictionary triggers another kind of debate, one that is not so relaxed. You can joke about cheap wine all you want, but you can’t joke about the national language.¹

On November 16th 2021, the editors of Le Robert included the gender-inclusive pronoun ‘iel’ in its online 2022 edition (Dico en Ligne).

Le Robert’s Dico en Ligne (https://dictionnaire.lerobert.com/definition/iel)

‘Iel’ is a neologism, and other spelling variants exist, e.g. : yel or ielle. It is considered gender inclusive because it is conflates the masculine pronoun il ‘he’ and its feminine counterpart elle ‘she’. Iel has not yet been ratified by the Académie Française (a highly conservative French council for matters pertaining to the French language), and won’t be for a while. While the everyday use of iel is largely anecdotal for now, conservative critics deem it a linguistic provocation that must be banned.

In the immediate aftermath of the online publication, Le Robert editors were accused of wokeism by French MP François Jolivet, of La République en Marche, the party founded by current French President Emmanuel Macron. Mr Jolivet’s main complaint is summarized in an official letter reproduced below. The letter is addressed to Hélène Carrère d’Encausse, the current president of the Académie Française. Note that Mrs. Carrère d’Encausse’s official title Secrétaire perpétuel de l’Académie Française ‘Permanent Secretary of the French Academy’ is in the masculine, i.e. purposely not in a gender-inclusive form. According to Mr Jolivet, this type of initiative sullies French, and ends up dividing its users instead of bringing them together.

Mr Jolivet’s protest was supported by then French Education Minister Jean-Michel Blanquer, who tweeted that “Inclusive writing is not the future of the French language.” He added that “Our students, who are consolidating their basic knowledge, cannot have that as a reference.”

Je soutiens évidemment la protestation de @FJolivet36 vis-à-vis du #PetitRobert
L’écriture inclusive n’est pas l’avenir de la langue française.
Alors même que nos élèves sont justement en train de consolider leurs savoirs fondamentaux, ils ne sauraient avoir cela pour référence: https://t.co/09thJzQ7iN
— Jean-Michel Blanquer (@jmblanquer) November 16, 2021

Le Robert’s director denied any activist motive, saying its specialists had noted a rise in the use of iel for several months. The week after the online publication, I went to my local bookstore (by the way, they are wonderful, if you ever visit Paris, make sure you pay them a visit) and looked up iel in the print edition. I could not find it, which suggests that Le Robert are well aware that their print edition has a different attitude with respect to the exposition of the baseline of language standards. The paper version seems to lag behind the online edition, which is more flexible as far as neologisms are concerned. Chief editor Charles Bimbenet explained Le Robert did not want to promote ‘wokeism‘: “[i]t seemed useful to specify its meaning for people who come across it, whether they want to use it or, on the contrary, reject it. Defining the words at use in the world helps us to better understand it.”

Iel in particular and gender-inclusive markers in general (the use of ‘neutral’ pronouns they/them in English, the interpunct in French, the ‘neutral’ morpheme x in Spanish, such as nosotrxs instead of nosotros/nosotras, etc.) are interesting because they oppose two camps (naively: conservatives vs. activists). Both kinds instantiate change from above. Often, change from above is considered to begin in the speech of educated people, or people with a high social prestige. With respect to inclusive writing, social prestige does not seem to be a discriminating factor as both camps are educated.

From Twitter to Amazon

Out of curiosity, I read the reviews of Le Robert on Amazon. I was immediately struck by the low ratings, only three stars for the favorite dictionary of the French. Note that this is the print edition, i.e. the one without iel.

How *Le Petit Robert* rates on Amazon on Nov 30, 2021

This piqued my curiosity and prompted me to read the reviews. A few clicks later, tadaaa!

How to recognize far-right trollers (the reviews are public, but I have truncated the names)

If you do not speak French, let me translate the titles of the reviews for you. From top to bottom, we have: “Poor dictionary”, “Leftist propaganda manual”, and “Not recommended for school or university use”. The prose is clearly declinist and typical of far-right trolling. According to them, the French language is going to hell because French civilization is in decline (sic). One reviewer is a good case in point. He has left the same review and the same poor rating for all editions of the dictionary. In contrast, he gave five stars to far-right presidential candidate Éric Zemmour’s racist pamphlet (he nicknames Zemmour “the French Trump”). Another reviewer gives the dictionary only one star but is generous enough to grant five stars to former far-right leader Jean-Marie Le Pen’s memoirs.

The reviewer below has only written a total of three reviews: one for a phone case, another one for the dictionary, and yet another one for… a gun safe!

In contrast, the reviews written by lovers of the French language all over the world are quite good.

From gender-inclusive pronouns to gender-inclusive morphemes

Last semester, I invited the students in my sociolinguistics course at Paris 8 University to approach the debate with the tools that they learned in class. In other words, I asked them to leave their preferences aside (some of them use iel as a token of their activism, others reject iel on political grounds) and reflect upon gender-inclusive variables from the angle of objective science. I gave them Burnett & Pozniak (2021) to read. The authors’ goal was to conduct “a large quantitative corpus study of the (non)use of EI in Parisian undergraduate brochures.” Using a corpus of undergraduate brochures in twelve Parisian universities, they extracted all occurrences of the noun/adjective étudiant ‘student’, whose masculine/feminine alternation invites the use of inclusive morphemes, i.e. the interpunct étudiant·e, parentheses étudiant(e), the dash étudiant-e, the slash étudiant/e, the period étudiant.e, or repetition étudiant et étudiante. Burnett and Pozniak found that the period is neutral, the interpunct the most activist form, and parentheses are the least activist marker (for a full account, read Section 4.2.3 of the paper). Stratification is at work, based on several effects, the main ones being the prestige attached to each university, the discipline, and gender parity.

I expected my university (Paris 8) to rank first with respect to the use of the interpunct (‘point médian’), but I was wrong. The most activist institution is clearly Paris Nanterre University (‘Paris 10’), which happens to be the one that hosts my lab.

The distribution of gender-neutral morphemes in twelve Parisian universities (based on Burnett & Pozniak 2021)

I asked my student to replicate Burnett and Pozniak’s methodology on undergraduate brochures in sociology at non-Parisian universities. We expected to find many occurrences of inclusive writing because in this discipline, both faculty and students are traditionally pro-activist. Instead, the students found very few instances of étudiant·e, with the interpunct. We suspect place to be a significant factor in the distribution of gender-inclusive morphemes. We shall test this hypothesis next semester.

I cannot help but be amazed at how much societal chaos a single word can cause. Of course, conscious changes (those that emanate from institutions, socially dominant groups, and pressure groups) can inflect the spread of a word or expression, but at the end of the day, I am convinced that collective usage below the level of consciousness prevails. At least, that is what my experience as a usage-based linguist has taught me. For this reason, it is hard to predict how linguistic units spread across communities of speakers.

Reference

Burnett, H., & Pozniak, C. (2021). Political dimensions of gender inclusive writing in Parisian universities. Journal of Sociolinguistics, 25(5), 808-831. Full paper.

Due to a very busy year, I began writing this post last Fall but did not have time to finish it until now. Here it is, at last.

]]> https://corpling.hypotheses.org/3593/feed 0 Diachronic linguistics with distributional semantic models in R https://corpling.hypotheses.org/3583 https://corpling.hypotheses.org/3583#respond Thu, 30 Sep 2021 12:05:16 +0000 https://corpling.hypotheses.org/?p=3583 Martin Schweinberger and Michael Haugh (University of Queensland, Australia) invited me to to give a LADAL Opening Webinar Series 2021 talk. I felt quite honored to be part of a prestigious panel of fellow corpus linguists (Stefan Th. Gries, Natalia Levshina, Monika Bednarek, Laurence Anthony, Laura Janda, etc.). On Sept 30 2021, I explained ways of doing diachronic linguistics with distributional semantic models in R. Below, you will find the video of my talk as well as a link to download the companion R notebook.

Access the R companion notebook (or download it here).

]]> https://corpling.hypotheses.org/3583/feed 0 Tracing the shifting collocates of complex prepositions with diachronic word embeddings: supplementary materials https://corpling.hypotheses.org/3532 https://corpling.hypotheses.org/3532#respond Mon, 16 Aug 2021 15:58:44 +0000 https://corpling.hypotheses.org/?p=3532 The 11th International Conference on Construction Grammar (ICCG11) takes place from Wednesday 18 till Friday 20 August 2021. It covers a broad range of topics related to Construction Grammar approaches to language. My talk is scheduled on Friday at 2.15pm. This post contains supplementary materials, namely high-resolution figures and a full bibliography.

Abstract

Complex prepositional constructions in the form <in/at the middle/midst/center/heart of NP> denote a relationship of internal location between a trajector, i.e. a located entity which is the primary focal participant, and a landmark, i.e. a reference entity which is the secondary focal participant (Langacker 1987: 225–228).

Locative prepositions denoting internal location are not synonyms because each imposes specific constraints that go beyond the spatial criteria that are ordinarily used: the landmark’s internal plurality, its boundedness, the magnitude of its referent, and the degree of functional dependence between the trajector and the landmark (Gréa 2017).

I adapt the above to the study of English. Using data from the Corpus of Historical American English (Davies 2010), I assess to what extent each prepositional construction imposes its own construal and how this construal has shifted over the past 200 years.

I draw upon three key assumptions from previous research: (a) changes in the collocational patterns of a linguistic unit reflect changes in meaning and/or function (Hilpert 2008, 2011; Hilpert and Gries 2009); (b) meaning can be modeled with dense word vectors (Mikolov et al. 2013b; Mikolov et al. 2013c, Author 2019); (c) by supplementing frequency-based methods with distributional word representations (Baroni et al. 2014; Turney and Pantel 2010), one can trace semantic shifts more precisely and with greater explanatory power (Hilpert 2016; Kulkarni et al. 2015; Perek 2016, 2018).

Word vectors were obtained by running a prediction-based shallow neural model on the COHA, namely Skipgram with negative sampling (Mikolov et al. 2013a). I made a semantic vector space of the most distinctive landmark collocates of all complex prepositions across the whole period covered by the corpus (1810s–2000s). Only the most distinctive collocates of each construction appear in the vector space. Next, I divided the corpus into four arbitrary periods (1810s–1860s, 1870s–1910s, 1920s–1970s, and 1980s–2000s). I included frequency-based density plots for each period and each prepositional construction. I obtained one semantic vector space per period. The resulting maps confirm that the four prepositional constructions involve different types of constraints and that these constrains have changed since the 1810s. For example, <in the heart of NP> has shifted from profiling landmarks that denote institutional or geographic bounded areas (e.g. countries, nations, world, etc.) to interactional events whose boundedness is unclear but whose magnitude is foregrounded (e.g. dispute, problem, issue, etc.).

Slides

Extended bibliography

]]> https://corpling.hypotheses.org/3532/feed 0 Preuves et présomptions en linguistique de corpus https://corpling.hypotheses.org/3522 https://corpling.hypotheses.org/3522#respond Sun, 28 Mar 2021 17:38:20 +0000 https://corpling.hypotheses.org/?p=3522 L’Ecole Doctorale 139 de l’Université Paris Nanterre organise un séminaire pluridisciplinaire thématique. Cette année (2020-2021), le thème imposé est “Preuves et présomptions”.

Avec Sylvain Kahane, professeur en sciences du langage à l’Université Paris Nanterre, nous avons accepté l’invitation. Le défi thématique nous semblait intéressant à relever en tant que linguistes travaillant sur des corpus.

Nous abordons une question fondamentale pour la discipline : dans quelle mesure un travail sur corpus constitue-t-il la preuve d’une hypothèse modélisatrice en linguistique ? Après un rappel épistémologique sur la linguistique de corpus (son contexte d’émergence, ses prétentions, ses applications et ses limites), nous proposons deux études de cas en syntaxe : la forte limitation des auto-enchâssements centrés et la vérification d’un universal implicatif à la Greenberg. Ces études reposent sur l’exploitation de corpus où les phrases sont annotées par des arbres de dépendances syntaxiques.

Voici un lien pour télécharger la vidéo.

]]> https://corpling.hypotheses.org/3522/feed 0 Clustering corpus data with multidimensional scaling https://corpling.hypotheses.org/3497 https://corpling.hypotheses.org/3497#respond Wed, 03 Mar 2021 15:53:58 +0000 https://corpling.hypotheses.org/?p=3497 Multidimensional scaling (MDS) is a very popular multivariate exploratory approach because it is relatively old, versatile, and easy to understand and implement. It is used to visualize distances in multidimensional maps (in general: two-dimensional plots).

I hardly ever use MDS because I was trained in the French school of data analysis. This means that I favor equivalent multivariate exploratory approaches such as (multiple) correspondence analysis or hierarchical cluster analysis. However, this has the effect of puzzling most non-French reviewers. This is why I advise you to consider MDS an option in case you are considering a top-tier journal.

MDS comes in different flavors:

vanilla/classical MDS (metric MDS);
Kruskal’s non-metric multidimensional scaling;
Sammon’s non-linear mapping.

I focus on classical multidimensional scaling (MDS), which is also known as principal coordinates analysis (Gower 1966).

MDS takes as input a matrix of dissimilarities and returns a set of points such that the distances between the points are approximately equal to the dissimilarities. A strong selling point of MDS is that, given the n dimensions of a table, it returns an optimal solution to represent the data in a space whose dimensions are (much) lower than n.

Case study

The data are from Desagulier (2014). The data set was compiled to see how 23 English intensifiers cluster on the basis of their most associated adjectives. For each of the 23 adverbs, I first extracted all adjectival collocates from the Corpus of Contemporary American English (Davies 2008–2012), amounting to 432 adjective types and 316,159 co-occurrence tokens. Then, I conducted a collexeme analysis for each of the 23 degree modifiers. To reduce the data set to manageable proportions, the 35 most attracted adjectives were selected on the basis of their respective collostruction strengths, yielding a 23-by-432 contingency table containing the frequency of adverb-adjective pair types.

The contingency table is available from a secure server:

intensifiers <- readRDS(url("https://tinyurl.com/7k378zcd"))

Here is what the first ten rows and the first ten columns look like:

The dissimilarity matrix

The contingency table must be converted into a distance object. Technically, this distance object is a dissimilarity matrix. Because the matrix is symmetric, it is divided into two parts (two triangles) on either side of the diagonal of null distances between the same cities. Only one triangle is needed.

You obtain the dissimilarity matrix by converting the contingency table into a table of distances with a user-defined distance measure. When the variables are ratio-scaled, you can choose from several distance measures: Euclidean, City-Block/Manhattan, correlation, Pearson, Canberra, etc. I have noticed that the Canberra distance metric handles best the relatively large number of empty occurrences that we typically obtain in linguistic data (i.e. when we have a sparse matrix).

We use the dist() function:

the first argument is the data table;
the second argument is the distance metric (method="canberra");
the third argument (diag) lets you decide if you want R to print the diagonal of the distance object;
the fourth argument (upper) lets you decide if you want R to print the upper triangle of the distance object.

dist.object <- dist(intensifiers, method="canberra", diag=T, upper=T)

The distance object is quite large. To see a snapshot, enter the following:

dist.matrix <- as.matrix(dist.object)
dist.matrix[1:5, 1:5] # first 10 rows, first 10 columns

The diagonal of 0 values separates the upper and lower triangles, as expected from a distance matrix.

Running MDS with `cmdscale()`

The distance matrix serves as input to the base-R cmdscale() function, which performs a ‘vanilla’ version of MDS. We specify k=2, meaning that the maximum dimension of the space which the data are to be represented in is 2.

mds <- cmdscale(dist.matrix,eig=TRUE, k=2)
mds

The result is a matrix with 2 columns and 23 rows (fit$points). The function has done a good job at outputting the coordinates of intensifiers in the reduced two-dimensional space that we requested. Note that cmdscale() returns the best-fitting k-dimensional representation, where k may be less than the argument k.

To plot the results, first we retrieve the coordinates for the two dimensions (x and y).

x <- mds$points[,1]
y <- mds$points[,2]

Second, we plot the two axes and add information about the intensifiers (Fig. 1).

plot(x, y, xlab="Dim.1", ylab="Dim.2", type="n")
text(x, y, labels = row.names(intensifiers), cex=.7)

The question we are addressing is whether these dimensions reflect differences in the semantics of the intensifiers. Existing typologies of intensifiers tend to group them as follows:

diminishers (slightly, a little, a bit, somewhat)
moderators (quite, rather, pretty, fairly)
boosters (most, very, extremely, highly, awfully, terribly, frightfully, jolly)
maximizers (completely, totally, perfectly, absolutely, entirely, utterly)

Maximizers and boosters stretch horizontally across the middle of the plot. Moderators are in the upper left corner, and diminishers in the lower left corner. Note the surprising position of almost.

Combining MDS and k-means clustering

We can improve the MDS plot in Fig. 1 by grouping and coloring the individuals by means of k-means clustering. K-means clustering partitions the data points into into k classes, based on the nearest mean.

We download and load one extra package from the tidyverse, namely ggpubr.

install.packages("ggpubr")
library(ggpubr)

We convert the coordinates obtained above into a data frame.

mds.df <- as.data.frame(mds$points) # convert the coordinates colnames(mds.df) <- c("Dim.1", "Dim.2") # assign column names 
mds.df # inspect

We proceed to $k$-means clustering on the data frame with the kmeans() function.

kmclusters <- kmeans(mds.df, 5) # k-means clustering with 5 groups 
kmclusters <- as.factor(kmclusters$cluster) # convert to a factor 
mds.df$groups <- kmclusters # join to the existing data frame mds.df # inspect

We are ready to launch the plot with ggscatter() (Fig. 2). Each group will be assigned a color.

ggscatter(mds.df, 
          x = "Dim.1", 
          y = "Dim.2", 
          label = rownames(intensifiers), 
          color = "groups", 
          palette = "jco", 
          size = 1, 
          ellipse = TRUE, 
          ellipse.type = "convex", 
          repel = TRUE)

Fig. 2. The MDS plot of intensifiers with k-means clusters

A comparison with HCA

The distance matrix can also serve as input for another multivariate exploratory method: hierarchical cluster analysis.

We use the hclust() function to apply an amalgamation rule that specifies how the elements in the matrix are clustered. We amalgamate the clusters with Ward’s method, which evaluates the distances between clusters using an analysis of variance. Ward’s method is the most widely used amalgamation rule because it has the advantage of generating clusters of moderate size. We specify method="ward.D".

clusters <- hclust(dist.object, method="ward.D")

We plot the dendrogram (Fig. 3) as follows:

plot(clusters, sub="(Canberra, Ward)")

Although based on the same distance matrix, the dendrogram clusters the intensifiers slightly differently.

References

Gower, John C. 1966. “Some Distance Properties of Latent Root and Vector Methods Used in Multivariate Analysis.” Biometrika 53 (3-4): 325–38.

]]> https://corpling.hypotheses.org/3497/feed 0 Manipulate data with dplyr https://corpling.hypotheses.org/3461 https://corpling.hypotheses.org/3461#comments Wed, 03 Mar 2021 11:56:47 +0000 https://corpling.hypotheses.org/?p=3461 The dplyr package is based on a data manipulation ‘grammar’. This grammar provides a consistent set of ‘verbs’ that solve the most common data manipulation tasks. I illustrate five of these ‘verbs’: filter(), arrange(), select(), mutate(), and summarise(). Please refer to the dplyr documentation for details.

First of all, install and load the dplyr package in R:

install.packages('dplyr')
library(dplyr)

Data

The functions are illustrated with a data set from Fox and Jacewicz (2009). The authors compare the spectral change of five vowels in Western North Carolina, Central Ohio, and Southern Wisconsin. The corpus consists of 1920 utterances by 48 female informants. The authors find variation in formant dynamics as a function of phonetic factors. They also find that, for each vowel and for each measure employed, dialect is a strong source of variation in vowel-inherent spectral change.

Load the data as follows:

vow.dur <- read.table("https://bit.ly/2Iw7kn7", header=TRUE, sep="\t")

Once loaded as a data frame, here is what the data look like:

filter rows with `filter()`

The filter() function subsets a data frame, retaining all rows that meet one or several conditions. You express the condition(s) by means of the following logical operators:

== (equal to), != (not equal to), > (greater than), >= (greater than or equal to), etc
& (and), | (or), ! (not), xor() (exclusive or)
is.na() (checks whether a value is NA)
etc.

filtering with one condition

Keep only the vowels that occur in a voiceless context:

filter(vow.dur, context == "voiceless")

The same can be achieved with the tidyverse syntax:

vow.dur %>% 
     filter(context == "voiceless")

Let us now keep only the vowels whose duration is greater than 187:

filter(vow.dur, Vow_dur_ms > 187)

And now, let us keep only the vowels whose duration is greater than the mean duration:

filter(vow.dur, Vow_dur_ms > mean(Vow_dur_ms, na.rm = TRUE))

filtering with multiple conditions

To filter with multiple conditions, separate each condition with &. The code below keeps only the vowels that occur in voiceless and consonantal contexts:

filter(vow.dur, context == "voiceless" & position == "Ccontext")

Keep only the vowels that occur in a voiceless context AND whose duration is greater than 187:

filter(vow.dur, context == "voiceless" & Vow_dur_ms > 187)

arrange rows with `arrange()`

With arrange(), you can order the rows of a data frame by the values of selected columns.

arrange(vow.dur, US_state) # order by US state

arrange(vow.dur, US_state, Vow_dur_ms) # order by US state and vowel duration

arrange(vow.dur, US_state, desc(Vow_dur_ms)) # order by US state and vowel duration in decreasing order

select columns with `select()`

select() accesses the variables (columns) in a data frame based on their names. Selection can be made with the following base-R logical operators:

: for selecting a range of consecutive variables
! for taking the complement of a set of variables (e.g. !variable1 = all variables except variable1)
& and | for selecting the intersection or the union of two sets of variables
c() for combining selections

With tidyverse-specific operators, you can

match patterns in variable names:
- starts_with(): the variable name starts with a prefix
- ends_with(): the variable name ends with a suffix
- contains(): the variable name contains a literal string
- matches(): the variable name matches a regular expression
- num_range(): the variable name matches a numerical range like x01, x02, x03.
select variables from a character vector:
- all_of(): matches variable names in a character vector
- any_of(): same as all_of(), except that no error is thrown for names that don’t exist.
select variables with a function:
- where(): applies a function to all variables and selects those for which the function returns TRUE

Suppose we want to fetch US_state. With select(), we can do it it in several ways, including highly irrelevant ones.

vow.dur %>% select(starts_with("US"))
vow.dur %>% select(ends_with("te"))

The most obvious way consists in using the plain variable name, without quotes.

vow.dur %>% select(US_state)

Suppose we now want to fetch US_state and Vow_dur_ms. Both variable names have the underscore in common. Let us use this to select them.

vow.dur %>% select(contains("_"))

If you are familiar with regular expressions, write your regex as an argument of matches():

vow.dur %>% 
   select(matches("(\\w+_)+"))

vow.dur %>% select(matches("\\w+_\\w+_\\w+""))

add new variables/colums with `mutate()`

With mutate(), you can add new variables and preserve existing ones. A close equivalent, transmute() adds new variables but drops existing ones.

mutate() is often used with group_by() to calculate sums or means over grouped values.

vow.dur %>% 
   group_by(US_state) %>%
   mutate(mean_vow_dur = mean(Vow_dur_ms, na.rm = TRUE))

With transmute(), the variable Vow_dur_ms is dropped:

vow.dur %>%
   select(US_state, context, Vow_dur_ms) %>%
   group_by(US_state) %>%
   transmute(mean_vow_dur = mean(Vow_dur_ms, na.rm = TRUE))

rename variable names with `rename()`

rename() changes the names of individual variables using new_name = old_name syntax.

vow.dur %>%
   rename(vowel_duration = Vow_dur_ms)

make grouped summaries with `summarise()`

summarise() creates a new data frame based on a source data frame with one row per grouping variable.

Here is how to calculate the mean vowel duration overall:

vow.dur %>%
   summarise(mean = mean(Vow_dur_ms))

the mean vowel duration per US state:

vow.dur %>% 
   group_by(US_state) %>%
   summarise(mean = mean(Vow_dur_ms))

or the mean vowel duration per US state and context:

vow.dur %>%
   group_by(US_state, context) %>%
   summarise(mean = mean(Vow_dur_ms))

There are many functions other than mean(): median(), sd() (standard deviation), IQR() (interquartile range), min(), max(), quantile(), n() (count), etc.

References

Fox, Robert Allen, and Ewa Jacewicz. 2009. “Cross-Dialectal Variation in Formant Dynamics of American English Vowels.” The Journal of the Acoustical Society of America 126 (5): 2603–18. https://doi.org/10.1121/1.3212921.

]]> https://corpling.hypotheses.org/3461/feed 1 Tidy corpus linguistics data with tidyr https://corpling.hypotheses.org/3441 https://corpling.hypotheses.org/3441#respond Tue, 17 Nov 2020 09:46:15 +0000 https://corpling.hypotheses.org/?p=3441 The tidyr package is part of the tidyverse. As its name indicates, it is meant to help you create tidy data or tidy messy data according to the tidy data principles: each variable forms a column; each observation forms a row; each type of observational unit forms a table. This post illustrates how to tidy a data set in R using two tidyr functions: pivot_longer() and separate().

First, let us load the package.

library(tidyr)

Pivot data from wide to long

Tab. 1 was featured in a previous post. It is messy because the column headers are values (actual age ranges) that should be grouped under a single variable name (“age”).

variable	0_10	11_18	19_29	30_39	40_49	50_59	60_69	70_79	80_89	90_99
hello	77	139	377	186	261	90	80	59	20	9
hi	14	52	305	46	57	19	36	11	3	0

Table 1. A messy data frame with columns headers as values

To tidy the data table, we need to pivot it, i.e. increase the number of rows and decrease the number of columns. This is done with pivot_longer().

Load the messy data:

df.messy.1 <- read.table("https://bit.ly/366nbkn", header=T, sep="\t", check.names = F)

Apply pivot_longer():

df.longer.1 <- df.messy.1 %>%
   pivot_longer(
   !(variable), # all the columns except 'variable' are concerned
   names_to = "age", # new column
   values_to = "frequency", # where the counts will appear
   values_drop_na = TRUE # do not include NA values (providing NA values appear)
)
df.longer.1 # inspect

The age ranges are now grouped under a single variable: age.

Separate a character column into multiple columns

Tab. 2 was also featured as messy in a previous post because of its multiple variables stored in the same column. Indeed, each column apart from general_extender conflates two variables: city and socioeconomic status (WC = ‘working class’; MC = ‘middle class’)

general_extender	Reading_MC	Reading_WC	Milton.Keynes_MC	Milton.Keynes_WC	Hull_MC
and that	4	49	9	44	10
and all that	4	14	2	4	1
and stuff	36	6	45	5	62
and things	32	0	35	0	12
and everything	21	16	22	18	30
or something	72	20	30	17	23

Table 2. A messy data frame with multiple variables stored in the same column

Tidying the Tab. 2 involves two steps. We need to:

pivot the data frame (i.e. increase the number of rows and decrease the number of columns)
split each column into two distinct variables: city and socioeconomic status.

Load the data:

df.messy.2 <- read.table("https://bit.ly/34KGjER", header=TRUE, sep="\t")

pivot

Pivot the messy data frame with pivot_longer():

df.longer.2 <- df.messy.2 %>%
   pivot_longer(
   !(general_extender),
   names_to = "city_socioeconomicstatus",
   values_to = "frequency",
   values_drop_na = TRUE
   )
df.longer.2 # inspect

Split columns

Split each column into two distinct variables (city and socioeconomic status) with separate():

df.separate <- separate(df.longer.2,
   city_socioeconomicstatus,
   sep="_",
   into=c("city", "socioeconomic_status"))
df.separate # inspect

The data frame is now tidy.

More functionalities

There are of course more functionalities than the two I have illustrated above. Fore more details, Please refer to the tidyr documentation.

]]> https://corpling.hypotheses.org/3441/feed 0

Original Source | Taken Source

Are you new to Shiny?

The (humble) philosophy behind the (humble) app

How the app works

Assets

Limitations and Next Steps

Concluding Thoughts

What kind of AI are we talking about?

Why are LLMs so powerful now?

Why should we be concerned about corpora?

Stochastic parrots?

Traditional corpora are safe…

…but what about web-based corpora?

The ouroboros menace

Other issues beyond corpus linguistics

Breaking the cycle

Going further

References

Voyant Tools in a few words

Getting Started with Voyant Tools

Visit the Voyant Tools website

Input your text

Key tools

Cirrus

Reader

Trends

Summary

Contexts

Collocates

Additional tools

ScatterPlot

Collocates Graph

Assets and Limitations

Video recap

References

Getting Started

The lemmatized freqlist

The unlemmatized freqlist

Structuralist semantics vs. Prototype semantics

The soup experiment

Likert-scale survey

Step 1: clear the workspace

Step 2: install and load the required packages

Step 3: load the data with xlsx

Step 4: inspect the data

Step 5: modify column names

Step 6: prepare Likert scale labels

Step 7: data preprocessing

Step 8: define factor levels

Step 9: create and customize the Likert plot

Step 10: save the plot

Step 11: export the data

References

Dependency parsing

The pipeline

(Down)load the language model

Annotate a sentence

Plot the dependencies

Interpreting the graph

Choosing the right language model

References

UDPipe

Language Models

Load and pre-process the text

Annotate the text

Plotting frequency distributions

References

Shiny 101

The Shiny word-cloud app

From Twitter to Amazon

From gender-inclusive pronouns to gender-inclusive morphemes

Reference

Abstract

Slides

Extended bibliography

Case study

The dissimilarity matrix

Running MDS with cmdscale()

Combining MDS and k-means clustering

A comparison with HCA

References

Step 3: load the data with `xlsx`

Running MDS with `cmdscale()`

filter rows with `filter()`

arrange rows with `arrange()`

select columns with `select()`

add new variables/colums with `mutate()`

rename variable names with `rename()`

make grouped summaries with `summarise()`