Posts tagged data
| CARVIEW |
Superlinguo
For those who like and use language
Linguistic Data Interest Group: Five years of improving data citation practices in linguistics
After five years with the Linguistics Data Interest Group of the Research Data Alliance, I’ve stepped down as a co-chair of the group. I wanted to use this as a chance to collate some of the work I was involved with over the last five years. I’ll still be a member of the LDIG for its next chapter, and thrilled that Andrea Berez-Kroeker, Helene N. Andreassen, and Lindsay Ferrara will be heading things up. And, of course, the group has many excellent members (if you’re a linguist and/or interested in data management, you can join the LDIG too!).
The Austin Principles of Data Citation in Linguistics
The Austin Principles of Data Citation were the first major output for the LDIG, it focuses on the why of data citation. This short document is a position statement on the importance of data in linguistic work. From the preamble:
Data is central to empirical linguistic research. Linguistic data comes in many different forms, and is collected and processed with a wide range of methods. Data citation recognizes the centrality of data to research. Furthermore, it facilitates verification of claims and repurposing of data for other studies.
The official Austin Principles website
The Trømso Recommendations in academic publishing
If the Austin Principles are the why the Tromsø Recommendations are the how.
The Tromsø Recommendations provide clear guidance for data citation for referencing language data, both in the bibliography and in the text of linguistics publications. The recommendations have been written to account for the rich variety of linguistic data, and include clear guidance and examples.
The official Tromsø Recommendations documents
Building uptake for the Trømso Recommendations
Now that the Tromsø Recommendations have been published, there’s an ongoing campaign to normalise their use in academic publishing and grant writing. Get involved by encouraging your favourite publishers to include to Tromsø Recommendations in their author guidelines!
The Linguistics Data Interest Group (a working group in the Research Data Alliance) have developed the Tromsø Recommendations in collaboration with linguists working in a range of disciplines. The next step is to help encourage citation of data by encouraging journals to include the Tromsø Recommendations in their instructions for authors.
Publications
Alongside LDIG colleagues I was involved in a number of publications looking at the role of data in linguistics. Below are links to the Superlinguo posts with more information, the abstract and links to open access versions.
- Reproducible research in linguistics: A position statement on data citation and attribution in our field
- Situating Linguistics in the Social Science Data Movement. Chapter in the Open Handbook of Linguistic Data Management
- Data transparency and citation in the journal Gesture
- Reflections on reproducible research, in Reflections on Language Documentation 20 Years after Himmelmann 1998
- Putting practice into words: The state of data and methods transparency in grammatical descriptions
See also: The Superlinguo lingdata tag.
Adopting the Trømso Recommendations in academic publishing
The Tromsø recommendations for citation of research data in linguistics
Data is central to empirical linguistic research. Linguistic data comes in many different forms, and is collected and processed with a wide range of methods. Data citation recognizes the centrality of data to research. Furthermore, it facilitates verification of claims and repurposing of data for other studies.
The Tromsø Recommendations provide clear guidance for data citation for referencing language data, both in the bibliography and in the text of linguistics publications. The recommendations have been written to account for the rich variety of linguistic data, and include clear guidance and examples.
The Linguistics Data Interest Group (LDIG, a working group in the Research Data Alliance) have developed the Tromsø Recommendations in collaboration with linguists working in a range of disciplines.
Making the Tromsø Recommendations part of academic publishing
The next step is to help encourage citation of data by encouraging journals to include the Tromsø Recommendations in their instructions for authors.
The LDIG has launched a campaign to encourage publishers, archives and other groups that present linguistic data to adopt the Tromsø Recommendations.
What would it look like to include the Tromsø Recommendations in academic publishing?
The Tromsø Recommendations cover all types of linguistic data, and have citation formats for all levels of detail - from a whole corpus to a single line of a text. They are designed to focus on content, not style, so they work with the formatting for a particular journal.
To include the recommendations can be as minimal as adding a line of text like this to a publication’s Information for Authors page:
<Journal Name> [encourages|requires] the citation of linguistic data in all published articles. Citation structure should follow the The Tromsø Recommendations for Citation of Research Data in Linguistics. For more on the importance of data citation see The Austin Principles of Data Citation in Linguistics.
How can I help support the Tromsø Recommendations in academic publishing?
- If you are a journal or series editor, member of a journal editorial board or publisher, update your author instructions to encourage authors to include data citation.
- If you are a contributor to a journal in your field or popular publisher, write to the editors and ask them to consider adopting the Tromsø Recommendations. We have some suggested text for your email or letter in the next tab.
- If you are a member of a PhD program, a funding body or a body that gives a linguistics award, consider adopting data citation as a category for assessment.
- If you are a linguist or language scientist, encourage the journals, societies, funding and awards bodies you are a member of to take up the Tromsø Recommendations.
For this last group, we’ve created a spreadsheet that has email templates, background information and a spreadsheet where we are tracking which journals have been contacted, and whether they’ve adopted the Tromsø Recommendations: bit.ly/trecs-campaign
You do not have to be an LDIG or RDA member to be involved! Also, if the journals you publish in do not immediately adopt the Tromsø Recommendations for the whole publication, you can always adopt them yourself for the next thing you publish!
See also:
- The Austin Principles of Data Citation in Linguistics
- Reproducible research in linguistics: A position statement on data citation and attribution in our field (article)
- Data transparency and citation in the journal Gesture (article)
- Putting practice into words: The state of data and methods transparency in grammatical descriptions (article)
- Reflections on reproducible research, in Reflections on Language Documentation 20 Years after Himmelmann 1998 (article)
Linguistics Jobs: Interview with a Metadata Specialist and Genealogist
As someone who has built language archives, and spent a lot of time poking around in archives built by other people, I appreciate the importance of well-structured meta-data. It’s good meta-data that tells you what is in the giant pile of data you’re working with, making the whole process much less of a needle-in-a-haystack scenario. Mallory Manley is doing the important work of managing data across multiple languages in the field of genealogy. I appreciate Mallory’s honesty about the challenges of stepping sideways out of linguistics, and sharing that experience with us in this interview. You can follow Mallory on Twitter (@ManleyMallory).

What did you study at university?
I
studied a Master of Arts in Linguistics at the University of Essex. My
favorite subject in linguistics is morphology, so I continue to study it
on my own.
What is your job?
I
work for a genealogy company as a cataloguer. I receive digital copies
of historical records and I organize them by place, record type (birth
certificates, census records, etc), and year to prepare them to be
published online. I am responsible for records coming from Scandinavia
and South Eastern Europe.
How does your linguistics training help you in your job?
When
I applied for this job, I had no working knowledge of the Scandinavian
languages or the languages of Eastern Europe, except for Russian. I
definitely oversold my abilities by stating in my cover letter that I
could learn any language. But knowing how to analyse language has helped
me learn these languages. And being able to identify patterns in
language helps me read those documents when I get stuck on words I don’t
know or simply can’t decipher. Learning the orthographies of each of
these languages has also proved to be a challenge, partly because
orthographies change over time, and partly because many of these
languages didn’t have a standardized orthography at all until relatively
recently. So even though I don’t use my linguistics training as much as
I hoped I would in a career, it has helped me succeed in this role.
Do you have any advice do you wish someone had given to you about linguistics/careers/university?
I
think when we’re young and planning for our future, we get specific
ideas about how our career path will look, and it becomes the only path
we envision. I had to learn to be flexible and accept changes. My first
year of college, I wanted to be a lexicographer (which I still think
would be an awesome job). I ended up instead building a career in
genealogy, and though it’s not where I expected or planned to be, it has
been fulfilling and joyful.
Related interviews:
Recent interviews:
- Interview with a Developer Advocate
- Interview with an ESL teacher, coach and podcaster
- Interview with a Juris Doctor (Master of Laws) student
- Interview with the Director of Education and Professional Practice at the American Anthropological Association
- Interview with a Research Coordinator, Speech Pathologist
Check out the full Linguist Jobs Interview List and the Linguist Jobs tag for even more interviews
New Article Published: Reflections on reproducible research, in Reflections on Language Documentation 20 Years after Himmelmann 1998
In 1998 Nikolas Himmelmann wrote “Documentary and descriptive linguistics”, an article for the journal Linguistics (I’m not going to link because it’s paywalled, but you should be able to find a copy floating around online). Himmelmann 1998 is an often cited work that makes a case for the work of documentation (collecting data) being as important as the work of description (describing data) in understanding how languages work.
To look at the 20 years since this publication, Bradley McDonnell, Andrea L. Berez-Kroeker and Gary Holton brought together over 30 linguists to reflect on how the field has changed and grown in that time. This is a Special Publication of Language Documentation & Conservation, available open access.
From the introduction to the special issue:
[W]e invited 38 experts from around the world to reflect on either particular issues within the realm of language documentation or particular regions where language documentation projects are being carried out. The issues addressed in this volume represent a broad and diverse set of topics from multiple perspectives and for multiple purposes that continue to be relevant to documentary linguists and language communities. Some topics have been hotly debated over the past two decades, while others have emerged more recently.
All of the chapters are short and to the point (they’re around 10 pages each). I has a great time writing with Andrea Berez-Kroeker about how language documentation has changed over the last two decades in regards to data management and sharing. Himmelmann talks about the importance of sharing primary data, but the nature of that data, and how we can share it has changed so much since 1998.
Abstract
Reproducibility in language documentation and description means that the analysis given in descriptive publication is presented in a way that allows the reader to access the data on which the claims are based, to verify the analysis for themself. Linguists, including Himmelmann, have long pointed to the centrality of documentation data to linguistic description. Over the twenty years since Himmelmann’s 1998 paper we have seen a growth in digital archiving, and the rise of the Open Access movement. Although there is good infrastructure in place to make reproducible research possible, few descriptive publications clearly link to underlying data, and very little documentation data is publicly accessible. We discuss some of the institutional roadblocks to reproducibility, including a lack of support for the development of published primary data. We also look at what work on language documentation and description can learn from the recent replication crisis in psychology.
Reference
Gawne, Lauren and Andrea L. Berez-Kroeker. 2018. Reflections on
reproducible research. In McDonnell, Bradley, Andrea L. Berez-Kroeker,
and Gary Holton. (Eds.) Reflections on Language Documentation 20 Years
after Himmelmann 1998. Language Documentation & Conservation Special
Publication no. 15. [PP 22-32] Honolulu: University of Hawai‘i Press. [Open Access PDF]
New Journal Article in GESTURE: Contexts of Use of a Rotated Palms Gesture among Syuba (Kagate) Speakers in Nepal
A popular expression in Nepal is a fatalistically resigned ke garne? ‘what to do?’ The government office is closed, ke garne? The bus is running late, ke garne? When people say this, they also bring their palms up and rotate them inwards, with their thumb and index finger extended and the other fingers bunched in.
This gesture doesn’t just occur with this phrase, it turns up in all kinds of question-asking contexts, across the wider region of India-Nepal-Pakistan and beyond. This has been noted anecdotally before, and in this new paper for the journal Gesture, I look at the gesture and its use in detail for the first time. I’m very excited about this publication because it’s my first publication on gesture in Syuba, and my first publication in Gesture (if the name alone doesn’t give it away, it’s the journal in this field!).
(GIF from SUY-141022-03, Sangbu Syuba gesturing while he says ‘what do we say?’)
The data are archived with Paradisec, and I made the clips specifically with the rotated palms tokens available through FigShare.
You can view the abstract on the journal website, and download the full text if you have institutional access. If you don’t, but you’d like to read the article, you can contact me for a pre-publication copy.
I’m also very excited that the paper has been included in a major review paper Kensy Cooperrider by Natasha Abner and Susan
Goldin-Meadow on ‘the palm up puzzle’.
Abstract
In this paper I examine the use of the ‘rotated palms’ gesture family among speakers of Syuba (Tibeto-Burman, Nepal), as recorded in a video corpus documenting this language. In this family of gestures one or both forearms are rotated to a supine (‘palm up’) position, each hand with thumb and forefinger extended and the other fingers, in varying degrees, flexed toward the palm. When used independently of speech this gesture tends to be performed in a relatively consistent manner, and is recognised as an interrogative gesture throughout India and Nepal. In this use it can be considered an emblem. When used with speech it shows more variation, but can still be used to indicate the interrogative nature of what is said, even when the speech may not indicate interrogativity in its linguistic construction. I analyse the form and function of this gesture in Syuba and argue that there are a number distinct functions relating to interrogativity. This can therefore be considered as a family of gestures. This research lays the groundwork for better understand of this common family of gestures across the South Asian area, and beyond.
Reference
Gawne, Lauren. 2018. Contexts of use of a rotated palms gesture among Syuba (Kagate) speakers in Nepal. Gesture 17(1): 37–64. [Abstract]
Gawne, Lauren. 2018. Syuba Rotated Palms Gesture Tokens. figshare. Fileset. https://doi.org/10.4225/22/5b1a37144e1c1
Cooperrider, Kensy, Natasha Abner & Susan Goldin-Meadow. 2018. The palm-up puzzle: Meanings and origins of a widespread form in gesture and sign. Frontiers in Communication 3: 23. doi: 10.3389/fcomm.
Digital Daisy Bates - turning 90,000 words and 4,500 pages into an online portal to explore
In 1904, Daisy Bates printed out 500 copies of a survey and sent them to public servants and pastoralists in the western states of Australia. The survey listed around 1800 words, and Bates asked people to fill them in with the vocabulary of the local Indigenous population. Bates was something of an eccentric. She spent most of her adult life among Indigenous Australians, mostly in South Australia. Always dressing in Edwardian style until her death in the 1950s, in photographs she looks like a misplaced governess.
Eventually around 120 surveys were returned. For many years after Bates’ death they sat in boxes in an archive. Over the last few years, there’s been a project to turn these old surveys into an interactive digital corpus.
From an article by Nick Thieberger about the project:
There are 4,500 pages of typescript representing languages from the Southern South Australia/Western Australia border all the way up to the Kimberley. At least 123 speakers are named in the vocabularies and, even now, it’s not clear how many languages they represent.
The vocabularies preserved in the Daisy Bates questionnaires are extraordinarily precious as little else was recorded in the same time period, and nothing of the same scale has been attempted before or since.
The questionnaires she sent out contained some 2,000 prompt words and sentences in English, and asked each respondent to fill in as much as possible in the local Aboriginal language. It means that in addition to the lists of words totalling over 90,000 individual items, the collection includes grammatical information in the form of example sentences.
For every word there’s a map, showing all the tokens that were collected. They gently jostle each other on the page. You can see how broadly some words were in use, and local areas of variation. Clicking on a particular word takes you to a page where you can see the original handwritten survey answers, and an old typewritten transcript.

https://bates.org.au/word-maps/#wattle-tree

https://bates.org.au/word-maps/#kangaroo-red

https://bates.org.au/word-maps/#gum
I helped out briefly on an earlier stage of the project, where the digital sans all had to be renamed to sequence correctly. It’s exiting to see the final version online!
See also:
- My piece about Daisy Bates for Dangerous Women
- An earlier post about the Daisy Bates project
- Bringing back languages from scraps of paper (Nick Thieberger giving an overview of the project)
Communicating colours using black and white - a new app with a new perspective on language evolution
Can you use a string of black and white symbols to communicate colour? This is the premise behind the Color Game app, in which users create and solve puzzles matching colours to non-coloured symbols.
I’ve been enjoying coming up with ways to represent different colours for other players to decode, and also playing through puzzles created by others. Because humans are wonderfully clever and good at communicating, players often do better than chance at the puzzles.
Other than being entertaining, this app is also helping researchers better understand how language evolves. It was designed by Scientists at the Max Planck Institute for the Science of Human History, who will use the anonymous data gathered from the game to understand how the players create an ever-changing symbolic vocabulary.
From the app’s press release:
Difficult as this may sound, players are able to reach the correct result more often than would occur by chance. Players also get better at it, as once-neutral symbols acquire meanings that they lacked at the start of the game. Players are creating a language together, in the very act of using it.
The Color Game Website, including links to download the app for Android and iOS: www.colorgame.net
see also:
Linguistic data form not only a record of scholarship, but also of cultural heritage, societal evolution, and human potential. Because of this, the data on which linguistic analyses are based are of fundamental importance to the field and should be treated as such. Linguistic data should be citable and cited, and these citations should be accorded the same importance as citations of other, more recognizable products of linguistic research like publications.
The Austin Principles of Data Citation in Linguistics (Berez-Kroeker et al. 2017)
The Austin Principles of Data Citation in Linguistics
The
Linguistics Data Interest Group in the Research Data Alliance have
spent the last year working on a document that explains why it’s so
important for linguists to make their data ‘citable’. Linguistics builds
itself as a discipline that uses empirical evidence to make claims, but
so often the reader doesn’t have access to this underlying data
to verify the claims (or test their own ideas!). The Austin Principles
of Data Citation in Linguistics provide the reasons that citation should
be central to all linguistic work. Even if you work with introspection,
you may not have data to cite, but making clear that the data come from
in introspection is still an important, and often omitted, piece of
information.
If you’re a linguist, you can endorse the principles
Even if you’re only just starting your linguistic research career, you can add your name to the list of linguists
who endorse the principles. You can also use the principles to start
conversations about data citation and management in your lab, class,
department, linguistics society, or journal.
If you would like to help shape the future of linguistic data, you can also join us in the RDA Linguistics Data Interest Group (LDIG). This post from April explains what the LDIG is, and why it’s important.
From the Preamble of the Austin Principles:
Data is central to empirical linguistic research. Linguistic data comes in many different forms, and is collected and processed with a wide range of methods. Data citation recognizes the centrality of data to research. Furthermore, it facilitates verification of claims and repurposing of data for other studies.
The FORCE11 Joint Declaration of Data Citation Principles* state that “[s]ound, reproducible scholarship rests upon a foundation of robust, accessible data. For this to be so in practice as well as theory, data must be accorded due importance in the practice of scholarship and in the enduring scholarly record. In other words, data should be considered legitimate, citable products of research. Data citation, like the citation of other evidence and sources, is good research practice and is part of the scholarly ecosystem supporting data reuse.”
The FORCE11 Joint Declaration of Data Citation Principles is intentionally broad so to be as inclusive of data from as many scientific disciplines as possible. This document, the Austin Principles of Data Citation in Linguistics, interprets the FORCE11 document to address linguistic data specifically. These guiding principles have been created to enable linguists to make decisions about their data that ensure it is as accessible and transparent as possible. Some subfields of linguistics may already have specific guidelines for data citation; in these cases the Austin Principles can supplement extant guidelines to ensure that data citation conforms with current best practices.
What’s next for the Linguistics Data Interest Group
The Austin Principles of Data Citation are the first major output for the LDIG, it focuses on the why of data citation. We’re now focusing on the how
of data citation. This includes sharing the Austin Principles,
developing more training on research data management, and also working
with researchers, archives and publishers to develop data citation
standards.
Reference
Berez-Kroeker, A. L., Andreassen, H. N.,
Gawne, L., Holton, G., Kung, S. S., Pulsifer, P., Collister, L. B., The
Data Citation and Attribution in Linguistics Group, & the
Linguistics Data Interest Group. 2017. Draft: The Austin Principles of
Data Citation in Linguistics (Version 0.1). https://site.uit.no/linguisticsdatacitation/austinprinciples/ Accessed 13/12/2017.
Putting practice into words: New paper out about methods and data in descriptive grammar writing
Descriptive grammars take a lot of time to write, often based on years of fieldwork by the author. For many languages they are the only major source of information we have about how the language works. Although a grammar takes a lot of effort to produce, it’s not often made clear where the data come from, who the author worked with, or whether the data can be accessed by other people. Barbara Kelly, Andrea Berez-Kroeker, Tyler
Heston and I decided to explore just how people are talking about their method of work and their research data in descriptive grammars. We have all done work in language documentation and description, and felt like there was a gap in how linguists talked about their work with each other and what they wrote in published grammars. This work is now published in the latest volume of Language Documentation & Conservation.
A survey of 50 published grammars and 50 dissertations
We looked at 100 grammars published between 2003 and 2012. We assessed how transparent they were in their data collection methods, looking at whether they explicitly mentioned things like how many people they worked with, and the recording equipment used. We also looked at whether researchers link from the grammar to the original example. This is important because it means that it’s possible to revisit the original recording or example.
Some of the findings:
- Fewer than a third of grammars mentioned the equipment used to collect data or software used to analyse it
- For over 2/3rds of the grammars we have no idea where the data used in the analysis is now
- Over half of the grammars do not include any metadata for the examples, meaning it’s unclear where they came from, if they’re elicited or whether they were written or recorded.
- We can be optimistic about the future; dissertations performed better than published grammars, suggesting that a newer generation of scholars are aiming for greater transparency in their research
We don’t think linguists are doing bad research, we just think that we need to make the work we are doing clearer.
Abstract
Language documentation and description are closely related practices, often performed as part of the same fieldwork project on an un(der)-studied language. Research trends in recent decades have seen a great volume of publishing in regards to the methods of language documentation, however, it is not clear that linguists’ awareness of the importance of robust data-collection methods is translating into transparency about those methods or data citation in resultant publications. We analyze 50 dissertations and 50 grammars from a ten-year span (2003-2012) to assess the current state of the field. Publications are critiqued on the basis of transparency of data collection methods, analysis and storage, as well as citation of primary data. While we found examples of transparent reporting in these areas, much of the surveyed research does not include key information about methodology or data. We acknowledge that descriptive linguists often practice good methodology in data collection, but as a field we need to build a better culture with regard to making this clear in research writing. Thus we conclude with suggested benchmarks for the kind of information we believe is vital for creating a rich and useful research methodology in both long and short format descriptive research writing.
Reference
Gawne, Lauren, Barbara F. Kelly, Andrea L. Berez-Kroeker & Tyler
Heston. 2017. Putting practice into words: The state of data and methods
transparency in grammatical descriptions. Language Documentation &
Conservation 11: 157-189. [Open Access PDF available here]
Linguistics Data Interest Group - New RDA group to improve data citation and transparency
A really big part of linguistic research is having data. Whether it’s a language documentation corpus, a set of experiments, or your own intuitions about how language works, this is all data on which analyses and theories are built.
While linguists have always relied on data, we’re not the best at making clear where the data we are talking about come from. There has been little incentive or support for how to do that. But it is important. Making it clear where your data come from, and giving other people access to it helps make linguistic research more reproducible.
I think that it’s important that we encourage linguists to cite where their data come from, and make that data more easily accessible to other researchers. That is why I’ve joined an awesome group of researchers to create the Linguistics Data Interest Group (LDIG) as part of the Research Data Alliance. I’m a co-chair of the group (!!!) along with Andrea L. Berez-Kroeker (U Hawai‘i), Susan S. Kung (U Texas) and Helene N. Andreassen (UiT The Arctic University). If you’re a linguist who works with data (i.e. any linguist) and you think that it’s important that we strive to do the best kind of science we can, then you’re welcome to join the LDIG through joining the RDA (it’s free to join, and puts you in a really great network).
I’m especially hoping that some people who are doing their PhD or PostDoc will join, because we’re often at the front line of data collection and management.
From the LDIG announcement:
The Linguistics Data Interest Group (RDA) has been established through the Research Data Alliance (RDA) and aims to develop the discipline-wide adoption of common standards for data citation and attribution. In our parlance citation refers to the practice of identifying the source of linguistic data, and attribution refers to mechanisms for assessing the intellectual and academic value of data citations. The LDIG aims to encourage an international discussion of these topics, bolstering discussions that are already happening in specific sub-disciplines of linguistics in different countries.
The LDIG is for people who work with linguistic and language data. This work includes, but is not limited to, the collection, management and analysis of linguistic data. We encourage participation from academic and speaker communities.
You can see the LDIG draft Charter Statement on the RDA website (and leave a comment if you sign up as an RDA member).
Review: Women Talk More than Men… and Other Myths about Language Explained (Abby Kaplan)
Women Talk More than Men is a volume aimed at the undergraduate textbook market. Each chapter takes a ‘myth’ about language and deconstructs it, with
careful and critical attention to research. This means that each chapter
touches on a different theme in linguistics, including first language
acquisition, second language acquisition, language and gender, sign
language, sociolinguistic perception, and animal communication.
My initial skepticism about this volume was my own fault - before reading this I’d gone back and read The Language of Food, and In The Land of Invented Languages, two amazing books that are about people and their use of language, with some information about linguistics on the side. Women Talk More than Men, while not as lively and driven by anecdote, is a remarkably personable and compelling entry in the textbook genre.
As Gretchen at All Things Linguistic noted, framing arguments around ‘debunking myths’ can be problematic, as it can reinforce the presumptions you’re trying to challenge. I think the topics are well chosen, but I also find the ‘mythbusting’ a little uncomfortable; if you’ve never held that prejudice or presumption it can be hard to feel compelled by a chapter - for example, I’d never thought of signed languages as inferior to spoken language (please, I’m not flattering myself here, I didn’t suffer from such a bias because I don’t recall thinking about sign language *at all* before studying linguistics, which is really a more fundamental problem).
The content of each chapter is fairly uniformly excellent. After setting up the initial premise, it is critically situated with the domain of linguistics, and then deconstructed drawing on research. There isn’t a lot of cutting edge work, but it does touch on a lot of ‘classic’ papers. Each work is summarised, but also crucially appraised, with observations about the limitations of the method or the results. If you want to learn how to critically read research, this is the book for you. The length of each chapter is a bit varied, they’re anywhere from 19-35 pages long, and the content can be a bit unpredictable, for example I hadn’t expected such detailed description of the different kinds of non-standard writing shortcuts are used in text messages. The book does not shy away from linguistic terminology, but it does ensure that most of it is made accessible to those reading outside of a class syllabus.
While reading I could already see myself using different chapter of this book in future classes (and I don’t even teach at the moment). Lecturing staff will find this book incredibly useful, and some of the activities may be useful for those teaching smaller classes.
If you only read one chapter of this book, read this one:
I had not expected this, but at the end of the book is a whole chapter on critically reading statistical information. If you only read one chapter that will help you be a more critical reader, make it this one. It definitely throws a lot of information at you, but you can use it as a way to figure out what you need to improve on. There’s still the same engaging yet critical tone as the rest of the book, but it drills more deeply into the data set used to illustrate different concepts such as mean, standard deviation, correlation, and significance.
It should not be possible to finish an undergraduate degree in linguistics and still uncritically believe even one of those myths. It should not be possible to start a (post)graduate program without being able to make sense of the final chapter. Protolinguist me would have loved this book for its rigorous application of research evidence to answering questions about language and its use. Future me is going to love it as an excellent example of critically research appraisal for many years to come.
Kaplan, A. (2016). Women Talk More than Men. Cambridge University Press. ISBN: 9781107446908 (paperback), 9781107084926 (hardback)
See also: this thoughtful review of the book by Stan Carey (it’s what convinced me to give it a shot, which I’m very glad for):
Buy: Bookshop.org affiliate link, Amazon affiliate link
Check out the Superlinguo linguistics books list

Somehow prooffreader wasn’t on my reading radar until a friend (a self-confessed language and data visualisation geek) sent through a couple of link. David Taylor, who runs the site, does fun and thoughtful things with language data. Above is a table of the most decade-specific words. Click on the image, or head to prooffreader.com for a more detailed chart.
Taylor’s data visualisations are great, but so it the discussion - in the post for the chart above I learnt about the Brigham Young University’s Corpus of Historical American English. In a couple of other posts I learnt more about some of the problems of working with the Google N-gram corpus, and in this post I learnt about why you should be skeptical of anyone using the US Social Security database to talk about baby naming trends.
I also love that if prooffreader isn’t geeky enough for you, there’s always prooffreader plus, which deals more explicitly with data-wrangling methodology.
Thanks to my good mate Hugh, prooffreader is on my regular reading list, and we’ll be sure to blog or tweet anything particularly interesting!
Lexical riches: Shakespeare vs. hip hop’s finest
“Literary elites love to rep Shakespeare’s vocabulary: across his entire corpus, he uses 28,829 words, suggesting he knew over 100,000 words and arguably had the largest vocabulary, ever.
I decided to compare this data point against the most famous artists in hip hop. I used each artist’s first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake.”
- vocabulary analysis by Matt Daniels (designer, coder and data scientist). Read his full analysis at The Largest Vocabulary in Hip Hop
Tim Berners-Lee, the web and open data
Last week I went to see a public lecture by Tim Berners-Lee, the man who created the protocols for the World Wide Web - the shiny link-clicking front end of the internet we all take for granted. Not only did he create the web, but he is now one of the biggest advocates of open and accessible data.
It was a talk that touched on a lot of the problems facing modern digital societies; the problem with giving all your data to a single, non-sharing company (like Facebook) and the problem with governments trying to bully people into keeping data locked down (like the tragedy of Arron Swartz). Because it was as the University of Melbourne he also talked a bit about the need to keep data accessible as part of research. This is something we talk about a lot in the field of language documentation, but not so much in other areas of linguistics - so today I thought that I would share way I work to illustrate the kinds of ideas that Berners-Lee was trying to get across.
When I wrote the Lamjung Yolmo dictionary I didn’t just type all the words up in alphabetical order in a text document. I used a program called Toolbox, which allowed me to create a database of the language that can be used for other processes too. The program stores the underlying data in a .txt file (the more modern Fieldworks stores the same data as an xml file). This means that in the future, even if people don’t have Toolbox, they can still find the data useful for other projects because the data isn’t ‘locked’ into the program, and the underlying file is a common type with consistent formatting. Many projects involve specific programs that mean the data can never come out of them, and when people stop developing it your data can be 'trapped’ in an out-dated program.
My data will still be accessible after Toolbox and I are gone, because the programs I use create open type files, and because I archive it all with Paradisec (but that’s a story for another day). But even all of this doesn’t mean my research is always accessible.
For my PhD I was awarded an APA stipend. This was around just over $70,000 (Australian) over three and a half years. That money came from the Australian tax paying public (for which I am always grateful, just don’t ever figure out what that works out to an hour, it’s a little grim). So, I was payed money that came from the public to do this research, but what happens when I want to publish and share my findings? It’s likely that I will write journal articles and, if someone is interested/stupid enough to green-light it, perhaps a book. Almost all of this kind of output is part of a business model that does not accept openness of information. Even if you are a member of the Australian public I cannot just give you copies of my work because now a publishing company has the rights to it, I can’t even give other colleagues copies according to the terms and conditions of contracts with such publishers. Things are changing slowly - for example the excellent journal Language Documentation and Conservation (in which I have this joint paper about tools for language elicitation), bypasses the traditional publishing model, meaning that I can distribute the paper to as many people as I like under a Creative Commons license and not be charged for sharing my own work. It doesn’t sound like a big step, but these kind of processes mean that we’re becoming more aware about the way in which we share our work.
