Project Background

This project is being undertaken through the Brackenridge Research Fellowship at the University of Pittsburgh. The fellowship supports research in all disciplines of academic study with an eye to encouraging interdisciplinary work among undergraduates.

This project seeks to investigate the effects of the restrictive nature of tweeting on the use of register shifting in discourse on Twitter. Part of the inspiration for undertaking this research stems from the work done in Computational Methods in the Humanities (LING 1050) at the University of Pittsburgh.


The research I propose investigates sociolinguistic trends in posts (known as tweets) to the social media site Twitter. Specifically, I'm interested in examining the use of different linguistic registers in speaking to other users on Twitter. I seek to discover how users reconcile the requisite brevity of posts to Twitter, which limits users to 140 characters or less, and the varying levels of formality associated with talking to friends as opposed to celebrities or strangers. The length limitation imposed by Twitter often impedes the use of the grammatically correct and well-formed language required in formal writing. The use of linguistic register within the Twitter community has not been explored extensively, and much of the discussion has been largely informal in nature.

Speakers shift linguistic register all the time without conscious thought. One register is used to talk to professors, another for friends, another for close family, another for one's grandparents. Linguistic register is the variety of language a speaker uses in a given situation. For example, one would not use the same kind of language to talk to one's grandmother as to your friends. One avoids the use of slang and vulgar language in an academic setting, and the language used in a formal presentation is not the language used in conversation. This is not just a phenomenon in English, of course; in languages like Japanese there are special verbs only used in honorific or humble situations and different structures which can increase or decrease the politeness of a sentence to suit any situation. This sort of shift takes place effortlessly most of the time, but relatively new forms of communication such as Twitter and other social media sites may be blocking this process somehow.

In response to informal claims that the current generation's language is negatively affected by modern communication tools like Twitter, Mark Liberman undertook a brief analysis comparing the inaugural addresses of various Presidents. This analysis can be found on University of Pennsylvania's popular linguistics blog "Language Log". Remarkably, he found a significant trend of shortening sentence and word lengths over the last 200 years. My research, while not addressing this directly, will demonstrate whether using these services affects a user's ability to shift linguistic registers to match the situation as they would normally be expected to.

If users are failing to shift registers appropriately, this would suggest that some aspect of using Twitter has given users a false impression of closeness which allows them to speak informally even in situations which would require otherwise if the interaction was in-person. I expect to find that this is generally the case, and users do not bother shifting linguistic registers to interact with other users regardless of the nature of their relationship. The convenience of shortening a message by any and all means necessary to say what needs to be said will override any pull towards grammaticality or formality regardless of situational factors. Although any trend of this nature cannot be attributed solely to Twitter in this study, the presence of such a trend within the corpus of tweets still indicates that some factor is allowing users to refrain from the use of different linguistic registers in their communications.

In order to ascertain this, I will use a corpus of tweets compiled by the National Institute of Standards and Technology. This corpus consists of an estimated 16 million tweets collected between the dates of January 23rd and February 8th, 2011. According to the corpus homepage, this is designed to be a representative sample of posts. Of these 16 million tweets, I will work with only messages which are directed at other users (denoted by @username, typically at the beginning of the tweet in question) and of these messages only those posted in English.

Once the corpus is acquired, I plan to use the XML markup language to denote particular features of each tweet such as register and the relationship of the user to the person being messaged. The register of a tweet will be assessed based on the use of slang, chatspeak and other informal language and the grammaticality of the tweet. Relationship will be assessed based on the past interactions with the messaged user and the verified account status of the messaged user as opposed to the messaging user (a service provided by Twitter). Given the large size of the corpus, I will use regular expressions to filter out the messages in other languages and those that are not directed at another user. Regular expressions are used in computer programming to match text based on user specified patterns.

By using regular expressions I can also auto-tag features which take a common form, like the body of a tweet (which will always occur after @username followed by a space). To begin with, I will use a random sampling of the corpus to fine-tune my markup and iron out any problems with my auto-tagging. If this smaller pool of tweets still proves unruly to markup by hand, it may be useful to alter my markup scheme slightly to automate the process of assessing formality and relationships between users. Time allowing, I will then be able to expand my focus and work with the entire corpus. After markup, the data can be analyzed and formalized using XSLT, a tool used to transform XML markup in useful ways.

I will be consulting with Professor Birnbaum and Professor Han, who have experience in digital humanities and linguistics, respectively. My exposure to digital humanities through Professor Birnbaum's new course Computational Methods in the Humanities this semester is what inspired me to do this research. When I expressed this interest to a professor in the Linguistics Department, I was directed to Professor Han for her experience with both computational linguistics and corpus linguistics. Computational linguistics involves the application of computational processes to model or parse languages, one case where the latter occurs is in the interpretation of user input to a search engine such as Google. Corpus linguistics is the study of linguistic phenomena in a body of texts collected into a corpus. Both of these fields of linguistics utilize methods which are complementary to those used in digital humanities, with some overlap.

This fellowship would grant me the opportunity to improve as a scholar in a way my regular coursework cannot. Through performing this research, I hope to hone my skills as a digital humanist and gain first-hand experience doing independent research. I also hope to gain insight into the field of corpus linguistics, an area of linguistics I have had little exposure to before this year. The experience I gain will be an invaluable asset whether I enter the workforce or pursue higher education after graduation.