Twitter Register Variation: Creating and Processing the Corpus

Janis Chinn

University of Pittsburgh

I. Abstract

Register is the variety of language a speaker uses in a given situation (Reid T. , 1956). Speakers shift register without conscious thought all the time: without thinking about it you speak differently with your grandparents than with your parents or your friends or your professors. In written language, formal register is usually reflected in correct spelling, punctuation, and a more advanced vocabulary. Posts on Twitter, however, have a 140 character limit which impedes a user's ability to utilize these register-raising techniques.

In this paper, I use computational methods in order to investigate the extent to which users attempt to shift register in their communications (1) not directed at any user in particular, with no direct recipient; (2) directed at celebrities and other users with verified accounts, such as CNN; and (3) directed at users with non-verified accounts, the average user. To this end, I will collect a corpus of one million tweets programmatically and process these to create a database of tweets which I can then query to assess certain linguistic properties of the corpus at large. Using computational methods allows us to work with a large corpus without requiring us to look at each individual message (which would likely produce highly inconsistent results and take an inordinate amount of time) and instead allows us to pinpoint messages which are unusual to then investigate more closely as they arise. The results of this research may suggest some validation to claims that social media sites like Twitter are affecting users’ linguistic abilities by allowing users only 140 characters with which to express themselves (Liberman, 2011); although no concrete claims could be made without further research. If users are not switching registers appropriately on Twitter, it would be useful to explore whether they switch registers in other contexts both on- and off-line. This could demonstrate whether some particular aspect of Twitter is causing this failure to use the appropriate register, or some deeper feeling among users that there is no need to shift registers online. To determine this, I will use user-reported information combined with language identification tools and other filtering methods detailed below in order to limit the scope of the investigation to English messages. The messages are gathered via the Twitter API and will be judged for register based on the presence or absence of certain computationally identifiable features. With this research, I hope to lay the ground work for future research investigating register variation and changes in linguistic competency on a broader scale.

II. Introduction

Register variation in spoken language is typically unconscious; speakers switch between formal and informal language effortlessly to match the situation and audience at hand. A common example of this is the language used with one’s grandparents as opposed to one’s close friends or one’s boss. Each of these different audiences require different language use; one is respectful with one’s grandparents, but still familiar, whereas the language used with a friend is often extremely informal and might include slang or profanity depending on the friend and exact relationship, while the language used with one’s boss must be strictly respectful and professional. In written language, however, register variation tends to be a more conscious effort. Indicators of written register include the correct use of punctuation, spelling, vocabulary, and general writing style (Biber, On the role of computational, statistical, and interpretive techniques in multi-dimensional anaylses of register variation: A reply to Watson, 1995; Ferrara, Brunner, & Whittemore, 1991). Twitter is thus constraining a user’s ability to exploit all of these features as fully as they might otherwise. This paper seeks to discover whether this constraint eradicates or simply limits the appropriate shifting of linguistic register in three social environments: messages with no explicit recipient, messages to a celebrity or otherwise socially-elevated recipient, and messages to a socially equal recipient.

Twitter is a social micro-blogging site which has been on the web since 2006; it had over 300 million members as of 2011 (Taylor, 2011), in contrast to the 106 million members reported in 2010 (Bamman, 2010), and of those 300 million, nearly 20,000 are verified users (Twitter, 2012). The verified users system is a relatively new system implemented by Twitter to help give both users and celebrities some level of confidence that interactions between users and celebrities are limited to those accounts confirmed by Twitter to be legitimate. Celebrities can have their accounts verified by Twitter, which labels their account page specially, to give users assurance that any communication between the user and the verified account is official, and to keep imposters from posing as the celebrity and misrepresenting them (Twitter, FAQs about Verified Accounts, 2012). This is a long and complicated process, so not just anyone can have their account verified—the system is limited to this special group.

Users may elect to lock their accounts, making the messages private to only user-approved followers, or to remain in the public timeline (the default setting), which means that anyone on the internet, user of Twitter or otherwise, may view that user’s messages, or tweets. These messages can be only 140 characters long; any longer and the message is truncated at 140 characters, so brevity is an important skill to acquire. Since by default any user can communicate with any other user instantaneously, an illusion of proximity is created (Deseriis, 2012). This may influence users’ perceived need to shift register at all, although other forms of written communication, such as letter writing, and email, are known to still require register shifting. Twitter messages come bundled with a good deal of metadata aside from verified vs. non-verified status, such as time zone, user information, and information about the tweet itself.

To direct a message at a particular user, one includes what is known as an @recipient, which consists of an @ sign, which tells Twitter that the message has a recipient, followed by the target user’s username. CNN’s Twitter account, then, can be messaged with the following syntax: “@CNN”. There is one further variety of message common to Twitter, called a retweet (symbolized RT), allows User A to re-send User B’s message to User A’s own timeline in order to share messages while maintain source attribution. There is a relatively new system within Twitter which supports this however it is also possible to do a manual retweet, which allows for User A to insert additional commentary.

III. Literature Review

Previous research on this exact topic has not been undertaken, however research on the language and nature of tweets and research on the computational ways a humanities scholar should approach register analysis have both been performed. Bamman’s (2012) project, the Lexicalist, analyzes a corpus of tweets for demographic information such as location and age to build a dialect map depicting real-time changes in usage of English in the US. The Lexicalist uses the embedded geographic information which users can allow Twitter to supply in order to determine exact locations for users, and certain statistics on name trends in the US can give a rough estimate of age. Given the immense potential size of a corpus of tweets, Bamman is able to throw away any tweets where this information is not reasonably assured, which can happen in cases where users did not allow for exact geographic data and specify only a vague location such as “Springfield” or “Somewhere over the rainbow”, and in cases where a name has been common over multiple generations and cannot be used to pinpoint age (Bamman, 2010). Eisenstein, O’Connor, Smith, and Xing conduct similar research, comparing explicit geographic data with speech communities and regional and topic variation (Eisenstein, O'Connor, Smith, & Xing, 2010). Russ also performs an analysis of regional variation by examining geotagged tweets, taking advantage of the diverse demographic and vast availability of data for analysis (Russ, 2012).

Biber, Conrad, and Reppen (1998), in their introduction to corpus linguistics, discusses various approaches to linguistic analysis. Particularly relevant to this discussion is their treatment of register analysis and how this manifests in a quantitatively measureable way. By using computational methods to assist in corpus analysis, greater quantities of data become manageable. Furthermore, the results of such wide-scope research become more reliable given their more quantitative grounding) which affords more repeatable results. Most importantly, Biber, et al. points out, corpus studies must be careful to present not just quantitative data, but also qualitative interpretations of that data (Biber, Conrad, & Reppen, 1998). In an earlier paper, Biber (1995) emphasized that computational analysis of register must be multi-dimensional in nature and reflect the nature of the work(s) being studied. By using a multi-dimensional approach to quantifying register variation, multiple parameters contribute to determining the register used in a given text. Automated generation of these parameters based on a computational analysis of the corpus is an important feature of this method for textual analysis, as it allows researchers to focus on features which are present in the text and decreases the room for human error in selecting features for analysis (Biber, 1995). Using just a single measure to examine trends in word length between speeches given by US Presidents, Liberman demonstrates that mean word length and accordingly sentence length have both decreased with time (Liberman, 2011). Unfortunately, this sort of one dimensional analysis does not do much to explain what is going on or why, although it is an interesting trend. Liberman’s research stems from the idea that social media sites such as Twitter are detrimental to the language of younger generations (Liberman, 2011).

Past research was more easily undertaken, before Twitter changed its terms of service and prohibited the sharing or publishing of any collection of tweets (Watters, 2011). The Library of Congress was recently granted access to every tweet in the public timeline ever published. Unfortunately, the Library of Congress isn’t releasing this corpus to the public for several years to come (Raymond, 2010). Once these tweets have been released for public use, they can serve as an invaluable tool for linguistic analysis; in the meantime, however, researchers will each have to build their own corpora according to their specific needs.

Joos, one of the first to formalize discussion of linguistic register, identifies “five main styles which he termed intimate, casual, consultative, formal and frozen” (Joos, 1967). The main basis for these styles comes from two broad divisions, public and private, which are further divided into a total of four categories: under public, the intimate and the casual; under private, the consultative and the formal. He notes that the intimate style is often marked by shorter forms and elliptical in nature. Casual style is used with less familiar social relationships and involves more explicit communication and jargon is more likely to be used than slang due to the nature of the in-group relationship. On the other hand, consultative style is reserved for semi-formal situations, with a conscious avoidance of slang, although jargon may still occur. More complex sentence structure is another important feature of this style. Formal style is typically employed in presentations and papers, where communication is one-way and in one-to-many communication styles. The final style, frozen, is reserved for artistic literary style (Joos, 1967). Joos bases these styles on the presence and absence of shared background knowledge between parties, which allows participants to use less and less explicit communication styles and increasing amounts of slang and jargon. “These styles have the function, among other things, of establishing and defining the social relationship between communicants, and are manipulated as markers of group membership and social distance (Joos, 1967).”

Ferrara, Brunner, and Whittemore, in their 1991 study, investigate the use of register within a corpus of written online communication. They determine that register use is reduced, and that conventions regarding what register is appropriate in online communication was still being determined (Ferrara, Brunner, & Whittemore, 1991). The study determined that many users forwent proper capitalization and punctuation in online communication (or computer mediated communication, as Ferrara, et al. refer to it) and many users were inconsistent even within their own communications across the corpus. The widespread popularity of the internet as a mode of communication affords immediate and diverse communications, changing at a rapid rate. This suggests that the study of internet discourse offers the chance to investigate language change in progress (Ferrara, Brunner, & Whittemore, 1991).

Register variation online is demonstrably reduced, although still present in internet communication (Ferrara, Brunner, & Whittemore, 1991). The medium provided by Twitter, of 140 character messages which can be sent online or via text message, has introduced a new and unique form of communication which has witnessed rapid changes since its inception in 2006. By studying the presence of register variation in this relatively new medium, Ferrera, et al.’s work can be re-validated. The different types of posts, public, directed to non-verified users, and directed to verified users allow insight into the use of register when speaking in a number of different settings which do not frequently co-occur so commonly. Although Ferrera, et al.’s work is thorough and insightful, the internet and the nature of online communication have changed much in the last two decades and is thus slightly outdated. This paper seeks to determine the ways users vary register and the extent to which they do so in these three situations. Based on Ferrera, et al’s work, it is expected that register usage will persist, although in a more reduced form than in other media for communication.

IV. Methodology

The corpus is built based on a tutorial for writing a Python script which accesses Twitter’s Streaming API, collecting 1 million tweets each time the script runs (Paul, 2010). These tweets are delivered in the JSON format, a lightweight and human-readable text-based format for representing simple data structures. The tweets in question are all part of the public timeline on Twitter; their users have all elected to allow their tweets to be seen by anyone on the Internet without restriction as discussed above. Although the users have made this decision, Twitter’s Terms of Service indicate that no tweets may be released to the public, so no publication of the corpus is possible.

Following some housekeeping to remove illegal characters and any duplicate tweets resulting from glitches in the API, a Python package called Pyxser transforms the JSON output into XML. An XML based corpus grants researchers access to the whole wealth of XML-related technologies. The Pyxser output gets cleaned, transformed with XSLT (eXstensible Stylesheet Language Transformation), and pruned to remove extraneous elements such as “profile background color” to ease processing of the corpus in the future. In order to make judgments on register variation, the corpus must contain only tweets the researchers are equipped to analyze. These changes to the corpus are done on a server running the CentOS distribution of Unix and executed through a number of hand-crafted shell scripts employing Bash, Perl, regular expressions, and Python. To accomplish this filtering, the corpus is transformed once more, such that the tweets all meet a set of requirements specified by myself with input from my two faculty advisors, Professor David J. Birnbaum and Professor Na-Rae Han.

  1. Tweets must contain only English alphabet characters (as specified by the ASCII character set)
  2. A tweet’s time zone must be set to one of the time zones in the US or Canada
  3. A tweet’s user’s language setting must be English

These criteria, however, prove insufficient. Although the ease of acquiring more tweets means it is reasonable to throw away tweets according to such broad criteria (Bamman, 2010), non-English tweets remaining in the corpus will corrupt any statistical analysis of language usage. Improving the quality of the corpus requires better-identifying non-English tweets within the working corpus. Ideally, a language identification tool might allow an easy classification of English vs. non-English text. Unfortunately, given the small size of each text sample to be identified, such tools are not particularly accurate. A language identification tool known as TextCat was identified as the most useable of these tools, and after observing it in action, a new filtering scheme was devised. TextCat makes a ranking of possible languages for each text sample it examines. Although these judgments are not generally accurate in terms of identifying what is English, they are more often accurate in identifying what is not English. Based on this information, tweets are being filtered based on the original three criteria detailed above and on new criteria based on TextCat categorization.

  1. The language cannot be identified by TextCat
  2. The language is identified as either English or Scots
  3. The language is identified as Malay, Indonesian, Spanish, or Tagalog in any of the first three categorization positions

Whereas criteria 1-3 were all required to be true before a tweet could remain in the corpus, once all tweets not conforming to these three criteria have been removed, activation of any of criteria 4-6 is sufficient to preserve that tweet in the corpus. The reasoning for employing criteria 4-6 may not be as apparent as that for 1-3. As previously stated, TextCat is not 100% reliable, especially when working with such potentially miniscule text samples. These small sample sizes mean sometimes TextCat is unable to make any judgment of the language used. To avoid losing relevant English-language data consistently, these failed judgments are preserved. Keeping the English language tweets in (5) is an obvious step; since Scots and English are syntactically quite similar, many English tweets are misidentified as Scots and must therefore be preserved as well. The languages listed in (6) are the most frequent offenders that 1-3 missed. Since time zone and user language are both user-specified fields, they are open to some level of error. These fields are still better for usage than the location field, which is user entered, whereas time zone and language are selected from a pre-defined list of possible options. The effectiveness of criteria 4-6 has yet to be determined. Given the low yield rate of each initial corpus file, the Python script accessing the Twitter Streaming API must be run multiple times. This project will work with an initial corpus size of one million tweets, which can easily be augmented if necessary upon beginning analysis. Three random samplings of the corpus, of 100 tweets each, show the corpus to be 98% English. This is an acceptable margin of error for a corpus of this size.

Tweets will be assessed for formal register according to a number of criteria. Informality will be judged based on the presence or absence of these various criteria along a continuum. Although these features have not been computationally selected as Biber (1995, 1998) suggests is a necessity, they have been selected based on a combination of informal, careful observation of language use within internet communities and those features computationally selected by Biber (1995, 1998) in his research and will be fine-tuned once the corpus is in a workable state. Biber’s features were not used wholesale because the nature of his corpus differs greatly from a corpus of tweets in terms of length and communicatory purpose. The following are the primary linguistic features to be used in analysis: (a) the use of expletives and profanity, (b) the rate of non-dictionary word usage within a tweet, (c) a measure of average word length of dictionary words per tweet, (d) appropriate application of capitalization, (e) the presence of correctly applied standard punctuation (with special attention to the rate of non-alphanumeric character use in a tweet and the use of numbers and symbols to emulate letters as in the internet ‘language’ known as leetspeak, or, more commonly, 13375P34|<), (f) the use of chatspeak, which is commonly found in text messages and brief communications online and often employs shortened forms of words and the substitution of numbers whose pronunciation is similar to the omitted letters as in “gr8” for “great”, (g) and the ratio of function words within a tweet. Proposed additions to this set of criteria, to be evaluated once the corpus is suitably filtered down, are: (h) an analysis of word n-grams and character bi-grams, and (i) the prescriptive use of ‘whom’ over ‘who’.

By using a multi-dimensional approach, tweets in different situational contexts can be compared to each other based on the number and combination of informality and formality markers present in the tweet. The features selected are generally socially stigmatized as examples of poor writing, and thus, informal style. It is important to track not only non-dictionary words, but the use of leetspeak and chatspeak, because there are a number of techniques for shortening a tweet without having to remove ideas. These techniques have been informally observed by researchers, and this study seeks to formalize not just whether Twitter users are shifting registers, but also what methods they employ in doing so. Suppose, for example, that a message requiring formal register requires more than 140 characters—there are a number of potential solutions. Often, function words and punctuation are the first to go, followed by the vowels within words, and then the use of chatspeak in order to save valuable character space. I hypothesize that the extent to which a user is willing to follow this path of message-shortening without losing meaning should vary depending on the required register of the tweet. That is, messages to non-verified accounts or the public timeline are expected to exhibit more of these features than messages to verified accounts, despite each feature’s typical status as a marker of informality.

The corpus will be accessed via a database in eXist, a database management system built on XML technology. The tweets will be indexed as they are uploaded to further improve processing time, and can then be accessed through a querying language called XQuery, which is XML aware. Complex queries can then be constructed to look at only tweets referring to a verified account, or only tweets that do not specify an @recipient. Basic statistics generated by querying the corpus in this way can be presented as graphs and charts to allow for easy visual identification of interesting data points for further analysis. Once computational methods have pinpointed these areas for closer human analysis, detailed analysis can be undertaken. Based on preliminary findings, alterations to the features identifying registers or further filtering may be necessary. After this process has been suitably fine-tuned, it can be applied to a much larger corpus to afford more accurate results.

V. References

Bamman, D. (2010, May 18). Language Log. Retrieved 3 2012, from http://languagelog.ldc.upenn.edu/nll/?p=2334

Biber, D. (1995). On the role of computational, statistical, and interpretive techniques in multi-dimensional anaylses of register variation: A reply to Watson. Text, 15, 341-370.

Biber, D., Conrad, C., & Reppen, R. (1998). Corpus linguistics: investigating language structure and use. Cambridge, UK: Cambridge University Pres.

Deseriis, M. (2012, January). Mail Art/Fluxus/Networking. Retrieved from Booki: http://www.booki.cc/the-digital-legacies-of-the-avant-garde/mail-artfluxusnetworking/

Eisenstein, O'Connor, Smith, & Xing. (2010). A Latent Variable Model for Geographic Lexical Variation. EMNLP 2010.

Ferrara, K., Brunner, H., & Whittemore, G. (1991, January). Interactive Written Discourse as an Emergent Register. Written Communication, 8(1), pp. 8-34.

Joos, M. (1967, March). The Five Clocks (Vol. 5). New York: Brace & World.

Liberman, M. (2011, October 31). Language Log. Retrieved from http://languagelog.ldc.upenn.edu/nll/?p=3534

Paul, R. (2010, April 21). Tutorial: consuming Twitter's real-time stream API in Python. Retrieved from Ars Technica: http://arstechnica.com/open-source/guides/2010/04/tutorial-use-twitters-new-real-time-stream-api-in-python.ars

Raymond, M. (2010, April 14). How Tweet It Is!: Library Acquires Entire Twitter Archive. Retrieved from The Library of Congress Blog: http://blogs.loc.gov/loc/2010/04/how-tweet-it-is-library-acquires-entire-twitter-archive/

Reid, T. (1956). Linguistics, structuralism, and philology. Archivum Linguisticum, 8, pp. 28-37.

Russ, B. (2012). Examining Large-Scale Regional Variation Through Online Geotagged Corpora. Retrieved from http://www.briceruss.com/ADStalk.pdf

Taylor, C. (2011, June 27). Social networking 'utopia' isn't coming. Retrieved from CNN Tech: http://articles.cnn.com/2011-06-27/tech/limits.social.networking.taylor_1_twitter-users-facebook-friends-connections?_s=PM:TECH

Twitter. (2012). FAQs about Verified Accounts. Retrieved from Twitter: https://support.twitter.com/groups/31-twitter-basics/topics/111-features/articles/119135-about-verified-accounts

Twitter. (2012, March). Verified Accounts (verified). Retrieved from Twitter: https://twitter.com/#!/verified

Watters, A. (2011, March 3). How Recent Changes to Twitter's Terms of Service Might Hurt Academic Research. Retrieved from Read Write Web: http://www.readwriteweb.com/archives/how_recent_changes_to_twitters_terms_of_service_mi.php