The third batch has been processed and uploaded to eXist. Here are the final numbers for the corpus:

Total words:
Unique words:
Type-token ratio:
Average tweet length (words):
Average tweet length (characters):
Total tweets:
Total authors:
Total verified authors:
Total non-verified authors:

Gave another presentation today outlining the proposed research. This presentation is available online and can be found at: http://prezi.com/ucu0igvy1cpx/twitter-register-variation-a-research-proposal/.



Two of the three batches of tweets have been categorized and cleaned of non-English tweets, and just to get a feel for the data, we uploaded these to eXist. Using xQuery, we gathered the following data about the corpus:

Total words:
Unique words:
Type-token ratio:
Average tweet length (words):
Average tweet length (characters):
Total tweets:
Total authors:
Total verified authors:
Total non-verified authors:


Now that the textcat process has been finalized, we can assess what we want to do with our corpus. The first batch of 1 million tweets collected has been refined as much as we can, so we decided to do some testing on it to get a feel for how pure it is. Based on three random samples of 100 tweets each, we manually analyzed them and found that the corpus is 98% English, which is pretty good. Since the semester is nearly over and we won't have much time to do actual register analysis of the corpus, we will unfortunately have to limit ourselves to some information about the corpus as a finishing point for our work within the context of this semester. We're hoping to continue this work in our freetime, but we'd like to feel that we reached a real stopping place at least before we part ways. For now we'd like to determine:



Based on preliminary results, we had to tweak the criteria for which language categorizations should be kept, and which should be thrown out. To this end, first any tweet judged as "I don't know" is kept. Next, any tweet containing judged as English or Scots at any point is kept. Any tweet containing Malay, Indonesian, Spanish, or Tagalog is thrown out if it is in the first 4 judgements. Tweets judged as Welsh at any point are kept, and finally, once all those tweets have been judged, any tweet which met none of the previous criteria is kept.



After writing shell scripts for most of the steps in our draft procedure, we realized that we were overcomplicating things a bit. Based on this information, we revised the last several steps to be a bit more efficient. There were also a number of tweets which contained newline characters, which we created a perl script to handle. These newlines are stripped and then the tweets are processed (again using Perl) to classify them as IN or OUT based on the textcat categorization. The new process then looks like this:

  1. Generate files of the tweet IDs and contents, where each tweet is its own line.
  2. Clean up these files to remove symbols and the like which will confuse textcat.
  3. Get the language classifications. (Optionally recombine these with the original messages to check results)
  4. Use perl script to classify tweets as IN or OUT.
  5. Combine IN and OUT judgements with ID numbers.
  6. Filter out the IN ID lines, leaving only the OUT IDs.
  7. Use XSLT to remove the non-English tweets.


Looking through our data, we realized that a good number of our "tweets" in the corpus up to this point are actually deletion stubs, which are instructions to Twitter indicating that a tweet has been deleted by the user. This means that we actually have many fewer tweets in our corpus than we thought. We adapted one of our stylesheets to remove these from the corpus so that our final numbers would be accurate and then got back to preparing to use textcat.

The first step on our draft procedure was to generate files containing only the information relevant to language categorization, namely the ID and the tweet itself. This turns out to be just a handful of lines in an XSLT stylesheet, which we wrote in a minute or two. What the stylesheet does is run over each of the corpus files and pull out the ID and tweet, separated by tabs, and then inserts a newline character before moving on to the next tweet in that file. This way each tweet occpuies only one line, allowing textcat to look at them separately.

The next step from our draft procedure from last week is to filter out troublesome characters before running textcat. We worked on figuring out which characters those would be and wrote a quick perl script to handle them, using global substitutions to remove @ recipients, hashtags, ampersands, angle brackets, RT markers, and slashes. This should minimize mis-identification of languages.



Continuing on with trying to clean out the corpus of non-English tweets, we wrote an XSLT to remove foreign tweets by ID number, based on the results of textcat categorization. So far, this involves creating a variable which will contain the IDs of tweets to remove. The main template rule is just an identity transformation, but there is an empty template rule for tweets with an ID equal to any of the IDs in the variable list.

The other main task for this week was to draft the commands for actually running textcat. This is a long and complicated process, which we could simplify by creating a batch file or scripting in python; for development purposes we will keep each step separate, allowing us to monitor the output of each step. The steps look roughly like this:

  1. Use an XSLT stylesheet to pull out the IDs and tweet contents for each <tweet> element.
  2. Filter out symbols which are likely to confuse textcat, along with all usernames and the text "RT", which are not really part of any language.
  3. Get language judgements for each tweet, removing the ID numbers from the equation for the moment.
  4. Combine the judgements and the tweet text and eyeball the results for any tweets that will be lost/kept erroneously.
  5. Create files for the various conditions which mean a tweet should be kept. For now this is a categorization of either Scots or English as the first or second choices and keeping any which do not contain Malay, Indonesian, Spanish, or Tagalog as either of the first two choices. This may require some fine-tuning.
  6. Combine the files generated in the previous step.
  7. Create a list of the IDs to throw away by combining the file from the previous step (which is only IDs to keep) and the original file which textcat ran on. Keep only IDs of tweets which appear more than once.

Although this is probably not the exact procedure we will use in the future, it's a good starting point to build off of based on trial and error. Hopefully not too much will need to be changed to get the results we're looking for.



This week we tried to write an XSLT which would remove the duplicate tweets, but this turned out to be exceedingly expensive computationally and running it over the whole corpus would have been foolhardy at best. Instead, we realized that since the duplicate tweets are identical character for character, we could use sort and uniq within Unix to first sort and then weed out duplicates to leave only unique lines. Unfortunately, this is a process that has to be done while the tweets each take up their own (solitary) line, so we'll have to re-process all the tweets we've collected so far to be sure we don't have any duplicates lurking. Doing this is infinitely more efficient than removing duplicates with XSLT, however, and is actually relatively fast, so we'll push on with this method.

In preparation for our meeting today, we read a 1995 article by Biber1 about using multi-dimensional analyses of register-variation. After discussing the dimensions Biber uses, we decided to formalize our list of features for measuring linguistic register as follows.

  1. expletives
  2. rate of non-dictionary words
  3. average word length (restricted to dictionary words)
  4. capitalization scheme
  5. standard punctuation
    • rate of non-alphanumeric character use
    • 1337 (leetspeak)
  6. presence of chatspeak etc
    • particular lexical items
  7. word n-grams/character bi-grams (?)
  8. ratio of function words (!)

1Biber, D. 1995. On the role of computational, statistical, and interpretive techniques in multi-dimensional analyses of register variation: A reply to Watson (1994). Text 15.341-370.



Gave a brief presentation today (~15 minutes) on our progress so far and plans for the future!



Several weeks ago we were thinking of running our tweets against an English language dictionary and flagging the ones claiming to contain no English, but we've since decided that that will be a pretty expensive process computationally, and still not particularly foolproof. We discussed using SIL's English Word List and Snowball to this end, but decided that the SIL Word List would not work well enough to be worth it, and Snowball was probably too brute-force-y in method to be really effective or reliable. To rectify this, we looked into a number of different language identification tools in the hopes of simply throwing out anything not reporting as English. We referenced a list of Language identification tools to investigate this.

At first, we thought that the Xerox Language Identifier would be the tool for the job, but their API limit would make using the service incredibly impractical. We also investigated libtextcat, which turned out to be impractically slow for the size of corpus we plan on using. Initially, we decided that textcat was probably also not the best choice for our purposes, but after fiddling with it for some time we have decided that it is probably good enough. Although textcat is no longer maintained by the developer, it has a number of features which make it useful to us. While in general we found that the language ID tools were unable to cope well with the small sample size of a tweet, and especially bad at knowing that even simple sentences were in English, it was fairly accurate in identifying that something was not in English. Based on this knowledge, we plan to sort out the remaining non-English tweets by keeping anything identified as English, Scots, and unknown, but throwing out anything that identifies in the top two-three choices as Indonesian, Malay, Tagalog, or Spanish. These are the languages that appear to be most prevalent in our data at this point, so hopefully by removing these we can be reasonably confident that our data isn't being contaminated by large quantities of foreign language text.

Towards the end of our meeting, we realized that we had duplicate tweets in our data. We're not sure how these entered the corpus, but they're textually identical (even the time stamps match!). We'll have to find a way to remove these before we can progress much further, perhaps using XSLT or some command line function.



The other week (2012-01-25) we had to skip over one of the files because it contained non-Unicode characters that we couldn't manage to remove. To see if this problem was somehow a result of how we were splitting the files, we tried to clean the files before splitting them. Although initialy the problem file was pt30, this time around the problem was in multiple parts. This seems to suggest that the problem characters are products of the pyxser transformation process...



We recently had to migrate to a different server for development purposes, because the files for this project take up enough space that we didn't want them clogging up the server we'd been working on. After migration, we had to re-install the pyxser for Python. It'd been awhile since we had to install packages for Python, so we were a little rusty on how to do it, but once we'd refreshed our memories we found we were still having trouble. The install kept running into errors, saying we were missing required packages that pyxser needs to run. After checking the install directions file, we tried to go and install those packages, but this time found that they didn't exist. At a loss, we spent a long time trying to find the packages online until finally we found out something unfortunate. Where the install directions for pyxser were directing us to download things like "libxml2_dev", the package is actually called "libxml2_devel". By the time we realized this, we were extremely frustrated and at our wit's end.

Our other project for the day was fine tuning the transformation process that takes us from 1M separate JSON blocks to 100 XML files each containing 10,000 tweets with extraneous elements removed.



Today we worked on tidying the results of the pyxser transformation XML file. At this point we had transformed the pyxser-generated XML to remove the pyxser namespace, but our XML hierarchy was pretty wonky, and we had a number of extraneous elements leftover from the initial JSON blocks from Twitter.


The element pt###### was unique for each tweet, which was important when we were transitioning from JSON but which is unecessary and harmful in XML. What's more, we wanted to weed out as many of the non-English tweets as possible. Based on the information Twitter includes about a user in each tweet, we selected based on the user-specified language, time zone, and the characters used within the text of the tweet itself (we threw out tweets containing non-ASCII characters). We had some trouble sorting out foreign-language tweets which had been retweeted by English speakers, which eventually we solved using the every $var in PATH satisfies TEST expression in XPath.

Once we had tested our XSLT to make sure it was doing what we expected, we wrote a shell script to run it across the rest of the corpus and let it work its magic. There's still some level of contamination from tweets which use ASCII characters but aren't in English, but they are few enough that we are not going to worry about trying to weed those out at this stage. For now we're thinking of parsing each tweet with an English dictionary and checking any tweet which claims to contain no English words to sort out the rest of the foreign language tweets.



To get an idea of what format the time zone element contains, we wrote a quick XQuery to generate a list in HTML of the unique values for time_zone in the 100,000 tweets that are currently in eXist. There's 109 different time zones represented in this small block of tweets, and they take a descriptive enough form to be useful for filtering out non-English tweets. Obviously this will not be a perfect approach, but it's a good place to start. There's also a language element lang which we can make good use of. This lang element is user specified in their Twitter Settings as the language they want the Twitter interface to be in. This is also not going to be a foolproof measure, but we're optimistic that these two factors will be fairly effective in doing what we want to a reasonable extent.

<ul><li>Abu Dhabi</li>


We did some spring cleaning of our many test files on the server and in Dropbox and worked on transforming the pyxser-generated XML to slightly tidier XML (ie: minus the pyxser namespace) with our newly segmented XML files to alleviate the stress of processing 1,000,000 separate files. The files are now separated into 100 files, each containing 100,000 tweets. Schema from the previous week in hand, we created a list of pertinent elements to preserve in the tidier XML we're working on generating. Based on this tentative list, we indexed these elements to make querying them a little more efficient. We're having trouble with eXist handling so many files (and we were hoping that indexing would help), even separated out as they are now, and increasing the heap space allocated for Java is proving to be only a temporary fix. It looks like we'll need to upgrade the server storage to get more memory...



We started transforming the first 100,000 tweets yesterday and the process is still not complete... In retrospect, opening and closing 100,000 files (let alone 1,000,000 or 16,000,000!) is too expensive to do efficiently. We'll need to rethink our approach here before we can progress.



The pyxser-generated XML files are pretty messy, so we're creating an XSLT stylesheet to take it out of the pyxser namespace. We constructed a batch file to iterate through all the XML files and apply the transformation once we'd confirmed that the transformation would behave as expected when applied to a test file of only ten tweets. While we were working on this, we decided to create a schema (using Relax NG) to model the de-namespaced XML files based on this test transformation output. By doing this, we can better understand the structure of a tweet and get a feel for what elements will be worth keeping and indexing for ease in querying with eXist.



Upon gaining access to the TREC corpus as detailed in the original project proposal, we realized that the methods we had been developing to work with the data would not be sufficient for such a large corpus. One of the reasons for working with an established corpus is that it allows for more repeatable results and ensures that the data is of a certain standard. However, the nature of the TREC corpus negates these factors to some extent. Individuals using the TREC corpus must retrieve the tweets in a process which is apparently somewhat unstable and sometimes causes errors. There are instructions on the TREC corpus site detailing how to deal with these errors should they arise. Instead, we are working on building our own corpus using a Python script.

For now the corpus we will build will consist of 1,000,000 tweets which we will need to prune somewhat to remove non-English tweets programatically. We have some concerns about storing 1,000,000 tweets (let alone the 16,000,000 tweets of the TREC corpus), but we will tackle this problem should it arise and move forward with appropriate caution for now. Once we've got the 1,000,000 tweets, we will need to convert them from the JSON blocks used by Twitter into valid XML, which we can then process using XQuery and XSLT.