Alexandre Patry

Reducing entropy one post at a time

Building a Tweet Corpus

I wanted to play around with some tweets, but I quickly discovered that getting a hand on a corpus is not that easy because of twitter terms of service. It is up to every one to create is own corpus.

Luckily, twitter has an API to sample tweets randomly. I created a small application over it that can be used following these steps:

  1. Register a twitter application on https://dev.twitter.com/apps/new. Application name is not important, you only want to get its credentials.

  2. Download the last version of twitter-sampler.

  3. Download credentials.clj and fill in the blanks with the credentials of your application.

  4. Run the following command:

java -jar twitter-sampler-1.0.0-SNAPSHOT-standalone.jar -c credentials.clj -n 1000 tweets.json

where credentials.clj is the file containing your credentials, 1000 is the number of tweets you want to download and tweets.json is the file where the tweets should be saved.

You should now have a corpus of tweets to play with.

Words Lists in a Shell

I often want to manipulate set of words that I want to compare. This post present some of the one lines that I frequently use to manipulate such lists.

Get a set of words from a text file

If you start from a text file, the following command will convert it to a list of words:

cat input.txt | sed 's/\>/\n/g' | sed 's/^[[:space:]]*//' | sed 's/[[:space:]]*$//' | grep -v "\^$" | sort | uniq  > output.txt

If you run osx use this command instead (notice the new line in the middle of the command):

cat input.txt | sed 's/[[:>:]]/\
/g' | sed 's/^[[:space:]]*//' | sed 's/[[:space:]]*$//' | grep -v "\^$" | sort | uniq  > output.txt

Both of these commands replace word boundaries by newlines, trim words and then print a sorted lists of words.

Intersection

A first one liner to find the elements that are common to two lists:

cat file1.txt file2.txt | sort | uniq -d

Union

A similar one liner to find the elements that are in one set or the other:

cat file1.txt file2.txt | sort | uniq

Union minus intersection

To get the words that are only in file1.txt or file2.txt, but not both:

cat file1.txt file2.txt | sort | uniq -u

Difference

To get the elements that are in file1.txt, but not file2.txt:

cat file1.txt file2.txt file2.txt | sort | uniq -u

Histogram of words

As a bonus, we can tweak our first command to get an histogram of words:

cat input.txt | sed 's/\>/\n/g' | sed 's/^[[:space:]]*//' | sed 's/[[:space:]]*$//' | grep -v "\^$" | sort | uniq -c | sort -nr

The following variations prints words appearing at least 10 times:

cat input.txt | sed 's/\>/\n/g' | sed 's/^[[:space:]]*//' | sed 's/[[:space:]]*$//' | grep -v "\^$" | sort | uniq -c | sort -nr | awk '$1 >= 10 {print $2}'

Forer Effect

I recently learned about Forer effect or how people can take general statements and make them their own. In a classic experiment , Forer asked his students to fill a personality test. A week later, he gave his analysis back to each student and made them rate its accuracy on a scale of 0 (poor) to 5 (perfect). The analysis were so targeted that only one out of 39 students rated the results lower than 4.

As it turned out, the results were not good, they were perceived as good. Everyone received the exact same excerpt from an astrology book. Students believed the analysis were targeted because they were made of universally valid statements:

A universally valid statement, then, is one which applies equally well to the majority or the totality of the population. A universally valid statement is true for the individual, but it lacks the quantitative specification and the proper focus which are necessary for differential diagnosis.

Some universal statements taken from Forer’s paper are:

You have a great need for other people to like and admire you.

You have a tendency to be critical of yourself.

Some of your aspirations tend to be pretty unrealistic.

One of Forer’s conclusions is that people are really bad at assessing information about themselves. Denis Dutton explains very nicely how this weakness is used by mentalists to deceive people with cold reading.

Alt Car in Emacs for OSX

When I installed emacs for OSX, right option key acted as Meta instead of my beloved alt-car. It can be fixed using these steps:

  1. If you use emacs 23 or prior, install package.el
  2. Add marmalade to the list of repositories in your .emacs: ;; Adds marmalade to package.el (require ‘package)
    (add-to-list ‘package-archives ‘(“marmalade” . ”http://marmalade-repo.org/packages/”)) (package-initialize)
  3. Refresh your package index using M-x package-refresh-contents
  4. Install mac-key-mode using M-x package-install mac-key-mode
  5. Configure the right option key to act like alt-car using the following lines in your .emacs: (require ‘mac-key-mode) (setq mac-option-key-is-meta t) (setq mac-right-option-modifier nil)
  6. Restart emacs

You should now be able to use the right option key as alt-car to enjoy characters like @ and } on a french canadian keyboard.

Welcome to My Blog

Welcome to my blog, a place where I will put information that I hope will be useful to others or future me.