Words lists in a shell

I often want to manipulate set of words that I want to compare. This post present some of the one lines that I frequently use to manipulate such lists.

Get a set of words from a text file

If you start from a text file, the following command will convert it to a list of words:

cat input.txt | sed 's/\>/\n/g' | sed 's/^[[:space:]]*//' | sed 's/[[:space:]]*$//' | grep -v "^$" | sort | uniq  > output.txt

If you run osx use this command instead (notice the new line in the middle of the command):

cat input.txt | sed 's/[[:>:]]/\
/g' | sed 's/^[[:space:]]*//' | sed 's/[[:space:]]*$//' | grep -v "^$" | sort | uniq  > output.txt

Both of these commands replace word boundaries by newlines, trim words and then print a sorted lists of words.

Intersection

A first one liner to find the elements that are common to two lists:

cat file1.txt file2.txt | sort | uniq -d

Union

A similar one liner to find the elements that are in one set or the other:

cat file1.txt file2.txt | sort | uniq

Union minus intersection

To get the words that are only in file1.txt or file2.txt, but not both:

cat file1.txt file2.txt | sort | uniq -u

Difference

To get the elements that are in file1.txt, but not file2.txt:

cat file1.txt file2.txt file2.txt | sort | uniq -u

Histogram of words

As a bonus, we can tweak our first command to get an histogram of words:

cat input.txt | sed 's/\>/\n/g' | sed 's/^[[:space:]]*//' | sed 's/[[:space:]]*$//' | grep -v "^$" | sort | uniq -c | sort -nr

The following variations prints words appearing at least 10 times:

cat input.txt | sed 's/\>/\n/g' | sed 's/^[[:space:]]*//' | sed 's/[[:space:]]*$//' | grep -v "^$" | sort | uniq -c | sort -nr | awk '$1 >= 10 {print $2}'

Comments