I often want to manipulate set of words that I want to compare. This post present some of the one lines that I frequently use to manipulate such lists.
Get a set of words from a text file
If you start from a text file, the following command will convert it to a list of words:
cat input.txt |\
sed 's/\>/\n/g' |\
sed 's/^[[:space:]]*//' |\
sed 's/[[:space:]]*$//' |\
grep -v "^$" |\
sort |\
uniq > output.txt
If you run osx use this command instead (notice the new line in the middle of the command):
cat input.txt |\ sed 's/[[:>:]]/\
/g' | sed 's/^[[:space:]]*//' | sed 's/[[:space:]]*$//' | grep -v "^$" | sort | uniq > output.txt
Both of these commands replace word boundaries by newlines, trim words and then print a sorted lists of words.
Intersection
A first one liner to find the elements that are common to two lists:
cat file1.txt file2.txt | sort | uniq -d
Union
A similar one liner to find the elements that are in one set or the other:
cat file1.txt file2.txt | sort | uniq
Union minus intersection
To get the words that are only in file1.txt or file2.txt, but not both:
cat file1.txt file2.txt | sort | uniq -u
Difference
To get the elements that are in file1.txt, but not file2.txt:
cat file1.txt file2.txt file2.txt | sort | uniq -u
Histogram of words
As a bonus, we can tweak our first command to get an histogram of words:
cat input.txt |\
sed 's/\>/\n/g' |\
sed 's/^[[:space:]]*//' |\
sed 's/[[:space:]]*$//' |\
grep -v "^$" |\
sort |\
uniq -c |\
sort -nr
The following variations prints words appearing at least 10 times:
cat input.txt |\
sed 's/\>/\n/g' |\
sed 's/^[[:space:]]*//' |\
sed 's/[[:space:]]*$//' |\
grep -v "^$" |\
sort |\
uniq -c |\
sort -nr |\
awk '$1 >= 10 {print $2}'