A few weeks ago I started a hobby project for fun and learning. My task this week is to compile a gazetteer of music bands and artists. I wanted to play with Wikidata for a long time and it was the perfect opportunity.
Getting familiar with Wikidata
Wikidata is the Wikipedia of data. Contributors are either robots or human updating a database of facts. The best way to grasp how Wikipedia and Wikidata compares is to look at a concrete example, the entries for the band Arcade Fire: Wikidata, Wikipedia.
The first thing we notice is that Wikipedia is written in prose and it is targeted at our fellow humans. Wikidata on the other hand is much more structured and doesn’t bother with well formed sentences. It makes it a perfect source of knowledge for machines.
To convince ourselves, let’s look find the place of origin of Arcade Fire from both sources. On Wikipedia, we first enter Arcade Fire in the search box, pick the right article and then read the text. We get Montreal, our answer, from this paragraph:
Win Butler and Josh Deu founded Arcade Fire in Montreal around 2001, having first met at Phillips Exeter Academy as high school students.
On Wikidata, can ask directly for the place of origin of Arcade Fire using the following query:
SELECT ?origin ?originLabel
{
wd:Q58608 wdt:P740 ?origin.
?origin rdfs:label ?originLabel.
FILTER (LANG(?originLabel) = 'en').
}
And get the following result:
origin | originLabel |
---|---|
http://www.wikidata.org/entity/Q340 | Montreal |
Let’s unroll what just happened. The query is written in Sparql, a
language at the intersection of Prolog and SQL. The symbols beginning
with a question mark (?origin
and ?originLabel
) are blank values
that we would like Wikidata to fill for us. The block inside the curly
braces asks Wikidata to fill the ?origin
variable with the place of
origin (wdt:P740) of the band Arcade Fire (wd:Q58608). The returned
value (Q340
) is an id for Montreal. The two remaining line
asks for the label of Montreal in English.
We could also ask for the place of origin to be labeled in French:
SELECT ?origin ?originLabel
{
wd:Q58608 wdt:P740 ?origin.
?origin rdfs:label ?originLabel.
FILTER (LANG(?originLabel) = 'fr').
}
and get the following result:
origin | originLabel |
---|---|
http://www.wikidata.org/entity/Q340 | Montréal |
One fair question is why are we using these weird wd
ids
to identify entities and wdt
ids to identify properties? The simple
answer is that Wikidata is language agnostic and unambiguous. Q340
identifies the concept Montreal and can only refer to the city in
Canada, never Montreal in Wisconsin.
Using a Sparql query may looks over-complicated, but it allows us to do things that would be difficult if we were only using Wikipedia. For example, if we wanted to get five other music bands from Montreal, we could run the following query:
SELECT ?band ?bandLabel
{
?band wdt:P740 wd:Q340.
?band wdt:P31 wd:Q215380.
?band rdfs:label ?bandLabel.
FILTER (LANG(?bandLabel) = 'en').
}
LIMIT 5
And obtain results like these:
band | bandLabel |
---|---|
http://www.wikidata.org/entity/Q368132 | Blessed by a Broken Heart |
http://www.wikidata.org/entity/Q485825 | Simple Plan |
http://www.wikidata.org/entity/Q499847 | Islands |
http://www.wikidata.org/entity/Q614949 | The Luyas |
http://www.wikidata.org/entity/Q630797 | The Stills |
For this query, we used one Sparql’s most useful property, instance of (P31), to find other bands (Q215380) from Montreal. If you would like to learn more about Sparql, you can start with this tutorial to work your way through more complex queries.
Searching for music bands on Wikidata
Our task is to build an extensive list of music band names from Wikidata. I am no Wikidata taxonomist, so we will have to look around to learn how to build the right query. Let’s start our investigation by looking at the Wikidata entry for Arcade Fire.
Arcade Fire is an instance of band (with id Q215380
), which is a
subclass of musical ensemble (with id Q2088357
), which is defined as
group of people who perform instrumental and\/or vocal music, with the
ensemble typically known by a distinct name.
Let’s investigate other examples:
-
Alt-J is also an instance of band;
-
Jean Leloup, a song-writer from Quebec, is not an instance of band but an instance of human. This is not very useful. If we look further down the page, we see that his occupation is singer.
-
Céline Dion is also entered as a singer.
-
André Gagnon, a famous pianist, is a pianist which is in the field of occupation of music. Looking back at singer, it is also the case.
If we generalize from these examples, we are searching for music ensembles or humans in the field of occupation of music. Let’s translate this into two different queries. One for musical ensembles:
SELECT DISTINCT ?band ?bandLabel
WHERE
{
?band wdt:P31/wdt:P279* wd:Q2088357.
?band rdfs:label ?bandLabel.
FILTER (LANG(?bandLabel) = 'en')
}
LIMIT 5
band | bandLabel |
---|---|
http://www.wikidata.org/entity/Q396 | U2 |
http://www.wikidata.org/entity/Q371 | !!! |
http://www.wikidata.org/entity/Q689 | Bastille |
http://www.wikidata.org/entity/Q50598 | Infinite |
http://www.wikidata.org/entity/Q18788 | Epik High |
And one for human (Q5
) whose occupation (P106
) is a subclass (P279
) of
musician (Q639669
):
SELECT DISTINCT ?musician ?musicianLabel
WHERE
{
?musician wdt:P31 wd:Q5;
wdt:P106/wdt:P279* wd:Q639669.
?musician rdfs:label ?musicianLabel.
FILTER (LANG(?musicianLabel) = 'en')
}
LIMIT 5
musician | musicianLabel |
---|---|
http://www.wikidata.org/entity/Q254 | Wolfgang Amadeus Mozart |
http://www.wikidata.org/entity/Q255 | Ludwig van Beethoven |
http://www.wikidata.org/entity/Q180861 | Roger Waters |
http://www.wikidata.org/entity/Q122538 | Laurentius Laurentii |
http://www.wikidata.org/entity/Q107164 | Atlas Crusius |
Looking at these queries, we get around 77,000 bands and 238,000 musicians.
Downloading our gazetteers
Now that we know which queries we want to build our gazetteers from,
we are ready to download them. The easiest way I found to achieve this
is to use the following curl
command:
curl --data-urlencode query@file-with-query.sparql \
--header "Accept: text/csv" \
--output dataset.csv \
https://query.wikidata.org/bigdata/namespace/wdq/sparql
Our dataset will be saved as a csv file under dataset.csv
. With
these datasets in hand, my next step for Word of Mouth is to annotate
bands and musician mentions in reddit posts.