UN Voting Cluster Analysis

The data set contains the votes of the UN General Assembly, each vote for each member nation. I discovered it when reading a blog post and I thought I could do some cluster analysis on it.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
load("Data\\UNVotesPublished.RData")

# Do a filter to only get the last 10 years (session) and ignore votes that are irrelevant
voteslast10 <- x %>% filter(session > 60 & vote <=3)

The vote column has several options, but we only care if they voted yes (1) or not:

voteslast10$vote <- as.integer(voteslast10$vote == 1)

Let’s get a look at the votes and calculate the average percent for each vote. Then filter out votes that were unamimous because they are less interesting.

votes10avg <- voteslast10 %>% group_by(rcid) %>% summarize(perc = mean(vote))
votes10avg <- votes10avg %>% filter(perc < 1 & perc > 0) %>% arrange(desc(perc))

Let’s examine a histogram of the vote percentages:

hist(votes10avg$perc)

The vote percentages are concentrated around 1. There are a few close votes and very few votes with low percentages. This makes sense, because an unpopular proposal would be unlikely to come up for a vote.

Take each country’s vote and subtract the overall average to get the score. The score is a measure of how different the country is from the consensus. If a country votes no on a popular proposal that passed with 90% approval, then then will have a score of -0.9 (0 - 0.9), whereas a country that voted yes on that proposal would have a score of 0.1 (1 - 0.9), indicating that it is closer to the consensus.

scores <- voteslast10 %>% inner_join(votes10avg, by = c("rcid", "rcid"))
scores$score <- scores$vote - scores$perc

The countrycode package will let us convert the country codes to country names. We will need a data set with one row per country, and one column per vote score.

# install.packages("countrycode")
library(countrycode)

countries <- voteslast10 %>% group_by(ccode) %>% summarize(avgscore = mean(vote))
countries$cname <- countrycode(countries$ccode, "cown","country.name")
countries <- countries %>% arrange(avgscore)

#voteslast10$cname <- countrycode(voteslast10$ccode, "cown","country.name")

countries[1:10,]
## # A tibble: 10 × 3
##    ccode  avgscore                                                cname
##    <int>     <dbl>                                                <chr>
## 1      2 0.1814404                             United States of America
## 2    666 0.1985915                                               Israel
## 3    986 0.3359375                                                Palau
## 4    987 0.3816667                     Micronesia (Federated States of)
## 5     20 0.4466019                                               Canada
## 6    983 0.4480122                                     Marshall Islands
## 7    900 0.5179558                                            Australia
## 8    970 0.5213675                                                Nauru
## 9    200 0.5222841 United Kingdom of Great Britain and Northern Ireland
## 10   220 0.5255878                                               France
countries[length(countries$ccode):(length(countries$ccode)-10),]
## # A tibble: 11 × 3
##    ccode  avgscore                 cname
##    <int>     <dbl>                 <chr>
## 1    591 0.9919679            Seychelles
## 2     80 0.9780381                Belize
## 3    781 0.9716312              Maldives
## 4    404 0.9707904         Guinea Bissau
## 5    403 0.9677419 Sao Tome and Principe
## 6    811 0.9674556              Cambodia
## 7    402 0.9665145            Cabo Verde
## 8    700 0.9637155           Afghanistan
## 9    860 0.9624796           Timor-Leste
## 10   481 0.9571984                 Gabon
## 11    56 0.9547445           Saint Lucia

So here is a simple one dimensional clustering. The first set are the countries that voted NO most often. The second set voted YES most often.

Now add a column for each vote:

# Process each vote

for (rcid1 in votes10avg$rcid)
{
  colname <- paste("V" , toString(rcid1), sep="")

  # Get votes scores for this vote
  # do join
  countries[colname] <- (countries %>% left_join(filter(scores, rcid == rcid1), by=c("ccode","ccode")))$score
}

countries[is.na(countries)] <- 0

Now that we have a good dataset, use the dist() function to create a matrix of distances between each row. Then run cmdscale() to scale those distances to 2 dimensions.

# Calculate distance matrix
countriesDist <- dist(countries[,c(-1, -2, -3)])

countriesmodel <- cmdscale(countriesDist, k=2)

countriescoord <- as.data.frame(countriesmodel)
countriescoord <- cbind(countriescoord, countries$cname)

### plot the points (with labels)
colnames(countriescoord) <- c("attribute1", "attribute2", "country")

require(ggplot2)
## Loading required package: ggplot2
ggplot()+geom_point(data=countriescoord, aes(x=attribute1, y=attribute2), color="gray")+
  geom_text(data=countriescoord, aes(x=attribute1, y=attribute2, label=country, size=1))

Analyzing the plot

We can see definite clusters in the plot, though the biggest ones are crowded. Like the one dimensional version, the U.S. is far from the others, along with Israel. This indicates that the two countries vote the same on most votes.

A plot with the more interesting items is easier to read.

## Loading required package: ggplot2

While the countries on the left vote together on many items, the top left and top right have their own differences. Canada and Australia try to split the difference between the top left and the top right. Then there are few countries that vote independently such as Naru and Cameroon.