Making a pairwise distance matrix in pandas

This is a somewhat specialized problem that forms part of a lot of data science and clustering workflows. It starts with a relatively straightforward question: if we have a bunch of measurements for two different things, how do we come up with a single number that represents the difference between the two things?

An example will make the question clearer. Let's load our olympic medal dataset:

import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.max_columns = 6
data = pd.read_csv("https://raw.githubusercontent.com/mojones/binders/master/olympics.csv", sep="\t")
data
City Year Sport ... Medal Country Int Olympic Committee code
0 Athens 1896 Aquatics ... Gold Hungary HUN
1 Athens 1896 Aquatics ... Silver Austria AUT
2 Athens 1896 Aquatics ... Bronze Greece GRE
3 Athens 1896 Aquatics ... Gold Greece GRE
4 Athens 1896 Aquatics ... Silver Greece GRE
... ... ... ... ... ... ... ...
29211 Beijing 2008 Wrestling ... Silver Germany GER
29212 Beijing 2008 Wrestling ... Bronze Lithuania LTU
29213 Beijing 2008 Wrestling ... Bronze Armenia ARM
29214 Beijing 2008 Wrestling ... Gold Cuba CUB
29215 Beijing 2008 Wrestling ... Silver Russia RUS

29216 rows × 12 columns

and measure, for each different country, the number of medals they've won in each different sport:

summary = data.groupby(['Country', 'Sport']).size().unstack().fillna(0)
summary
Sport Aquatics Archery Athletics ... Water Motorsports Weightlifting Wrestling
Country
Afghanistan 0.0 0.0 0.0 ... 0.0 0.0 0.0
Algeria 0.0 0.0 6.0 ... 0.0 0.0 0.0
Argentina 3.0 0.0 5.0 ... 0.0 2.0 0.0
Armenia 0.0 0.0 0.0 ... 0.0 4.0 4.0
Australasia 11.0 0.0 1.0 ... 0.0 0.0 0.0
... ... ... ... ... ... ... ...
Virgin Islands* 0.0 0.0 0.0 ... 0.0 0.0 0.0
West Germany 62.0 0.0 67.0 ... 0.0 7.0 9.0
Yugoslavia 91.0 0.0 2.0 ... 0.0 0.0 16.0
Zambia 0.0 0.0 1.0 ... 0.0 0.0 0.0
Zimbabwe 7.0 0.0 0.0 ... 0.0 0.0 0.0

137 rows × 42 columns

Now we'll pick two countries:

summary.loc[['Germany', 'Italy']]
Sport Aquatics Archery Athletics ... Water Motorsports Weightlifting Wrestling
Country
Germany 175.0 6.0 99.0 ... 0.0 20.0 24.0
Italy 113.0 12.0 71.0 ... 0.0 14.0 20.0

2 rows × 42 columns

Each country has 42 columns giving the total number of medals won in each sport. Our job is to come up with a single number that summarizes how different those two lists of numbers are. Mathematicians have figured out lots of different ways of doing that, many of which are implemented in the scipy.spatial.distance module.

If we just import pdist from the module, and pass in our dataframe of two countries, we'll get a measuremnt:

from scipy.spatial.distance import pdist
pdist(summary.loc[['Germany', 'Italy']])
array([342.3024978])

That's the distance score using the default metric, which is called the euclidian distance. Think of it as the straight line distance between the two points in space defined by the two lists of 42 numbers.

Now, what happens if we pass in a dataframe with three countries?

pdist(summary.loc[['Germany', 'Italy', 'France']])
array([342.3024978 , 317.98584874, 144.82403116])

As we might expect, we have three measurements:

  • Germany and Italy
  • Germnay and France
  • Italy and France

But it's not easy to figure out which belongs to which. Happily, scipy also has a helper function that will take this list of numbers and turn it back into a square matrix:

from scipy.spatial.distance import squareform

squareform(pdist(summary.loc[['Germany', 'Italy', 'France']]))
array([[  0.        , 342.3024978 , 317.98584874],
       [342.3024978 ,   0.        , 144.82403116],
       [317.98584874, 144.82403116,   0.        ]])

In order to make sense of this, we need to re-attach the country names, which we can just do by turning it into a DataFrame:

pd.DataFrame(
    squareform(pdist(summary.loc[['Germany', 'Italy', 'France']])),
    columns = ['Germany', 'Italy', 'France'],
    index = ['Germany', 'Italy', 'France']
)
Germany Italy France
Germany 0.000000 342.302498 317.985849
Italy 342.302498 0.000000 144.824031
France 317.985849 144.824031 0.000000

Hopefully this agrees with our intuition; the numbers on the diagonal are all zero, because each country is identical to itself, and the numbers above and below are mirror images, because the distance between Germany and France is the same as the distance between France and Germany (remember that we are talking about distance in terms of their medal totals, not geographical distance!)

Finally, to get pairwise measurements for the whole input dataframe, we just pass in the complete object and get the country names from the index:

pairwise = pd.DataFrame(
    squareform(pdist(summary)),
    columns = summary.index,
    index = summary.index
)

pairwise
Country Afghanistan Algeria Argentina ... Yugoslavia Zambia Zimbabwe
Country
Afghanistan 0.000000 8.774964 96.643675 ... 171.947666 1.732051 17.492856
Algeria 8.774964 0.000000 95.199790 ... 171.688672 7.348469 19.519221
Argentina 96.643675 95.199790 0.000000 ... 148.128323 96.348326 89.810912
Armenia 5.830952 9.848858 96.477977 ... 171.604196 5.744563 18.384776
Australasia 18.708287 20.024984 97.744565 ... 166.991018 18.627936 22.360680
... ... ... ... ... ... ... ...
Virgin Islands* 1.414214 8.774964 96.457244 ... 171.947666 1.732051 17.492856
West Germany 153.052279 150.306354 142.537714 ... 184.945938 152.577849 144.045132
Yugoslavia 171.947666 171.688672 148.128323 ... 0.000000 171.874955 169.103519
Zambia 1.732051 7.348469 96.348326 ... 171.874955 0.000000 17.521415
Zimbabwe 17.492856 19.519221 89.810912 ... 169.103519 17.521415 0.000000

137 rows × 137 columns

A nice way to visualize these is with a heatmap. 137 countries is a bit too much to show on a webpage, so let's restrict it to just the countries that have scored at least 500 medals total:

import seaborn as sns
import matplotlib.pyplot as plt

# make summary table for just top countries
top_countries = (
    data
    .groupby('Country')
    .filter(lambda x : len(x) > 500)
    .groupby(['Country', 'Sport'])
    .size()
    .unstack()
    .fillna(0)
    )

# make pairwise distance matrix
pairwise_top = pd.DataFrame(
    squareform(pdist(top_countries)),
    columns = top_countries.index,
    index = top_countries.index
)

# plot it with seaborn
plt.figure(figsize=(10,10))
sns.heatmap(
    pairwise_top,
    cmap='OrRd',
    linewidth=1
)

png

Now that we have a plot to look at, we can see a problem with the distance metric we're using. The US has won so many more medals than other countries that it distorts the measurement. And if we think about it, what we're really interested in is not the exact number of medals in each category, but the relative number. In other words, we want two contries to be considered similar if they both have about twice as many medals in boxing as athletics, for example, regardless of the exact numbers.

Luckily for us, there is a distance measure already implemented in scipy that has that property - it's called cosine distance. Think of it as a measurement that only looks at the relationships between the 44 numbers for each country, not their magnitude. We can switch to cosine distance by specifying the metric keyword argument in pdist:

# make pairwise distance matrix
pairwise_top = pd.DataFrame(
    squareform(pdist(top_countries, metric='cosine')),
    columns = top_countries.index,
    index = top_countries.index
)

# plot it with seaborn
plt.figure(figsize=(10,10))
sns.heatmap(
    pairwise_top,
    cmap='OrRd',
    linewidth=1
)

png

And as you can see we spot some much more interstesting patterns. Notice, for example, that Russia and Soviet Union have a very low distance (i.e. their medal distributions are very similar).

When looking at data like this, remember that the shade of each cell is not telling us anything about how many medals a country has won - simply how different or similar each country is to each other. Compare the above heatmap with this one which displays the proportion of medals in each sport per country:

plt.figure(figsize=(10,10))
sns.heatmap(
    top_countries.apply(lambda x : x / x.sum(), axis=1),
    cmap='BuPu',
    square=True,
    cbar_kws = {'fraction' : 0.02}
)

png

Finally, how might we find pairs of countries that have very similar medal distributions (i.e. very low numbers in the pairwise table)? By far the easiest way is to start of by reshaping the table into long form, so that each comparison is on a separate row:

# create our pairwise distance matrix
pairwise = pd.DataFrame(
    squareform(pdist(summary, metric='cosine')),
    columns = summary.index,
    index = summary.index
)

# move to long form
long_form = pairwise.unstack()

# rename columns and turn into a dataframe
long_form.index.rename(['Country A', 'Country B'], inplace=True)
long_form = long_form.to_frame('cosine distance').reset_index()

Now we can write our filter as normal, remembering to filter out the unintersting rows that tell us a country's distance from itself!

long_form[
    (long_form['cosine distance'] < 0.05) 
    & (long_form['Country A'] != long_form['Country B'])
]
Country A Country B cosine distance
272 Algeria Zambia 0.026671
1034 Azerbaijan Mongolia 0.045618
1105 Bahamas Barbados 0.021450
1111 Bahamas British West Indies 0.021450
1113 Bahamas Burundi 0.021450
... ... ... ...
17033 United Arab Emirates Haiti 0.010051
17037 United Arab Emirates Independent Olympic Participants 0.000000
17051 United Arab Emirates Kuwait 0.000000
18164 Virgin Islands Netherlands Antilles 0.000000
18496 Zambia Algeria 0.026671

462 rows × 3 columns