League: Carries vs Non-Carries
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from prettytable import PrettyTable
%matplotlib inline
%config InlineBackend.figure_format = 'retina' # Higher resolution figures
Summary of Findings¶
Introduction¶
In this project we are looking at League of Legends(LoL) match data from 2022. More specifically, we wanted to find out whether carry roles generally do better on average.
What Constitutes 'Doing Better'¶
There are many different way to rank performance in LoL such as number of kills, a players creep score(cs), number of towers destroyed, total damage dealt and even total gold accumalated. In LoL or MOBAs(Mulitplayer Online Battle Arena) we also have a term cause game impact, which is less quantifyable which can range from miniscule micro plays that affect the tide of a battle to sacrificing one's own life to change the outcome and give their own team an advantage.
The bottom line is that there are a huge number of factors that can play a role to 'Doing Better'. Therefore for the sake of trying to answer our question we have selected a players KDA (Kills, Deaths, Assists ratio) to be our definition of 'Doing Better'. That is, the higher a players KDA the better they are doing. $$KDA = \frac{Kills + Assists}{Deaths}$$
What Determines a Carry?¶
A carry in a MOBA, typically is the role with Champions who tend to get stronger and be the strongest on the map as time goes on. This generally correlates to 'Doing Better' by our definition as it would mean that they would have more kills, more assists and less deaths.
So to answer our questions we first analysized what are the different roles in LoL, which were determined to be the Top laner, Bottom Laner(Also known as the Attack Damage Carry, ADC), Mid Laner, Support and the Jungle. As both the Bottom Laner (ADC) and Mid players are usually considered to be the carry roles, for the purposely for answering our question we will set them to be the 'carry' roles and the other roles to be 'non-carry'
Cleaning¶
First we began by loading in the data from Oracle's Elixir website. We chose to work with the 2022 dataset as it was more representative of the current meta.
#Loading in the data
df = pd.read_csv('data/2022_LoL.csv')
#Displaying the first 5 rows of the dataframe
df.head()
c:\Users\16195\anaconda3\envs\dsc80\lib\site-packages\IPython\core\interactiveshell.py:3378: DtypeWarning: Columns (2) have mixed types.Specify dtype option on import or set low_memory=False. exec(code_obj, self.user_global_ns, self.user_ns)
gameid | datacompleteness | url | league | year | split | playoffs | date | game | patch | ... | opp_csat15 | golddiffat15 | xpdiffat15 | csdiffat15 | killsat15 | assistsat15 | deathsat15 | opp_killsat15 | opp_assistsat15 | opp_deathsat15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ESPORTSTMNT01_2690210 | complete | NaN | LCK CL | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | ... | 121.0 | 391.0 | 345.0 | 14.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | ESPORTSTMNT01_2690210 | complete | NaN | LCK CL | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | ... | 100.0 | 541.0 | -275.0 | -11.0 | 2.0 | 3.0 | 2.0 | 0.0 | 5.0 | 1.0 |
2 | ESPORTSTMNT01_2690210 | complete | NaN | LCK CL | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | ... | 119.0 | -475.0 | 153.0 | 1.0 | 0.0 | 3.0 | 0.0 | 3.0 | 3.0 | 2.0 |
3 | ESPORTSTMNT01_2690210 | complete | NaN | LCK CL | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | ... | 149.0 | -793.0 | -1343.0 | -34.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 0.0 |
4 | ESPORTSTMNT01_2690210 | complete | NaN | LCK CL | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | ... | 21.0 | 443.0 | -497.0 | 7.0 | 1.0 | 2.0 | 2.0 | 0.0 | 6.0 | 2.0 |
5 rows × 123 columns
Looking at the data, the columns that most represented information to answer our questions were the number of kills
, the number of deaths
and the number of assists
. This is because we need this information in order to calculate the KDA
of each role. We then would also need to get the position
column as that contained the information of the roles.
We also decided to keep the gameid
column in order to distinguish between different games.
#Columns that we determined were relevant
cols = ['gameid', 'position', 'kills','deaths', 'assists']
df_relevant_cols = df[cols] #Filtering
df_relevant_cols[:10]
gameid | position | kills | deaths | assists | |
---|---|---|---|---|---|
0 | ESPORTSTMNT01_2690210 | top | 2 | 3 | 2 |
1 | ESPORTSTMNT01_2690210 | jng | 2 | 5 | 6 |
2 | ESPORTSTMNT01_2690210 | mid | 2 | 2 | 3 |
3 | ESPORTSTMNT01_2690210 | bot | 2 | 4 | 2 |
4 | ESPORTSTMNT01_2690210 | sup | 1 | 5 | 6 |
5 | ESPORTSTMNT01_2690210 | top | 1 | 1 | 12 |
6 | ESPORTSTMNT01_2690210 | jng | 4 | 1 | 10 |
7 | ESPORTSTMNT01_2690210 | mid | 6 | 3 | 12 |
8 | ESPORTSTMNT01_2690210 | bot | 8 | 2 | 10 |
9 | ESPORTSTMNT01_2690210 | sup | 0 | 2 | 18 |
Now before we proceed, we used the pandas function .isna()
to check if our columns contained any missing values.
df_relevant_cols.isna().sum() #Count of all Nans present in the Dataframe
gameid 0 position 0 kills 0 deaths 0 assists 0 dtype: int64
As all the columns summed up to zero, we can concude that for ths particular instance, there are no missing values that need to be handled.
As we were looking through the data we noticed weird outlying data that appeared every 11/12 position.
# Displaying the 9th - 14th column to show outlying data points
df_relevant_cols[9:14]
gameid | position | kills | deaths | assists | |
---|---|---|---|---|---|
9 | ESPORTSTMNT01_2690210 | sup | 0 | 2 | 18 |
10 | ESPORTSTMNT01_2690210 | team | 9 | 19 | 19 |
11 | ESPORTSTMNT01_2690210 | team | 19 | 9 | 62 |
12 | ESPORTSTMNT01_2690219 | top | 0 | 5 | 2 |
13 | ESPORTSTMNT01_2690219 | jng | 1 | 2 | 1 |
With that we looked through all the columns and realized that the rows go by 12s where each row has both 5 players of each team and 2 rows for the 'teams'. We deemed those rows as irrelevant towards answering our question therefore we decided to filter them out.
# Filtering out the rows that are irrevelantfrom the dataframe
df_relevant_cols = df_relevant_cols[df_relevant_cols['position'] != 'team']
df_relevant_cols[9:14]
gameid | position | kills | deaths | assists | |
---|---|---|---|---|---|
9 | ESPORTSTMNT01_2690210 | sup | 0 | 2 | 18 |
12 | ESPORTSTMNT01_2690219 | top | 0 | 5 | 2 |
13 | ESPORTSTMNT01_2690219 | jng | 1 | 2 | 1 |
14 | ESPORTSTMNT01_2690219 | mid | 2 | 4 | 0 |
15 | ESPORTSTMNT01_2690219 | bot | 0 | 2 | 2 |
Next we created a new column which contains the KDA of each player for each game using the formula, $$KDA = \frac{Kills + Assists}{Deaths}$$ And we also fixed the position columns to distingush between a carry role and a non-carry role
# Calculating the KDA and adding it to the dataframe
kda = (df_relevant_cols['kills'] + df_relevant_cols['assists'])/df_relevant_cols['deaths']
df_relevant_cols['kda'] = kda
#Fixing the position column to fit our purposes
position_fixed = df_relevant_cols['position'].apply(lambda s:'Carry' if s in ['bot','mid'] else 'Non-carry')
df_relevant_cols['position'] = position_fixed
df_relevant_cols.head()
gameid | position | kills | deaths | assists | kda | |
---|---|---|---|---|---|---|
0 | ESPORTSTMNT01_2690210 | Non-carry | 2 | 3 | 2 | 1.333333 |
1 | ESPORTSTMNT01_2690210 | Non-carry | 2 | 5 | 6 | 1.600000 |
2 | ESPORTSTMNT01_2690210 | Carry | 2 | 2 | 3 | 2.500000 |
3 | ESPORTSTMNT01_2690210 | Carry | 2 | 4 | 2 | 1.000000 |
4 | ESPORTSTMNT01_2690210 | Non-carry | 1 | 5 | 6 | 1.400000 |
EDA¶
During our exploratory data analysis, we wanted to visualize and understand the data more before we proceeded with a hypothesis test to answer our question.
Univariate Analysis¶
In order to understand the data more, we first did a univariate analysis by looking at the distributions of the number of kills
, assists
, deaths
and kda
in hopes to find any relationship between them,
df_relevant_cols.describe() # Show statistics of values from the dataframe
kills | deaths | assists | kda | |
---|---|---|---|---|
count | 122090.000000 | 122090.000000 | 122090.000000 | 1.220260e+05 |
mean | 2.891121 | 2.896888 | 6.453731 | inf |
std | 2.739519 | 1.992892 | 4.474779 | NaN |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 |
25% | 1.000000 | 1.000000 | 3.000000 | 1.333333e+00 |
50% | 2.000000 | 3.000000 | 6.000000 | 3.250000e+00 |
75% | 4.000000 | 4.000000 | 9.000000 | 8.000000e+00 |
max | 28.000000 | 14.000000 | 41.000000 | inf |
First thing we noticed was that the max and the mean for kda
was infinity. Taking into account the formula for kda
we realized this must be due to a 0 division. Which would have came from games where the player had 0 deaths which mean they had a perfect game.
This is important as it changes the way we will calculate the kda
when performing the hypothesis test.
Next we decided to look at the distributions of kills
, deaths
and assists
as a graph,
Putting the data into a graph¶
# Plotting 3 histograms, for Kills, Deaths and Assists
plt.hist(df_relevant_cols['assists'], label='Assists Distribution', alpha = 0.5)
plt.hist(df_relevant_cols['kills'], label='Kill Distribution', alpha = 0.5)
plt.hist(df_relevant_cols['deaths'], label='Death Distribution', alpha = 0.5)
# Ploting the Legend
plt.legend(loc='upper right')
# Title
plt.title('Assist, Kill, and Death Distributions')
# x-axis and y-axis label
plt.xlabel('Count')
plt.ylabel('Frequency')
Text(0, 0.5, 'Frequency')
This histogram did not end up helping us visualize anything related to our question. But will be useful later on to identify realistic kda values. ___
Bivariate Analysis¶
Now let's jump deeper into the analysis to search for trends and patterns. First we decided to make a couple of pivot tables that compared statistics between 'Carries' and 'Non-Carries',
# Pivot table, groupby positions and displaying mean, median, max and min values of kills, deaths, assists and kda
pd.pivot_table(df_relevant_cols, index='position', aggfunc=['mean', 'median', 'max', 'min'], values=['kills', 'deaths', 'assists', 'kda'], fill_value=np.inf)
mean | median | max | min | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
assists | deaths | kda | kills | assists | deaths | kda | kills | assists | deaths | kda | kills | assists | deaths | kda | kills | |
position | ||||||||||||||||
Carry | 5.625072 | 2.593435 | inf | 3.872574 | 5 | 2 | 3.75 | 3 | 27 | 14 | inf | 28 | 0 | 0 | 0 | 0 |
Non-carry | 7.006170 | 3.099189 | inf | 2.236820 | 6 | 3 | 3.00 | 2 | 41 | 14 | inf | 27 | 0 | 0 | 0 | 0 |
We also wanted to see what the kda statistics would look like without the perfect games, therefore we replaced the infinities with Nans,
# Replacing the infinite values with nans
df_relevant_cols_no_inf = df_relevant_cols.replace(np.inf, np.nan)
# Pivot table, groupby positions and displaying mean, median, max and min values of kills, deaths, assists and kda
pd.pivot_table(df_relevant_cols_no_inf, index='position', aggfunc=['mean', 'median', 'max', 'min'], values=['kills', 'deaths', 'assists', 'kda'], fill_value=np.inf)
mean | median | max | min | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
assists | deaths | kda | kills | assists | deaths | kda | kills | assists | deaths | kda | kills | assists | deaths | kda | kills | |
position | ||||||||||||||||
Carry | 5.625072 | 2.593435 | 4.687860 | 3.872574 | 5 | 2 | 3.0 | 3 | 27 | 14 | 33 | 28 | 0 | 0 | 0 | 0 |
Non-carry | 7.006170 | 3.099189 | 4.112728 | 2.236820 | 6 | 3 | 2.5 | 2 | 41 | 14 | 35 | 27 | 0 | 0 | 0 | 0 |
The results of removing the infinities from the kda
column resulted in a value for the mean kda
. This, however, is inaccurate as pandas mean function ignores Nans. Hence, reducing the total number of games played as well and giving an incorrect value of the kda
mean.
This can be checked by calculating the mean kda
using the mean values of kills
, assists
and deaths
as there were no infinities in those columns therefore we will not have the same issue. This works, as when the values are transformed by a function, the mean has a property where it will also be transformed by the same function. Therefore the true mean of carry kda
for example will be, $$Carry_{KDA} = \frac{3.873 + 5.625}{2.593} = 3.667...$$
Next we plot a histogram to compare the kda
values of 'Carry' vs 'Non-Carry' roles, we also leave the infinities values as Nans in order to plot the histogram,
# Getting dataframes that are only Carry's and one that is only Non-Carry's
df_carry = df_relevant_cols_no_inf[df_relevant_cols_no_inf['position'] == 'Carry']
df_noncarry = df_relevant_cols_no_inf[df_relevant_cols_no_inf['position'] != 'Carry']
# Plotting both histograms
plt.hist(df_noncarry['kda'], label = 'Non-Carry KDA')
plt.hist(df_carry['kda'], label = 'Carry KDA')
# Plotting the Legend
plt.legend(loc = 'upper right')
# x-axis and y-axis labels
plt.xlabel('KDA Value')
plt.ylabel('frequency')
# Title
plt.title(label="Carry KDA vs. Non-Carry KDA")
Text(0.5, 1.0, 'Carry KDA vs. Non-Carry KDA')
Looking at the histogram, we noticed that carry roles tend to have higher KDA than other roles. So now we will run a hypothesis test to see whether it is due to random chance. ___
Hypothesis Test¶
Null Hypothesis: It is due to random chance that 'Carry' roles have a higher kda
Alternative Hypothesis: It is not due to random chance that 'Carry' roles have a higher kda
For our test statistic we will be using the average kda
, in order to check that this observed sample lies in the same distribution as a set of randomly simulated samples.
Observed Value¶
df_carry = df_relevant_cols[df_relevant_cols['position'] == 'Carry'] # Carry only Dataframe
def get_mean_kda(df):
"""Function to get the mean KDA of a given dataframe
Args:
df (Dataframe): Dataframe to get the mean KDA
Returns:
Float: Mean KDA of dataframe given
"""
return (df['kills'].mean() + df['assists'].mean()) / df['deaths'].mean()
# Getting the mean KDA of the observed
obsv = get_mean_kda(df_carry)
obsv
3.662187235991252
Now we will repeatedly sample data from our clean dataset without replacement in order to simulate cases where kda
is unimpacted by roles. We will be sampling with a size of $\frac{2}{5}$ of the original dataset, as that would represent the number of 'Carry' roles present in the original dataset.
#Running Hypothesis Tests
drc_kda = df_relevant_cols[['position', 'kills', 'assists', 'deaths']]
N = 10000 # Number of times to run the loop
stats = [] # List to store KDA means
# Loop that samples the Dataframe and calculates KDA means and stores in the list
for _ in range(N):
sample = drc_kda.sample(2*(drc_kda.shape[0]//5))
stats.append(get_mean_kda(sample))
stats[0:10]
[3.206143008101444, 3.248640204501882, 3.2498153304165007, 3.222665912785356, 3.2389620120735727, 3.2113751763046543, 3.240155724650504, 3.2115063885119857, 3.228300629501503, 3.2473746948969744]
Now we shall calculate the p-value
and plot a histogram of our findings.
(stats >= obsv).sum()/N # Getting P-value
0.0
# Plotting Histogram
plt.hist(stats, label = 'simulations')
# Observed Line
plt.axvline(obsv, color = 'red', label = 'observed')
# Cutoff
plt.axvline(np.percentile(stats, 99), color = 'orange', label = 'cutoff')
# Legend, Title, axis labels
plt.legend(loc = 'upper center')
plt.title('Test if Carry KDA is Due to Chance')
plt.xlabel('Mean KDA Values')
plt.ylabel('Frequency')
Text(0, 0.5, 'Frequency')
Conclusion¶
According to the plot, we can we reject the null hypothesis at a 99% Confidence level. Therefore it does not seem likely that the higher kda
of 'Carry' roles is due to random chance.
Assessment of Missingness¶
As per our assignment, we were also required to assess non-trivial missing values from the original data set. As it so happens, our question's relevant columns did not contain any missing data. So instead we decided to analyze missing playerid
values. Which we noticed during our skim of the data set.
In our skim, we noticed that the missing values of playerid
tend to come in bunches. So we did not think that they could be classified as Not Missing At Random(NMAR). To test this we ran a couple of permutation tests in order to investigate whether the missingness was dependant on other columns.
Is playerid
missing due to positions?¶
Since we had spent most of the project working with the players positions/roles, we first wanted to test if the playerid
missingness was dependant on the role.
First just like the data cleaning we did earlier, we will pick out the relevant columns and remove the rows that represented the team and not a player.
df_relevant_missing = df.copy()
# Getting relevant columns for assessing missingness
df_relevant_missing = df_relevant_missing[['position', 'playerid']]
# Cleaning Dataframe (again)
df_relevant_missing = df_relevant_missing[df_relevant_missing['position'] != 'team']
df_relevant_missing.head()
position | playerid | |
---|---|---|
0 | top | oe:player:38e0af7278d6769d0c81d7c4b47ac1e |
1 | jng | oe:player:637ed20b1e41be1c51bd1a4cb211357 |
2 | mid | oe:player:d1ae0e2f9f3ac1e0e0cdcb86504ca77 |
3 | bot | oe:player:998b3e49b01ecc41eacc392477a98cf |
4 | sup | oe:player:e9741b3a238723ea6380ef2113fae63 |
Next, we added a new boolean
column which tells us whether the playerid
value is missing or not. From there we generate an empirical distrbution of the playerid
missingnes against the position
.
# Creating new boolean column
df_relevant_missing['id missing'] = df_relevant_missing['playerid'].isna()
# Pivot table groupby positions, getting the size of missing and not missing values
empircal = pd.pivot_table(df_relevant_missing,
index = 'position',
columns = 'id missing',
aggfunc = 'size',
values = None)\
.fillna(0).apply(lambda x: x / x.sum()) # Get proportions
empircal
id missing | False | True |
---|---|---|
position | ||
bot | 0.200060 | 0.196455 |
jng | 0.199927 | 0.204333 |
mid | 0.199902 | 0.205810 |
sup | 0.200285 | 0.183161 |
top | 0.199827 | 0.210241 |
Looking at the values it looks like there is no real relationship between the positions
and whether playerid
is missing.
We plotted a graph to further visualize the problem.
# Plotting empircal table
empircal.plot(kind= 'barh',
title= 'Missing vs Not Missing PlayerId by Role',
xlabel= 'Roles',
ylabel= 'Distribution')
<AxesSubplot:title={'center':'Missing vs Not Missing PlayerId by Role'}, ylabel='Roles'>
The graph also shows that there may not be any relationship between positions
and playerid
. To confirm this we will run a permuatation test.
Permuatation Test, Position vs PlayerId¶
Null Hypothesis: The distribution of position
is the same whether playerid
is missing or not.
Alternative Hypothesis: The distribution of position
is not the same whether playerid
is missing or not.
As we are comparing between 2 categorical values, the test statistic that we will be using is the Total Variation Distance (TVD). Which is calculated by, $$\text{TVD}(A, B) = \frac{1}{2} \sum_{i = 1}^k |a_i - b_i|$$
First we get the observed value.
# Calculate Observed TVD
obsv_tvd = empircal.diff(axis=1).iloc[:, -1].abs().sum() / 2
obsv_tvd
0.02072887772571398
Now we will perform a permuatation test by shuffling the position columns and getting the TVD each time.
shuffled = df_relevant_missing.copy()
N = 500 # Number of times to shuffle
tvds = [] # TVDs list
# Loop that shuffles the position column, calculates the empircal values and appends TVD to list
for _ in range(N):
shuffled['position'] = np.random.permutation(shuffled['position'])
shuffled_empircal = pd.pivot_table(shuffled,
index = 'position',
columns = 'id missing',
aggfunc = 'size',
values = None)\
.fillna(0).apply(lambda x: x / x.sum())
tvds.append(shuffled_empircal.diff(axis=1).iloc[:, -1].abs().sum() / 2)
tvds[:10]
[0.019226785136894103, 0.020829017231635272, 0.024934736974409538, 0.019226785136894103, 0.00931297405068307, 0.016723297488861005, 0.028239340669813207, 0.019827622172422052, 0.019226785136894117, 0.014319949346749253]
Now we calculate the p-value
and plot a histogram
# Calculating the p-value
pval = (np.array(tvds) >= obsv_tvd).sum()/N
pval
0.284
# Plotting the histogram of TVDS with lines representing Observed and Cutoff values
plt.hist(tvds,
label= 'TVDs')
plt.axvline(obsv_tvd, color = 'red', label= 'Observed')
plt.axvline(np.percentile(tvds, 99), color = 'orange', label= 'Cutoff')
plt.title("Results of Simulation")
plt.legend(loc='upper right')
<matplotlib.legend.Legend at 0x18b84666910>
Conclusion for positions
vs playerid
¶
As we do not have enough evidence to reject the null. It does not seem likely that the missing values in playerid
are due to positions
.
However, there must be a reason that the missing values of playerid
tend to be bunched up. So we decided to test playerid
's missingness with another column to prove this.
Missingness date
vs playerid
¶
We thought that maybe the missingness of playerid
being bunched up by be due to the day that the game was played, the servers got a little overloaded and therefore some data was lost.
So first we plotted a bar chart to see if there is any correlation between date
and playerid
missingness.
df_relevant_missing = df.copy()
# Getting relevant columns
df_relevant_missing = df_relevant_missing[['playerid', 'date']]
# Converting all dates into datetime objects and formatting it as MM-DD
dates = pd.to_datetime(df_relevant_missing['date'])
df_relevant_missing['date'] = dates.apply(lambda dt:str(dt.month)) + '-' + dates.apply(lambda dt:str(dt.day))
# Creating Boolean Columns
df_relevant_missing['id missing'] = df_relevant_missing['playerid'].isna()
# Pivot table groupby positions, getting the size of missing and not missing values
empircal = pd.pivot_table(df_relevant_missing, index = 'date', columns = 'id missing',\
aggfunc = 'size', values = None).fillna(0).apply(lambda x: x / x.sum())
empircal.head()
id missing | False | True |
---|---|---|
date | ||
1-10 | 0.002907 | 0.003138 |
1-11 | 0.004140 | 0.004802 |
1-12 | 0.004748 | 0.005218 |
1-13 | 0.003248 | 0.002949 |
1-14 | 0.002582 | 0.002344 |
The pivot table is too long, so below we plotted a barchart to look out for any trends that could lead to a possible missingness correlation. We also only included the first 30 entries, but it was sufficient to see that there seems to be some kind of relationship.
# Plotting Bar Chart
empircal[:30].plot(kind = 'barh',
title= 'Missing vs Not Missing playerids by date',
)
<AxesSubplot:title={'center':'Missing vs Not Missing playerids by date'}, ylabel='date'>
Looking at the bar chart there seems to be some kind of relationship between the missing values of playerid and dates. To answer this, we will perform another permuatation test.
Permuatation Test, date
vs playerid
¶
Null Hypothesis: The distribution of date
is the same whether playerid
is missing or not.
Alternative Hypothesis: The distribution of date
is not the same whether playerid
is missing
We will again use TVDs for this permuatation test as they are again 2 categorical values.
# Calculating Observed Value
observed_tvd = empircal.diff(axis=1).iloc[:, -1].abs().sum() / 2
observed_tvd
0.056299137239917495
Now we will perform a permuation test.
shuffled = df_relevant_missing.copy()
N = 500 # Number of times to loop
tvds = [] # TVDs list
# Loop that shuffles the position column, calculates the empircal values and appends TVD to list
for _ in range(N):
shuffled['date'] = np.random.permutation(shuffled['date'])
shuffled_empircal = pd.pivot_table(shuffled, index = 'date', columns = 'id missing',\
aggfunc = 'size', values = None).fillna(0).apply(lambda x: x / x.sum())
tvds.append(shuffled_empircal.diff(axis=1).iloc[:, -1].abs().sum() / 2)
tvds[0:10]
[0.04045979647993346, 0.04555398736332357, 0.04133476422310948, 0.04538539469042754, 0.044224534012846664, 0.04445006870701895, 0.045255264712816184, 0.04372756107177824, 0.04223099263868397, 0.04100018638957388]
Calculating the p-value
and plotting a histogram
# Calculating p-value
p_val = (np.array(tvds) >= observed_tvd).sum()/N
p_val
0.0
# Plotting histogram with line representing Observed and Cutoffs
plt.hist(tvds, label='TVDs')
plt.axvline(observed_tvd, color = 'red', label='Observed_tvd')
plt.axvline(np.percentile(tvds, 99), color = 'orange', label='Cutoff')
# Title
plt.title("Results of Simulations")
# Legend
plt.legend(loc='upper center')
<matplotlib.legend.Legend at 0x18b877424c0>
Conclusion, dates
vs playerid
¶
Based on the results of the simulation, we reject the null with a Confidence Interval of 99%. This means that is is likely that the missingness of playerid
might have a correlation with the values in dates
which would make the missingness type of playerid
be Missing At Random(MAR).