League: Carries vs Non-Carries

20 Oct 2022 in Projects / Machine learning

League Project

In [ ]:

import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from prettytable import PrettyTable
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

Summary of Findings¶

Introduction¶

In this project we are looking at League of Legends(LoL) match data from 2022. More specifically, we wanted to find out whether carry roles generally do better on average.

What Constitutes 'Doing Better'¶

There are many different way to rank performance in LoL such as number of kills, a players creep score(cs), number of towers destroyed, total damage dealt and even total gold accumalated. In LoL or MOBAs(Mulitplayer Online Battle Arena) we also have a term cause game impact, which is less quantifyable which can range from miniscule micro plays that affect the tide of a battle to sacrificing one's own life to change the outcome and give their own team an advantage.

The bottom line is that there are a huge number of factors that can play a role to 'Doing Better'. Therefore for the sake of trying to answer our question we have selected a players KDA (Kills, Deaths, Assists ratio) to be our definition of 'Doing Better'. That is, the higher a players KDA the better they are doing. $$KDA = \frac{Kills + Assists}{Deaths}$$

What Determines a Carry?¶

A carry in a MOBA, typically is the role with Champions who tend to get stronger and be the strongest on the map as time goes on. This generally correlates to 'Doing Better' by our definition as it would mean that they would have more kills, more assists and less deaths.

So to answer our questions we first analysized what are the different roles in LoL, which were determined to be the Top laner, Bottom Laner(Also known as the Attack Damage Carry, ADC), Mid Laner, Support and the Jungle. As both the Bottom Laner (ADC) and Mid players are usually considered to be the carry roles, for the purposely for answering our question we will set them to be the 'carry' roles and the other roles to be 'non-carry'

Cleaning¶

First we began by loading in the data from Oracle's Elixir website. We chose to work with the 2022 dataset as it was more representative of the current meta.

In [ ]:

#Loading in the data
df = pd.read_csv('data/2022_LoL.csv')
#Displaying the first 5 rows of the dataframe
df.head() 

c:\Users\16195\anaconda3\envs\dsc80\lib\site-packages\IPython\core\interactiveshell.py:3378: DtypeWarning: Columns (2) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)

Out[ ]:

	gameid	datacompleteness	url	league	year	split	date	game	patch	...	opp_csat15	golddiffat15	xpdiffat15	csdiffat15	killsat15	assistsat15	deathsat15	opp_killsat15	opp_assistsat15	opp_deathsat15
0	ESPORTSTMNT01_2690210	complete	NaN	LCK CL	2022	Spring	2022-01-10 07:44:08	1	12.01	...	121.0	391.0	345.0	14.0	0.0	1.0	0.0	0.0	1.0	0.0
1	ESPORTSTMNT01_2690210	complete	NaN	LCK CL	2022	Spring	2022-01-10 07:44:08	1	12.01	...	100.0	541.0	-275.0	-11.0	2.0	3.0	2.0	0.0	5.0	1.0
2	ESPORTSTMNT01_2690210	complete	NaN	LCK CL	2022	Spring	2022-01-10 07:44:08	1	12.01	...	119.0	-475.0	153.0	1.0	0.0	3.0	0.0	3.0	3.0	2.0
3	ESPORTSTMNT01_2690210	complete	NaN	LCK CL	2022	Spring	2022-01-10 07:44:08	1	12.01	...	149.0	-793.0	-1343.0	-34.0	2.0	1.0	2.0	3.0	3.0	0.0
4	ESPORTSTMNT01_2690210	complete	NaN	LCK CL	2022	Spring	2022-01-10 07:44:08	1	12.01	...	21.0	443.0	-497.0	7.0	1.0	2.0	2.0	0.0	6.0	2.0

5 rows × 123 columns

Looking at the data, the columns that most represented information to answer our questions were the number of kills, the number of deaths and the number of assists. This is because we need this information in order to calculate the KDA of each role. We then would also need to get the position column as that contained the information of the roles.

We also decided to keep the gameid column in order to distinguish between different games.

In [ ]:

#Columns that we determined were relevant
cols = ['gameid', 'position', 'kills','deaths', 'assists'] 
df_relevant_cols = df[cols] #Filtering
df_relevant_cols[:10]

Out[ ]:

	gameid	position	kills	deaths	assists
0	ESPORTSTMNT01_2690210	top	2	3	2
1	ESPORTSTMNT01_2690210	jng	2	5	6
2	ESPORTSTMNT01_2690210	mid	2	2	3
3	ESPORTSTMNT01_2690210	bot	2	4	2
4	ESPORTSTMNT01_2690210	sup	1	5	6
5	ESPORTSTMNT01_2690210	top	1	1	12
6	ESPORTSTMNT01_2690210	jng	4	1	10
7	ESPORTSTMNT01_2690210	mid	6	3	12
8	ESPORTSTMNT01_2690210	bot	8	2	10
9	ESPORTSTMNT01_2690210	sup	0	2	18

Now before we proceed, we used the pandas function .isna() to check if our columns contained any missing values.

In [ ]:

df_relevant_cols.isna().sum() #Count of all Nans present in the Dataframe

Out[ ]:

gameid      0
position    0
kills       0
deaths      0
assists     0
dtype: int64

As all the columns summed up to zero, we can concude that for ths particular instance, there are no missing values that need to be handled.

As we were looking through the data we noticed weird outlying data that appeared every 11/12 position.

In [ ]:

# Displaying the 9th - 14th column to show outlying data points
df_relevant_cols[9:14]

Out[ ]:

	gameid	position	kills	deaths	assists
9	ESPORTSTMNT01_2690210	sup	0	2	18
10	ESPORTSTMNT01_2690210	team	9	19	19
11	ESPORTSTMNT01_2690210	team	19	9	62
12	ESPORTSTMNT01_2690219	top	0	5	2
13	ESPORTSTMNT01_2690219	jng	1	2	1

With that we looked through all the columns and realized that the rows go by 12s where each row has both 5 players of each team and 2 rows for the 'teams'. We deemed those rows as irrelevant towards answering our question therefore we decided to filter them out.

In [ ]:

# Filtering out the rows that are irrevelantfrom the dataframe
df_relevant_cols = df_relevant_cols[df_relevant_cols['position'] != 'team']
df_relevant_cols[9:14]

Out[ ]:

	gameid	position	kills	deaths	assists
9	ESPORTSTMNT01_2690210	sup	0	2	18
12	ESPORTSTMNT01_2690219	top	0	5	2
13	ESPORTSTMNT01_2690219	jng	1	2	1
14	ESPORTSTMNT01_2690219	mid	2	4	0
15	ESPORTSTMNT01_2690219	bot	0	2	2

Next we created a new column which contains the KDA of each player for each game using the formula, $$KDA = \frac{Kills + Assists}{Deaths}$$ And we also fixed the position columns to distingush between a carry role and a non-carry role

In [ ]:

# Calculating the KDA and adding it to the dataframe
kda = (df_relevant_cols['kills'] + df_relevant_cols['assists'])/df_relevant_cols['deaths']
df_relevant_cols['kda'] = kda

#Fixing the position column to fit our purposes
position_fixed = df_relevant_cols['position'].apply(lambda s:'Carry' if s in ['bot','mid'] else 'Non-carry')
df_relevant_cols['position'] = position_fixed

df_relevant_cols.head()

Out[ ]:

	gameid	position	kills	deaths	assists	kda
0	ESPORTSTMNT01_2690210	Non-carry	2	3	2	1.333333
1	ESPORTSTMNT01_2690210	Non-carry	2	5	6	1.600000
2	ESPORTSTMNT01_2690210	Carry	2	2	3	2.500000
3	ESPORTSTMNT01_2690210	Carry	2	4	2	1.000000
4	ESPORTSTMNT01_2690210	Non-carry	1	5	6	1.400000

EDA¶

During our exploratory data analysis, we wanted to visualize and understand the data more before we proceeded with a hypothesis test to answer our question.

Univariate Analysis¶

In order to understand the data more, we first did a univariate analysis by looking at the distributions of the number of kills, assists, deaths and kda in hopes to find any relationship between them,

In [ ]:

df_relevant_cols.describe() # Show statistics of values from the dataframe

Out[ ]:

	kills	deaths	assists	kda
count	122090.000000	122090.000000	122090.000000	1.220260e+05
mean	2.891121	2.896888	6.453731	inf
std	2.739519	1.992892	4.474779	NaN
min	0.000000	0.000000	0.000000	0.000000e+00
25%	1.000000	1.000000	3.000000	1.333333e+00
50%	2.000000	3.000000	6.000000	3.250000e+00
75%	4.000000	4.000000	9.000000	8.000000e+00
max	28.000000	14.000000	41.000000	inf

First thing we noticed was that the max and the mean for kda was infinity. Taking into account the formula for kda we realized this must be due to a 0 division. Which would have came from games where the player had 0 deaths which mean they had a perfect game.

This is important as it changes the way we will calculate the kda when performing the hypothesis test.

Next we decided to look at the distributions of kills, deaths and assists as a graph,

Putting the data into a graph¶

In [ ]:

# Plotting 3 histograms, for Kills, Deaths and Assists
plt.hist(df_relevant_cols['assists'], label='Assists Distribution', alpha = 0.5)
plt.hist(df_relevant_cols['kills'], label='Kill Distribution', alpha = 0.5)
plt.hist(df_relevant_cols['deaths'], label='Death Distribution', alpha = 0.5)

# Ploting the Legend
plt.legend(loc='upper right')
# Title
plt.title('Assist, Kill, and Death Distributions')
# x-axis and y-axis label
plt.xlabel('Count')
plt.ylabel('Frequency')

Out[ ]:

Text(0, 0.5, 'Frequency')

This histogram did not end up helping us visualize anything related to our question. But will be useful later on to identify realistic kda values. ___

Bivariate Analysis¶

Now let's jump deeper into the analysis to search for trends and patterns. First we decided to make a couple of pivot tables that compared statistics between 'Carries' and 'Non-Carries',

In [ ]:

# Pivot table, groupby positions and displaying mean, median, max and min values of kills, deaths, assists and kda
pd.pivot_table(df_relevant_cols, index='position', aggfunc=['mean', 'median', 'max', 'min'], values=['kills', 'deaths', 'assists', 'kda'], fill_value=np.inf)

Out[ ]:

	mean				median				max				min
	assists	deaths	kda	kills	assists	deaths	kda	kills	assists	deaths	kda	kills	assists	deaths	kda	kills
position
Carry	5.625072	2.593435	inf	3.872574	5	2	3.75	3	27	14	inf	28	0	0	0	0
Non-carry	7.006170	3.099189	inf	2.236820	6	3	3.00	2	41	14	inf	27	0	0	0	0

We also wanted to see what the kda statistics would look like without the perfect games, therefore we replaced the infinities with Nans,

In [ ]:

# Replacing the infinite values with nans
df_relevant_cols_no_inf = df_relevant_cols.replace(np.inf, np.nan)
# Pivot table, groupby positions and displaying mean, median, max and min values of kills, deaths, assists and kda
pd.pivot_table(df_relevant_cols_no_inf, index='position', aggfunc=['mean', 'median', 'max', 'min'], values=['kills', 'deaths', 'assists', 'kda'], fill_value=np.inf)

Out[ ]:

	mean				median				max				min
	assists	deaths	kda	kills	assists	deaths	kda	kills	assists	deaths	kda	kills	assists	deaths	kda	kills
position
Carry	5.625072	2.593435	4.687860	3.872574	5	2	3.0	3	27	14	33	28	0	0	0	0
Non-carry	7.006170	3.099189	4.112728	2.236820	6	3	2.5	2	41	14	35	27	0	0	0	0

The results of removing the infinities from the kda column resulted in a value for the mean kda. This, however, is inaccurate as pandas mean function ignores Nans. Hence, reducing the total number of games played as well and giving an incorrect value of the kda mean.

This can be checked by calculating the mean kda using the mean values of kills, assists and deaths as there were no infinities in those columns therefore we will not have the same issue. This works, as when the values are transformed by a function, the mean has a property where it will also be transformed by the same function. Therefore the true mean of carry kda for example will be, $$Carry_{KDA} = \frac{3.873 + 5.625}{2.593} = 3.667...$$

Next we plot a histogram to compare the kda values of 'Carry' vs 'Non-Carry' roles, we also leave the infinities values as Nans in order to plot the histogram,

In [ ]:

# Getting dataframes that are only Carry's and one that is only Non-Carry's
df_carry = df_relevant_cols_no_inf[df_relevant_cols_no_inf['position'] == 'Carry']
df_noncarry = df_relevant_cols_no_inf[df_relevant_cols_no_inf['position'] != 'Carry']

# Plotting both histograms
plt.hist(df_noncarry['kda'], label = 'Non-Carry KDA')
plt.hist(df_carry['kda'], label = 'Carry KDA')

# Plotting the Legend
plt.legend(loc = 'upper right')
# x-axis and y-axis labels
plt.xlabel('KDA Value')
plt.ylabel('frequency')
# Title
plt.title(label="Carry KDA vs. Non-Carry KDA")

Out[ ]:

Text(0.5, 1.0, 'Carry KDA vs. Non-Carry KDA')

Looking at the histogram, we noticed that carry roles tend to have higher KDA than other roles. So now we will run a hypothesis test to see whether it is due to random chance. ___

Hypothesis Test¶

Null Hypothesis: It is due to random chance that 'Carry' roles have a higher kda

Alternative Hypothesis: It is not due to random chance that 'Carry' roles have a higher kda

For our test statistic we will be using the average kda, in order to check that this observed sample lies in the same distribution as a set of randomly simulated samples.

Observed Value¶

In [ ]:

df_carry = df_relevant_cols[df_relevant_cols['position'] == 'Carry'] # Carry only Dataframe

def get_mean_kda(df):
    """Function to get the mean KDA of a given dataframe

    Args:
        df (Dataframe): Dataframe to get the mean KDA

    Returns:
        Float: Mean KDA of dataframe given
    """
    return (df['kills'].mean() + df['assists'].mean()) / df['deaths'].mean()

# Getting the mean KDA of the observed 
obsv = get_mean_kda(df_carry)
obsv

Out[ ]:

3.662187235991252

Now we will repeatedly sample data from our clean dataset without replacement in order to simulate cases where kda is unimpacted by roles. We will be sampling with a size of $\frac{2}{5}$ of the original dataset, as that would represent the number of 'Carry' roles present in the original dataset.

In [ ]:

#Running Hypothesis Tests
drc_kda = df_relevant_cols[['position', 'kills', 'assists', 'deaths']]

N = 10000 # Number of times to run the loop
stats = [] # List to store KDA means

# Loop that samples the Dataframe and calculates KDA means and stores in the list
for _ in range(N):
    sample = drc_kda.sample(2*(drc_kda.shape[0]//5))
    stats.append(get_mean_kda(sample))

stats[0:10]

Out[ ]:

[3.206143008101444,
 3.248640204501882,
 3.2498153304165007,
 3.222665912785356,
 3.2389620120735727,
 3.2113751763046543,
 3.240155724650504,
 3.2115063885119857,
 3.228300629501503,
 3.2473746948969744]

Now we shall calculate the p-value and plot a histogram of our findings.

In [ ]:

(stats >= obsv).sum()/N # Getting P-value

Out[ ]:

0.0

In [ ]:

# Plotting Histogram
plt.hist(stats, label = 'simulations')
# Observed Line
plt.axvline(obsv, color = 'red', label = 'observed')
# Cutoff
plt.axvline(np.percentile(stats, 99), color = 'orange', label = 'cutoff')
# Legend, Title, axis labels
plt.legend(loc = 'upper center')
plt.title('Test if Carry KDA is Due to Chance')
plt.xlabel('Mean KDA Values')
plt.ylabel('Frequency')

Out[ ]:

Text(0, 0.5, 'Frequency')

Conclusion¶

According to the plot, we can we reject the null hypothesis at a 99% Confidence level. Therefore it does not seem likely that the higher kda of 'Carry' roles is due to random chance.

Assessment of Missingness¶

As per our assignment, we were also required to assess non-trivial missing values from the original data set. As it so happens, our question's relevant columns did not contain any missing data. So instead we decided to analyze missing playerid values. Which we noticed during our skim of the data set.

In our skim, we noticed that the missing values of playerid tend to come in bunches. So we did not think that they could be classified as Not Missing At Random(NMAR). To test this we ran a couple of permutation tests in order to investigate whether the missingness was dependant on other columns.

Is `playerid` missing due to positions?¶

Since we had spent most of the project working with the players positions/roles, we first wanted to test if the playerid missingness was dependant on the role.

First just like the data cleaning we did earlier, we will pick out the relevant columns and remove the rows that represented the team and not a player.

In [ ]:

df_relevant_missing = df.copy()
# Getting relevant columns for assessing missingness
df_relevant_missing = df_relevant_missing[['position', 'playerid']]
# Cleaning Dataframe (again)
df_relevant_missing = df_relevant_missing[df_relevant_missing['position'] != 'team']

df_relevant_missing.head()

Out[ ]:

	position	playerid
0	top	oe:player:38e0af7278d6769d0c81d7c4b47ac1e
1	jng	oe:player:637ed20b1e41be1c51bd1a4cb211357
2	mid	oe:player:d1ae0e2f9f3ac1e0e0cdcb86504ca77
3	bot	oe:player:998b3e49b01ecc41eacc392477a98cf
4	sup	oe:player:e9741b3a238723ea6380ef2113fae63

Next, we added a new boolean column which tells us whether the playerid value is missing or not. From there we generate an empirical distrbution of the playerid missingnes against the position.

In [ ]:

# Creating new boolean column 
df_relevant_missing['id missing'] = df_relevant_missing['playerid'].isna()

# Pivot table groupby positions, getting the size of missing and not missing values
empircal = pd.pivot_table(df_relevant_missing,
                          index = 'position',
                          columns = 'id missing',
                          aggfunc = 'size',
                          values = None)\
                              .fillna(0).apply(lambda x: x / x.sum()) # Get proportions
                              
empircal

Out[ ]:

id missing	False	True
position
bot	0.200060	0.196455
jng	0.199927	0.204333
mid	0.199902	0.205810
sup	0.200285	0.183161
top	0.199827	0.210241

Looking at the values it looks like there is no real relationship between the positions and whether playerid is missing.

We plotted a graph to further visualize the problem.

In [ ]:

# Plotting empircal table
empircal.plot(kind= 'barh',
              title= 'Missing vs Not Missing PlayerId by Role',
              xlabel= 'Roles',
              ylabel= 'Distribution')

Out[ ]:

<AxesSubplot:title={'center':'Missing vs Not Missing PlayerId by Role'}, ylabel='Roles'>

The graph also shows that there may not be any relationship between positions and playerid. To confirm this we will run a permuatation test.

Permuatation Test, Position vs PlayerId¶

Null Hypothesis: The distribution of position is the same whether playerid is missing or not.

Alternative Hypothesis: The distribution of position is not the same whether playerid is missing or not.

As we are comparing between 2 categorical values, the test statistic that we will be using is the Total Variation Distance (TVD). Which is calculated by, $$\text{TVD}(A, B) = \frac{1}{2} \sum_{i = 1}^k |a_i - b_i|$$

First we get the observed value.

In [ ]:

# Calculate Observed TVD
obsv_tvd = empircal.diff(axis=1).iloc[:, -1].abs().sum() / 2
obsv_tvd

Out[ ]:

0.02072887772571398

Now we will perform a permuatation test by shuffling the position columns and getting the TVD each time.

In [ ]:

shuffled = df_relevant_missing.copy()
N = 500 # Number of times to shuffle
tvds = [] # TVDs list

# Loop that shuffles the position column, calculates the empircal values and appends TVD to list
for _ in range(N):
    shuffled['position'] = np.random.permutation(shuffled['position'])
    
    shuffled_empircal = pd.pivot_table(shuffled,
                                       index = 'position',
                                       columns = 'id missing',
                                       aggfunc = 'size',
                                       values = None)\
                                           .fillna(0).apply(lambda x: x / x.sum())
                                           
    tvds.append(shuffled_empircal.diff(axis=1).iloc[:, -1].abs().sum() / 2)
    
tvds[:10]

Out[ ]:

[0.019226785136894103,
 0.020829017231635272,
 0.024934736974409538,
 0.019226785136894103,
 0.00931297405068307,
 0.016723297488861005,
 0.028239340669813207,
 0.019827622172422052,
 0.019226785136894117,
 0.014319949346749253]

Now we calculate the p-value and plot a histogram

In [ ]:

# Calculating the p-value
pval = (np.array(tvds) >= obsv_tvd).sum()/N
pval

Out[ ]:

0.284

In [ ]:

# Plotting the histogram of TVDS with lines representing Observed and Cutoff values
plt.hist(tvds,
         label= 'TVDs')

plt.axvline(obsv_tvd, color = 'red', label= 'Observed')
plt.axvline(np.percentile(tvds, 99), color = 'orange', label= 'Cutoff')

plt.title("Results of Simulation")
plt.legend(loc='upper right')

Out[ ]:

<matplotlib.legend.Legend at 0x18b84666910>

Conclusion for `positions` vs `playerid`¶

As we do not have enough evidence to reject the null. It does not seem likely that the missing values in playerid are due to positions.

However, there must be a reason that the missing values of playerid tend to be bunched up. So we decided to test playerid's missingness with another column to prove this.

Missingness `date` vs `playerid`¶

We thought that maybe the missingness of playerid being bunched up by be due to the day that the game was played, the servers got a little overloaded and therefore some data was lost.

So first we plotted a bar chart to see if there is any correlation between date and playerid missingness.

In [ ]:

df_relevant_missing = df.copy()
# Getting relevant columns
df_relevant_missing = df_relevant_missing[['playerid', 'date']]

# Converting all dates into datetime objects and formatting it as MM-DD
dates = pd.to_datetime(df_relevant_missing['date'])
df_relevant_missing['date'] = dates.apply(lambda dt:str(dt.month)) + '-' + dates.apply(lambda dt:str(dt.day))

# Creating Boolean Columns
df_relevant_missing['id missing'] = df_relevant_missing['playerid'].isna()

# Pivot table groupby positions, getting the size of missing and not missing values
empircal = pd.pivot_table(df_relevant_missing, index = 'date', columns = 'id missing',\
    aggfunc = 'size', values = None).fillna(0).apply(lambda x: x / x.sum())

empircal.head()

Out[ ]:

id missing	False	True
date
1-10	0.002907	0.003138
1-11	0.004140	0.004802
1-12	0.004748	0.005218
1-13	0.003248	0.002949
1-14	0.002582	0.002344

The pivot table is too long, so below we plotted a barchart to look out for any trends that could lead to a possible missingness correlation. We also only included the first 30 entries, but it was sufficient to see that there seems to be some kind of relationship.

In [ ]:

# Plotting Bar Chart
empircal[:30].plot(kind = 'barh',
                   title= 'Missing vs Not Missing playerids by date',
                   )

Out[ ]:

<AxesSubplot:title={'center':'Missing vs Not Missing playerids by date'}, ylabel='date'>

Looking at the bar chart there seems to be some kind of relationship between the missing values of playerid and dates. To answer this, we will perform another permuatation test.

Permuatation Test, `date` vs `playerid`¶

Null Hypothesis: The distribution of date is the same whether playerid is missing or not.

Alternative Hypothesis: The distribution of date is not the same whether playerid is missing

We will again use TVDs for this permuatation test as they are again 2 categorical values.

In [ ]:

# Calculating Observed Value
observed_tvd = empircal.diff(axis=1).iloc[:, -1].abs().sum() / 2
observed_tvd

Out[ ]:

0.056299137239917495

Now we will perform a permuation test.

In [ ]:

shuffled = df_relevant_missing.copy()

N = 500 # Number of times to loop
tvds = [] # TVDs list

# Loop that shuffles the position column, calculates the empircal values and appends TVD to list
for _ in range(N):
    shuffled['date'] = np.random.permutation(shuffled['date'])
    
    shuffled_empircal = pd.pivot_table(shuffled, index = 'date', columns = 'id missing',\
        aggfunc = 'size', values = None).fillna(0).apply(lambda x: x / x.sum())
    
    tvds.append(shuffled_empircal.diff(axis=1).iloc[:, -1].abs().sum() / 2)

tvds[0:10]

Out[ ]:

[0.04045979647993346,
 0.04555398736332357,
 0.04133476422310948,
 0.04538539469042754,
 0.044224534012846664,
 0.04445006870701895,
 0.045255264712816184,
 0.04372756107177824,
 0.04223099263868397,
 0.04100018638957388]

Calculating the p-value and plotting a histogram

In [ ]:

# Calculating p-value
p_val = (np.array(tvds) >= observed_tvd).sum()/N
p_val

Out[ ]:

0.0

In [ ]:

# Plotting histogram with line representing Observed and Cutoffs
plt.hist(tvds, label='TVDs')
plt.axvline(observed_tvd, color = 'red', label='Observed_tvd')
plt.axvline(np.percentile(tvds, 99), color = 'orange', label='Cutoff')
# Title
plt.title("Results of Simulations")
# Legend
plt.legend(loc='upper center')

Out[ ]:

<matplotlib.legend.Legend at 0x18b877424c0>

Conclusion, `dates` vs `playerid`¶

Based on the results of the simulation, we reject the null with a Confidence Interval of 99%. This means that is is likely that the missingness of playerid might have a correlation with the values in dates which would make the missingness type of playerid be Missing At Random(MAR).

In [ ]:

League: Carries vs Non-Carries

Summary of Findings¶

Introduction¶

What Constitutes 'Doing Better'¶

What Determines a Carry?¶

Cleaning¶

EDA¶

Univariate Analysis¶

Putting the data into a graph¶

Bivariate Analysis¶

Hypothesis Test¶

Observed Value¶

Conclusion¶

Assessment of Missingness¶

Is `playerid` missing due to positions?¶

Permuatation Test, Position vs PlayerId¶

Conclusion for `positions` vs `playerid`¶

Missingness `date` vs `playerid`¶

Permuatation Test, `date` vs `playerid`¶

Conclusion, `dates` vs `playerid`¶

Jack Kai Lim

Error

Summary of Findings¶

Introduction¶

What Constitutes 'Doing Better'¶

What Determines a Carry?¶

Cleaning¶

EDA¶

Univariate Analysis¶

Putting the data into a graph¶

Bivariate Analysis¶

Hypothesis Test¶

Observed Value¶

Conclusion¶

Assessment of Missingness¶

Is playerid missing due to positions?¶

Permuatation Test, Position vs PlayerId¶

Conclusion for positions vs playerid¶

Missingness date vs playerid¶

Permuatation Test, date vs playerid¶

Conclusion, dates vs playerid¶

Templates (for web app):

Error

Is `playerid` missing due to positions?¶

Conclusion for `positions` vs `playerid`¶

Missingness `date` vs `playerid`¶

Permuatation Test, `date` vs `playerid`¶

Conclusion, `dates` vs `playerid`¶