Is there a correlation between the size of football stadium

and team win percentage?

Context

Football is a world wide sport

played by nearly everyone. Since I was young I have had a serious interest in

it and I have been to many football games and regularly play. I have chosen it as my topic area of interest

as football is a great passion of mine and a hobby I enjoy playing. The data

that I collected came from a competition called the Champions League. It is a

European competition where 32 teams entered into groups and then play each

other. All the teams have their own football stadium in which they play and each

stadium has a varying capacity.

Introduction:

I will be investigating whether there

is a correlation between the size of a team’s football stadium and their win

percentage. I have taken secondary data from Wikipedia to find the size of

stadium and the teams results from Uefa (Union of European Football

Associations). With the data that

gathered I will use to perform a number of Mathematical calculations; Mean,

the c2

test to see if the size of

football stadiums is independent of win

percentage, Scatter graphs to identify any relationship, Pearson’s Correlation

Coefficient to show the strength of the relationship and how to work out the line

of least squares regression which will help to identify whether these two

variables are correlated. I think that there could possibly a correlation as

the teams that do the best are the largest teams in Europe.

Data Collection:

I collected the data for each of

the stadiums from Wikipedia. The 32 teams in the competition was a good sample

size to use as it is above 30 readings. This representative sample will give a

fairly decent spread of results to make my calculations more reliable and

accurate and is suitable for this coursework to get decent results.

Table of grouped stadium size and

win Percentage

For this table I split the 32 teams

into a grouped format to make the data more manageable. I did this by splitting

them into their stadium sizes ranging from 10,000-100,000. On the left hand

side, we have stadium size and across that we have the % of teams which fall

under the stadium sizes. I then took the number of wins, draws and losses and

for each of the grouped teams. Finally, I worked out the win percentage. I did

this by dividing the number of wins by the total number of games then X by 100.

For example, (12/60 = 0.2), (0.2 X 100= 20%) I did this as it is the only fair

way to compare each of the stadium sizes because they don’t all have the same

number of teams in them.

Graph 1: Scatter graph to show

stadium size against win percentage

I have used a scatter graph as it

can begin to show me if there is any sort of correlation. First we have the

graph on Excel as it is easier to read and show the values, then the second

graph is on the TI-84 and shows the readings in more detail and accurately.

Along the X axis have football stadium size (in 1000’s) and along the Y axis we

have win percentage (%). From this we can begin to see that there is a fairly

strong correlation between the size of

football stadium and win

percentage. We are able to come to this conclusion because the points are

aligning in a way a positive fashion and you can see the points almost forming

a straight line which shows a fairly strong positive correlation.

Calculation of line of least

squares regression:

X= are the midpoints of the the

size of football stadium

Y= The win percentage of teams

The equation for the line of least

squares regression is Y= ax + b

a= is the gradient of the graph

b= is the point at which the line

intercepts the Y axis

?= the sum of the values

n= the number of values (9)

I used the mid points of X because

it is grouped data, so an estimate is then calculated, so it is the only way to

be able to use it accurately. The X values are not done in 1000’s as it has no

outcome of the final answer it just makes it more manageable to use and helps

to eradicate errors using smaller numbers. So I have just worked out the line

of least squares regression and now will be able to put this onto my graph and

show the trend. The line equation now shows that for every stadium size

increased by 1000 the win percentage goes up by 0.85%. In this context 0.85% is

the gradient, so the graph will go up by 0.85.

Calculation of Pearson’s

Correlation Coefficient:

The Pearson correlation coefficient

shows how strong a relationship between 2 variables is. The values for

Pearson’s Correlation lie between -1 and 1. This means if your value is -1 it

represents a perfect negative correlation and if your value is 1 it represents

a perfect positive correlation. The values will never be greater than 1 and

will never be less than -1. First I sorted out all the data into columns to

make the workings easier to do and to try to eradicate any mistakes. Then I

used the values that I had worked out in put them into the formula to find my

value of Pearson’s correlation coefficient. From the calculations above my

working shows a very strong positive correlation between the size of football

stadium and win percentage (r=0.9157037606). This shows a very strong positive correlation

and there is clearly a relationship between the size of football stadium and

win percentage. It is considered a positive strong correlation as the value is

between 0.9

Crit (16.6 > 9.49) this means we reject H0 and accept H1

and therefore size of football stadium is dependent of win percentage. This

backs up my previous findings that there is a relationship between the size of

football stadium and win percentage.

Validity:

The data that I have collected from

Wikipedia and UEFA is all secondary data. This means I will have to assume all

the data is reliable and I have to trust in order to answer the question that I

have asked.

From the data that I have used and

collected there will be some limitations and validity issues that may have

affected the outcome of the correlation between football stadium size and

number of games won. Validity is extremely important as it may affect the

outcome of my results and determine if there is any relationship between the

variables that I am investigating. Throughout my workings I am confident that

there are few mistakes, and hopefully there will be no other variables that

will affect the outcome of my results.

The first validity issue is that

using grouped data comes with a few problems. The stadium size between

20,000-30,000 has only 1 team in their and this team lost all their games which

meant they had a win percentage of 0%. This is a problem as it not very

reliable and this is because I chose to group my data.

For my line of least squares

regression my working was fairly accurate as I worked out the line to be

0.85X-5.083 and on my TI 84 it worked the line of least squares regression to

be 0.85X-4.992. So I was fairly close to the reading my calculator produced and

this is because my calculator is able to use more significant figures and

therefore this means it will be more accurate.

One problem that I encountered was

with the c2

test because my data was grouped.

For my observed table I had quite a few readings which were either 0 or less

than 5 and this disrupts the reliability of the result if they are like this. I

worked out that my test was 16.6 and this was quite a bit larger than the c2 crit value of

9.49. So for me it shows there is a serious relationship, while this is true

there is a strong relationship it will not be this high. This problem affects

reliability and it occurred for me as I used grouped data.

The calculation for Pearson’s correlation

coefficient was valid as I worked it out to be 0.9157 and my TI-84 worked it

out to be 0.903. This occurs the same as the line of least squares regression

its because my calculator uses more figures and therefore makes its reading

more accurate than mine.

Overall my working and data

collection was pretty reliable. On the whole there don’t tend to be too many

errors. The only real problem was with the c2

test as it was so high in relation

to the c2

crit value. This would be my only real problem with my validity and

reliability. Using grouped data was a good idea as it made the calculations far

more manageable however the only real problem that came with it was the c2 test as I have already mentioned.

Conclusion:

In conclusion from my results it is

clear to see that there is definitely a correlation between size of football

stadium and win percentage. From my Person’s Correlation Coefficient working

(0.9157) and my c2

test (16.6) they both indicate

that there is a very strong correlation and relationship between size of football

stadium and win percentage. so although the c2

value is a bit out of proportion I

think it is quite easy to see that there is relationship between size of

football stadium and win percentage. Although from this its clear that there is

a correlation between size of football stadium and win percentage, stadium size

is not the only factor that effects win percentage. Wealth of club, how good

the players are, transfer fees all play a part in win percentage. Stadium size

definitely affects win percentage but it is not the only factor that makes an

impact on the win percentage.