Is there a correlation between the size of football stadium
and team win percentage?
Football is a world wide sport
played by nearly everyone. Since I was young I have had a serious interest in
it and I have been to many football games and regularly play. I have chosen it as my topic area of interest
as football is a great passion of mine and a hobby I enjoy playing. The data
that I collected came from a competition called the Champions League. It is a
European competition where 32 teams entered into groups and then play each
other. All the teams have their own football stadium in which they play and each
stadium has a varying capacity.
I will be investigating whether there
is a correlation between the size of a team’s football stadium and their win
percentage. I have taken secondary data from Wikipedia to find the size of
stadium and the teams results from Uefa (Union of European Football
Associations). With the data that
gathered I will use to perform a number of Mathematical calculations; Mean,
test to see if the size of
football stadiums is independent of win
percentage, Scatter graphs to identify any relationship, Pearson’s Correlation
Coefficient to show the strength of the relationship and how to work out the line
of least squares regression which will help to identify whether these two
variables are correlated. I think that there could possibly a correlation as
the teams that do the best are the largest teams in Europe.
I collected the data for each of
the stadiums from Wikipedia. The 32 teams in the competition was a good sample
size to use as it is above 30 readings. This representative sample will give a
fairly decent spread of results to make my calculations more reliable and
accurate and is suitable for this coursework to get decent results.
Table of grouped stadium size and
For this table I split the 32 teams
into a grouped format to make the data more manageable. I did this by splitting
them into their stadium sizes ranging from 10,000-100,000. On the left hand
side, we have stadium size and across that we have the % of teams which fall
under the stadium sizes. I then took the number of wins, draws and losses and
for each of the grouped teams. Finally, I worked out the win percentage. I did
this by dividing the number of wins by the total number of games then X by 100.
For example, (12/60 = 0.2), (0.2 X 100= 20%) I did this as it is the only fair
way to compare each of the stadium sizes because they don’t all have the same
number of teams in them.
Graph 1: Scatter graph to show
stadium size against win percentage
I have used a scatter graph as it
can begin to show me if there is any sort of correlation. First we have the
graph on Excel as it is easier to read and show the values, then the second
graph is on the TI-84 and shows the readings in more detail and accurately.
Along the X axis have football stadium size (in 1000’s) and along the Y axis we
have win percentage (%). From this we can begin to see that there is a fairly
strong correlation between the size of
football stadium and win
percentage. We are able to come to this conclusion because the points are
aligning in a way a positive fashion and you can see the points almost forming
a straight line which shows a fairly strong positive correlation.
Calculation of line of least
X= are the midpoints of the the
size of football stadium
Y= The win percentage of teams
The equation for the line of least
squares regression is Y= ax + b
a= is the gradient of the graph
b= is the point at which the line
intercepts the Y axis
?= the sum of the values
n= the number of values (9)
I used the mid points of X because
it is grouped data, so an estimate is then calculated, so it is the only way to
be able to use it accurately. The X values are not done in 1000’s as it has no
outcome of the final answer it just makes it more manageable to use and helps
to eradicate errors using smaller numbers. So I have just worked out the line
of least squares regression and now will be able to put this onto my graph and
show the trend. The line equation now shows that for every stadium size
increased by 1000 the win percentage goes up by 0.85%. In this context 0.85% is
the gradient, so the graph will go up by 0.85.
Calculation of Pearson’s
The Pearson correlation coefficient
shows how strong a relationship between 2 variables is. The values for
Pearson’s Correlation lie between -1 and 1. This means if your value is -1 it
represents a perfect negative correlation and if your value is 1 it represents
a perfect positive correlation. The values will never be greater than 1 and
will never be less than -1. First I sorted out all the data into columns to
make the workings easier to do and to try to eradicate any mistakes. Then I
used the values that I had worked out in put them into the formula to find my
value of Pearson’s correlation coefficient. From the calculations above my
working shows a very strong positive correlation between the size of football
stadium and win percentage (r=0.9157037606). This shows a very strong positive correlation
and there is clearly a relationship between the size of football stadium and
win percentage. It is considered a positive strong correlation as the value is
Crit (16.6 > 9.49) this means we reject H0 and accept H1
and therefore size of football stadium is dependent of win percentage. This
backs up my previous findings that there is a relationship between the size of
football stadium and win percentage.
The data that I have collected from
Wikipedia and UEFA is all secondary data. This means I will have to assume all
the data is reliable and I have to trust in order to answer the question that I
From the data that I have used and
collected there will be some limitations and validity issues that may have
affected the outcome of the correlation between football stadium size and
number of games won. Validity is extremely important as it may affect the
outcome of my results and determine if there is any relationship between the
variables that I am investigating. Throughout my workings I am confident that
there are few mistakes, and hopefully there will be no other variables that
will affect the outcome of my results.
The first validity issue is that
using grouped data comes with a few problems. The stadium size between
20,000-30,000 has only 1 team in their and this team lost all their games which
meant they had a win percentage of 0%. This is a problem as it not very
reliable and this is because I chose to group my data.
For my line of least squares
regression my working was fairly accurate as I worked out the line to be
0.85X-5.083 and on my TI 84 it worked the line of least squares regression to
be 0.85X-4.992. So I was fairly close to the reading my calculator produced and
this is because my calculator is able to use more significant figures and
therefore this means it will be more accurate.
One problem that I encountered was
with the c2
test because my data was grouped.
For my observed table I had quite a few readings which were either 0 or less
than 5 and this disrupts the reliability of the result if they are like this. I
worked out that my test was 16.6 and this was quite a bit larger than the c2 crit value of
9.49. So for me it shows there is a serious relationship, while this is true
there is a strong relationship it will not be this high. This problem affects
reliability and it occurred for me as I used grouped data.
The calculation for Pearson’s correlation
coefficient was valid as I worked it out to be 0.9157 and my TI-84 worked it
out to be 0.903. This occurs the same as the line of least squares regression
its because my calculator uses more figures and therefore makes its reading
more accurate than mine.
Overall my working and data
collection was pretty reliable. On the whole there don’t tend to be too many
errors. The only real problem was with the c2
test as it was so high in relation
to the c2
crit value. This would be my only real problem with my validity and
reliability. Using grouped data was a good idea as it made the calculations far
more manageable however the only real problem that came with it was the c2 test as I have already mentioned.
In conclusion from my results it is
clear to see that there is definitely a correlation between size of football
stadium and win percentage. From my Person’s Correlation Coefficient working
(0.9157) and my c2
test (16.6) they both indicate
that there is a very strong correlation and relationship between size of football
stadium and win percentage. so although the c2
value is a bit out of proportion I
think it is quite easy to see that there is relationship between size of
football stadium and win percentage. Although from this its clear that there is
a correlation between size of football stadium and win percentage, stadium size
is not the only factor that effects win percentage. Wealth of club, how good
the players are, transfer fees all play a part in win percentage. Stadium size
definitely affects win percentage but it is not the only factor that makes an
impact on the win percentage.