By the end of this chapter, you will be familiar with:
Basic definitions:
Data/variables can be classified into two main types- numerical (quantitative) and categorical (qualitative) data.
Numerical data
Numerical or quantitative data measures a numerical quantity.
There are two types of numerical data:
Categorical or Qualitative data
It measure a quality or characteristic of the experimental unit. Categorical data represent characteristics such as a person’s gender, marital status, hometown, or the types of movies they like.
We often use a pie-chart to display different values of a given variable.
It shows how a total amount is divided between levels of a categorical variable as a circle divided into radial slices. Each categorical value corresponds with a single slice of the circle, and the size of each slice (both in area and arc length) indicates what proportion of the whole each category level takes.
Ex. Favorite type of movies:
A frequency distribution table is one way you can organize data so that it makes more sense.
The frequency of an observation tells you the number of times the observation occurs in the data. For example, in the following list of numbers, the frequency of the number 9 is 5 (because it occurs 5 times):
1, 2, 3, 4, 6, 9, 9, 8, 5, 1, 1, 9, 9, 0, 6, 9.
In a frequency distribution table, the left column, called classes or groups, includes numerical intervals on the variable being studied. The right column is a list of the frequencies or number of observations, for each class.
Taking an example on how to generate a frequency distribution table:
Ex. The data below shows the mass of 40 students in a class. The measurement is to the nearest kg. Draw a frquency distribution table for the same.
55, 70, 57, 73, 55, 59, 64, 72, 60, 48, 58, 54, 69, 51, 63, 78, 75, 64, 65, 57, 71, 78, 76, 62, 49, 66, 62, 76, 61, 63, 63, 76, 52, 76, 71, 61, 53, 56, 67, 71
STEP 1- Check for the minimum and maximum value. Here 48 is the minimum and 78 is themaximum.
STEP 2-Find the range. Here, the range is 78-48=30. The scale of the frequency table must contain the range of the masses.
STEP 3- Decide the intervals for the distribution. Here we can use interval of 5 starting from 45 and ending at 79.
STEP 4- Draw the frequency table using the selected scale and intervals.
MASS (kg) | FREQUENCY |
45-49 | 2 |
50-54 | 4 |
55-59 | 7 |
60-64 | 10 |
65-69 | 4 |
70-74 | 6 |
75-79 | 7 |
We are familiar with bar charts. A histogram looks similar to a bar chart. Bar charts are suitable for discrete data, while a histogram is used for continuous data.
We can convert a frequency distribution into a histogram. The intervals are the x-axis variables and the frequencies are the y-axis variables. We construct bars for each interval without any space between them. The width of each bar spans the width of the data class it represents.
Generating a histogram for the example worked out for frequency distributions.
The cumulative frequency is calculated by adding each frequency from a frequency distribution table to the sum of its predecessors. The last value will always be equal to the total for all observations, since all frequencies will already have been added to the previous total.
A relative cumulative frequency distributions converts all cumulative frequencies to cumulative percentages.
Adding the column of cummulative and relative cummulative frequency, in the example we worked out for frequency distribution.
MASS (kg) | FREQUENCY | CUMULATIVE FREQUENCY | PERCENTAGE OF MASS | CUMULATIVE PERCENTAGE OF MASS |
45-49 | 2 | 2 | 5 | 5 |
50-54 | 4 | 6 | 10 | 15 |
55-59 | 7 | 13 | 17.5 | 32.5 |
60-64 | 10 | 23 | 25 | 57.5 |
65-69 | 4 | 27 | 10 | 67.5 |
70-74 | 6 | 33 | 15 | 82.5 |
75-79 | 7 | 40 | 17.5 | 100 |
TOTAL | 40 | 100.0 |
Cumulative frequencies and their graphs help in analysing data given in group form.
Cumulative frequency graph is a line graph. The upper limit of each class is the x-axis variable and the cumulative frequency is the y-axis variable.
SHAPE OF DISTRIBUTION
Note the following shapes of distribution, symmetric and skewed
The population consists of every member in the group that you want to find out about. A sample is a subset of the population that will give you information about the population as a whole.
There are a number of ways in which we can draw a sample from the population. We should always try to choose a method which results in the sample giving the best approximation for the population as a whole.
When surveying, however, it is vital to ensure the people in your sample reflect the population or else you will get misleading results.
1.) REASONS FOR SAMPLING
TYPES OF SAMPLING
a) Random sampling
Random sampling is a part of the sampling technique in which each sample has an equal probability of being chosen. It is also called probablity sampling.
b) Non-random sampling
Non-random sampling is a way of selecting units based on factors other than random chance. Here, not every unit of the population has the same probablity of being selected into the sample. It is also called non-probablity sampling.
A measure of central tendency is a summary statistic that represents the center point or typical value of a dataset. These measures indicate where most values in a distribution fall and are also referred to as the central location of a distribution.
The three most common measures of central tendency are the mean, median, and mode.
Variability refers to how “spread out” a group of scores is. The terms variability, spread, and dispersion are synonyms, and refer to how spread out a distribution is. Just as in the section on central tendency where we discussed measures of the center of a distribution of scores, in this section we will discuss measures of the variability of a distribution which include the range, the variance, the standard deviation, the interquartile range and the coefficient of variance.
Ex. You and your friends have just measured the heights of your dogs (in millimeters):
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
Mean = (600+470+170+430+300)/5
Mean =394
Now we calculate, each dog’s difference from the mean and square it.
Height of dog | |xi-x| | (xi-x)2 |
600 | 206 | 42436 |
470 | 76 | 5776 |
170 | 224 | 50176 |
430 | 36 | 1296 |
300 | 94 | 8836 |
Variance =
Variance =21704 mm2
Standard deviation =√21704
Standard deviation =147.32 mm
STEP 1: Arrange the data from smallest to largest and find the median. 25, 28, 29, 29, 30, 34, 35, 35, 37, 38
Median = (30+34 )/2 = 32
STEP 2: Find the quartiles
The first quartile is the median of the data points to the left of the median.
25, 28, 29,29,30
Q1=29
The third quartile is the median of the data points to the right of the median.
34, 35, 35, 37, 38
Q3=35
STEP 3: Complete the five-number summary by finding the minimum and the maximum.
Minimum- 25, Maximum- 38
The five number summary is 25, 29, 32, 35, 38
Now, we extend the definitions of variance and standard deviation to data which has been grouped. An alternative, yet equivalent formula for variance, which is often easier to use is:
Ex. Find an estimate of the variance and standard deviation of the following data for the marks obtained in a test by 88 students.
Marks (x) | 0≤x<10 | 10≤x<20 | 20≤x<30 | 30≤x<40 | 40≤x<50 |
Frequency (f) | 6 | 16 | 24 | 25 | 17 |
We can show the calculations in a table as follows:
Marks | Mid Interval Value (x) | f | fx | x2 | fx2 |
0≤x<10 | 5 | 6 | 30 | 25 | 150 |
10≤x<20 | 15 | 16 | 240 | 225 | 3600 |
20≤x<30 | 25 | 24 | 600 | 625 | 15000 |
30≤x<40 | 35 | 25 | 875 | 1225 | 30625 |
40≤x<50 | 45 | 17 | 765 | 2025 | 34425 |
Total | 88 | 2510 | 83800 |
Mean = ∑ 𝑓x/𝑛 = 2510/88 = 28.52
Variance = (∑ 𝑓x2/𝑛) – x-2 = (83800/88) -(2510/88)2.
Variance =138.73
Standard deviation = √138.73 = 11.78
A practical way of seeing the significance of the standard deviation can be demonstracted by empirical rule.
The empirical rule applies to a normal distribution. In a normal distribution, virtually all data falls within three standard deviations of the mean. The mean, mode, and median are all equal.
The empirical rule is also referred to as the Three Sigma Rule or the 68-95-99.7 Rule because:
Here, is the mean and is the standard deviation.
Bivariate statistics is a type of inferential statistics that deals with the relationship between two variables. That is, bivariate statistics examines how one variable compares with another or how one variable influences another variable.
Ex. Ice cream sales versus the temperature on that day. The two variables are Ice Cream Sales and Temperature.
(If you have only one set of data, such as just Temperature, it is called “Univariate Data”)
One way to determine whether there is a relationship between two data sets is to plot the bivariate data on a scatter diagram. A scatter diagram takes the two sets of data and plots one set on the x-axis and the other set on the y-axis.
Look at the scatter diagram for the above example:
Ex. The local ice cream shop keeps track of how much ice cream they sell versus the temperature on that day, here are their figures for the last 12 days:
Temperature | Ice cream sales |
14.20 | $215 |
16.40 | $325 |
11.90 | $185 |
15.20 | $332 |
18.50 | $406 |
22.10 | $522 |
19.40 | $412 |
25.10 | $614 |
23.40 | $544 |
18.10 | $421 |
22.60 | $445 |
17.20 | $408 |
Plot a scatter plot and find the correlation coefficient.
Scatter plot:
We can see that warmer weather and higher sales go together. The relationship is good but not perfectly linear.
Temp(0C) (x) | Sales(y) | 𝑥i-𝑥̅ | 𝒚i-𝒚̅ | (𝑥i-𝑥̅)(𝒚i-𝒚̅) | (𝑥i-𝑥̅ )2 | (𝒚i-𝒚̅)2 |
14.20 | $215 | -4.5 | -187 | 842 | 20.3 | 34969 |
16.40 | $325 | -2.3 | -77 | 177 | 5.3 | 5929 |
11.90 | $185 | -6.8 | -217 | 1476 | 46.2 | 47089 |
15.20 | $332 | -3.5 | -70 | 245 | 12.3 | 4900 |
18.50 | $406 | -0.2 | 4 | -1 | 0.0 | 16 |
22.10 | $522 | 3.4 | 120 | 408 | 11.6 | 14400 |
19.40 | $412 | 0.7 | 10 | 7 | 0.5 | 100 |
25.10 | $614 | 6.4 | 212 | 1357 | 41.0 | 44944 |
23.40 | $544 | 4.7 | 142 | 667 | 22.1 | 20164 |
18.10 | $421 | -0.6 | 19 | -11 | 0.4 | 361 |
22.60 | $445 | 3.9 | 43 | 168 | 15.2 | 1849 |
17.20 | $408 | -1.5 | 6 | -9 | 2.3 | 36 |
Total | 5325 | 177 | 174757 |
Which shows that the correlation is strong but not perfect.
A line of best fit can be roughly determined using an eyeball method by drawing a straight line on a scatter plot so that the number of points above the line and below the line is about equal (and the line passes through as many points as possible).
Drawing the line of best fit for the previous example of ice cream sale and temperature;
Interpolation- We could use our function to predict the value of the dependent variable for an independent variable that is in the midst of our data. In this case, we are performing interpolation. It is reasonable when the scatter plot shows a strong relationship.
Extrapolation- We could use our function to predict the value of the dependent variable for an independent variable that is outside the range of our data. In this case, we are performing extrapolation. It is extremely suspect. Without data in the range, there is no reason to believe that the relationship between X and Y is the same as in the region in which there is data.
Ex. Draw a least square regression line for the following data:
Hours spent on essay | Grade |
6 | 82 |
10 | 88 |
2 | 56 |
4 | 64 |
6 | 77 |
7 | 92 |
0 | 23 |
1 | 41 |
8 | 80 |
5 | 59 |
3 | 47 |
Scatter plot:
Mean of hours spent =4.72
Mean of grade=64.45
Hours spent on essay (𝑥) | Grade(𝑦) | 𝑥i-𝒚̅ | (𝑦̅i-𝑥̅) | (𝑥i-𝑥̅)(𝑦i-𝑦̅ ) | 𝑥i-𝑥2 |
6 | 82 | 1.27 | 17.55 | 23.33 | 1.6129 |
10 | 88 | 5.27 | 23.55 | 124.15 | 27.772 |
2 | 56 | -2.73 | -8.45 | 23.06 | 7.452 |
4 | 64 | -0.73 | -0.45 | 0.33 | 0.532 |
6 | 77 | 1.27 | 12.55 | 15.97 | 1.612 |
7 | 92 | 2.27 | 27.55 | 62.60 | 5.152 |
0 | 23 | -4.73 | -41.45 | 195.97 | 22.372 |
1 | 41 | -3.73 | -23.45 | 87.42 | 13.91 |
8 | 80 | 3.27 | 15.55 | 50.88 | 10.69 |
5 | 59 | 0.27 | -5.45 | -1.49 | 0.072 |
3 | 47 | -1.73 | -17.45 | 30.15 | 2.992 |
Total | 611.36 | 94.18 |
𝑏 = ∑(𝑥𝑖−𝑥̅)(𝑦𝑖−𝑦̅) /∑(𝑥𝑖−𝑥̅)2
b = (611.36)/(94.18)
b = 6.49a
a=𝒚̅-b𝑥̅
a=64.45-6.49(4.72)
a=30.18
We have the slope and the intercept of the equation, therefore, the equation will be-
y=6.49x+30.18
Now, drawing the least square regression line on the scatter plot.
In reality, a single linear function might not be sufficient to model a scenario. In this case, a different function could be used to model each section . Combining two or more linear functions results in what is called a piecewise linear function.
Ex. 𝑓(𝑥) = { 2𝑥 + 3, − 3 ≤ 𝑥 < 1 5, 1 ≤ 𝑥 ≤ 6