Business Analytics

Chapter 5

Numerical descriptive measures…..Cont

Chapter outline

Lecture 04

5.3 Measures of relative standing and box plots

5.4 Approximating descriptive measures for grouped data

5.5 Measures of association

5.6 General guidelines on the exploration of data

Learning objectives

LO1 to L03 were covered last week

This week

LO4 Explain the concepts of percentiles, deciles, quartiles and interquartile range, and show their usefulness through the application of a box plot

LO5 Calculate the mean and variance when the data are already in grouped form

LO6 Obtain numerical measures to calculate the direction and strength of the linear relationship between two variables

LO7 Understand the use of graphical methods and numerical measures to present summary information about a data set.

5.3 Measures of Relative Standing

and Box Plots

Measures of relative standing are designed to provide information about the position of particular values relative to the entire data set.

Percentile: the pth percentile is the value for which p percent are less than that value and 100(1-p)% are greater than that value.

Suppose you scored in the 60th percentile on your final exam, that means 60% of the other students’ scores were below yours, while 40% of scores were above yours.

Percentiles

The pth percentile of a set of measurements is the value for which

at most p% of the measurements are less than that value

at most 100(1–p)% of all the measurements are greater than that value.

For example, suppose 77 is the 68th percentile of a statistics exam score. Then

Quartiles

We have special names for the 25th, 50th and the 75th percentiles, namely quartiles.

•First (lower) quartile, Q1 = 25th percentile (p25)

•Second (middle) quartile, Q2 = 50th percentile (p50) (which is also the median)

•Third (upper) quartile, Q3 = 75th percentile (p75)

We can also convert percentiles into quintiles (fifths) and deciles (tenths).Commonly Used Percentiles…

First (lower) decile = 10th percentile

First (lower) quartile, Q1 = 25th percentile

Second (middle)quartile,Q2 = 50th percentile

Third quartile, Q3, = 75th percentile

Ninth (upper) decile = 90th percentile

For example, if your exam mark places you in the 80th percentile, that doesn’t mean you scored 80% on the exam – it means that 80% of your peers scored lower than you and 20% scored higher than you in the exam. It is about your position relative to others, not the actual mark.

Example 11

Find the quartiles of the following set of measurements

7, 18, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8

Example 11 - Solution

First sort the measurements

2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30

Location of Percentiles

Find the location of any percentile using the formula

Example 12

Example 12 - Solution

After sorting the data we have

0, 0, 5, 7, 8, 9, 12, 14, 22, 33.

Example 12 – Solution…

The 50th percentile is halfway between the fifth and sixth observations (in the middle between 8 and 9), that is 8.5. That is,

p50 = 8 + (.5)(9 – 8) = 8.5

Example 12 – Solution…

The 75th percentile is one quarter of the distance between the eighth and ninth observation. That is

p75 = 14+.25(22 – 14) = 16.

Location of Percentiles…

Please remember…

Quartiles and Variability

Quartiles can provide an idea about the shape of a histogram.

Interquartile Range…

The quartiles can be used to create another measure of variability, the interquartile range, which is defined as follows:

Interquartile Range (IQR) = Q3 – Q1

The interquartile range measures the spread of the

Large values of this statistic mean that the 1st and 3rd quartiles are far apart, indicating a high level of variability.

Box Plots

Box Plot is a pictorial display that graphs five main descriptive measures of the measurement set:

L – The largest measurement

Q3 – The upper quartile

Q2 – The median

Q1 – The lower quartile

S – The smallest measurement

Box Plots…

The box plot is a technique that graphs five statistics:

• the minimum and maximum observations, and

the first, second, and third quartiles.

Box Plots…

The lines extending to the left and right are called whiskers.

Any points that lie outside the whiskers are called outliers.

The whiskers extend outward to the smaller of 1.5 times the interquartile range or to the most extreme point that is not an outlier.

Example 13

Create a box plot for the data regarding the number of customers who purchased petrol in an Independent petrol station each day in the last 200 days.

The number of customers range from 410 to 700.

About half the days, the number of customers are less than 560, and about half are greater than 560.

About half the days, the number of customers lie between 530 and 590.

About a quarter lies below 530 and a quarter above 590.

5.4 Approximating Descriptive

Measures for Grouped Data

Approximating descriptive measures for grouped data may be needed when approximated values satisfy the needs when only secondary grouped data are available.

Approximate the mean and standard deviation of the telephone call durations problem, represented by the frequency distribution.

5.5 Measures of Association

Two numerical measures are presented, for the description of linear relationship between two variables depicted in the scatter diagram.

Covariance (is there any pattern to the way two variables move together?)

Correlation coefficient (how strong is the linear relationship between two variables?)

Covariance…

Covariance…

In much the same way there was a ‘shortcut’ for calculating sample variance without having to calculate the sample mean, there is also a shortcut for calculating sample covariance without having to first calculate the means:

Covariance…

Coefficient of Correlation…

The coefficient of correlation is defined as the covariance divided by the standard deviations of the variables:

Coefficient of Correlation…

The coefficient of correlation can take positive or negative values.

It can take only values between –1 and +1.

Coefficient of Correlation…

Strong positive linear relationship

If the two variables are very strongly positively linear related, the coefficient value is close to +1.

Strong negative linear relationship

If the two variables are very strongly negatively linear related, the coefficient value is close to –1.

No linear relationship

No linear (straight line) relationship is indicated by a coefficient value close to zero.

Coefficient of Correlation…

Compute the covariance and the coefficient of correlation between advertising expenditure and sales level and discuss the strength and direction of the relationship between them. Base your calculation on the data (in millions) provided below.

Excel output

Interpretation

The covariance (10.2679) indicates that advertisement expenditure and sales level are positively related

The coefficient of correlation (0.797) indicates that there is a strong positive linear relationship between advertisement expenditure and sales level.

The Least Squares Method

The objective of the scatter diagram is to measure the strength and direction of the linear relationship.

Both can be more easily judged by drawing a straight line through the data.

We need an objective method of producing a straight line.

Such a method has been developed; it is called the least squares method.

The Least Squares Method…

Recall, the slope-intercept equation for a line is expressed in these terms:

y = mx + b

where:

m is the slope of the line

b is the y-intercept.

If we’ve determined that there is a linear relationship between two variables using the covariance and the coefficient of correlation, can we determine a linear function of the relationship?

The Least Squares Method

…produces a straight line drawn through the points so that the sum of squared deviations between the points and the line is minimised. This line is represented by the equation:

The Least Squares Method

The coefficients and are given by:

Fixed and Variable Costs

Fixed costs are costs that must be paid whether or not any units are produced.

These costs are ‘fixed’ over a specified period of time or range of production.

Variable costs are costs that vary directly with the number of products produced.

Fixed and Variable Costs

There are some expenses that are mixed.

There are several ways to break the mixed costs in its fixed and variable components. One such method is the least squares line. That is, we express the total costs of some component as

y = b0 + b1x

where y = total mixed cost, b0 = fixed cost and b1 = variable cost, and x is the number of units.

XM05-18 A tool and die maker operates out of a small shop making specialised tools. He is considering increasing the size of his business and needs to know more about his costs.

One such cost is electricity, which he needs to operate his machines and lights. (Some jobs require that he turn on extra bright lights to illuminate his work.) He keeps track of his daily electricity costs and the number of tools that he made that day. Determine the fixed and variable electricity costs.

The y-intercept is 9.587.

That is, the regression line strikes the y-axis at 9.587. This is simply the value of when x = 0.

However, when x = 0, we are producing no tools and hence the estimated fixed cost of electricity is $9.59 per day.

When we introduced the coefficient of correlation we pointed out that except for −1, 0, and +1 we cannot precisely interpret its meaning.

We can judge the coefficient of correlation in relation to its proximity to −1, 0, and +1 only.

Fortunately, we have another measure that can be precisely interpreted. It is the coefficient of determination, which is calculated by squaring the coefficient of correlation. For this reason we denote it R2.

The coefficient of determination is

R2 = 0.758

This tells us that 75.8% of the variation in electrical costs is explained by the number of tools. The remaining 24.2% is unexplained.

Interpreting Correlation

Because of its importance we remind you about the correct interpretation of the analysis of the relationship between two numerical variables. That is, if two variables are linearly related, it does not mean that X is causing Y. It may mean that another variable is causing both X and Y or that Y is causing X. Remember

‘Correlation is not Causation’

Parameters and Sample Statistics