IC252 – Introduction to Statistics | INPHO-Style Problem Set

Problem 1 z-Scores · Relative Standing · Interpretation

Three students — Arya, Bhanu, and Charu — each take a different university-level statistics exam on the same day. Their raw scores and class statistics are shown below:

Student	Raw Score	Class Mean (μ)	Class Std Dev (σ)
Arya	72	60	8
Bhanu	85	78	14
Charu	91	88	4

Compute the z-score for each student. Round to two decimal places.
Rank the three students by their relative performance within their respective classes. Who performed best relative to their peers?
Arya's professor offers to add a flat bonus of +10 marks to every student's score. Does Arya's z-score change? Justify rigorously using the definition of z-score.
Charu's class switches to a grading scheme where every score is doubled. What is Charu's new z-score? What general principle does this illustrate?
Suppose a fourth student, Dev, has a z-score of −2.3. What raw score did Dev earn if his class mean was 70 and standard deviation was 10? Interpret what this score means in plain language.

▶ Reveal Full Solution

(a) Computing z-scores

The z-score formula is z = (X − μ) / σ.

Arya: z = (72 - 60) / 8 = 12/8 = 1.50 Bhanu: z = (85 - 78) / 14 = 7/14 = 0.50 Charu: z = (91 - 88) / 4 = 3/4 = 0.75

z_Arya = 1.50 | z_Bhanu = 0.50 | z_Charu = 0.75

(b) Ranking by relative performance

A higher z-score means farther above the class mean. Ranking: Arya (1.50) > Charu (0.75) > Bhanu (0.50). Despite Charu's higher raw score (91) and Bhanu's second-highest raw score (85), Arya performed the best relative to her class because her raw score was 1.5 standard deviations above her class mean — the largest relative margin.

⚠️ Raw scores alone are meaningless for cross-class comparison. z-scores put all students on a common scale.

(c) Adding a flat bonus of +10

If every score increases by a constant c, the new mean becomes μ' = μ + c and the new standard deviation is unchanged (σ' = σ). Therefore:

z' = (X + c - (μ + c)) / σ = (X - μ) / σ = z

Arya's z-score remains exactly 1.50.

A uniform shift of origin does not change relative standing — everyone moves equally.

(d) Doubling all scores

If every score is multiplied by k = 2, the new mean is μ' = 2μ = 176 and the new standard deviation is σ' = 2σ = 8. Charu's new score is 2 × 91 = 182.

z' = (182 - 176) / 8 = 6/8 = 0.75

z-score is unchanged: 0.75

Principle: Multiplying all data by a nonzero constant scales both the mean and standard deviation equally, leaving z-scores invariant. The z-score is dimensionless and scale-free.

(e) Recovering Dev's raw score from his z-score

Rearranging z = (X − μ) / σ:

X = μ + z\cdotσ = 70 + (-2.3)(10) = 70 - 23 = 47

Dev's raw score = 47

Interpretation: Dev scored 23 points below the class average, placing him 2.3 standard deviations below the mean. In a roughly normal distribution, fewer than ~2% of students would score this low, indicating Dev performed in the bottom tail of his class.

Problem 2 Mean · Median · Mode · Robustness to Outliers

The annual salaries (in ₹ lakhs) of 9 employees at a startup are:

4, 5, 5, 6, 7, 8, 8, 9, 60

Calculate the mean, median, and mode. Show all working.
The CEO (earning ₹60 lakhs) leaves. Recalculate all three measures for the remaining 8 employees. By what percentage does the mean change? By what percentage does the median change?
Which measure — mean or median — is a more honest representation of the "typical" salary in the original dataset? Justify using the concept of skewness.
A new employee joins at ₹7 lakhs (the original dataset is restored without the CEO, but with the new employee). The dataset is now: 4, 5, 5, 6, 7, 7, 8, 8, 9. Without computing the mean explicitly, use the shift-of-origin property to determine the new mean, given that the mean of the original 8 employees (post-CEO departure) was 6.625.
True or False: For any dataset, the sum of deviations from the mean is always zero. Prove it algebraically.

▶ Reveal Full Solution

(a) Original dataset: 4, 5, 5, 6, 7, 8, 8, 9, 60 (n = 9)

Sum = 4+5+5+6+7+8+8+9+60 = 112 Mean = 112 / 9 \approx 12.44 lakhs Median = 5th value (ordered) = 7 lakhs Mode = 5 and 8 (bimodal)

Mean ≈ ₹12.44L | Median = ₹7L | Mode = ₹5L and ₹8L

(b) After CEO leaves: 4, 5, 5, 6, 7, 8, 8, 9 (n = 8)

Sum = 52 Mean = 52/8 = 6.625 lakhs Median = (6+7)/2 = 6.5 lakhs % change in mean = (12.44 - 6.625)/12.44 \times 100 \approx 46.7% % change in median = (7 - 6.5)/7 \times 100 \approx 7.1%

Mean dropped ~46.7% | Median dropped only ~7.1%

One extreme outlier caused almost a 47% shift in the mean but barely affected the median — a vivid demonstration of the median's robustness.

(c) Which is more representative?

The original dataset is heavily right-skewed (mean ≫ median, due to the ₹60L outlier). In such distributions, the mean is pulled far to the right by the outlier, overestimating the salary of a typical employee. The median (₹7L) accurately reflects what most employees earn.

📌 This is why national income reports use median household income, not mean — the mean is distorted by the ultra-wealthy.

(d) Shift-of-origin to find new mean

The 8 employees have mean = 6.625. Adding one employee at 7 lakhs gives n = 9. The new mean is:

New mean = (8 \times 6.625 + 7) / 9 = (53 + 7) / 9 = 60/9 \approx 6.667 lakhs

New mean ≈ ₹6.667 lakhs

We used the property: new mean = (old sum + new value) / new n, without re-summing all values.

(e) Proof: Sum of deviations from mean = 0

Σᵢ (Xᵢ - X̄) = Σᵢ Xᵢ - Σᵢ X̄ = n\cdotX̄ - n\cdotX̄ [since X̄ = (ΣXᵢ)/n ⟹ ΣXᵢ = n\cdotX̄] = 0 ✓

TRUE — always exactly zero, for any dataset.

This is why we cannot use the average of deviations as a measure of spread — we must square them first (variance) or take absolute values (mean absolute deviation).

Problem 3 Levels of Measurement · Data Types · Appropriate Statistics

A researcher collects the following variables for a study on university students:

#	Variable	Sample Values
1	Department	CSE, EEE, MECH, CIVIL
2	Year of study	1st, 2nd, 3rd, 4th
3	Temperature of lab (°C)	22, 23, 25, 19
4	Monthly pocket money (₹)	2000, 3500, 0, 5000
5	Whether student owns a laptop	Yes, No, Yes, Yes

Classify each of the five variables as Nominal, Ordinal, Interval, or Ratio. Justify each classification rigorously — pay special attention to the "true zero" criterion.
For each variable, state which measure(s) of central tendency (mean, median, mode) are mathematically meaningful and which are not. Explain any exceptions.
A student argues: "Since Year of Study takes values 1, 2, 3, 4, I can compute the mean and say the average student is in Year 2.5." Is this valid? Why or why not?
Variable 5 (laptop ownership) is binary. Encode "Yes" = 1, "No" = 0. Given the sample above (Yes, No, Yes, Yes), compute the mean of the encoded data. What does this number represent in practical terms?
Can you compute a meaningful ratio like "Lab at 25°C is 25% warmer than at 20°C"? Why or why not? Which variable type would support such a ratio?

▶ Reveal Full Solution

(a) Classification

Variable	Level	Key Reason
Department	Nominal	Named categories with no natural order; CSE ≠ > EEE
Year of Study	Ordinal	Ordered (1st < 2nd < 3rd), but gaps between years are not equal in meaning
Temp (°C)	Interval	Equal units (1°C = 1°C everywhere), but 0°C does not mean "no temperature" — zero is arbitrary
Pocket Money (₹)	Ratio	True zero exists (₹0 = no money), ratios are meaningful (₹5000 is 2.5× ₹2000)
Laptop Ownership	Nominal (Binary)	Dichotomous category; no ordering, no scale

(b) Appropriate measures of central tendency

Variable	Mode	Median	Mean
Department (Nominal)	✅ Only valid option	❌ No ordering	❌ Not meaningful
Year of Study (Ordinal)	✅	✅ Ordering exists	⚠️ Technically invalid (unequal gaps)
Temp (°C) (Interval)	✅	✅	✅
Pocket Money (Ratio)	✅	✅	✅
Laptop Ownership (Binary)	✅ (Majority category)	⚠️ Tells only majority	✅ Gives proportion

(c) Is mean of Year of Study valid?

Not rigorously valid. Year of study is an ordinal variable. While the labels 1, 2, 3, 4 are numbers, the difference between "1st year" and "2nd year" is not necessarily the same in academic workload, course count, or difficulty as between "3rd year" and "4th year." The arithmetic mean assumes equal intervals — a property ordinal data does not guarantee. "Year 2.5" has no real-world meaning.

⚠️ Encoding ordinal categories as numbers and computing their mean is a common but questionable practice. It requires the assumption of equal spacing.

(d) Mean of binary variable

Encoding: Yes=1, No=0 → data: 1, 0, 1, 1

Mean = (1+0+1+1)/4 = 3/4 = 0.75

Mean = 0.75 = 75% of students own a laptop

The mean of a binary 0/1 variable equals the proportion of "1"s in the sample. This is the one case where the mean is meaningful for nominal data.

(e) Ratio comparison for Celsius temperature

No. Celsius is an interval scale — it lacks a true zero. 0°C does not mean "no heat"; it is the arbitrary freezing point of water. Therefore, ratios like "25°C is 25% warmer than 20°C" are mathematically invalid.

To make meaningful temperature ratios, you must convert to Kelvin (a ratio scale with absolute zero): 25°C = 298 K, 20°C = 293 K → ratio = 298/293 ≈ 1.017, i.e., only ~1.7% warmer in absolute thermodynamic terms.

Only Ratio scales support meaningful ratio comparisons.

Problem 4 Quartiles · IQR · Five-Number Summary · Outlier Detection

The daily rainfall (mm) recorded at a Himalayan weather station over 15 consecutive days is:

3, 7, 7, 12, 15, 18, 22, 25, 28, 30, 35, 42, 58, 80, 140

Compute Q1, Q2 (Median), and Q3 using the split-at-median method. Show the partitioning clearly.
Compute the IQR. Using the standard 1.5 × IQR rule, identify all outliers (suspected and extreme). Show the fence calculations.
Construct the five-number summary. Based on the relative distances Q1→Median vs. Median→Q3 and Q3→fence vs. fence→Q1, determine the skewness of the distribution without computing Pearson's coefficient.
If a 16th observation of 200 mm is added (a cloudburst event), which changes more — the IQR or the range? Quantify both changes.
A meteorologist claims: "The IQR tells us about the variability of all 15 days." Critique this claim, explaining what the IQR actually captures and what it misses.

▶ Reveal Full Solution

(a) Quartiles (n = 15)

Ordered: 3, 7, 7, 12, 15 | 18, 22, 25, 28, 30 | 35, 42, 58, 80, 140 Lower half Median Upper half Q2 (Median) = 8th value = 25 mm Lower half: 3, 7, 7, 12, 15 \to Q1 = 7 mm (3rd value of 5) Upper half: 35, 42, 58, 80, 140 \to Q3 = 58 mm (3rd value of 5)

Q1 = 7 mm | Q2 = 25 mm | Q3 = 58 mm

(b) IQR and Outlier Detection

IQR = Q3 - Q1 = 58 - 7 = 51 mm Lower fence = Q1 - 1.5\timesIQR = 7 - 76.5 = -69.5 mm (impossible for rainfall) Upper fence = Q3 + 1.5\timesIQR = 58 + 76.5 = 134.5 mm Extreme fence = Q3 + 3\timesIQR = 58 + 153 = 211 mm Values beyond upper fence (134.5 mm): 140 mm Values beyond extreme fence (211 mm): none

140 mm is a suspected outlier. No extreme outliers.

Lower fence is negative — physically impossible for rainfall — so no low-end outliers exist.

(c) Five-Number Summary and Skewness

Five-Number Summary: 3 | 7 | 25 | 58 | 140 Lower spread: Q2 - Q1 = 25 - 7 = 18 mm Upper spread: Q3 - Q2 = 58 - 25 = 33 mm (upper > lower) Whisker: Q3 \to max = 140 - 58 = 82 mm (long upper whisker) Whisker: min \to Q1 = 7 - 3 = 4 mm (very short lower whisker)

Conclusion: The box is asymmetric (upper half larger), and the upper whisker is dramatically longer. Distribution is strongly right-skewed (positively skewed).

This makes physical sense — rainfall has a hard lower bound of 0 mm but no upper limit (cloudburst events extend the right tail).

(d) Adding 200 mm observation

Original Range = 140 - 3 = 137 mm New Range = 200 - 3 = 197 mm \to Change = +60 mm For n=16, new Q1 and Q3 must be recomputed: Ordered (n=16): 3,7,7,12,15,18,22 | 25,28 | 30,35,42,58,80,140,200 New Q1 = median of lower 8: (12+15)/2 = 13.5 mm New Q3 = median of upper 8: (58+80)/2 = 69 mm New IQR = 69 - 13.5 = 55.5 mm Original IQR = 51 mm \to Change = +4.5 mm

Range increased by 60 mm | IQR increased by only 4.5 mm

The range is extremely sensitive to the new outlier. The IQR, covering only the middle 50%, is largely resistant to it.

(e) Critique of the meteorologist's claim

The claim is partially incorrect. The IQR describes the spread of only the central 50% of observations. It effectively ignores the bottom 25% and top 25% — the very extreme events that matter most to a meteorologist studying floods or droughts. In this dataset, IQR = 51 mm but the range is 137 mm; the IQR tells us nothing about the extreme rainfall of 80 or 140 mm. For applications where tails matter (risk, flood planning), range, 90th/99th percentiles, or full distributional analysis are needed.

The IQR is robust to outliers precisely because it excludes them — a feature, not a bug, in some contexts; a critical limitation in others.

Problem 5 Standard Deviation · Variance · Degrees of Freedom · Least Squares

Two machines at a factory produce ball bearings. Their diameters (in mm) over 6 production runs are recorded:

Run	Machine A (mm)	Machine B (mm)
1	10.1	9.5
2	10.0	11.0
3	9.9	10.2
4	10.2	8.8
5	10.0	10.5
6	9.8	10.0

Compute the sample mean and sample standard deviation for each machine. Use n−1 in the denominator.
Which machine produces more consistent ball bearings? Justify numerically.
Why do we divide by (n−1) and not n when computing the sample standard deviation? Explain the concept of degrees of freedom in this context.
If the target diameter is 10.0 mm, compute the sum of squared deviations Σ(Xᵢ − 10.0)² for Machine A. Now compute Σ(Xᵢ − X̄_A)². Which is larger? What fundamental property of the mean does this illustrate?
Both machines are recalibrated, increasing every reading by 0.5 mm. How do the standard deviations change?

▶ Reveal Full Solution

(a) Means and Standard Deviations

Machine A: Σ = 10.1+10.0+9.9+10.2+10.0+9.8 = 60.0 X̄_A = 60.0/6 = 10.0 mm Deviations: 0.1, 0, -0.1, 0.2, 0, -0.2 Squared: 0.01, 0, 0.01, 0.04, 0, 0.04 \to Σ = 0.10 s²_A = 0.10/5 = 0.02 \to s_A = \sqrt0.02 \approx 0.141 mm Machine B: Σ = 9.5+11.0+10.2+8.8+10.5+10.0 = 60.0 X̄_B = 60.0/6 = 10.0 mm Deviations: -0.5, 1.0, 0.2, -1.2, 0.5, 0.0 Squared: 0.25, 1.00, 0.04, 1.44, 0.25, 0.00 \to Σ = 2.98 s²_B = 2.98/5 = 0.596 \to s_B = \sqrt0.596 \approx 0.772 mm

s_A ≈ 0.141 mm | s_B ≈ 0.772 mm | Both have X̄ = 10.0 mm

(b) Which machine is more consistent?

Both machines have the same mean (10.0 mm), but Machine A has a standard deviation of only 0.141 mm vs. Machine B's 0.772 mm — more than 5× larger. Machine A is far more consistent. Despite producing identical average diameters, Machine B's bearings vary wildly, making them unsuitable for precision manufacturing.

(c) Why n−1? Degrees of Freedom

When we compute a sample standard deviation, we first estimate the population mean μ using the sample mean X̄. This estimation "uses up" one piece of information — once we fix X̄ from n data points, only (n−1) of the deviations (Xᵢ − X̄) are free to vary (the last one is determined by the constraint that they sum to zero). Hence we have only n−1 degrees of freedom.

Dividing by n instead of n−1 produces a biased estimator that systematically underestimates the population variance σ². Dividing by (n−1) corrects this bias, making s² an unbiased estimator of σ².

n−1 ensures s² is an unbiased estimator of σ²

(d) Least Squares Property of the Mean

Σ(Xᵢ - 10.0)² = (0.1)²+(0)²+(-0.1)²+(0.2)²+(0)²+(-0.2)² = 0.01+0+0.01+0.04+0+0.04 = 0.10 Σ(Xᵢ - X̄_A)² = same calculation since X̄_A = 10.0 = 0.10 [In general: if target \neq X̄, Σ(Xᵢ - target)² > Σ(Xᵢ - X̄)²]

Here the target happens to equal the mean, so both sums are equal. This illustrates the least squares property: the sample mean minimises the sum of squared deviations. No other value, when subtracted from each data point and squared, produces a smaller sum. This is a defining optimality property of the mean. Σ(Xᵢ − c)² is minimised when c = X̄

(e) Adding 0.5 mm to all readings

Adding a constant to every value shifts the mean but does not change the spread. Standard deviation measures how data points deviate from the mean; if all values shift by the same amount, each deviation (Xᵢ + 0.5) − (X̄ + 0.5) = Xᵢ − X̄ is unchanged. Standard deviations remain unchanged: s_A ≈ 0.141 mm, s_B ≈ 0.772 mm

Problem 6 Coefficient of Variation · Cross-Domain Comparison

An investor is comparing the risk of three different assets. Historical monthly returns (%) are summarized as:

Asset	Mean Monthly Return (%)	Std Dev (%)
Government Bond	0.40	0.10
Blue-Chip Stock	1.80	0.60
Crypto Token	5.00	18.00

Compute the Coefficient of Variation (CV) for each asset. Show the formula and calculation.
Rank the three assets from least to most risky using CV. Does this ranking differ from the ranking by standard deviation alone? What does this reveal?
The Crypto Token has a much higher standard deviation than the Bond. A naïve analyst declares: "Crypto is obviously the worst investment because its standard deviation is 180× larger." What is wrong with this reasoning? Use CV to rebut.
A fourth asset has CV = 50%. If its mean return is 3%, what is its standard deviation?
In what scenario would CV be a misleading or inappropriate measure of relative variability?

▶ Reveal Full Solution

(a) Computing CV = (s / X̄) × 100%

Government Bond: CV = (0.10 / 0.40) \times 100 = 25.0% Blue-Chip Stock: CV = (0.60 / 1.80) \times 100 = 33.3% Crypto Token: CV = (18.00 / 5.00) \times 100 = 360.0%

CV: Bond = 25% | Stock = 33.3% | Crypto = 360%

(b) Risk ranking by CV vs. by std dev

By std dev: Bond (0.10) < Stock (0.60) < Crypto (18.00) By CV: Bond (25%) < Stock (33.3%) < Crypto (360%)

In this case both rankings agree. However, CV provides additional insight: the Crypto Token is not merely 18× riskier than the Bond in absolute terms — per unit of expected return, it carries 360/25 = 14.4× more relative risk. The CV properly normalises risk per unit of reward.

(c) Rebutting the naïve analyst

The analyst compares standard deviations across assets with very different scales of return. Crypto's std dev of 18% is large, but its mean return is also 5% — vs. Bond's mean of only 0.40%. A fair comparison must normalise by the mean (CV). The Bond's CV = 25% means for every 1% of expected return, there is 0.25% of risk. Crypto's CV = 360% means for every 1% of expected return, there is 3.6% of risk. The Bond carries far less risk per unit of expected return. The analyst's comparison is like comparing the weight fluctuation of a mouse (±5g) to an elephant (±10kg) and declaring the elephant is "heavier in every way" — context matters.

CV = (risk per unit of return) × 100. Lower CV = more efficient risk-to-reward profile.

(d) Finding std dev from CV and mean

CV = (s / X̄) \times 100 50 = (s / 3) \times 100 s = 50 \times 3 / 100 = 1.5%

Standard deviation = 1.5%

(e) When is CV misleading?

CV is problematic when:

Mean ≈ 0: If the mean return is close to zero or negative (e.g., an asset losing money on average), CV becomes meaningless or infinite — dividing by a near-zero denominator produces absurd numbers.
Data can take negative values: CV assumes the mean is positive and meaningful as a denominator. For temperatures in Celsius (which can be negative), CV has no coherent interpretation.
Comparing variables of different sign conventions (e.g., gains vs. losses).

⚠️ CV = s/X̄ is dimensionless and powerful, but assumes a positive, meaningful mean. Always check this assumption first.

Problem 7 Skewness · Box Plot Interpretation · Distribution Shape

The time (in hours) taken by 12 software engineers to complete a coding challenge is:

1.5, 2.0, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 6.0, 8.0, 24.0

Compute the mean and median. Based on their relationship, predict the direction of skewness.
Compute Q1, Q3, and IQR. Using the 1.5×IQR rule, check whether 24.0 hours is an outlier.
Construct the five-number summary. If you were to sketch a box plot, describe — without drawing — where the median line would sit within the box and which whisker would be longer.
The manager removes the engineer who took 24 hours from the dataset. Without full recomputation, predict (qualitatively) what happens to: (i) the mean, (ii) the median, (iii) the standard deviation, (iv) the skewness.
Give one real-world context each for a naturally right-skewed and a naturally left-skewed variable, explaining why the bound on one side forces the tail to extend on the other.

▶ Reveal Full Solution

(a) Mean, Median, and Skewness Prediction

Sum = 1.5+2+2+2.5+3+3.5+4+4.5+5+6+8+24 = 66.0 Mean = 66.0/12 = 5.50 hours Median = average of 6th and 7th values = (3.5+4.0)/2 = 3.75 hours

Since Mean (5.50) > Median (3.75), the distribution is right-skewed (positively skewed). The 24-hour outlier pulls the mean rightward while barely affecting the median.

(b) IQR and Outlier Check

Lower 6 values: 1.5, 2, 2, 2.5, 3, 3.5 \to Q1 = (2+2.5)/2 = 2.25 hrs Upper 6 values: 4, 4.5, 5, 6, 8, 24 \to Q3 = (5+6)/2 = 5.5 hrs IQR = 5.5 - 2.25 = 3.25 hrs Upper fence = Q3 + 1.5\timesIQR = 5.5 + 4.875 = 10.375 hrs 24.0 > 10.375 hrs \to 24.0 hours IS a suspected outlier ✓

24.0 hours is confirmed as a suspected outlier.

(c) Five-Number Summary and Box Plot Description

Min | Q1 | Median | Q3 | Max 1.5 | 2.25 | 3.75 | 5.5 | 24.0

Box plot description: The box spans from 2.25 to 5.5 hours (IQR = 3.25 hrs). The median line at 3.75 hours sits closer to Q1 than to Q3 (distance from Q1: 1.5 hrs; distance to Q3: 1.75 hrs), indicating slight asymmetry even within the box. The right whisker would extend to the fence at ~10.375 hrs (since the next-largest value is 8, which is inside the fence), while the left whisker extends to 1.5 hrs. The value 24.0 would be plotted as a separate dot (outlier). The right whisker is considerably longer.

(d) Removing the 24-hour engineer — qualitative effects

(i) Mean: Drops significantly. The outlier 24.0 contributed (24−5.5) = 18.5 hours above the mean. New mean = (66−24)/11 = 42/11 ≈ 3.82 hrs — a large decrease.
(ii) Median: Shifts slightly. With 11 observations, the new median is the 6th value = 3.5 hrs (was 3.75). Small change — the median is robust.
(iii) Standard Deviation: Decreases substantially. The 24-hour value was 18.5 hours above the mean; removing it dramatically reduces the sum of squared deviations.
(iv) Skewness: Decreases (distribution becomes more symmetric). The right tail shrinks; mean and median move closer together.

(e) Real-world examples of skewed distributions

Right-skewed: Income/wealth. The lower bound is ₹0 (you can't earn negative income), but there is no upper limit — a few billionaires extend the right tail to enormous values. Most people earn modest amounts, creating a long right tail.

Left-skewed: Test scores on a very easy exam. The upper bound is 100% — scores cannot exceed this. Most students score near the top (90–100%), while a few who struggle extend the left tail toward lower marks. The ceiling compresses the right side and forces the tail leftward.

📌 The key insight: a hard boundary on one side forces the tail to extend toward the opposite side.

Problem 8 Percentiles · Quantiles · Application to Real Datasets

A national entrance exam was taken by 10,000 students. A student, Priya, scored 347 marks.

If Priya is in the 82nd percentile, how many students scored below her? How many scored above? How many scored the same or higher?
Another student, Rohan, is in the 99th percentile with a score of 490. A third student, Sita, scored 350 (3 marks more than Priya) but is only in the 83rd percentile. What does the small percentile gap between Priya and Sita tell you about the density of the score distribution near 347–350?
Monsoon rainfall data is classified using P90, P95, and P99.99 thresholds for "heavy rainfall," "extreme rainfall," and "cloudburst events" respectively. In a region with 1000 historical daily rainfall readings, how many days had rainfall classified as at least "extreme" (≥ P95)?
A student scored at the 50th percentile. Is their score necessarily equal to the mean? Justify your answer using a skewed distribution example.
Deciles divide a dataset into 10 equal parts. What is the relationship between the 5th decile (D5), the 2nd quartile (Q2), and the 50th percentile (P50)? Are they always identical?

▶ Reveal Full Solution

(a) Counts from percentile rank

82nd percentile means 82% of students scored below Priya. Students below Priya = 0.82 \times 10,000 = 8,200 students Students above Priya = 10,000 - 8,200 - (those at same score) \approx 1,800 students (exactly if no ties at 347) Students at or above \approx 1,800 students

~8,200 below | ~1,800 at or above

(b) Density of the distribution near Priya's score

Priya (347, P82) and Sita (350, P83) differ by only 1 percentile despite a 3-mark difference. This means only about 0.01 × 10,000 = 100 students scored between 347 and 350. The distribution is sparse in this region. Contrast this with regions where many students cluster (e.g., a jump from P50 to P55 might cover just 2 marks in a dense middle region). Percentile rank changes fast where students are densely packed and slowly where they are sparse.

📌 Percentile rank is essentially the empirical CDF — it moves fastest where the distribution is most dense.

(c) Counting extreme rainfall days

P95 threshold: 95% of days have rainfall BELOW this level. Days at or ABOVE P95 = (100 - 95)% \times 1000 = 5% \times 1000 = 50 days These 50 days include: - "Extreme rainfall" (P95 to P99.99) - "Cloudburst events" (above P99.99)

At least 50 days had "at least extreme" rainfall (≥ P95).

(d) 50th percentile ≠ mean in skewed distributions

The 50th percentile (median) is the value at which half the data falls above and half below. The mean is the arithmetic average. They coincide only in symmetric distributions.

Example: Salaries: ₹20K, ₹25K, ₹30K, ₹35K, ₹500K (n=5). Median = ₹30K (50th percentile). Mean = ₹122K. These are drastically different. A student at the 50th percentile earned exactly ₹30K, not ₹122K.

In right-skewed data: mean > median. Scores at P50 ≠ mean.

(e) D5 = Q2 = P50

Yes — they are all the same value, by definition:

P50 = value below which 50% of data lies Q2 = value that splits data into two equal halves (= median) D5 = 5th of 10 equal parts \to boundary between bottom 50% and top 50% ∴ D5 = Q2 = P50 = Median — always identical.

D5 = Q2 = P50 = Median — all three refer to the same quantity.

Similarly: Q1 = P25 = D2.5 (2.5th decile). Q3 = P75 = D7.5.

Problem 9 Properties of Mean & Median · Linear Transformations

A dataset of n = 5 observations has: mean X̄ = 40, median = 36, and standard deviation s = 8.

Every observation is increased by 15. State the new mean, median, and standard deviation. No computation is needed — derive from transformation properties.
Every observation is multiplied by 3. State the new mean, median, and standard deviation.
The transformation Y = 2X − 10 is applied to every observation. Find the new mean, median, variance (s²), and CV. (Original variance s² = 64.)
The median of the transformed data in (c) equals 2×36 − 10 = 62. Verify that this is consistent with the general rule for linear transformations of the median.
A statistician applies the transformation Z = (X − X̄)/s to all observations (standardisation). What are the mean, standard deviation, and variance of Z? Why is this transformation important?

▶ Reveal Full Solution

(a) Adding 15 to every observation

Adding a constant c shifts the location measures by c but doesn't affect spread.

New mean = 40 + 15 = 55 New median = 36 + 15 = 51 New std dev = 8 (unchanged — spread is unaffected by shift)

Mean = 55 | Median = 51 | s = 8

(b) Multiplying every observation by 3

Multiplying by k scales both location and spread by |k|.

New mean = 3 \times 40 = 120 New median = 3 \times 36 = 108 New std dev = 3 \times 8 = 24 (s scales with |k|) New variance = 9 \times 64 = 576 (variance scales with k²)

Mean = 120 | Median = 108 | s = 24 | s² = 576

(c) Transformation Y = 2X − 10

For Y = aX + b: new mean = aX̄ + b, new std dev = |a|×s, new variance = a²×s².

New mean = 2\times40 - 10 = 70 New median = 2\times36 - 10 = 62 New std dev = 2\times8 = 16 New variance = 4\times64 = 256 New CV = (16/70)\times100 = 22.9% Original CV = (8/40)\times100 = 20.0%

Mean=70 | Median=62 | s=16 | s²=256 | CV≈22.9%

Note: CV changes because the shift (−10) affects the mean but not the std dev. CV is not invariant to location shifts.

(d) Verification of linear transformation rule for median

The general rule: if Y = aX + b, then Median_Y = a × Median_X + b. This holds because:

Multiplying by a (positive) preserves the order of observations
Adding b shifts all values equally
The middle value of the transformed data is the transformation of the middle value of the original data

Therefore: Median_Y = 2×36 − 10 = 62. ✓ Consistent with the rule.

(e) Standardisation: Z = (X − X̄)/s

This is the linear transformation Y = (1/s)X − (X̄/s), i.e., a = 1/s, b = −X̄/s.

Mean of Z = (1/s)\timesX̄ + (-X̄/s) = X̄/s - X̄/s = 0 Std dev of Z = |1/s|\timess = 1 Variance of Z = 1² = 1

Mean of Z = 0 | Std Dev of Z = 1 | Variance = 1

Importance: Standardisation converts any variable into a dimensionless z-score, enabling comparison across variables measured in different units and scales. It is the foundation of z-tests, normal tables, and principal component analysis. All z-scores share the same scale, making cross-variable and cross-study comparisons rigorous.

Problem 10 Synthesis · Complete Descriptive Analysis · Inference from Summary Statistics

Two researchers study response times (in seconds) for users of a website. They collect samples and report only summary statistics (no raw data):

Statistic	Server A	Server B
n	100	100
Mean	3.2 s	3.2 s
Median	3.1 s	2.1 s
Mode	3.0 s	1.5 s
Std Dev	0.4 s	2.8 s
Min	2.0 s	0.5 s
Max	4.5 s	18.0 s
Q1	2.9 s	1.2 s
Q3	3.5 s	4.0 s

For each server, determine the skewness direction using both (i) the mean-vs-median criterion and (ii) the box plot asymmetry (Q2 position within the box and whisker lengths). Do both methods agree?
Compute the IQR and CV for each server. What do these tell you about reliability?
Using the 1.5×IQR rule, check whether the maximum values (4.5s and 18.0s) are outliers for their respective servers.
The engineering team must choose one server for a real-time application where user experience requires consistently fast response times. Which server should they choose and why? Construct a comprehensive argument using at least four statistical measures.
A manager argues: "Both servers have a mean of 3.2s, so they perform identically." Construct a formal statistical refutation of this claim, identifying every piece of evidence in the summary statistics that contradicts it.

▶ Reveal Full Solution

(a) Skewness determination

Server A:

Mean (3.2) ≈ Median (3.1) → nearly symmetric, slight right skew
Q2−Q1 = 3.1−2.9 = 0.2 s; Q3−Q2 = 3.5−3.1 = 0.4 s; Upper whisker: 4.5−3.5 = 1.0 s; Lower whisker: 2.9−2.0 = 0.9 s → mild right skew

Server B:

Mean (3.2) >> Median (2.1) → strong right skew
Q2−Q1 = 2.1−1.2 = 0.9 s; Q3−Q2 = 4.0−2.1 = 1.9 s (upper much larger); Upper whisker: 18.0−4.0 = 14.0 s; Lower whisker: 1.2−0.5 = 0.7 s → extreme right skew

Both methods agree: A ≈ symmetric; B is heavily right-skewed.

(b) IQR and CV

Server A: IQR = 3.5 - 2.9 = 0.6 s CV = (0.4/3.2) \times 100 = 12.5% Server B: IQR = 4.0 - 1.2 = 2.8 s CV = (2.8/3.2) \times 100 = 87.5%

Server A's IQR is 4.7× smaller (much tighter middle 50%), and its CV is 7× lower. Both confirm Server A is dramatically more consistent per unit of mean response time.

(c) Outlier check via 1.5×IQR rule

Server A: Upper fence = Q3 + 1.5\timesIQR = 3.5 + 0.9 = 4.4 s Max = 4.5 s > 4.4 s \to 4.5 s IS a suspected outlier (barely) Server B: Upper fence = Q3 + 1.5\timesIQR = 4.0 + 4.2 = 8.2 s Max = 18.0 s > 8.2 s \to 18.0 s IS a suspected outlier (severely) Extreme fence = 4.0 + 3\times2.8 = 12.4 s \to 18.0 > 12.4 \to EXTREME outlier

Server A: 1 mild outlier | Server B: 1 extreme outlier (18.0s)

(d) Engineering recommendation

Choose Server A, based on every relevant dimension:

Std Dev: A = 0.4s vs. B = 2.8s. Server B's variability is 7× larger.
IQR: A = 0.6s vs. B = 2.8s. Even the middle 50% of B's responses are highly variable.
Maximum: A's worst case is 4.5s. B's worst case is 18.0s — 4× worse. For real-time apps, worst-case latency matters enormously.
Skewness: B's strong right skew means frequent spikes (the mean is pulled to 3.2s by rare 18s responses). A is symmetric — predictable.
CV: A = 12.5% (excellent consistency) vs. B = 87.5% (highly erratic per unit of mean).

Server B sometimes responds faster (mode = 1.5s) but is wildly inconsistent. A real-time application requires predictable performance, not occasionally fast performance. Server A wins comprehensively.

(e) Formal refutation of the manager's claim

The manager's claim relies solely on the mean — a single statistic that collapses all distributional information into one number. Here is every piece of evidence against his claim:

Median: A = 3.1s, B = 2.1s. The typical (middle) response is 1 second faster on B, yet B's mean equals A's due to extreme tail events. The servers are not equivalent for the typical user.
Standard deviation: 0.4s vs. 2.8s — a factor of 7 difference in spread.
Maximum: 4.5s vs. 18.0s — Server B occasionally fails catastrophically.
Skewness: A is symmetric; B is right-skewed. Identical means can coexist with radically different shapes.
IQR: 0.6s vs. 2.8s — the middle 50% of users experience vastly different service on each server.
Outliers: B has an extreme outlier at 18s inflating its mean to match A's. Remove outliers and B's mean would be < A's.

⚠️ The mean is not a sufficient description of any distribution. Two distributions with identical means can differ in median, spread, shape, outliers, and skewness — making them statistically and practically incomparable based on mean alone.

The manager's claim is statistically unfounded on 6 independent dimensions.

IC252 — Introduction to Statistics

🗂 Topics Covered