Descriptive Statistics
Learning Objectives
By the end of this chapter, you will be able to:
- ✅ Understand and apply measures of location (mean, median, mode, percentiles, quartiles)
- ✅ Calculate and interpret measures of variability (range, IQR, variance, standard deviation, coefficient of variation)
- ✅ Assess distribution characteristics (skewness, z-scores, outliers)
- ✅ Create and interpret boxplots for data comparison
- ✅ Measure association between two variables using covariance and correlation
About the TechFlow Dataset
This chapter uses hypothetical business data from TechFlow Solutions’ Q4 2024 performance. All examples, calculations, and visualizations draw from this dataset, which you’ll use to complete practice problems and develop your analytical skills.
Accessing the Data:
- Download:
TechFlow_Q4_2024.xlsx
from your course materials folder - Dataset Structure
- Sheet 1: Daily Sales
- Sheet 2: Product Sales
- Sheet 3: Regional Sales
- Sheet 4: Customer Orders
- Sheet 5: Customer Satisfaction
- Sheet 6: Response Times
Realism: Values reflect typical patterns in the consumer electronics industry, including: - Weekend sales dips in daily data - Black Friday spike (November 29) - Holiday season surge (December) - Right-skewed response times (most fast, some very slow) - Left-skewed satisfaction (most satisfied, few unhappy)
Completeness: No missing values. All fields populated for easier learning (real-world data would require cleaning).
Rounding: Financial figures rounded to nearest dollar for readability.
1 Introduction: The $2.3 Million Question
It’s 9:00 AM on January 16, 2025. Sarah Chen, CEO of TechFlow Solutions, sits in the executive conference room reviewing her marketing team’s Q4 2024 performance report. Her coffee grows cold as she reads:
“Total revenue for Q4 was $2.3 million, which is up from $2.1 million in Q3. That’s an increase of about $200,000 or roughly 9.5%. The average monthly revenue was around $767,000… Daily sales ranged quite a bit—our lowest day was around $18,000 and our highest was $52,000 during the holiday rush. Most days we did somewhere in the middle range… We had some outliers on both ends.”
Sarah sets down the report and looks at her leadership team. “What decisions can we actually make from this?”
Silence.
“Should we discontinue Product D (AudioWave Speaker)? Expand in Asia-Pacific? Hire more customer service staff? This report tells me we made $2.3 million, but it doesn’t tell me what to do about it.”
Her CFO clears his throat. “The numbers are vague. ‘Around $767,000’? ‘Somewhere in the middle’? I can’t forecast with that. I can’t set budgets with that.”
Sarah slides the report across the table. “I want this redone. I want precision. I want statistics that tell us where to invest, what to fix, and which assumptions to challenge. And I want it by end of week.”
Every year, businesses lose millions because of imprecise statistical reporting: - “Around” numbers prevent tracking month-over-month changes - “Somewhere in the middle” doesn’t help set performance targets - “Some outliers” misses both problems (to fix) and opportunities (to leverage) - Vague language makes recommendations indefensible to stakeholders
The fix: Calculate exact values. Use proper statistical measures. Quantify variation.
This chapter is about the difference between those two reports. The gap between “around” and “exactly.” Between “some outliers” and “15% of tickets exceed our 21-hour threshold.” Between data and decisions.
Welcome to descriptive statistics. The art and science of making data speak clearly.
2 Types of Data: Knowing What You’re Working With
Before the analytics team at TechFlow could redo their analysis, they faced a fundamental question: What kind of data do we actually have?
It sounds simple, but this question matters. Calculate a mean for customer satisfaction categories (Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied), and you’ll get a number that means nothing. Try to find an average color, and you’ll waste everyone’s time. Different data types require different analytical approachoes. Get this wrong at the start, and everything downstream will be flawed. Know your data type before choosing your statistical tools.
2.1 The Building Blocks: Elements, Variables, and Observations
Think of your data as a spreadsheet. Each row represents an element—the thing you’re studying. At TechFlow, one element might be a single customer order. Another analysis might treat each product as an element, or each region, or each day.
The columns are your variables—the characteristics you’ve measured. For a customer order, variables might include: order amount, product purchased, customer region, satisfaction score, and response time if they contacted support.
Each cell—the intersection of a row and column—is an observation. It’s the specific value recorded. Order 1247
bought Product A (SmartHub Pro) for $220 from North America with a satisfaction score of 4.5 out of 5.
Understanding this structure matters because it determines what questions you can ask. If your elements are orders, you can analyze typical order size. If your elements are customers, you can analyze purchase frequency. Same underlying data, different analytical structure, different insights.
2.2 Categorical vs. Quantitative: The Great Divide
The analytics team begins organizing TechFlow’s Q4 data. David, the lead analyst, pulls up the customer file.
“Okay, we’ve got product type—that’s categorical. Region—also categorical. Order amount—that’s quantitative. Satisfaction rating… wait, is that categorical or quantitative?”
His colleague Maria leans over. “It’s ordinal. The numbers have order—5 is better than 4—but the distance between them isn’t necessarily equal. Someone going from 2 to 3 might be a bigger jump than 4 to 5.”
Categorical (Qualitative) Data:
- Nominal: Pure categories, no order (Product Type: A/B/C/D, Region, Color)
- Valid operations: Count frequencies, percentages, mode
- Ordinal: Categories with meaningful order (Satisfaction: 1-5, Education Level)
- Valid operations: Count, percentages, mode, median
Quantitative (Numerical) Data:
- Interval/Ratio: Actual numbers you can do math with (Sales $, Response time, Age)
- Valid operations: All mathematical operations including mean, standard deviation
Key distinction: Ordinal looks like numbers but behaves like categories. Be careful!
Wrong: “Average satisfaction is 3.7”
Problem: The distance between 3 and 4 may not equal the distance between 4 and 5. A mean of 3.7 implies precision that doesn’t exist.
Better: “Median satisfaction is 4” or “70% of customers rated us 4 or 5”
Exception: If you have large sample sizes and treat the scale as approximately continuous, mean can be acceptable—but report median too.
2.3 The Time Dimension: Cross-Sectional, Time Series, and Panel Data
TechFlow’s analytics team now faces another structural question. David pulls up three different analyses:
“Report A looks at all four products’ Q4 performance—snapshot at one point in time. Report B tracks Product A’s sales monthly from 2020 to 2025. Report C tracks all four products monthly over the same period. These need different approaches.”
A dataset can have one of the following three structures:
- Cross-Sectional: Different entities, same time point
- Example: Q4 2024 sales for Products A, B, C, D
- Order doesn’t matter (can scramble rows)
- Enables: Comparison across entities
- Use for: “Which product/region/customer is best?”
- Time Series: One entity, multiple time points
- Example: Product A monthly sales 2020-2025
- Order matters enormously (dates carry information)
- Enables: Trend identification, seasonality, forecasting
- Use for: “How are we changing over time?”
- Panel Data: Multiple entities × multiple time points
- Example: All products’ monthly sales 2020-2025
- Most information-rich, most complex
- Enables: Both cross-sectional and time series analysis
- Use for: “How do differences across products evolve over time?”
Maria nods. “For the Q4 analysis, we’re mostly cross-sectional. But Sarah wants to know if Product D’s poor performance is new or ongoing. That requires time series thinking.”
3 Measures of Location: Finding the Center
Sarah’s directive was clear: precision, not approximation. The analytics team starts with the most basic question: What is typical performance?
But “typical” turns out to be more complicated than it sounds.
3.1 The Mean: Precision Matters
David pulls up the monthly revenue numbers.
“October was $720,000. November hit $780,000. December reached $800,000. The marketing report said ‘around $767,000’ for average monthly revenue. Let me calculate the exact mean.”
\[\bar{x} = \frac{720{,}000 + 780{,}000 + 800{,}000}{3} = \frac{2{,}300{,}000}{3} = \$766{,}666.67\]
“So it’s $766,667, not ‘around $767,000.’ Big difference?” Maria asks.
“Track this over six months,” David responds. “Month 1: ‘around $767k.’ Month 2: ‘around $770k.’ Month 3: ‘around $773k.’ You report to Sarah that revenue is growing, and she asks by how much. You can’t answer because ‘around’ doesn’t give you enough precision to measure change.”
\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + \cdots + x_n}{n}\]
In plain English: Add up all values, divide by how many you have.
Population mean (when you have ALL possible values): \[\mu = \frac{\sum_{i=1}^{N} x_i}{N}\]
What it represents: The “center of gravity” of your data. Each observation “pulls” the mean toward it.
“Around $767,000” vs. “$766,667”
Why precision matters: - Can’t track month-over-month changes with “around” - Can’t calculate % achievement of targets - Can’t defend budget requests to finance - Can’t identify meaningful trends vs. noise
Business Rule: If it’s worth reporting, it’s worth calculating exactly.
3.2 When Equal Isn’t Equal: Weighted Mean
Maria moves to the customer satisfaction analysis. “We surveyed 150 customers. Simple mean satisfaction score is 4.1 out of 5. Looks good.”
David frowns. “But wait—are all customers equally important? What if we weight by their revenue contribution?”
He pulls up the data. The top 10% of customers by order value have average satisfaction of 3.8. The bottom 50% have average satisfaction of 4.3.
“If we treat all responses equally, we get 4.1. But our high-value customers—the ones generating the most revenue—are actually less satisfied. That changes the story completely.”
\[\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i} = \frac{w_1x_1 + w_2x_2 + \cdots + w_nx_n}{w_1 + w_2 + \cdots + w_n}\]
where each observation \(x_i\) has weight \(w_i\).
When to use: Not all observations are equally important. Weight by: - Revenue (for customer metrics) - Investment amount (for portfolio returns) - Volume (for supplier quality) - Credit hours (for GPA) - Sales volume (for product profitability)
✓ Customer satisfaction weighted by customer lifetime value ✓ Product returns weighted by sales volume ✓ Regional performance weighted by market size ✓ Supplier quality weighted by purchase volume ✓ Investment returns weighted by capital allocation
Key insight: Simple mean can hide critical patterns when importance varies across observations.
3.3 When Multiplication Matters: Geometric Mean
“Here’s another problem,” David says, pulling up a different analysis. “Product C pricing. Year 1: increased price 20%. Year 2: decreased price 10%. Marketing wants to report average annual price change.”
Maria calculates: “(20% + (-10%))/2 = 5% average annual change.”
“Wrong,” David says gently. “Watch what actually happens to a $100 product: Year 1: $100 × 1.20 = $120. Year 2: $120 × 0.90 = $108. That’s an 8% total increase over two years, or 3.9% average annual, not 5%.”
\[\bar{x}_g = \sqrt[n]{x_1 \times x_2 \times \cdots \times x_n} = (x_1 \cdot x_2 \cdot \ldots \cdot x_n)^{1/n}\]
For growth rates: Convert percentages to multipliers (1 + rate), calculate geometric mean, subtract 1.
What it represents: The constant rate that would produce the same cumulative effect as the varying rates.
When to use:
- Investment returns over multiple periods
- Inflation rates averaged over years
- Population growth rates
- Any compound interest calculation
- Productivity improvements over time
- Price changes over multiple periods
Red flag: If your calculation involves multiplying growth factors, you need geometric mean.
The Classic Trap: Investment gains 50%, then loses 50%
Arithmetic mean: \((50\% + (-50\%))/2 = 0\%\) → Suggests you broke even ❌
Reality: \(\$100 \rightarrow \$150 \rightarrow \$75\) → You lost 25% ❌
Geometric mean: \(\sqrt{(1.5)(0.5)} - 1 = -13.4\%\) → Closer to reality ✓
RULE: If values multiply over time (growth rates, returns, inflation) → Use geometric mean
Exception: Arithmetic mean is correct for simple average of independent returns.
3.4 The Median: When the Middle Matters More
Maria pulls up the customer service response time data.
“Mean response time is 7.5 hours. That seems reasonable.”
David scrolls through the raw data. “Look closer. Most tickets get responses in 3-5 hours. But we’ve got tickets here at 24 hours, 36 hours, one at 52 hours. Those outliers are pulling the mean way up.”
He calculates the median by sorting all 200 response times and finding the middle value: 4.2 hours.
“So typical response time is actually 4.2 hours, not 7.5. The 7.5 is misleading because a few very slow responses are skewing it.”
The middle value when data are sorted in ascending order.
Calculation:
- Odd n: The middle observation
- Even n: Average of the two middle observations
What it represents: The point where 50% of data is below and 50% is above. Not affected by how extreme the extreme values are.
Use MEAN when:
- Data is symmetric
- No extreme outliers
- You want sensitivity to all values
- Mathematical properties matter (e.g., for further calculations)
Use MEDIAN when:
- Data is skewed
- Outliers are present
- You want robustness to extremes
- Reporting “typical” value to non-technical audience
Quick test: If mean > median, data is right-skewed → consider using median
When in doubt: Report both and explain the difference
Maria pulls up real estate data as an analogy. “Five houses sold: $200k, $210k, $225k, $230k, $1.2M. The mean is $413k. But would you advertise ‘average home price $413k’? That’s misleading. The median of $225k better represents the typical home.”
This is why real estate agents, salary surveys, and income reports use median. One billionaire in a room doesn’t make everyone else rich, but it makes the mean useless.
3.5 Percentiles: Drawing Lines in the Data
Sarah enters the analytics team’s workspace. “I need to launch a VIP customer program. Who qualifies?”
David asks, “What’s the criteria?”
“Top 10% by order value.”
This is a percentile question. The 90th percentile is the value below which 90% of customers fall—meaning the top 10% are above it.
The \(p^{th}\) percentile is the value below which \(p\%\) of the data falls.
Position formula: \[L_p = \frac{p}{100}(n + 1)\]
Interpretation:
- 25th percentile: 25% of data below, 75% above
- 50th percentile: The median
- 90th percentile: 90% below, 10% above (top 10%)
For 90th percentile with 100 customer orders: \[L_{90} = \frac{90}{100}(100 + 1) = 90.9\]
This means the 90th percentile is 90% of the way between the 90th and 91st observations (when sorted).
David sorts the 100 customer orders and finds:
- 90th observation: $420
- 91st observation: $435
\[P_{90} = 420 + 0.9(435 - 420) = 420 + 13.5 = \$433.50\]
“Anyone who orders $434 or more qualifies for VIP,” David announces.
Salary & Compensation:
- 75th percentile = Competitive pay benchmark
- 50th percentile = Market median
- 25th percentile = Entry-level benchmark
Customer Programs:
- 90th percentile = VIP threshold
- 75th percentile = Premium tier
- 50th percentile = Standard tier
Service Levels:
- 95th percentile response time = SLA target
- “We respond within X hours 95% of the time”
Performance Management:
- 25th percentile = Performance improvement needed
- 75th percentile = High performer threshold
Quartiles are special percentiles that divide data into four equal parts:
- \(Q_1\) (25th percentile): 25% of data below
- \(Q_2\) (50th percentile): The median
- \(Q_3\) (75th percentile): 75% of data below
For TechFlow customer orders:
- \(Q_1 = \$175\) (bottom 25% spend less than this)
- \(Q_2 = \$220\) (median order)
- \(Q_3 = \$310\) (top 25% spend more than this)
Sarah nods. “So we could create tiers: Basic (<$175), Standard ($175-$310), Premium (>$310). That’s actionable.”
4 Measures of Variability: Understanding Consistency
The next day, Sarah returns with a harder question. “Product D did $250,000 in Q4. Is that bad?”
David looks confused. “Compared to what?”
“Exactly,” Sarah says. “If all our products do $250k ± $10k, then Product D is normal. If they typically do $600k ± $50k, then Product D is a disaster. I need context. I need to understand variation.”
Location tells you the center. Variability tells you the spread.
Two businesses with same average sales:
- Business A: $100k ± $5k (predictable, easy to plan)
- Business B: $100k ± $60k (volatile, high risk)
They’re fundamentally different businesses requiring different strategies, despite identical means.
Variability is about: Consistency, predictability, risk, and reliability.
4.1 Range: Simple but Dangerous
Maria starts simple. “Product sales range from $250,000 to $890,000. Range is $640,000.”
“Okay, but what does that tell us?” Sarah asks.
“That there’s… variation?” Maria says uncertainly.
David jumps in. “The problem with range is it only uses two numbers—the min and max. Black Friday could have spiked the max. A data entry error could have created the min. Range ignores the other 98% of your data.”
\[R = x_{max} - x_{min}\]
Simplest measure of spread: Maximum value minus minimum value.
Advantages: Easy to calculate and understand.
Disadvantages: Extremely sensitive to outliers, uses only 2 data points, ignores the bulk of your data.
The problem: One outlier destroys the range.
- Black Friday spike creates artificially large range
- One angry customer makes service look more variable than it is
- Data entry error of $180,000 instead of $18,000 distorts everything
TechFlow example: Daily sales ranged $18K-$52K (range = $34K)
- Was that typical variation?
- One unusual day?
- Steady growth trend?
Range can’t tell you. It’s better than nothing, but barely.
Rule: Use range for quick initial assessment only. Always follow with more robust measures.
4.2 IQR: The Robust Alternative
“What about the middle 50% of daily sales?” Sarah asks. “Ignore the extremes. What’s the range of typical days?”
This is the Interquartile Range (IQR):
\[IQR = Q_3 - Q_1\]
For TechFlow’s 92 days of sales: - \(Q_1 = \$22,500\) (25th percentile) - \(Q_3 = \$28,000\) (75th percentile) - \(IQR = 28,000 - 22,500 = \$5,500\)
“So a typical day varies by about $5,500 from the lower to upper middle,” David explains. “This is robust—outliers don’t affect it.”
Robustness: The middle 50% tells you what’s “normal” without being fooled by outliers.
Test of robustness:
- One day at $100,000 (corporate bulk purchase)
- Range jumps to $82,000 (meaningless)
- IQR stays at $5,500 (still meaningful)
Use IQR for:
- Setting realistic performance targets
- Defining “typical” customer behavior
- Creating forecasts robust to unusual events
- Understanding your core business range
4.3 Variance and Standard Deviation: Measuring Typical Deviation
“I still don’t have a single number that tells me how much variation is normal,” Sarah says.
David opens a fresh calculation. “That’s what standard deviation does. It measures the typical distance from the mean.”
Variance = Average squared deviation from mean: \[s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}\]
Standard Deviation = Square root of variance: \[s = \sqrt{s^2}\]
Why square deviations?
1. Prevents cancellation (positive and negative deviations would sum to zero)
2. Penalizes larger deviations more heavily (quadratic penalty)
What \(s\) represents: The typical distance an observation is from the mean.
For TechFlow’s four product sales ($250k, $520k, $640k, $890k):
- Mean: \(\bar{x} = \$575,000\)
- Deviations: -$325k, -$55k, +$65k, +$315k
- Squared deviations: $105,625M, $3,025M, $4,225M, $99,225M
- Sum: $212,100M
- Variance: $s^2 = $212,100M / 3 = \(70,700M\)
- Standard deviation: \(s = \sqrt{\text{Variance}} = \$265,894\)
“So a typical product deviates about $266,000 from the $575,000 mean,” David explains. “Product D is $325,000 below mean—about 1.2 standard deviations below. That’s unusual but not extremely rare.”
Inventory Planning:
- Stock level = Mean demand + 1 SD (covers ~68% of days)
- For 95% coverage: Mean + 1.65 SD
Quality Control:
- Products > 3 SD from specification → Investigate immediately
- Process capability: Spec range should be ≥ 6 SD wide
Sales Forecasting:
- Mean ± 1 SD = Realistic range (covers ~68%)
- Mean ± 2 SD = Conservative range (covers ~95%)
Risk Assessment:
- Higher SD = Higher uncertainty = Need larger safety margins
- Compare SD to mean to understand relative risk
4.4 Coefficient of Variation: Comparing Apples and Oranges
Sarah brings up a new challenge. “You told me product SD is $266,000. Regional SD is $588,784. Which should I worry about more?”
“Can’t compare them directly,” David says. “Products average $575k; regions average $767k. A $266k SD on a $575k mean is different than a $589k SD on a $767k mean.”
\[CV = \frac{s}{\bar{x}} \times 100\%\]
What it is: Standard deviation expressed as a percentage of the mean.
What it does: Enables comparison of variability across different scales, units, or time periods.
Interpretation: “Variability is X% of the mean”
Product sales:
- Mean = $575,000, SD = $267,759
- CV = (267,759 / 575,000) \(\times\) 100% = 46.6%
Regional sales:
- Mean = $766,667, SD = $588,784
- CV = (588,784 / 766,667) \(\times\) 100% = 76.8%
“So relative to their means,” Maria explains, “regions are MORE variable (76.8%) than products (46.6%). Geography is your bigger consistency problem.”
CV < 15%: Low variability
- Consistent, predictable performance
- Reliable for planning
- Example: TechFlow monthly revenue (CV = 5.3%)
CV = 15-30%: Moderate variability
- Normal business fluctuation
- Acceptable for most planning purposes
CV > 30%: High variability
- Investigate root causes
- Need robust planning with safety margins
- Example: TechFlow products (46.6%), regions (76.8%)
CV > 100%: Extreme variability
- Standard deviation exceeds the mean
- Highly unpredictable
- May indicate fundamental business instability
You CAN’T directly compare:
- $10 SD on $100 mean vs. $100 SD on $10,000 mean
- Product variability vs. regional variability (different scales)
- This year vs. last year (if volumes changed)
CV makes them comparable:
- 10% CV vs. 1% CV → First is more variable relative to its mean
- Product CV 46.6% vs. Regional CV 76.8% → Regions more variable
- This year CV 25% vs. Last year CV 18% → This year more volatile
Applications:
- Portfolio comparison across asset classes
- Performance comparison across business units
- Year-over-year volatility comparison
- Benchmarking against industry standards
Sarah leans back. “That changes priorities. I was going to restructure the product portfolio. But this says I should focus on geographic strategy—understand why one region dominates and others lag.”
5 Distribution Characteristics: Understanding Shape
“I’ve got location and variability,” Sarah says. “What else do I need?”
“Shape,” David responds. “Is your data symmetric or skewed? Are there outliers? Is variation normal or are we seeing something unusual?”
He pulls up two histograms showing daily sales for two different quarters.
“Both have the same mean and same standard deviation,” he says. “But look—Q3 is symmetric, most days near the middle. Q4 is right-skewed, most days are low with occasional spikes.”
“So what?” Sarah asks.
“So in Q3, mean and median are both good measures. In Q4, median is better. In Q3, most days are near average. In Q4, ‘average’ is misleading because most days are below it. Your planning assumptions would be completely different.”
5.1 Skewness: When Data Leans
Skewness measures the asymmetry of a distribution:
\[\text{Skewness} = \frac{n}{(n-1)(n-2)} \sum \left(\frac{x_i - \bar{x}}{s}\right)^3\]
What it measures: Asymmetry in the distribution.
Interpretation:
- Skewness ≈ 0: Symmetric distribution
- Skewness < 0: Left-skewed (negative) → Tail points left
- Skewness > 0: Right-skewed (positive) → Tail points right
Rules of Thumb:
- |Skewness| < 0.5: Approximately symmetric
- |Skewness| = 0.5-1.0: Moderately skewed
- |Skewness| > 1.0: Highly skewed
The sign tells you where the tail points, not where most data is.
Maria analyzes TechFlow’s customer satisfaction scores:
- Mean = 4.1
- Median = 4.3
- Skewness = -0.95
“It’s left-skewed,” she explains. “Most customers are very satisfied (4-5 range), but a few unhappy customers drag the mean down.”
Left-Skewed (Negative): Most values HIGH, few LOW
- Use MEDIAN to report typical value
- The tail represents problems to fix
- Example: Customer satisfaction (most satisfied, few unhappy)
- Action: Investigate the unhappy customers in the left tail
Right-Skewed (Positive): Most values LOW, few HIGH
- Use MEDIAN to report typical value
- The tail represents opportunities to leverage
- Example: Order values (most small, some very large)
- Action: Identify and target high-value customers in right tail
Quick diagnostic:
- If mean > median → Right-skewed
- If mean < median → Left-skewed
- If mean ≈ median → Approximately symmetric
David adds the business implication: “When you report to Sarah, say ‘median satisfaction is 4.3’ not ‘average is 4.1.’ The median better represents typical experience.”
Sarah nods. “And those few unhappy customers? I want to know who they are and why they’re unhappy. In a left-skewed satisfaction distribution, the tail is where you find churning customers.”
5.2 Z-Scores: Your Statistical Alarm System
Sarah brings in the quarterly product comparison. “Product D did $250,000. Products A, B, and C did $890k, $640k, and $520k. How do I know if Product D is just naturally lower or actually failing?”
“Z-scores,” David says. He calculates:
\[z = \frac{x - \bar{x}}{s}\]
For Product D: \[z_D = \frac{250{,}000 - 575{,}000}{267{,}759} = \frac{-325{,}000}{267{,}759} = -1.21\]
“Product D is 1.21 standard deviations below the mean.”
\[z_i = \frac{x_i - \bar{x}}{s}\]
What it means: “How many standard deviations away from the mean is this value?”
Interpretation:
- z = 0: Exactly at the mean
- z > 0: Above the mean (positive = higher)
- z < 0: Below the mean (negative = lower)
- |z| = 1: One standard deviation from mean
- |z| = 2: Two standard deviations from mean
Power: Enables comparison across different scales. A z-score of +1.5 has the same meaning whether you’re measuring sales, response times, or satisfaction.
“Is that bad?” Sarah asks.
David pulls up a reference guide:
|z| < 1.0: Normal variation
- Action: None needed
- Interpretation: Within typical range
- Example: Product with z = 0.7 is slightly above average, but normal
|z| = 1.0-2.0: Worth monitoring
- Action: Track closely, investigate if persists
- Interpretation: Unusual but not crisis
- Example: Product D at z = -1.21 → Monitor, investigate within 90 days
|z| = 2.0-3.0: Investigate immediately
- Action: Urgent investigation required
- Interpretation: Highly unusual, likely problem or major opportunity
- Example: Regional sales z = -2.5 → Immediate strategic review
|z| > 3.0: Crisis or extraordinary opportunity
- Action: Emergency response or immediate leverage
- Interpretation: Extreme outlier, statistically very rare
- Example: Customer order z = 4.2 → Potential VIP, analyze immediately
“Product D at -1.21 is in the monitoring zone,” David says. “Not a crisis yet, but definitely below normal performance. You should investigate.”
Sarah makes a note: “Investigate Product D performance. Compare pricing to competitors, survey customers for product-specific feedback, evaluate marketing spend allocation. Set 90-day checkpoint: improve or consider discontinuation.”
5.3 Chebyshev’s Theorem: Planning for Any Distribution
“Here’s a practical question,” Sarah says. “I’m setting Q1 budget. I want enough to cover most scenarios. What’s ‘most’?”
David introduces Chebyshev’s Theorem:
For ANY distribution (regardless of shape):
At least \(1 - \frac{1}{z^2}\) of values fall within \(z\) standard deviations of the mean (where \(z > 1\)).
Key Implications:
- At least 75% within ±2 SD of mean
- At least 89% within ±3 SD of mean
- At least 94% within ±4 SD of mean
Power: Works for ANY distribution—symmetric, skewed, bimodal, whatever. No assumptions needed.
For TechFlow daily sales:
- Mean = $25,000, SD = $5,000
- 2 SD range: [$15,000, $35,000] covers at least 75% of days
- 3 SD range: [$10,000, $40,000] covers at least 89% of days
Sarah calculates: “If I budget for mean + 2 SD = $35,000 daily capacity, I’ll be prepared for at least 75% of days. If I want 89% coverage, I budget for $40,000 capacity.”
Safety Stock (Inventory):
- Mean demand + 2 SD → Covers at least 75% of demand scenarios
- Mean demand + 3 SD → Covers at least 89%
- Conservative approach when demand pattern is unknown
Capacity Planning:
- Size facility for mean + 2 SD → Handles at least 75% of days
- Critical operations: Use mean + 3 SD for 89% coverage
Budget Ranges:
- Revenue forecast: Mean ± 2 SD gives realistic envelope
- At least 75% of actual outcomes will fall in this range
Quality Control:
- Values beyond mean ± 3 SD → Investigate (only 11% should be outside)
5.4 Outliers: Problems or Opportunities?
Maria pulls up a scatterplot of customer orders. Most cluster between $150-$350. But six orders exceed $600.
“Are those data errors or real?” Sarah asks.
“Real,” Maria confirms. “I verified them. Corporate bulk purchases, returning customers stocking up.”
“Then those aren’t errors—they’re our most valuable customers,” Sarah says. “Can we identify them systematically?”
An observation with an unusually high or low value relative to the rest of the data.
Two types:
1. Problems: Data errors, fraud, defects, system failures
2. Opportunities: Top performers, VIP customers, innovations
Key insight: Never automatically delete outliers. Always investigate WHY they’re unusual.
Method 1: Z-Score Method
- Calculate: \(z = \frac{x - \bar{x}}{s}\)
- Flag: |z| > 3 (for large datasets) or |z| > 2 (for small datasets)
- Best for: Approximately normal distributions
Method 2: IQR Method
- Calculate: \(IQR = Q_3 - Q_1\)
- Lower boundary: \(Q_1 - 1.5 \times IQR\)
- Upper boundary: \(Q_3 + 1.5 \times IQR\)
- Flag: Any value outside these boundaries
- Best for: Any distribution, especially skewed data
For TechFlow customer orders using IQR method: - \(Q_1 = \$175\), \(Q_3 = \$310\) - \(IQR = 310 - 175 = \$135\) - Upper boundary: \(310 + 1.5(135) = \$512.50\)
Any order above $512.50 is statistically unusual.
David creates a flagging system: “Orders above $513 automatically flag for VIP customer follow-up. Orders below lower boundary (\(175 - 1.5(135) = -\$27.50\)—not applicable) would flag for potential data errors if negative.”
Right approach:
1. Verify: Is it a data entry error?
2. Classify: Problem or opportunity?
3. Investigate: Why is it unusual?
4. Decide: Fix, flag, or leverage
Questions to ask:
- Is it physically possible? (Negative age = error)
- Is it consistent with other data? (Order for $50,000 from customer who usually orders $200 = investigate) - Is there a pattern? (Multiple outliers on Black Friday = opportunity)
- What’s the business context? (Large B2B order vs. consumer fraud)
PROBLEMS to Investigate:
- Defective products (quality control)
- Fraudulent transactions (security)
- Data entry errors (data quality)
- System failures (IT/operations)
- Service breakdowns (customer service)
OPPORTUNITIES to Leverage:
- Exceptional salespeople (best practices)
- High-value customers (VIP programs)
- Top-performing stores (success factors)
- Breakthrough innovations (scale up)
- Unusually satisfied customers (testimonials)
TechFlow Example:
- Orders >$513 flagged as outliers
- Investigation revealed: B2B corporate customers
- Result: New B2B sales strategy (3% of customers, 8% of revenue)
Sarah sees the opportunity: “Those six high-value orders—can we profile those customers? Find commonalities? Market to similar prospects?”
Maria pulls up the analysis. “All six were B2B customers buying for their offices. Average order $687. They represent 3% of customers but 8% of revenue.”
“I want a dedicated B2B sales strategy by end of quarter,” Sarah decides.
6 Boxplots: Seeing the Whole Picture
“I need to compare,” Sarah says. “Products. Regions. This quarter vs. last quarter. Can you show me all this variation in one visual?”
David creates a boxplot.
6.1 The Five-Number Summary
Five-Number Summary:
1. Minimum (excluding outliers)
2. First quartile (Q1, 25th percentile)
3. Median (Q2, 50th percentile)
4. Third quartile (Q3, 75th percentile)
5. Maximum (excluding outliers)
Plus: Individual points for outliers
What each part shows:
- Box: Middle 50% of data (IQR)
- Line in box: Median
- Whiskers: Extend to min/max within 1.5×IQR
- Individual dots: Outliers beyond whiskers
6.2 Reading a Boxplot
David shows Sarah the boxplot of monthly sales by product for Q4.
“The box shows the middle 50%—from Q1 to Q3. The line inside is the median. The whiskers extend to the min and max values within 1.5×IQR of the box. Anything beyond that plots as individual dots—outliers.”
Sarah studies it. “So Product D’s box is lower—lower median. It’s also narrower—more consistent but consistently low. Product A’s box is higher and wider—higher median but more variable.”
Position (vertical placement):
- Higher box = Higher values overall
- Compare medians to see which group is typically higher
Spread (box height):
- Taller box = More variability (larger IQR)
- Narrower box = More consistent performance
Skewness (median position in box):
- Median near bottom = Right-skewed (tail extends up)
- Median near top = Left-skewed (tail extends down)
- Median centered = Approximately symmetric
Outliers (dots beyond whiskers):
- Individual points show unusual values
- Above box = Unusually high
- Below box = Unusually low
Whiskers (lines from box):
- Long whiskers = Extended range (excluding outliers)
- Short whiskers = Compact range
6.3 Comparative Boxplots: The Executive’s Dashboard
David creates a regional comparison boxplot for all of 2024.
Sarah points to Asia-Pacific. “Tiny box, very low position. Small market, very consistent performance, but consistently low. North America: high position, wider box—larger, more variable, but strong median.”
“Should we invest in Asia-Pacific to grow it or North America to maintain our strength?” Sarah asks.
Maria pulls up per-customer revenue. “Asia-Pacific: $1,908 per customer. North America: $1,790. Europe: $1,946. The revenue per customer is actually similar—Asia-Pacific’s problem is customer acquisition, not customer value.”
Sarah makes the decision: “Asia-Pacific needs marketing and brand awareness, not product changes. The customers we have there are profitable. We just need more of them.”
At a glance, you can see:
- Which group has highest median (typical value)
- Which has most variability (risk)
- Which has outliers (problems or opportunities)
- Which has skewed distribution (plan accordingly)
- How groups compare across ALL these dimensions
No numbers needed to understand basic patterns.
Immediate insights without detailed statistical analysis.
Perfect for:
- Comparing products across regions
- Comparing departments’ performance
- Comparing this quarter vs. last quarter
- Identifying which groups need attention
7 Association Between Two Variables: Finding Relationships
Sarah’s final question: “Do longer response times hurt customer satisfaction?”
This is a question about association—whether two variables are related.
7.1 Covariance: The Direction of Relationship
David calculates covariance between response time and satisfaction for 200 customers:
\[s_{xy} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{n-1}\]
The result: \(s_{xy} = -2.3\)
“Negative covariance,” Maria notes. “When response time goes up, satisfaction goes down.”
“But how strong is the relationship?” Sarah asks. “Is -2.3 a lot or a little?”
“Can’t tell,” David admits. “If satisfaction is measured 1-5 and response time in hours, -2.3 is in weird mixed units. We need to standardize.”
Sample covariance: \[s_{xy} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{n-1}\]
What it measures: How two variables vary together.
Interpretation:
- \(s_{xy} > 0\): Positive relationship (move together)
- \(s_{xy} < 0\): Negative relationship (move opposite)
- \(s_{xy} = 0\): No linear relationship
Problem: Magnitude depends on units → Hard to interpret “Is 500 a strong relationship?”
The problem: Magnitude is not standardized.
- Covariance of 100 could be strong or weak depending on variables
- Changing units changes covariance (dollars to cents multiplies by 10,000)
- Can’t compare covariances across different variable pairs
Solution: Use correlation coefficient instead (next section).
When covariance is useful: Just need to know direction (positive/negative), not strength.
7.2 Correlation: The Strength and Direction
Correlation coefficient scales covariance to a standard range:
\[r_{xy} = \frac{s_{xy}}{s_x \cdot s_y}\]
where \(r \in [-1, 1]\)
David recalculates: \(r = -0.68\)
Sample correlation: \[r_{xy} = \frac{s_{xy}}{s_x \cdot s_y}\]
Population correlation: \[\rho_{xy} = \frac{\sigma_{xy}}{\sigma_x \cdot \sigma_y}\]
Range: Always between -1 and +1
What it measures:
- Strength: How closely variables follow a linear relationship
- Direction: Whether they move together or opposite
Key property: Unit-free, enabling comparison across variable pairs
Strong Relationship:
- |r| > 0.8: Very strong
- Variables move together consistently
- One predicts the other well
Moderate Relationship:
- |r| = 0.5 to 0.8: Moderate
- Clear relationship but substantial scatter
- Some predictive value
Weak Relationship:
- |r| < 0.5: Weak
- Relationship exists but highly variable
- Limited predictive value
No Linear Relationship:
- |r| ≈ 0: No linear relationship
- Variables independent or non-linear relationship
Sign (Direction):
- Positive r: Variables move together (both increase/decrease)
- Negative r: Variables move opposite (one up, other down)
“So r = -0.68 is a moderate negative relationship,” Maria explains. “Longer response times are associated with lower satisfaction, and the relationship is moderately strong.”
Sarah asks the key question: “If we improve response times, will satisfaction improve?”
“Probably,” David says carefully, “but correlation doesn’t prove causation. Maybe dissatisfied customers contact us more, leading to longer queues. The arrow could go either way. But it’s worth investigating.”
Strong correlation does NOT mean one causes the other.
Classic example:
- Ice cream sales correlate with drowning deaths (r ≈ +0.9) - Does ice cream cause drowning? NO!
- Both caused by summer weather (confounding variable)
TechFlow example: Response time correlates with satisfaction (r = -0.68)
- Could mean: Slow responses → Lower satisfaction ✓
- Could mean: Unhappy customers → Contact support more → Longer queues ✓
- Could mean: Both caused by understaffing ✓
To establish causation: Need controlled experiment, not just correlation.
Rule: Correlation tells you WHERE to investigate. Experimentation tells you WHAT to do.
7.3 Business Applications of Correlation
Sarah wants to explore other relationships:
- Advertising spend vs. sales: \(r = +0.45\) (moderate positive—advertising helps but isn’t the only factor)
- Employee tenure vs. sales performance: \(r = +0.23\) (weak positive—experience helps but many factors matter)
- Product price vs. perceived quality: \(r = +0.61\) (moderate positive—higher prices signal quality)
- Order size vs. repeat purchase rate: \(r = +0.71\) (strong positive—high-value customers return more often)
“This is actionable,” Sarah says. “Strong positive correlation between order size and repeat rate means we should focus retention efforts on high-value customers. The ROI will be better than trying to retain everyone equally.”
Marketing & Sales:
- Do advertising dollars drive sales? (measure effectiveness)
- Does price correlate with perceived quality? (pricing strategy)
- Do promotions correlate with customer lifetime value? (ROI)
Operations:
- Does response time correlate with satisfaction? (service priorities)
- Does employee training correlate with performance? (HR investment)
- Does overtime correlate with errors? (quality management)
Strategy:
- Does market share correlate with profitability? (growth strategy)
- Does innovation spending correlate with revenue growth? (R&D allocation)
- Does employee satisfaction correlate with customer satisfaction? (culture investment)
Remember: Correlation suggests relationships worth investigating, not proof of causation.
Maria adds a warning: “But remember—correlation isn’t causation. Maybe high-value customers return more often because they’re more satisfied, not because they spend more. We’d need an experiment to know for sure.”
Sarah nods. “Understood. Correlation tells me where to look. Experimentation tells me what to do.”
8 Chapter Summary: From Vagueness to Precision
Six days after Sarah’s request, the analytics team presents their revised Q4 2024 report. Gone are the phrases “around,” “somewhere in the middle,” and “some outliers.”
In their place:
Monthly Performance:
- Mean: $766,667 (CV: 5.3% - highly consistent)
- Median: $780,000 (slight negative skew from October baseline)
- Progressive growth: +8.3% Oct-Nov, +2.6% Nov-Dec
Product Portfolio:
- Product D: z-score = -1.21 (statistical underperformer)
- Performance gap: $325,000 below portfolio mean (56.5% shortfall)
- Product CV: 46.6% (high variability requiring individual strategies)
- Recommendation: 90-day investigation—pricing, quality, or market fit
Regional Analysis:
- Regional CV: 76.8% vs Product CV: 46.6% (geography is primary challenge)
- 61% revenue concentration in North America (vulnerability risk)
- Asia-Pacific: $1,908 per customer vs $1,845 average (customer quality is good; need more customers)
Customer Insights:
- 90th percentile order: $425 (VIP threshold)
- IQR: $135 ($175-$310 represents middle 50% of customers)
- 6 orders exceed $513 (B2B opportunity: 3% of customers, 8% of revenue)
Service Quality:
- Median response time: 4.2 hours (mean: 7.5 hours skewed by outliers)
- 15% of tickets exceed 21 hours (outlier threshold)
- Correlation with satisfaction: r = -0.68 (moderate negative)
- Recommendation: 12-hour SLA for 95th percentile
Sarah reads through the new report. Every claim is specific. Every comparison is quantified. Every recommendation is defensible.
“This,” she says, “is a report I can act on.”
Measures of Location answer “What’s typical?”:
- Use mean for symmetric data without outliers
- Use median for skewed data or data with outliers
- Use geometric mean for growth rates and compound returns
- Use weighted mean when observations have different importance
- Use percentiles to define thresholds and create tiers
Measures of Variability answer “How consistent are we?”:
- Range and IQR show spread (IQR is robust to outliers)
- Standard deviation measures typical deviation from mean
- Coefficient of variation enables comparison across different scales
- Higher variability = higher uncertainty = larger safety margins needed
Distribution Characteristics answer “What’s the shape?”:
- Skewness indicates asymmetry (affects choice of location measure)
- Z-scores identify unusually high or low values (your alarm system)
- Chebyshev’s theorem works for any distribution shape (planning tool)
- Outliers can be problems to fix or opportunities to leverage
Visual Tools answer “Can you show me?”:
- Boxplots display five-number summary plus outliers
- Enable instant comparison across groups
- Reveal median, spread, skewness, and extremes at a glance
- Perfect executive dashboard for comparative analysis
Association Measures answer “Are things related?”:
- Covariance shows direction but hard to interpret magnitude
- Correlation standardizes to [-1, 1] for easy interpretation
- Strong correlation suggests where to investigate
- Remember: Correlation ≠ Causation (always)
8.1 The Statistical Toolkit in Action
The difference between TechFlow’s vague marketing report and the rigorous analytics report came down to proper application of descriptive statistics:
Marketing Report (❌):
- “Revenue was around $767,000”
- “Most days somewhere in the middle”
- “Some outliers on both ends”
- Result: No actionable insights, no defensible recommendations
Analytics Report (✓):
- Mean monthly revenue: $767,000 (CV: 5.3% - highly consistent)
- Product D: z-score = -1.21 (statistical underperformer, 56.5% below mean)
- Regional CV: 76.8% vs Product CV: 46.6% (geography is bigger challenge)
- Customer satisfaction: median 4.3/5 (left-skewed, majority highly satisfied)
- Response time outliers: 15% exceed 21-hour threshold (service quality issue)
- Result: Specific, quantifiable opportunities and risks with clear action items
8.2 From Student to Analyst
The difference between a weak analyst and a strong one isn’t mathematical sophistication—it’s knowing which tool to use when, and what the results actually mean for decisions.
Weak analysts calculate means and call it done. Strong analysts ask: “Is this data skewed? Should I use median instead? Are there outliers I should investigate? How does this variability compare to last quarter? What’s unusual enough to flag?”
Weak analysts report “average customer satisfaction: 4.1 out of 5.” Strong analysts report “median satisfaction: 4.3 (left-skewed distribution), but 10% of customers score below 3.0 (correlation with response times: r = -0.68).”
Weak analysts see a number below average and call it “disappointing.” Strong analysts calculate z-scores, compare to historical variation, and determine if it’s normal fluctuation or statistical evidence of a problem.
The tools in this chapter aren’t complex. But using them well—knowing when median beats mean, when to calculate CV instead of SD, when to flag outliers, when to test for correlation—that’s the professional advantage.
Statistical precision enables:
- Tracking performance accurately over time
- Setting realistic targets based on variability
- Identifying problems early through outlier detection
- Making defensible decisions supported by data
- Allocating resources effectively based on comparative analysis
Your mantra: Calculate. Don’t estimate. Quantify. Don’t assume.
Sarah’s directive was simple: “Give me precision.” The analytics team delivered by applying the right statistical tools to answer the right business questions.
Now it’s your turn.
9 Practice Problems
9.1 Problem Set 1: TechFlow Product Analysis
Given TechFlow’s Q4 product sales:
- Product A: $890,000
- Product B: $640,000
- Product C: $520,000
- Product D: $250,000
Calculate:
1. Mean and median product sales—are they different? What does this suggest?
2. Standard deviation and coefficient of variation
3. Z-score for each product
4. Which products are statistical outliers (use |z| > 2 criterion)?
5. If you could only invest in improving one product, which one and why?
9.2 Problem Set 2: Customer Order Analysis
TechFlow’s customer order data shows:
- 25th percentile: $175
- Median: $220
- 75th percentile: $310
- Mean: $245
Answer:
1. Calculate the IQR
2. Calculate outlier boundaries using the IQR method
3. Is the distribution skewed? Which direction? How do you know?
4. What minimum order value qualifies for the top 10% (VIP program)?
5. Should TechFlow use mean or median when reporting “typical order value” in marketing materials? Explain your reasoning.
9.3 Problem Set 3: Regional Performance
TechFlow’s regional Q4 sales:
- North America: $1,400,000 (782 customers)
- Europe: $650,000 (334 customers)
- Asia-Pacific: $250,000 (131 customers)
Calculate and Analyze:
1. Mean and standard deviation of regional sales
2. Coefficient of variation
3. Revenue per customer for each region
4. Z-score for each region
5. Write a one-paragraph recommendation to Sarah about regional strategy, citing specific statistics.
9.4 Problem Set 4: Investment Returns
An investment has the following annual returns over 5 years:
Year 1: +15%, Year 2: -5%, Year 3: +20%, Year 4: +10%, Year 5: -8%
Calculate:
1. Arithmetic mean return 2. Geometric mean return
3. Which measure correctly represents average annual return? Why?
4. If you invested $10,000 initially, what’s your ending value after 5 years?
5. Why would using arithmetic mean lead to wrong conclusions?
9.5 Problem Set 5: Service Quality Analysis
Response times (hours) for 200 customer tickets:
- Mean: 7.5 hours
- Median: 4.2 hours
- Q1: 2.5 hours, Q3: 6.0 hours
- Standard deviation: 8.2 hours
- Correlation with satisfaction: r = -0.68
Answer:
1. Is the distribution skewed? Provide two pieces of evidence.
2. Calculate the IQR and outlier boundaries
3. Should the company report mean or median in their public customer service metrics? Why?
4. If management sets “95% of tickets under 12 hours” as a goal, use Chebyshev’s theorem to assess if this is realistic
5. What does the correlation of -0.68 tell you? What action would you recommend?
10 Excel Functions Quick Reference
Measures of Location:
=AVERAGE(range) ' Mean
=MEDIAN(range) ' Median
=MODE.SNGL(range) ' Mode
=PERCENTILE.INC(range, k) ' kth percentile (k = 0.25 for 25th)
=QUARTILE.INC(range, q) ' Quartiles (q = 1, 2, or 3)
=GEOMEAN(range) ' Geometric mean
Measures of Variability:
=MAX(range) - MIN(range) ' Range
=VAR.S(range) ' Sample variance
=STDEV.S(range) ' Sample standard deviation
=(STDEV.S(range)/AVERAGE(range))*100 ' CV (%)
=QUARTILE.INC(range,3)-QUARTILE.INC(range,1) ' IQR
Distribution Measures:
=SKEW(range) ' Skewness
=STANDARDIZE(x, mean, std) ' Z-score
=(value-AVERAGE(range))/STDEV.S(range) ' Z-score alternative
Association:
=COVARIANCE.S(array1, array2) ' Sample covariance
=CORREL(array1, array2) ' Correlation coefficient
Helpful Combinations:
' Outlier detection (IQR method)
=IF(OR(A2<Q1-1.5*IQR, A2>Q3+1.5*IQR), "Outlier", "Normal")
' Skewness check
=IF(SKEW(range)>0.5, "Right-skewed", IF(SKEW(range)<-0.5, "Left-skewed", "Symmetric"))
' Z-score alarm
=IF(ABS(z)>3, "URGENT", IF(ABS(z)>2, "Investigate", IF(ABS(z)>1, "Monitor", "Normal")))
11 Glossary
Boxplot: Visual display showing five-number summary (min, Q1, median, Q3, max) plus outliers
Coefficient of Variation (CV): Ratio of standard deviation to mean (as percentage); enables comparison across different scales
Correlation Coefficient: Standardized measure of linear association between two variables; ranges from -1 to +1
Covariance: Measure of how two variables vary together; positive = move together, negative = move opposite
Geometric Mean: Appropriate average for growth rates and returns; accounts for compounding
Interquartile Range (IQR): Range of middle 50% of data (Q3 - Q1); robust to outliers
Mean: Arithmetic average; sum divided by count
Median: Middle value when data sorted; robust to outliers and appropriate for skewed data
Outlier: Unusually high or low value; can be problem (error) or opportunity (exceptional case)
Percentile: Value below which a given percentage of data falls
Quartile: Specific percentiles dividing data into four equal parts (Q1 = 25th, Q2 = 50th, Q3 = 75th)
Range: Difference between maximum and minimum; sensitive to outliers
Skewness: Measure of distribution asymmetry; negative = left tail, positive = right tail
Standard Deviation: Square root of variance; measures typical deviation from mean
Variance: Average squared deviation from mean
Weighted Mean: Average where observations have different importance/weights
Z-score: Number of standard deviations an observation is from the mean; standardizes values for comparison
Statistics tell stories about business performance. Learn to read them, and you’ll make better decisions. Learn to tell them, and you’ll lead.