Statistics and Probability
MYP Unit Framework
Key Concept: LOGIC Related Concepts: Representation. Validity. Patterns. Global Context: Identities and Relationships (How do statistical reasoning and probability help us understand ourselves, our communities, and our decisions?) Statement of Inquiry: Statistical reasoning enables us to make evidence-based decisions while understanding the limitations and ethical dimensions of data.
Inquiry Questions
| Type | Question |
|---|---|
| Factual | What are mean, median, and mode? What is standard deviation? What is a probability distribution? What is the difference between theoretical and experimental probability? |
| Conceptual | How do we know if a pattern is REAL or just RANDOM? How does SAMPLE SIZE affect the reliability of a conclusion? How can statistics be used to DECEIVE? |
| Debatable | Can statistics EVER 'prove' anything — or only suggest PROBABILITIES? Should ALGORITHMS use statistical profiles to make decisions about people (e.g., loan approvals, hiring, criminal sentencing)? |
1. Descriptive Statistics — Summarising Data
Measures of Central Tendency
| Measure | What it tells you | When to use |
|---|---|---|
| Mean | The ARITHMETIC average — sum of all values divided by number of values | When data is SYMMETRIC and has no extreme OUTLIERS |
| Median | The MIDDLE value when data is ordered | When data is SKEWED or has OUTLIERS (e.g., income data) |
| Mode | The MOST FREQUENT value | For CATEGORICAL data or identifying the most common value |
'The mean is SENSITIVE to outliers — one extreme value can DRAMATICALLY skew it. The median is ROBUST. This is why median income is USUALLY more meaningful than mean income when discussing economic inequality.'
Measures of Dispersion (Spread)
Range: MAX − MIN. 'Simple but USEFUL — and easily distorted by outliers.'
Interquartile range (IQR): Q3 − Q1 (the middle 50% of data). 'More ROBUST than the range — it IGNORES the top and bottom 25%.'
Standard deviation (σ): 'The AVERAGE distance of each data point from the mean. A SMALL standard deviation means data is CLUSTERED around the mean. A LARGE standard deviation means data is SPREAD OUT.'
Variance: σ². 'Standard deviation squared. Used in more advanced statistics because it has NICER mathematical properties.'
Visual Representations
- Histograms: Show the FREQUENCY distribution of continuous data.
- Box-and-whisker plots: Show median, quartiles, and outliers in a SINGLE diagram.
- Bar charts: For CATEGORICAL data.
- Scatter plots: Show RELATIONSHIP between TWO variables.
'CHOOSING the right visualisation is as IMPORTANT as doing the calculation. A good graph can REVEAL patterns that numbers alone HIDE. A bad graph can MISLEAD.'
2. Probability — Quantifying Uncertainty
The Language of Probability
'Probability is a NUMBER between 0 and 1 that measures how LIKELY an event is. 0 = IMPOSSIBLE. 1 = CERTAIN. 0.5 = EVEN CHANCE.'
Experimental (empirical) probability: P(event) = (number of favourable outcomes)/(total number of trials). 'Based on ACTUAL data — the more trials, the more RELIABLE.'
Theoretical probability: P(event) = (number of favourable outcomes)/(total number of POSSIBLE outcomes). 'Based on LOGIC — assumes all outcomes are EQUALLY LIKELY.'
The Law of Large Numbers
'As the number of trials INCREASES, the experimental probability APPROACHES the theoretical probability. If you flip a fair coin 10 times, you might get 7 heads (70%). If you flip it 1000 times, you will get CLOSER to 500 heads (50%).'
Key Rules
| Rule | Formula | Example |
|---|---|---|
| Complement rule | P(not A) = 1 − P(A) | If P(rain) = 0.3, P(no rain) = 0.7 |
| Addition rule (mutually exclusive) | P(A or B) = P(A) + P(B) | P(rolling a 2 or a 5 on a die) = 1/6 + 1/6 = 1/3 |
| Addition rule (non-mutually exclusive) | P(A or B) = P(A) + P(B) − P(A and B) | P(king or heart from a deck) = 4/52 + 13/52 − 1/52 = 16/52 |
| Multiplication rule (independent) | P(A and B) = P(A) × P(B) | P(two heads in a row) = 1/2 × 1/2 = 1/4 |
Conditional Probability
P(A|B) = P(A and B)/P(B). 'The probability of A GIVEN that B has happened. This is the mathematics of UPDATING beliefs based on NEW information — and is the foundation of BAYESIAN statistics.'
3. Probability Distributions
Discrete vs. Continuous
'Discrete data can ONLY take specific values (number of students in a class — 25, 26, 27...). Continuous data can take ANY value within a range (height — 165.3 cm, 165.34 cm...).'
Discrete Probability Distributions
'The BINOMIAL distribution: P(X = k) = C(n,k) × pᵏ × (1−p)ⁿ⁻ᵏ. It models the number of SUCCESSES in n INDEPENDENT trials, each with the SAME probability of success p.'
Conditions for binomial: Fixed number of trials (n). Each trial has TWO outcomes (success/failure). Probability of success (p) is CONSTANT. Trials are INDEPENDENT.
Example: 'If you flip a fair coin 10 times, the probability of getting EXACTLY 6 heads is: C(10,6) × (0.5)⁶ × (0.5)⁴ ≈ 0.205.'
The Normal Distribution
'The NORMAL (Gaussian) distribution is the MOST important probability distribution in statistics. It is symmetric, BELL-SHAPED, and described by TWO parameters: the mean (μ) and the standard deviation (σ).'
The 68–95–99.7 Rule:
- '68% of data falls within 1 standard deviation of the mean (μ ± σ).
- 95% falls within 2 standard deviations (μ ± 2σ).
- 99.7% falls within 3 standard deviations (μ ± 3σ).'
Z-scores: 'A z-score tells you how many standard deviations a value is ABOVE or BELOW the mean. z = (x − μ)/σ. Z-scores allow you to COMPARE values from DIFFERENT normal distributions.'
4. The Misuse of Statistics — How to Lie with Data
Common Deceptive Techniques
| Technique | Description | How to spot it |
|---|---|---|
| Cherry-picking | Presenting only data that SUPPORTS your argument | Look for data that has been OMITTED |
| Biased sample | The sample does not REPRESENT the population | Ask: How was the sample SELECTED? |
| Misleading graphs | Manipulating axes, scales, or visual elements | CHECK the axes — do they start at 0? Are they labelled? |
| Correlation presented as causation | Assuming correlation implies causation | Ask: Is there a THIRD VARIABLE? |
| Small sample sizes | Drawing conclusions from TOO LITTLE data | Look at the SAMPLE SIZE — is it LARGE enough? |
| Loaded questions | Survey questions that LEAD to a specific answer | Read the QUESTION carefully — is it NEUTRAL? |
Real-World Examples
'In the 1950s, tobacco companies used statistics to ARGUE that smoking did not cause cancer — by cherry-picking studies and attacking methodology. Today, similar tactics are used by climate change DENIERS and the PHARMACEUTICAL industry. Statistical literacy is NOT just a mathematical skill — it is a CIVIC RESPONSIBILITY.'
5. Ethical Dimensions of Data
Data and Power
'Data is POWER. Those who COLLECT data — governments, corporations, social media platforms — have ENORMOUS influence over our lives. They know our habits, our preferences, our locations, our health, our social networks. The question is: How is this data being USED?'
Algorithmic bias: 'Algorithms trained on HISTORICAL data can PERPETUATE and AMPLIFY existing biases — in hiring, lending, policing, and criminal sentencing. An algorithm used to predict recidivism (COMPAS) was found to be BIASED against Black defendants. The data REFLECTED historical inequalities — and the algorithm CODIFIED them.'
Statistical discrimination: 'Using GROUP averages to make decisions about INDIVIDUALS — "People from this postal code are less likely to repay loans." This is EFFICIENT for the bank — but UNFAIR to the individual who does not fit the pattern.'
Your Summative Assessment — The Statistical Investigation
Task: Conduct a STATISTICAL INVESTIGATION into a QUESTION that interests you. Collect DATA (minimum 30 data points per group if comparing groups, or 50 data points for a single group). Use DESCRIPTIVE STATISTICS to summarise your data (mean, median, mode, range, IQR, standard deviation). Create at least TWO VISUALISATIONS (histogram, box plot, scatter plot). If comparing groups, calculate PROBABILITIES or use a statistical test. Write a 1000–1200 word REPORT: What was your research question? How did you collect data? What did you find? What are the LIMITATIONS of your investigation? What are the ETHICAL CONSIDERATIONS?
'This investigation MIRRORS the IB DP Mathematics IA process: choose a question, collect data, analyse, conclude, and evaluate. The skills you develop here are DIRECTLY transferable.'
ATL Skills
| Skill | Focus |
|---|---|
| Critical Thinking | Evaluating statistical claims. Distinguishing correlation from causation. |
| Research | Collecting data ethically. Choosing appropriate statistical methods. |
| Communication | Writing a structured statistical report with clear visualisations. |
| Information Literacy | Evaluating the quality and bias of statistical claims in media. |
Formative Assessments
| Assessment | Focus |
|---|---|
| Descriptive statistics problem set | Calculate mean, median, mode, range, IQR, and standard deviation for given data sets. |
| Probability problem set | Solve probability problems using the addition and multiplication rules. |
| Data visualisation task | Create a histogram, box plot, and scatter plot from real data. |
| Misleading graphs analysis | Find and critique a misleading graph from media or advertising. |
Interdisciplinary Connections
- Sciences: Experimental design, data analysis, significance testing.
- Psychology: Statistical methods in psychological research, cognitive biases.
- Economics: Economic data analysis, probability in financial markets.
- TOK: What does it mean for a statistical result to be 'significant'? Can statistics produce KNOWLEDGE?
Service as Action
- Data literacy workshop: Design and deliver a workshop for younger students on identifying misleading statistics.
- School climate survey: Design, administer, and analyse a survey on an issue in the school community.
- Data for a local NGO: Help a community organisation collect and analyse data for their advocacy work.
IB Learner Profile Attributes
| Attribute | How This Unit Develops It |
|---|---|
| Thinkers | Students critically evaluate statistical claims and evidence. |
| Principled | Students consider the ethical dimensions of data collection and use. |
| Inquirers | Students formulate questions and investigate through data. |
| Knowledgeable | Students understand statistical concepts and their applications. |
Self-Test Questions
-
Calculate the mean, median, mode, range, IQR, and standard deviation for: 4, 7, 8, 9, 10, 12, 15.
-
When would you use the MEDIAN instead of the MEAN? Provide an example.
-
Draw a normal distribution and label the 68–95–99.7 regions.
-
A bag contains 3 red, 4 blue, and 5 green marbles. Two marbles are drawn WITHOUT replacement. What is the probability they are BOTH red?
-
Explain the difference between experimental and theoretical probability. State the Law of Large Numbers.
-
List FIVE ways statistics can be MISLEADING. Provide a real-world example of ONE of these.
-
'Algorithms should not be used to make important decisions about people because they reflect historical biases.' Write a paragraph arguing FOR or AGAINST this position.
