Statistics and Probability

MYP Unit Framework

Key Concept: LOGIC Related Concepts: Representation. Validity. Patterns. Global Context: Identities and Relationships (How do statistical reasoning and probability help us understand ourselves, our communities, and our decisions?) Statement of Inquiry: Statistical reasoning enables us to make evidence-based decisions while understanding the limitations and ethical dimensions of data.


Inquiry Questions

TypeQuestion
FactualWhat are mean, median, and mode? What is standard deviation? What is a probability distribution? What is the difference between theoretical and experimental probability?
ConceptualHow do we know if a pattern is REAL or just RANDOM? How does SAMPLE SIZE affect the reliability of a conclusion? How can statistics be used to DECEIVE?
DebatableCan statistics EVER 'prove' anything — or only suggest PROBABILITIES? Should ALGORITHMS use statistical profiles to make decisions about people (e.g., loan approvals, hiring, criminal sentencing)?

1. Descriptive Statistics — Summarising Data

Measures of Central Tendency

MeasureWhat it tells youWhen to use
MeanThe ARITHMETIC average — sum of all values divided by number of valuesWhen data is SYMMETRIC and has no extreme OUTLIERS
MedianThe MIDDLE value when data is orderedWhen data is SKEWED or has OUTLIERS (e.g., income data)
ModeThe MOST FREQUENT valueFor CATEGORICAL data or identifying the most common value

'The mean is SENSITIVE to outliers — one extreme value can DRAMATICALLY skew it. The median is ROBUST. This is why median income is USUALLY more meaningful than mean income when discussing economic inequality.'

Measures of Dispersion (Spread)

Range: MAX − MIN. 'Simple but USEFUL — and easily distorted by outliers.'

Interquartile range (IQR): Q3 − Q1 (the middle 50% of data). 'More ROBUST than the range — it IGNORES the top and bottom 25%.'

Standard deviation (σ): 'The AVERAGE distance of each data point from the mean. A SMALL standard deviation means data is CLUSTERED around the mean. A LARGE standard deviation means data is SPREAD OUT.'

Variance: σ². 'Standard deviation squared. Used in more advanced statistics because it has NICER mathematical properties.'

Visual Representations

  • Histograms: Show the FREQUENCY distribution of continuous data.
  • Box-and-whisker plots: Show median, quartiles, and outliers in a SINGLE diagram.
  • Bar charts: For CATEGORICAL data.
  • Scatter plots: Show RELATIONSHIP between TWO variables.

'CHOOSING the right visualisation is as IMPORTANT as doing the calculation. A good graph can REVEAL patterns that numbers alone HIDE. A bad graph can MISLEAD.'


2. Probability — Quantifying Uncertainty

The Language of Probability

'Probability is a NUMBER between 0 and 1 that measures how LIKELY an event is. 0 = IMPOSSIBLE. 1 = CERTAIN. 0.5 = EVEN CHANCE.'

Experimental (empirical) probability: P(event) = (number of favourable outcomes)/(total number of trials). 'Based on ACTUAL data — the more trials, the more RELIABLE.'

Theoretical probability: P(event) = (number of favourable outcomes)/(total number of POSSIBLE outcomes). 'Based on LOGIC — assumes all outcomes are EQUALLY LIKELY.'

The Law of Large Numbers

'As the number of trials INCREASES, the experimental probability APPROACHES the theoretical probability. If you flip a fair coin 10 times, you might get 7 heads (70%). If you flip it 1000 times, you will get CLOSER to 500 heads (50%).'

Key Rules

RuleFormulaExample
Complement ruleP(not A) = 1 − P(A)If P(rain) = 0.3, P(no rain) = 0.7
Addition rule (mutually exclusive)P(A or B) = P(A) + P(B)P(rolling a 2 or a 5 on a die) = 1/6 + 1/6 = 1/3
Addition rule (non-mutually exclusive)P(A or B) = P(A) + P(B) − P(A and B)P(king or heart from a deck) = 4/52 + 13/52 − 1/52 = 16/52
Multiplication rule (independent)P(A and B) = P(A) × P(B)P(two heads in a row) = 1/2 × 1/2 = 1/4

Conditional Probability

P(A|B) = P(A and B)/P(B). 'The probability of A GIVEN that B has happened. This is the mathematics of UPDATING beliefs based on NEW information — and is the foundation of BAYESIAN statistics.'


3. Probability Distributions

Discrete vs. Continuous

'Discrete data can ONLY take specific values (number of students in a class — 25, 26, 27...). Continuous data can take ANY value within a range (height — 165.3 cm, 165.34 cm...).'

Discrete Probability Distributions

'The BINOMIAL distribution: P(X = k) = C(n,k) × pᵏ × (1−p)ⁿ⁻ᵏ. It models the number of SUCCESSES in n INDEPENDENT trials, each with the SAME probability of success p.'

Conditions for binomial: Fixed number of trials (n). Each trial has TWO outcomes (success/failure). Probability of success (p) is CONSTANT. Trials are INDEPENDENT.

Example: 'If you flip a fair coin 10 times, the probability of getting EXACTLY 6 heads is: C(10,6) × (0.5)⁶ × (0.5)⁴ ≈ 0.205.'

The Normal Distribution

'The NORMAL (Gaussian) distribution is the MOST important probability distribution in statistics. It is symmetric, BELL-SHAPED, and described by TWO parameters: the mean (μ) and the standard deviation (σ).'

The 68–95–99.7 Rule:

  • '68% of data falls within 1 standard deviation of the mean (μ ± σ).
  • 95% falls within 2 standard deviations (μ ± 2σ).
  • 99.7% falls within 3 standard deviations (μ ± 3σ).'

Z-scores: 'A z-score tells you how many standard deviations a value is ABOVE or BELOW the mean. z = (x − μ)/σ. Z-scores allow you to COMPARE values from DIFFERENT normal distributions.'


4. The Misuse of Statistics — How to Lie with Data

Common Deceptive Techniques

TechniqueDescriptionHow to spot it
Cherry-pickingPresenting only data that SUPPORTS your argumentLook for data that has been OMITTED
Biased sampleThe sample does not REPRESENT the populationAsk: How was the sample SELECTED?
Misleading graphsManipulating axes, scales, or visual elementsCHECK the axes — do they start at 0? Are they labelled?
Correlation presented as causationAssuming correlation implies causationAsk: Is there a THIRD VARIABLE?
Small sample sizesDrawing conclusions from TOO LITTLE dataLook at the SAMPLE SIZE — is it LARGE enough?
Loaded questionsSurvey questions that LEAD to a specific answerRead the QUESTION carefully — is it NEUTRAL?

Real-World Examples

'In the 1950s, tobacco companies used statistics to ARGUE that smoking did not cause cancer — by cherry-picking studies and attacking methodology. Today, similar tactics are used by climate change DENIERS and the PHARMACEUTICAL industry. Statistical literacy is NOT just a mathematical skill — it is a CIVIC RESPONSIBILITY.'


5. Ethical Dimensions of Data

Data and Power

'Data is POWER. Those who COLLECT data — governments, corporations, social media platforms — have ENORMOUS influence over our lives. They know our habits, our preferences, our locations, our health, our social networks. The question is: How is this data being USED?'

Algorithmic bias: 'Algorithms trained on HISTORICAL data can PERPETUATE and AMPLIFY existing biases — in hiring, lending, policing, and criminal sentencing. An algorithm used to predict recidivism (COMPAS) was found to be BIASED against Black defendants. The data REFLECTED historical inequalities — and the algorithm CODIFIED them.'

Statistical discrimination: 'Using GROUP averages to make decisions about INDIVIDUALS — "People from this postal code are less likely to repay loans." This is EFFICIENT for the bank — but UNFAIR to the individual who does not fit the pattern.'


Your Summative Assessment — The Statistical Investigation

Task: Conduct a STATISTICAL INVESTIGATION into a QUESTION that interests you. Collect DATA (minimum 30 data points per group if comparing groups, or 50 data points for a single group). Use DESCRIPTIVE STATISTICS to summarise your data (mean, median, mode, range, IQR, standard deviation). Create at least TWO VISUALISATIONS (histogram, box plot, scatter plot). If comparing groups, calculate PROBABILITIES or use a statistical test. Write a 1000–1200 word REPORT: What was your research question? How did you collect data? What did you find? What are the LIMITATIONS of your investigation? What are the ETHICAL CONSIDERATIONS?

'This investigation MIRRORS the IB DP Mathematics IA process: choose a question, collect data, analyse, conclude, and evaluate. The skills you develop here are DIRECTLY transferable.'


ATL Skills

SkillFocus
Critical ThinkingEvaluating statistical claims. Distinguishing correlation from causation.
ResearchCollecting data ethically. Choosing appropriate statistical methods.
CommunicationWriting a structured statistical report with clear visualisations.
Information LiteracyEvaluating the quality and bias of statistical claims in media.

Formative Assessments

AssessmentFocus
Descriptive statistics problem setCalculate mean, median, mode, range, IQR, and standard deviation for given data sets.
Probability problem setSolve probability problems using the addition and multiplication rules.
Data visualisation taskCreate a histogram, box plot, and scatter plot from real data.
Misleading graphs analysisFind and critique a misleading graph from media or advertising.

Interdisciplinary Connections

  • Sciences: Experimental design, data analysis, significance testing.
  • Psychology: Statistical methods in psychological research, cognitive biases.
  • Economics: Economic data analysis, probability in financial markets.
  • TOK: What does it mean for a statistical result to be 'significant'? Can statistics produce KNOWLEDGE?

Service as Action

  • Data literacy workshop: Design and deliver a workshop for younger students on identifying misleading statistics.
  • School climate survey: Design, administer, and analyse a survey on an issue in the school community.
  • Data for a local NGO: Help a community organisation collect and analyse data for their advocacy work.

IB Learner Profile Attributes

AttributeHow This Unit Develops It
ThinkersStudents critically evaluate statistical claims and evidence.
PrincipledStudents consider the ethical dimensions of data collection and use.
InquirersStudents formulate questions and investigate through data.
KnowledgeableStudents understand statistical concepts and their applications.

Self-Test Questions

  1. Calculate the mean, median, mode, range, IQR, and standard deviation for: 4, 7, 8, 9, 10, 12, 15.

  2. When would you use the MEDIAN instead of the MEAN? Provide an example.

  3. Draw a normal distribution and label the 68–95–99.7 regions.

  4. A bag contains 3 red, 4 blue, and 5 green marbles. Two marbles are drawn WITHOUT replacement. What is the probability they are BOTH red?

  5. Explain the difference between experimental and theoretical probability. State the Law of Large Numbers.

  6. List FIVE ways statistics can be MISLEADING. Provide a real-world example of ONE of these.

  7. 'Algorithms should not be used to make important decisions about people because they reflect historical biases.' Write a paragraph arguing FOR or AGAINST this position.

Verified by the tuition.in editorial team
Written and reviewed by subject-matter experts — read about our process.
Editorial process →
Header Logo