Variance

Author: Ernad Mujakic
Date: 2025-07-10


Variance is a Measure of Dispersion that summarizes the spread of values in a dataset from the average. It represents the average squared distance from each data point to the mean of the dataset.

Interpretation

  • Low Variance: Indicates that the data points tend to be very close to the mean, suggesting consistency and reliability in the dataset.

  • High Variance: Implies that data points are more spread out across the range, indicating greater variability and less predictability.

HighVsLowSTD.png

Properties

  • Non-Negative: The variance can never be negative, since it is an average distance measure and distance can also never be negative.

Calculation

The variance of a numeric attribute is defined as:

Population Variance

For a population, the variance is calculated using:

Where:

  • represents the population variance.
  • represents the mean of the population.
  • represents the number of observations.

Sample Variance

For a sample, the variance is calculated using:

Where:

  • represents the sample variance.
  • represents the mean of the sample.
  • represents the number of observations.

Relation to Standard Deviation

The standard deviation is equal to the square root of the variance of the same dataset. Standard deviation is expressed in the same units as the original data, while variance is expressed in squared units.


The expected value, or mean, of a random variable , denoted , is a measure of central tendency that represents the average outcome of a random variable. Intuitively, it is the average of the outcomes of many samples from .

Variance of Random Variables

The variance of a random variable measures the dispersion of a random variable around its expected value . It is defined as the expected value of the squared differences from the mean:

Where:

  • is the expected value of .

Covariance is vital in understanding the relationships between random variables in probability distributions. The value of the covariance between two variables is represents the direction of the linear relationship between them.

A Covariance Matrix is a square matrix containing the covariance between multiple variables.

Formula

The formula for calculating the covariance is:

Where:

  • and represent the mean of and respectively.
  • represents the number of samples in the dataset.

Interpretation

  • Positive Covariance: Indicates that as increases, increases.
  • Negative Covariance: Indicates that as increases, decreases.
  • Zero/Near-Zero Covariance: Indicates no linear relationship between and .

Covariance is a key step in calculating correlation, which normalizes the covariance value to a standard scale. Correlation is useful for assessing the strength and direction of the relationship between two variables.


A dataset is homoscedastic if the variance of the residuals (errors) is constant across the range of the independent variable(s). Conversely, if the variance changes as a function of the independent variable, the dataset is heteroscedastic.

Regression Analysis

Homoscedasticity is an important concept in Regression analysis as it is an essential assumption for many regression models such as Linear Regression based on the Least Squares method.

Statistical Tests

  • Residuals vs. Fitted Values Plot: A scatter plot of residuals against the predicted values. A random scatter indicates that the dataset is homoscedastic.
  • Breusch-Pagan Test: Creates an initial regression model, then fits a new model on the squared residuals of original model against the independent variable. If the corresponding P-Value from the Chi-Squared Test is less than some chosen significance level, then heteroscedasticity is assumed.
  • White Test: A more general test which can detect non-linear heteroscedastic relationships. The methodology is similar to that of the Breusch-Pagan test, though the White test involves regressing the squared residuals against the original independent variables, their squares, and their cross-products.

Addressing Heteroscedasticity


References