Binary Data

Author: Ernad Mujakic
Date: 2025-07-08


Binary data is a type of Categorical Data with only two possible values, typically 1 and 0. Binary data is a specific type of Nominal Data, meaning the values are non-numeric, and qualitative.

Representation

  • Presence: Binary data is commonly used to represent the presence or absence of a specific attribute, 1 meaning the attribute is present, 0 meaning the attribute is absent.
  • Truth: Binary data is also commonly used to represent truth, where 1 indicates truth, and 0 indicated falsehood.
  • Opposing Categories: Binary data is commonly used to represent 2 opposing categories, such as male/female, or employed/unemployed.

No Numerical Significance

Since nominal data lacks numerical significance, data operations such as Mean or Median cannot be performed. However, the frequency of nominal data values can be analyzed, and a measure like the Mode can describe the most common category within a given dataset.


Symmetry

A binary attribute is symmetric if each state 0 or 1 is equally valuable, such as a gender attribute. A binary attribute is asymmetric if the two states are not equally valuable, such as the results of a disease test.

Asymmetric binary attributes require careful consideration in Machine Learning tasks to ensure the model understands the underlying implications of the data. Common strategies include:

  • Label Encoding: Label encoding assigns numerical values to binary attributes. Each binary class can be assigned a value representative of its importance, for example, assigning the true class a value of , while the false class is assigned a value of .
  • Class Weighting: In binary classification tasks, assigning different weights to each state during model training can put more emphasis on a particular class.

Similarity

Calculating the similarity and dissimilarity of binary attributes involves different methods depending on whether the attribute is symmetric or not.

Symmetric

For symmetric binary attributes, common measures include:

  • Jaccard Coefficient: The size of the intersection of two sets over their union:

    Where:

  • is the number of attributes where and are 1.

  • is the number of attributes where A is 1 and B is 0

  • is the number of attributes where A is 0 and B is 1

  • Dice Coefficient: Twice the size of the intersection of two sets over the sum of the sets:

Asymmetric

For asymmetric binary attributes, common measures include:

  • Weighted Jaccard Coefficient: The Jaccard coefficient can be modified to emphasize a particular class:
    Where:
  • ​ is 1 if the element is present in set and 0 if absent.
  • ​ is 1 if the element is present in set and 0 if absent.
  • ​ is the weight assigned to the element .

Regression

Binary Regression estimates a function that maps one or more independent variables to a single dependent binary variable. Common techniques include:

  • Logistic Regression: A statistical method used for binary classification, predicting the probability that a given input vector belongs to a certain binary category. It is essentially a Regression model with a Logistic Function applied to map the output to a value between 0 and 1.

  • Probit Regression: Similar to logistic regression, though, it assumes that errors between the predicted and actual values are normally distributed. Probit regression also uses the Probit Link Function rather than the logistic function.


References