Probability and Statistics form the core of Data Science. Probability gives us the capability to quantify how probable are the outcomes given the data at hand. The use of probability is most required when dealing with classification problems. The Machine Learning algorithm outputs the probability of the instance being of each of the classes. Probability has a lot more to it. And it essential to know these basics to fully understand the working behind the algorithms. That is what this article is meant about.

By the end of this tutorial, you will know the following:

- What is Probability space?
- Independent and Dependent events
- Marginal, Joint and conditional probability
- Correlation and Covariance
- Different types of probability distributions

**Understanding Probability Space**

A Probability Space is a set of all the possible outcomes from an experiment. To understand this, we will first need to see what an experiment is. A Random Experiment is an event where a certain specific outcome cannot be predicted with certainty. Instead, there can be probabilities assigned to each of the possible outcomes. For example, consider the case of rolling a die that has 6 faces. This is a random event as the probability of arrival of any number is equal and totally random. So, the sample space for rolling a die would be S= {1, 2, 3, 4, 5, 6}. The set of events in this case will be all the numbers that can come up when the die is rolled. Here, the subset of events of throwing dice can be {1}, {3,2}, etc.

To measure the probability of any event, we have to consider the total number of possible outcomes associated with it. In this case, the probability of seeing a 1 on the roll of a die would be ⅙. The probability of an event will always be between 0 and 1. An important rule here is that the sum of the probabilities of all events must be equal to 1 if the events are disjoint. The higher the probability of occurrence of an event, the higher will be its probability. In binary classification, we generally take 0.5 as the threshold. So, if the probability of an instance is >0.5, it is assigned to class A. And if its probability is <0.5, it is assigned to class B.

**Independent and Dependent events**

Consider 2 events, A and B. When the probability of occurrence of event A doesn’t depend on the occurrence of event B, then A and B are independent events. For eg., if you have 2 fair coins, then the probability of getting heads on both the coins will be 0.5 for both. Hence the events are independent.

Now consider a box containing 5 balls — 2 black and 3 red. The probability of drawing a black ball first will be 2/5. Now the probability of drawing a black ball again from the remaining 4 balls will be 1/4. In this case, the two events are dependent as the probability of drawing a black ball for the second time depends on what ball was drawn on the first go.

## Marginal, Joint and conditional probability

**Marginal Probability:** It is the probability of an event irrespective of the outcomes of other random variables, e.g. P(A) or P(B). In the die case, it will be the probability of 1 occurring, or 2 occurring, etc. These can be then depicted as P(1), P(2), etc.

**Joint Probability: **is the probability of two different events occurring at the same time, i.e., two (or more) simultaneous events, e.g. P(A and B) or P(A, B). The formula for calculating the joint probability is simply P(A)*P(B). Therefore, in our die example, if we want to calculate the probability of getting a 6 on both the roles, it would be the joint probability of P(6,6) which will be equal to ⅙ * ⅙ = 0.02777.

**Conditional Probability:** is the probability of one (or more) events, given the occurrence of another event or in other words, it is the probability of an event A occurring when a secondary event B is true. e.g. P(A given B) or P(A | B). This is calculated as P(A and B)/P(A).

**Correlation and Covariance**

Correlation and Covariance are often confused with each other. They are similar but tell different characteristics of the data. Correlation and Covariance both terms measure the relationship and the dependency between two variables.

**Covariance **defines the direction of the variance or the relationship between 2 variables. If Covariance is 0, it means that the variables have no variance with respect to each other. This would mean they are not related and provide completely different information about data. If it is positive, it means that one variable increases as the other increases (directly proportional) and if it is negative, then one of the variables decreases as the other increases (inversely proportional).

**Correlation**, on the other hand, also defines the magnitude of the relation between 2 variables. Correlation is nothing but the covariance divided by the standard deviations of the variables. Hence, the value of correlation lies between -1 and 1. -1 being a complete inverse proportionality and 1 being a complete direct proportionality. And such variables won’t add any more information to the data.

**Different types of probability distributions**

Suppose you draw a random sample from a population that measures the heights of the people in a city. As you measure the heights, you can create a distribution of values of the heights. Say, a lot of people lie in the range of 130-150 cm, fewer lie in the high range of 160-180 cm and fewer in the low range of 100-130 cm. As a result, you are most likely to pick a person at random with a height around 130-150cm, where the mean lies.

When plotting the histogram with heights on the X-axis, a curve can be drawn to approximate that histogram distribution. The points where the curve is taller denotes more number of data points and hence more the probability. This comes in extremely handy when we need to calculate the probabilities of drawing certain values from a sample.

Therefore, the probability distribution of a variable tells us the distribution of probabilities of all the subsets of that variable. Once we know the distribution of a variable, it can greatly help us in modelling the Machine Learning model. Let’s take a look at some common probability distributions.

### Uniform Distribution

In a variable where every event has the same probability of occurrence, it forms a Uniform distribution. For example, in the die example, the probability of occurrence of every number is ⅙. Hence, if the variable is discrete, the probability will be 1/n where n is the total number of possible outcomes.

### Bernoulli Distribution

This distribution is used for discrete variables. It has a parameter P. If the probability of an event is P, then the probability of the other event will be 1-P. For example, if the probability of a person defaulting the loan is 0.2, the probability of him not defaulting the loan will be 1-0.2=0.8.

### Normal Distribution

Suppose you draw a random sample from a population that measures the heights of the people in a city. As you measure the heights, you can create a distribution of values of the heights. Say, a lot of people lie in the range of 130-150 cm, fewer lie in the high range of 160-180 cm and fewer in the low range of 100-130 cm. As a result, you are most likely to pick a person at random with a height around 130-150cm, where the mean lies.

When plotting the histogram with heights on the X-axis, a curve can be drawn to approximate that histogram distribution. This is what we call the Normal Distribution. The points where the curve is taller denotes more number of data points and hence more the probability. This comes in extremely handy when we need to calculate the probabilities of drawing certain values from a sample.

**Before you go**

We covered most of the probability basics in this tutorial. When it comes to applying Data Science, the knowledge of these concepts helps a lot in getting more insights out of the data. This in turn helps us make a better model and predict better outcomes.