###### The three conditionals every data scientist should know

## Conditional Expectation

*Conditional expectation of a random variable is the value that we would expect it take, on the condition that another variable that it depends on, takes up a specific value.*

**If this sounds like a mouthful, despair not. **We’ll soon get to the ‘a-ha’ moment about this concept. To aid our quest, we’ll use a Poisson distributed random variable ** Y** as our protagonist.

So let ** Y** be a Poisson distributed discrete random variable with mean rate of occurrence

*λ*.

The **P**robability **M**ass **F**unction of ** Y** looks like the following. (

**PMF**is just another name for the probability distribution of a discrete random variable):

The expected value of ** Y**, denoted

*E**(*

*Y**)*, is the sum of the product of:

- Every value of
in its range [0, ∞],*Y**and*, - Its corresponding probability
*P(**Y**=y)*, picked off from its PMF. in other words, the following:

The thing with expected value is that if you were to put your money on any value of ** Y**,

*E(*

*Y**)*is what you should bet on.

In case of ** Y** which is

*Poisson(λ)*distributed, it can be shown that

**is simply**

*E(Y)**λ*. In sample distribution shown in the above graph,

*E(*

*Y**)*= 20

*.*

If ** Y** had been a continuous valued random variable such as a normally distributed random variable,

*E(*

*Y**)*is calculated as follows:

Where *f(y)* is the **P**robability **D**ensity **F**unction (**PDF**) of ** Y** and

*[a,b]*is its range

*.*So what is **conditional **about this expectation? In this case, absolutely nothing, because ** Y**, the way we have defined it, is an independent variable i.e. its value does not depend on any other variable’s value.

We now introduce another protagonist: Variable ** X**. With

**in the picture, we can consider the situation where**

*X***depends on**

*Y***, i.e.**

*X*

*Y is some function of X:**Y** = f(**X**).*

A simple example of *f(.)* is a linear relationship between ** Y** and

**:**

*X**Y** = **β0** + **β1*X*

Where ** β0** is the intercept and

**is the slope of the straight line.**

*β1*Let’s play God for a second. We’ll claim to know the *true* relationship between ** Y** and

**Let that relation be as follows (but don’t let the regression modeler inside your head know this true relation!):**

*X.**Y** = 20 + 10***X*

Here is the plot of ** Y **against

**:**

*X*Wait, but isn’t ** Y** a

*random*variable? Thus, for each value of

**, we’d expect**

*X***to take up one or more**

*Y**random*values governed by probability distribution (PDF or PMF) of

**. And this sort of a random behavior of**

*Y***leads to the following kind of plot of**

*Y***versus**

*Y***instead of the very neat plot we saw earlier:**

*X*Recollect once again that in our example, we have assumed that ** Y** is

*Poisson(λ)*distributed

*.*It turns out that the parameter

*λ*of the Poisson distribution’s PMF forms the connective tissue between the neat straight-line plot of

**versus**

*Y***we saw earlier and the smeared-over plot of**

*X***versus**

*Y***shown above. Specifically,**

*X**λ*is the following linear function of

**:**

*X**λ(**X**) = 20 + 10***X*

But as we noted at the beginning of the article, the expected value of a *Poisson(λ) *distributed random variable i.e. *E(**Y**)*, is *simply λ.* Therefore, it follows that for some value of ** X**:

*E(**Y**|**X**) = λ(**X**), and thus:*

*E(**Y**|**X**) =20 + 10***X*

The notation *E(**Y**|**X**)* is the **conditional expectation of Y ** on

**, or the expectation of**

*X***conditioned on**

*Y***taking a specific value ‘**

*X**x’*.

For example, in the neat straight line plot of Y versus X, when *X**=6*,

*E(**Y**|**X**=6) = 20 + 10*6 = 80*.

So now we can say that when ** X**=6,

**is a Poisson distributed random variable with a mean value**

*Y**λ*of 80.

We can carry out a similar calculation for every value of ** X** to get the corresponding conditional expectation of

**on that value of**

*Y*

*X**=x*, in other words the corresponding mean

*λ*for

*X**=x*. Doing so lets us ‘superimpose’ the two

**versus**

*Y***plots as follows. The orange dots are the conditional mean values of the Poisson distributed sample of Y shown by the blue dots.**

*X*In general, if ** Y** is a random variable having a

**linear relationship**with

**we can represent this relation as follows:**

*X,**E(Y|X**=x**)** = β0 + β1*****x*

*Where E(Y|X**=x**) is called the Expectation of Y conditional upon X taking on the value **x.*

The conditional expectation, also known as the conditional mean of ** Y** on

**can also be written concisely as**

*X**E(*

*Y**|*

*X**).*

## Conditional Probability

We’ll see how the ** explanatory variable X** is about to ‘work its way’ into the Poisson distributed

**’s PMF via the parameter**

*Y**λ:*

So far we have seen that:

*E(Y|X**=x**)** = λ(**X**=x) = β0 + β1*****x*

We also know that the PMF of ** Y** is given by:

In the above equation, substitute *λ *with *β0 + β1*****x*, or indeed with any other function *f(**X**)* of ** X**, and bingo! The probability of

**taking up a particular value**

*Y**y*is now suddenly dependent on

**taking on a specific value**

*X**x*. See the updated PMF formula below:

In other words, the PMF of ** Y** has now transformed into a

**. Isn’t that neat?**

*conditional probability distribution function*** Y** can have any sort of a probability distribution, discrete, continuous, or mixed. Similarly,

**can depend on**

*Y***via any sort of a relation**

*X**f(.)*between

**and**

*Y***. Thus, we can write the conditional expectation of**

*X***on**

*Y***as follows for the case when**

*X***is a discrete random variable:**

*Y*Where *P(**Y**=y_i|**X**=x)* is the *conditional Probability **Mass **Function* of ** Y**.

If ** Y** is a continuous random variable, we can write its conditional expectation as follows:

Where *f(**Y**=y_i|**X**=x)* is the *conditional Probability **Density **Function* of *Y**.*

Conditional probability distributions and condition expectation occupy a prominent place in the specifications of regression models. We’ll see the connection between the two in the context of linear regression.

## Conditional Expectation and regression modeling

Suppose we are given a random sample of (y, x) tuples that are drawn from the population of values shown in the earlier plot:

Suppose also that you have decided to fit a linear regression model to this sample, with the goal of predicting Y from X. After your model is trained (i.e. *fitted*) to the sample, the model’s regression equation can be specified as follows:

** Y_(predicted)** =

*β0_(fitted)**+*

*β1_(fitted)*X*Where ** β0_(fitted)** and

**are the fitted model’s coefficients. Let’s skip through the fitting steps for now. The following plot shows how the fitted model looks like for the above sample plot. I’ve superimposed the conditional expectations of**

*β1_(fitted)***,**

*Y***, for each value of**

*E(Y|X)***:**

*X*You can see from the plot that fitted model doesn’t exactly pass through the conditional expectations of ** Y**. It’s off by a small amount in each case. Let’s work out this error for

*X**=8*:

When X=8:

We know E(Y|X) because we are playing God and we know the exact equation between Y and X:*E(Y|X=8) = 20+10*8 = 100*

Our model cannot know God’s mind, so at best it can estimate E(Y|X):*Y_predicted|X=8 = 22.034 + 9.458 * 8 = 97.698*

Thus the residual error ϵ for X=8 is 100 – 97.698 = 2.302

This way, we can find the residual errors for all other values of ** X **also.

We can now use these residual error terms to express the conditional expectation of ** Y** in terms of the fitted model’s coefficients. This relationship for

*X**=8*looks like this:

*E(**Y**|**X**=8) =100 = 22.034 + 9.458 * 8 + 2.302*

Or in general, we have:

E(Y|X) =β0_(fitted)+β1_(fitted)*X + ϵ

in other words:

E(Y|X) = Y_(predicted) + ϵ

*Where **ϵ** are the regression model’s errors a.k.a. the residual errors of the model.*

In general, for a ‘well-behaved’ model, as you train your model on larger and larger sample sizes, the following interesting things happen:

- The fitted model’s coefficients, in our example:
*β0_(fitted)*and*β1_(fitted)*, start approaching the true values (in our example: 20 and 10 respectively). - The fitted model’s predictions (
) start approaching the conditional expectation of*Y_predicted*on*Y*,*X**i.e. E(**Y**|**X**)*. - The model’s residual errors start approaching zero.

## Conditional Variance

*Variance is a measure of the departure of your data from the mean.*

Mathematically, the variance *s²* of a sample (or *σ²* of the population) is the sum of the squares of the differences between the sample (or population) values and the mean value of the sample (or population).

The following plot illustrates the concepts of ** unconditional **variance

*s²(*

*Y**)*and the

**variance**

*conditional**s²(*

*Y**|*

*X**=x)*:

While the **unconditional **variance of the sample (or population) takes into account all the values in the sample (or population), the **conditional** variance focuses on only a subset of values of ** Y** that correspond to the given value of

**.**

*X*In our example, we define conditional variance of the sample *s²(**Y**|**X**=x) as the variance of Y for a given value of X. *Hence the notation* s²(**Y**|**X**=x).*

## Conditional Variance and regression modeling

As with conditional expectation, conditional variance occupies a special place in the field of regression modeling, and that place is as follows:

The primary reason for building a regression model (or for that matter, any statistical model) is to try to ‘explain’ the variability in the dependent variable. One way to achieve this goal is by including relevant explanatory variables ** X** in your model. The conditional variance gives you a way to explain by how much a regression variable has helped in reducing (i.e. explaining) the variance in

**.**

*Y*If you find that the presence of an explanatory variable has not been able to explain much variance, it can be dropped from the model.

To illustrate this concept, let’s once again look at the above plot showing the unconditional and conditional variance in *Y.*

The overall variance in the sample is 870.59. But if you were to fix ** X** at 6, then just knowing that

**is 6 has brought down the variance to only 25! The following table lists the conditional variance in**

*X***for the remaining values of**

*Y***Compare each value to the overall, unconditional variance of Y, namely 870.59:**

*X.*That’s it! That’s all there is to Conditional Probability, Condition Expectation and Condition Variance. You can see how useful and central a role these three concepts play in regression modeling.

Happy modeling!

## Related

Getting to Know The Poisson Process And The Poisson Probability Distribution

**PREVIOUS: **Understanding Partial Auto-correlation And The PACF

**NEXT: **An In-depth Study of Conditional Variance and Conditional Covariance