Skip to content

Probability

This part discusses concepts of probability, and uses R to demonstrate these concepts.

1. Discrete probability

The most important thing in this chapter is to learn how to calculate probabilities and use Monte Carlo simulations.

Event: refer to things that can happen when something occurs by chance.

Distribution: We simply assign a probability to each category. their proportion defines the distribution.

Monte Carlo

S
# Setting the random seed
set.seed(1986)
# rep to generate the
beads <- rep(c("red", "blue"), times = c(2,3))
B <- 10000
# repeat the same task any number of times.
events <- replicate(B, sample(beads, 1))
# use table to see the distribution
tab <- table(events)
# prop.table gives the proportions
prop.table(tab)

events <- sample(beads, B, replace = TRUE)

Seed: A popular way to pick the seed is the year - month - day

sample selects elements without replacement by default. We can change the replace argument to replace=TRUE for selects elements with replacement

Independent: two events are independent if the outcome of one does not affect the other

Multiplication rule:

\[ \mbox{Pr}(A \mbox{ and } B) = \mbox{Pr}(A)\mbox{Pr}(B \mid A) \]

Addition rule:

\[ \mbox{Pr}(A \mbox{ or } B) = \mbox{Pr}(A) + \mbox{Pr}(B) - \mbox{Pr}(A \mbox{ and } B) \]

Combinations and permutations: combinations and permutations in library gtools

Example

1. Natural 21 in Blackjack

Two ways to estimate the probablity: 1. Caculate directly

S
library(gtools)

suits <- c("Diamonds", "Clubs", "Hearts", "Spades")
numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven", 
             "Eight", "Nine", "Ten", "Jack", "Queen", "King")
deck <- expand.grid(number = numbers, suit = suits)
deck <- paste(deck$number, deck$suit)

hands <- permutations(52, 2, v = deck)
first_card <- hands[,1]
second_card <- hands[,2]

aces <- paste("Ace", suits)

facecard <- c("King", "Queen", "Jack", "Ten")
facecard <- expand.grid(number = facecard, suit = suits)
facecard <- paste(facecard$number, facecard$suit)

hands <- combinations(52, 2, v = deck)
mean(hands[,1] %in% aces & hands[,2] %in% facecard)
  1. Monte Carlo simulation
S
hand <- sample(deck, 2) 

blackjack <- function(){
   hand <- sample(deck, 2)
  (hand[1] %in% aces & hand[2] %in% facecard) | 
    (hand[2] %in% aces & hand[1] %in% facecard)
}

B <- 10000
results <- replicate(B, blackjack())
mean(results)

2. Monty Hall problem

S
B <- 10000
monty_hall <- function(strategy){
  doors <- as.character(1:3)
  prize <- sample(c("car", "goat", "goat"))
  prize_door <- doors[prize == "car"]
  my_pick  <- sample(doors, 1)
  show <- sample(doors[!doors %in% c(my_pick, prize_door)],1)
  stick <- my_pick
  stick == prize_door
  switch <- doors[!doors %in% c(my_pick, show)]
  choice <- ifelse(strategy == "stick", stick, switch)
  choice == prize_door
}
stick <- replicate(B, monty_hall("stick"))
mean(stick)
#> [1] 0.342
switch <- replicate(B, monty_hall("switch"))
mean(switch)
#> [1] 0.668

3. Birthday problem

Monte Carlo simulation

S
B <- 10000

same_birthday <- function(n){
  bdays <- sample(1:365, n, replace = TRUE)
  any(duplicated(bdays))
}

compute_prob <- function(n, B = 10000){
  results <- replicate(B, same_birthday(n))
  mean(results)
}

n <- seq(1,60)
prob <- sapply(n, compute_prob)

Real probability

\[ 1 \times \frac{364}{365}\times\frac{363}{365} \dots \frac{365-n + 1}{365} \]
S
1
2
3
4
5
6
exact_prob <- function(n){
  prob_unique <- seq(365, 365 - n + 1)/365 
  1 - prod( prob_unique)
}
eprob <- sapply(n, exact_prob)
qplot(n, prob) + geom_line(aes(n, eprob), col = "red")

birthday-problem-exact-probabilities

the larger the B, the better the approximation How to determine the B?

  • One practical approach is to check for the stability of the estimate.

2. Continuous probability

In this chapter, we should know the concept of: Cumulative distribution functions, Probability density function

R uses a convention that lets us remember the names, namely using the letters d (density function), q (quantile function), p (distribution function), and r (random generation) in front of a shorthand for the distribution. (e.g. dnorm, qnorm, pnorm, rnorm)

A simple chapter~

3. Random variables

Central Limit Theorem: when the number of draws, also called the sample size, is large, the probability distribution of the sum of the independent draws is approximately normal.

Using the CLT, we can skip the Monte Carlo simulation

\[ \begin{aligned} E(\frac{X_1+X_2+...X_n}{n}) &= n\mu/n =\mu \\ \sigma(\frac{X_1+X_2+...X_n}{n}) &= \frac{\sigma}{\sqrt{n}} \end{aligned} \]

Law of large numbers: the standard error of the average becomes smaller and smaller as sample size grows larger.

4. The Big Short

Equations: Calculating interest rate for 1% probability of losing money

We want to calculate the value of \(x\) for which \(\mathrm{Pr}(S<0)=0.01\). The expected value \(E[S]\) of the sum of \(n=1000\) loans given our definitions of \(x\),  \(l\) and \(p\) is:

\[ \mu_S=(lp+x(1-p))* n \]

And the standard error of the sum of \(n\) loans, \(\mathrm{SE}[S]\), is:

\[ \sigma_S=\left\vert x-l \right\vert\sqrt{np(1-p)} \]

Because we know the definition of a Z-score is \(Z=\dfrac{x-\mu}{\sigma}\), we know that \(\mathrm{Pr}(S<0)=\mathrm{Pr}(Z<-\dfrac{\mu}{\sigma})\). Thus, \(\mathrm{Pr}(S<0)=0.01\) equals:

\[ \mathrm{Pr}(Z<\frac{-\{lp+x(1-x)\}n}{(x-l)\sqrt{np(1-p)}}) \]

z<-qnorm(0.01) gives us the value of \(z\) for which \(\mathrm{Pr}(Z\leq z)=0.01\), meaning:

\[ z = \frac{-\{lp+x(1-x)\}n}{(x-l)\sqrt{np(1-p)}} \]

Solving for  gives:

\[ x = -l\frac{np-z\sqrt{np(1-p)}}{n(1-p)+z\sqrt{np(1-p)}} \]