dsgiitr · TheMaskulladore4321 · Jul 7, 2025 · Jul 7, 2025 · Jul 7, 2025
diff --git a/Tanmay/ProbStats1.ipynb b/Tanmay/ProbStats1.ipynb
diff --git a/Tanmay/ProbStats2.ipynb b/Tanmay/ProbStats2.ipynb
@@ -0,0 +1,170 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "49cee8ee",
+   "metadata": {},
+   "source": [
+    "#Q1\n",
+    "\n",
+    "Bayes' Formula for upadation of probability: \n",
+    "$$\n",
+    "P(H|D) = \\dfrac{P(D|H)P(H)}{P(D)}\n",
+    "$$\n",
+    "\n",
+    "The Prior: P(Expert) = 0.01\n",
+    "Likelihoods: P(3 Bullseyes in 5 throws | Expert) can be computed as $(P(Bullseye | Expert))^3$ = $(0.7)^3(0.3)^2$, since each throw is independent. If he's not an expert, it's $(0.1)^3(0.9)^2$.\n",
+    "\n",
+    "Hence, we use the Bayesian Update,\n",
+    "\n",
+    "$$\n",
+    "P(Expert|3\\;Bullseyes\\;in\\;5\\;throws) = \\dfrac{P(3\\;Bullseyes\\;in\\;5\\;throws|Expert)P(Expert)}{P(3\\;Bullseyes\\;in\\;5\\;throws)}\\\\\n",
+    "= \\dfrac{(0.7)^3\\times(0.3)^2\\times(0.01)}{(0.1)^3\\times(0.9)^2\\times0.99 + (0.7)^3\\times(0.3)^2\n",
+    "\\times0.01}\\\\\n",
+    "\\approx 0.27795\n",
+    "$$\n",
+    "\n",
+    "The probability goes from 1% to $\\approx$ 28% based on his performance being way better than what would be expected of an average person.\n",
+    "If our prior was 20% instead of 1%, our posterior would grow to 0.9050 or $\\approx$ 90.5% since the prior data informs the posterior. Our higher belief in the original hypothesis increases our probability of it being true. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78e06a3d",
+   "metadata": {},
+   "source": [
+    "#Q2\n",
+    "\n",
+    "Our set of times: ${T_1, T_2, ..., T_n}$\n",
+    "\n",
+    "Given that $T_i > 10$ and $T \\sim Exp(\\lambda)$.\n",
+    "\n",
+    "Then we must change the stnadard exponential PDF such that $\\int_{10}^{\\infty} k \\times f_T(t) \\: \\mathrm{d}t = 1$.<br/>\n",
+    "This gives us $g_T(t) = \\dfrac{f_X(x)}{F(\\infty) - F(10)} = \\lambda {\\mathrm{e}}^{-\\lambda(t - 10)}\\\\$\n",
+    "$$\n",
+    "L(\\lambda) = {\\lambda}^n \\prod_{i=1}^{n} {\\mathrm{e}}^{-\\lambda(T_i - 10)}\\\\\n",
+    "l(\\lambda) = \\log (L(\\lambda)) = n\\log (\\lambda) - \\lambda \\sum_{i = 1}^{n} (T_i - 10)\\\\\n",
+    "$$\n",
+    "Differentiating this wrt $\\lambda$,\n",
+    "$$\n",
+    "\\dfrac{n}{\\lambda} - \\sum_{i = 1}^{n} (T_i - 10) = 0\\\\\n",
+    "\\hat {\\lambda} = \\dfrac{n}{\\sum_{i = 1}^{n} (T_i - 10)}\\\\\n",
+    "$$\n",
+    "\n",
+    "Ignoring truncation would ofcourse give an incorrect pdf, which would represent the distribution of data not as it happened. Our model would then have a region of the distribution without any observations, and this would take away from the $T_i > 10$ region, giving waiting times less than they're supposed to be.\n",
+    "\n",
+    "If the device had a little bit of uncertainty that makes it sometimes start later than 10 minutes, we could model that uncertainty to get better predictions, probably as part of a mixed distribution of the waiting time where the point of truncation is variable.\n",
+    "\n",
+    "\n",
+    "Now if we give $\\lambda$ a $\\gamma (a, b)$ prior, the following changes:\n",
+    "\n",
+    "$\\hat \\lambda$ is $\\lambda$ that maximises $f_{Data|\\lambda}(data|t)f_{\\lambda}(t)$, where the former is the likelihood and the latter is the prior.\n",
+    "\n",
+    "The MLE only maximises the likelihood. We can use the likelihood function from our previous calculation.\n",
+    "\n",
+    "$$\n",
+    "f_{Data|\\lambda}(data|t)f_{\\lambda}(t) \\propto {\\lambda}^{n + \\alpha - 1}{{\\mathrm{e}}^{({\\dfrac{-\\lambda}{\\beta}})}}{\\prod_{i=1}^{n} {\\mathrm{e}}^{-\\lambda(T_i - 10)}}\n",
+    "$$\n",
+    "Taking a logarithm and differentiating, then equating to zero gives us the value:\n",
+    "\n",
+    "$\\hat \\lambda_{MAP} = \\dfrac {n + \\alpha - 1}{\\dfrac{1}{\\beta} + \\sum_{i = 1}^{n} (T_i - 10)}$\n",
+    "\n",
+    "The difference between $\\hat \\lambda_{MAP}$ and $\\hat \\lambda_{MLE}$ is $\\propto \\alpha,\\;\\beta$.\n",
+    "This is higher when the historical data is *very* different from the current model.\n",
+    "\n",
+    "The prior acts as said historical data, it pulls MAP towards itself, especially if data is scarce.\n",
+    "We should prefer MAP over MLE when in the data collected so far is low."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b0a360a0",
+   "metadata": {},
+   "source": [
+    "#Q3\n",
+    "\n",
+    "$$\n",
+    "D_{{KL}}(P \\parallel Q) = \\sum_{i=1}^k p_i \\log \\left( \\frac{p_i}{q_i} \\right)\n",
+    "$$\n",
+    "\n",
+    "We want to show that:\n",
+    "\n",
+    "$$\n",
+    "D_{{KL}}(P \\parallel Q) \\geq 0\n",
+    "$$\n",
+    "\n",
+    "The logarithm function is strictly concave. By Jensen's inequality:\n",
+    "\n",
+    "$$\n",
+    "\\sum_{i=1}^k p_i \\log \\left( \\frac{q_i}{p_i} \\right) \\leq \\log \\left( \\sum_{i=1}^k p_i \\cdot \\frac{q_i}{p_i} \\right) = \\log \\left( \\sum_{i=1}^k q_i \\right) = \\log(1) = 0\n",
+    "$$\n",
+    "\n",
+    "Multiplying both sides by \\(-1\\), we obtain:\n",
+    "\n",
+    "$$\n",
+    "\\sum_{i=1}^k p_i \\log \\left( \\frac{p_i}{q_i} \\right) \\geq 0\n",
+    "$$\n",
+    "\n",
+    "Thus,\n",
+    "\n",
+    "$$\n",
+    "D_{{KL}}(P \\parallel Q) \\geq 0\n",
+    "$$\n",
+    "\n",
+    "\n",
+    "\n",
+    "When is \\( D_{\\text{KL}}(P \\parallel Q) = 0 \\) ?\n",
+    "\n",
+    "This occurs if and only if:\n",
+    "\n",
+    "$$\n",
+    "p_i = q_i \\quad \\text{for all } i\n",
+    "$$\n",
+    "\n",
+    "That is, \\( P = Q \\). This follows from the strict convexity of the KL divergence.\n",
+    "\n",
+    "\n",
+    "\n",
+    "Connection to Cross-Entropy\n",
+    "\n",
+    "The cross-entropy between distributions \\( P \\) and \\( Q \\) is defined as:\n",
+    "\n",
+    "$$\n",
+    "H(P, Q) = - \\sum_{i=1}^k p_i \\log(q_i)\n",
+    "$$\n",
+    "\n",
+    "The entropy of \\( P \\) is:\n",
+    "\n",
+    "$$\n",
+    "H(P) = - \\sum_{i=1}^k p_i \\log(p_i)\n",
+    "$$\n",
+    "\n",
+    "We can express the KL divergence in terms of entropy and cross-entropy:\n",
+    "\n",
+    "$$\n",
+    "D_{\\text{KL}}(P \\parallel Q) = \\sum_{i=1}^k p_i \\log \\left( \\frac{p_i}{q_i} \\right) = \\sum_{i=1}^k p_i \\log(p_i) - \\sum_{i=1}^k p_i \\log(q_i)\n",
+    "$$\n",
+    "\n",
+    "$$\n",
+    "D_{\\text{KL}}(P \\parallel Q) = -H(P) + H(P, Q)\n",
+    "$$\n",
+    "\n",
+    "\n",
+    "Minimizing \\( D_{\\text{KL}}(P \\parallel Q) \\) with respect to \\( Q \\) is equivalent to minimizing the cross-entropy \\( H(P, Q) \\), up to the constant \\( H(P) \\), which depends only on \\( P \\)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.13.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}