FIX: Update python code to simplify and resolve FutureWarning (#540)

mmcky · jstac · web-flow · commit a58c4af24153 · 2024-08-01T14:34:53.000+10:00
* Misc edits to prob lecture

* fix variable name and minor formatting update

* add explanation for infinite support

* ENH: update code to simplify and resolve warnings

* remove all asarray

* address missed merge conflict issues

* remove extra x=df['income']

* FIX: set pd option to see if FutureWarning is resolved for inf and na

* revert test by setting pd option

* upgrade anaconda==2024.06

---------

Co-authored-by: John Stachurski &lt;john.stachurski@gmail.com&gt;
diff --git a/environment.yml b/environment.yml
@@ -4,7 +4,7 @@ channels:
   - conda-forge
 dependencies:
   - python=3.11
-  - anaconda=2024.02
+  - anaconda=2024.06
   - pip
   - pip:
     - jupyter-book==0.15.1
diff --git a/lectures/prob_dist.md b/lectures/prob_dist.md
@@ -4,14 +4,13 @@ jupytext:
     extension: .md
     format_name: myst
     format_version: 0.13
-    jupytext_version: 1.14.5
+    jupytext_version: 1.16.1
 kernelspec:
   display_name: Python 3 (ipykernel)
   language: python
   name: python3
 ---
 
-
 # Distributions and Probabilities
 
 ```{index} single: Distributions and Probabilities
@@ -23,6 +22,7 @@ In this lecture we give a quick introduction to data and probability distributio
 
 ```{code-cell} ipython3
 :tags: [hide-output]
+
 !pip install --upgrade yfinance  
 ```
 
@@ -35,7 +35,6 @@ import scipy.stats
 import seaborn as sns
 ```
 
-
 ## Common distributions
 
 In this section we recall the definitions of some well-known distributions and explore how to manipulate them with SciPy.
@@ -99,7 +98,6 @@ n = 10
 u = scipy.stats.randint(1, n+1)
 ```
 
-
 Here's the mean and variance:
 
 ```{code-cell} ipython3
@@ -195,7 +193,6 @@ u.pmf(0)
 u.pmf(1)
 ```
 
-
 #### Binomial distribution
 
 Another useful (and more interesting) distribution is the **binomial distribution** on $S=\{0, \ldots, n\}$, which has PMF:
@@ -232,7 +229,6 @@ Let's see if SciPy gives us the same results:
 u.mean(), u.var()
 ```
 
-
 Here's the PMF:
 
 ```{code-cell} ipython3
@@ -250,7 +246,6 @@ ax.set_ylabel('PMF')
 plt.show()
 ```
 
-
 Here's the CDF:
 
 ```{code-cell} ipython3
@@ -264,7 +259,6 @@ ax.set_ylabel('CDF')
 plt.show()
 ```
 
-
 ```{exercise}
 :label: prob_ex3
 
@@ -334,7 +328,6 @@ ax.set_ylabel('PMF')
 plt.show()
 ```
 
-
 #### Poisson distribution
 
 The Poisson distribution on $S = \{0, 1, \ldots\}$ with parameter $\lambda > 0$ has PMF
@@ -372,7 +365,6 @@ ax.set_ylabel('PMF')
 plt.show()
 ```
 
-
 ### Continuous distributions
 
 
@@ -449,7 +441,6 @@ plt.legend()
 plt.show()
 ```
 
-
 Here's a plot of the CDF:
 
 ```{code-cell} ipython3
@@ -466,7 +457,6 @@ plt.legend()
 plt.show()
 ```
 
-
 #### Lognormal distribution
 
 The **lognormal distribution** is a distribution on $\left(0, \infty\right)$ with density
@@ -646,7 +636,6 @@ plt.legend()
 plt.show()
 ```
 
-
 #### Gamma distribution
 
 The **gamma distribution** is a distribution on $\left(0, \infty\right)$ with density
@@ -730,7 +719,6 @@ df = pd.DataFrame(data, columns=['name', 'income'])
 df
 ```
 
-
 In this situation, we might refer to the set of their incomes as the "income distribution."
 
 The terminology is confusing because this set is not a probability distribution
@@ -761,14 +749,10 @@ $$
 For the income distribution given above, we can calculate these numbers via
 
 ```{code-cell} ipython3
-x = np.asarray(df['income'])   # Pull out income as a NumPy array
-```
-
-```{code-cell} ipython3
+x = df['income']
 x.mean(), x.var()
 ```
 
-
 ```{exercise}
 :label: prob_ex4
 
@@ -792,15 +776,13 @@ We will cover
 We can histogram the income distribution we just constructed as follows
 
 ```{code-cell} ipython3
-x = df['income']
 fig, ax = plt.subplots()
 ax.hist(x, bins=5, density=True, histtype='bar')
 ax.set_xlabel('income')
 ax.set_ylabel('density')
 plt.show()
 ```
 
-
 Let's look at a distribution from real data.
 
 In particular, we will look at the monthly return on Amazon shares between 2000/1/1 and 2024/1/1.
@@ -811,25 +793,21 @@ So we will have one observation for each month.
 
 ```{code-cell} ipython3
 :tags: [hide-output]
-df = yf.download('AMZN', '2000-1-1', '2024-1-1', interval='1mo' )
+
+df = yf.download('AMZN', '2000-1-1', '2024-1-1', interval='1mo')
 prices = df['Adj Close']
-data = prices.pct_change()[1:] * 100
-data.head()
+x_amazon = prices.pct_change()[1:] * 100
+x_amazon.head()
 ```
 
-
 The first observation is the monthly return (percent change) over January 2000, which was
 
 ```{code-cell} ipython3
-data[0] 
+x_amazon.iloc[0]
 ```
 
 Let's turn the return observations into an array and histogram it.
 
-```{code-cell} ipython3
-x_amazon = np.asarray(data)
-```
-
 ```{code-cell} ipython3
 fig, ax = plt.subplots()
 ax.hist(x_amazon, bins=20)
@@ -838,7 +816,6 @@ ax.set_ylabel('density')
 plt.show()
 ```
 
-
 #### Kernel density estimates
 
 Kernel density estimates (KDE) provide a simple way to estimate and visualize the density of a distribution.
@@ -893,10 +870,10 @@ For example, let's compare the monthly returns on Amazon shares with the monthly
 
 ```{code-cell} ipython3
 :tags: [hide-output]
-df = yf.download('COST', '2000-1-1', '2024-1-1', interval='1mo' )
+
+df = yf.download('COST', '2000-1-1', '2024-1-1', interval='1mo')
 prices = df['Adj Close']
-data = prices.pct_change()[1:] * 100
-x_costco = np.asarray(data)
+x_costco = prices.pct_change()[1:] * 100
 ```
 
 ```{code-cell} ipython3
@@ -907,7 +884,6 @@ ax.set_xlabel('KDE')
 plt.show()
 ```
 
-
 ### Connection to probability distributions
 
 Let's discuss the connection between observed distributions and probability distributions.
@@ -941,7 +917,6 @@ ax.set_ylabel('density')
 plt.show()
 ```
 
-
 The match between the histogram and the density is not bad but also not very good.
 
 One reason is that the normal distribution is not really a good fit for this observed data --- we will discuss this point again when we talk about {ref}`heavy tailed distributions<heavy_tail>`.
@@ -967,8 +942,6 @@ ax.set_ylabel('density')
 plt.show()
 ```
 
-
 Note that if you keep increasing $N$, which is the number of observations, the fit will get better and better.
 
 This convergence is a version of the "law of large numbers", which we will discuss {ref}`later<lln_mr>`.
-