Skip to content

Commit a58c4af

Browse files
mmckyjstac
andauthored
FIX: Update python code to simplify and resolve FutureWarning (#540)
* Misc edits to prob lecture * fix variable name and minor formatting update * add explanation for infinite support * ENH: update code to simplify and resolve warnings * remove all asarray * address missed merge conflict issues * remove extra x=df['income'] * FIX: set pd option to see if FutureWarning is resolved for inf and na * revert test by setting pd option * upgrade anaconda==2024.06 --------- Co-authored-by: John Stachurski <[email protected]>
1 parent 21a6894 commit a58c4af

File tree

2 files changed

+12
-39
lines changed

2 files changed

+12
-39
lines changed

environment.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ channels:
44
- conda-forge
55
dependencies:
66
- python=3.11
7-
- anaconda=2024.02
7+
- anaconda=2024.06
88
- pip
99
- pip:
1010
- jupyter-book==0.15.1

lectures/prob_dist.md

Lines changed: 11 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,13 @@ jupytext:
44
extension: .md
55
format_name: myst
66
format_version: 0.13
7-
jupytext_version: 1.14.5
7+
jupytext_version: 1.16.1
88
kernelspec:
99
display_name: Python 3 (ipykernel)
1010
language: python
1111
name: python3
1212
---
1313

14-
1514
# Distributions and Probabilities
1615

1716
```{index} single: Distributions and Probabilities
@@ -23,6 +22,7 @@ In this lecture we give a quick introduction to data and probability distributio
2322

2423
```{code-cell} ipython3
2524
:tags: [hide-output]
25+
2626
!pip install --upgrade yfinance
2727
```
2828

@@ -35,7 +35,6 @@ import scipy.stats
3535
import seaborn as sns
3636
```
3737

38-
3938
## Common distributions
4039

4140
In this section we recall the definitions of some well-known distributions and explore how to manipulate them with SciPy.
@@ -99,7 +98,6 @@ n = 10
9998
u = scipy.stats.randint(1, n+1)
10099
```
101100

102-
103101
Here's the mean and variance:
104102

105103
```{code-cell} ipython3
@@ -195,7 +193,6 @@ u.pmf(0)
195193
u.pmf(1)
196194
```
197195

198-
199196
#### Binomial distribution
200197

201198
Another useful (and more interesting) distribution is the **binomial distribution** on $S=\{0, \ldots, n\}$, which has PMF:
@@ -232,7 +229,6 @@ Let's see if SciPy gives us the same results:
232229
u.mean(), u.var()
233230
```
234231

235-
236232
Here's the PMF:
237233

238234
```{code-cell} ipython3
@@ -250,7 +246,6 @@ ax.set_ylabel('PMF')
250246
plt.show()
251247
```
252248

253-
254249
Here's the CDF:
255250

256251
```{code-cell} ipython3
@@ -264,7 +259,6 @@ ax.set_ylabel('CDF')
264259
plt.show()
265260
```
266261

267-
268262
```{exercise}
269263
:label: prob_ex3
270264
@@ -334,7 +328,6 @@ ax.set_ylabel('PMF')
334328
plt.show()
335329
```
336330

337-
338331
#### Poisson distribution
339332

340333
The Poisson distribution on $S = \{0, 1, \ldots\}$ with parameter $\lambda > 0$ has PMF
@@ -372,7 +365,6 @@ ax.set_ylabel('PMF')
372365
plt.show()
373366
```
374367

375-
376368
### Continuous distributions
377369

378370

@@ -449,7 +441,6 @@ plt.legend()
449441
plt.show()
450442
```
451443

452-
453444
Here's a plot of the CDF:
454445

455446
```{code-cell} ipython3
@@ -466,7 +457,6 @@ plt.legend()
466457
plt.show()
467458
```
468459

469-
470460
#### Lognormal distribution
471461

472462
The **lognormal distribution** is a distribution on $\left(0, \infty\right)$ with density
@@ -646,7 +636,6 @@ plt.legend()
646636
plt.show()
647637
```
648638

649-
650639
#### Gamma distribution
651640

652641
The **gamma distribution** is a distribution on $\left(0, \infty\right)$ with density
@@ -730,7 +719,6 @@ df = pd.DataFrame(data, columns=['name', 'income'])
730719
df
731720
```
732721

733-
734722
In this situation, we might refer to the set of their incomes as the "income distribution."
735723

736724
The terminology is confusing because this set is not a probability distribution
@@ -761,14 +749,10 @@ $$
761749
For the income distribution given above, we can calculate these numbers via
762750

763751
```{code-cell} ipython3
764-
x = np.asarray(df['income']) # Pull out income as a NumPy array
765-
```
766-
767-
```{code-cell} ipython3
752+
x = df['income']
768753
x.mean(), x.var()
769754
```
770755

771-
772756
```{exercise}
773757
:label: prob_ex4
774758
@@ -792,15 +776,13 @@ We will cover
792776
We can histogram the income distribution we just constructed as follows
793777

794778
```{code-cell} ipython3
795-
x = df['income']
796779
fig, ax = plt.subplots()
797780
ax.hist(x, bins=5, density=True, histtype='bar')
798781
ax.set_xlabel('income')
799782
ax.set_ylabel('density')
800783
plt.show()
801784
```
802785

803-
804786
Let's look at a distribution from real data.
805787

806788
In particular, we will look at the monthly return on Amazon shares between 2000/1/1 and 2024/1/1.
@@ -811,25 +793,21 @@ So we will have one observation for each month.
811793

812794
```{code-cell} ipython3
813795
:tags: [hide-output]
814-
df = yf.download('AMZN', '2000-1-1', '2024-1-1', interval='1mo' )
796+
797+
df = yf.download('AMZN', '2000-1-1', '2024-1-1', interval='1mo')
815798
prices = df['Adj Close']
816-
data = prices.pct_change()[1:] * 100
817-
data.head()
799+
x_amazon = prices.pct_change()[1:] * 100
800+
x_amazon.head()
818801
```
819802

820-
821803
The first observation is the monthly return (percent change) over January 2000, which was
822804

823805
```{code-cell} ipython3
824-
data[0]
806+
x_amazon.iloc[0]
825807
```
826808

827809
Let's turn the return observations into an array and histogram it.
828810

829-
```{code-cell} ipython3
830-
x_amazon = np.asarray(data)
831-
```
832-
833811
```{code-cell} ipython3
834812
fig, ax = plt.subplots()
835813
ax.hist(x_amazon, bins=20)
@@ -838,7 +816,6 @@ ax.set_ylabel('density')
838816
plt.show()
839817
```
840818

841-
842819
#### Kernel density estimates
843820

844821
Kernel density estimates (KDE) provide a simple way to estimate and visualize the density of a distribution.
@@ -893,10 +870,10 @@ For example, let's compare the monthly returns on Amazon shares with the monthly
893870

894871
```{code-cell} ipython3
895872
:tags: [hide-output]
896-
df = yf.download('COST', '2000-1-1', '2024-1-1', interval='1mo' )
873+
874+
df = yf.download('COST', '2000-1-1', '2024-1-1', interval='1mo')
897875
prices = df['Adj Close']
898-
data = prices.pct_change()[1:] * 100
899-
x_costco = np.asarray(data)
876+
x_costco = prices.pct_change()[1:] * 100
900877
```
901878

902879
```{code-cell} ipython3
@@ -907,7 +884,6 @@ ax.set_xlabel('KDE')
907884
plt.show()
908885
```
909886

910-
911887
### Connection to probability distributions
912888

913889
Let's discuss the connection between observed distributions and probability distributions.
@@ -941,7 +917,6 @@ ax.set_ylabel('density')
941917
plt.show()
942918
```
943919

944-
945920
The match between the histogram and the density is not bad but also not very good.
946921

947922
One reason is that the normal distribution is not really a good fit for this observed data --- we will discuss this point again when we talk about {ref}`heavy tailed distributions<heavy_tail>`.
@@ -967,8 +942,6 @@ ax.set_ylabel('density')
967942
plt.show()
968943
```
969944

970-
971945
Note that if you keep increasing $N$, which is the number of observations, the fit will get better and better.
972946

973947
This convergence is a version of the "law of large numbers", which we will discuss {ref}`later<lln_mr>`.
974-

0 commit comments

Comments
 (0)