runner/doc/resample.txt at main · fesmc/runner · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
usage: job resample [-h] --weights-file WEIGHTS_FILE [--log] [--iis]
                    [--epsilon EPSILON]
                    [--neff-bounds NEFF_BOUNDS NEFF_BOUNDS]
                    [--method {residual,multinomial}] [-N SIZE] [--seed SEED]
                    [-o OUT]
                    params_file

Resample an existing parameter set

Summary
-------
Resample an existing ensemble, based on an array of weights.
Optionally, a scaled version of the weights may be used, with
addition of noise, according to Annan and Hargreave's Iterative Importance Sampling.

Background
----------
Typically, weights can be derived from a Bayesian analysis, where each
realization is compared with observations and assigned a likelihood.  An array
of resampling indices can be derived from the weights, where realizations with
large weights are resampled several times, while realization with small weights
are not resampled.  To avoid the exact same parameter set to appear duplicated
in the resampled ensemble, introduction of noise (jitter) is necessary, which
conserves statistical properties of the resampled ensemble (covariance).

The problem is not trivial and several approaches exist for both the sampling
of indices and the addition of noise. Basically, differences in resampling
methods (before application of jitter) mainly affect how the tail - low-weights
realizations - are dealt with, which influences the results for "small"
ensemble size:

- multinomial : random sampling based on empirical distribution function.
    Simple but poor performance.
- residual : some of the resampling indices can be determined deterministically
    when weights are large enough, i.e. `w_i * N > 1` where `w_i` represents
    a normalized weight (sum of all weights equals 1), and N is the ensemble size.
    The array of weight residuals (`w_i * N - int(w_i * N)`) is then resampled
    using a basic multinomial approach.

More advanced methods are typically similar to `residual`, but the array of
residual weights is resampled taking into account the uniformity of samples in
the parameter or state space (and therefore requires additional information).
One of these methods, coined `deterministic` (re)sampling, is planned to be
implemented, in addition to the two mentioned above.

The jittering step is tricky because the noise is unlikely to have a pure
(multivariate) normal distribution (especially when the model is strongly non
linear).  An approach proposed by Annan and Heargraves, "iterative importance
sampling" (`iis`), is to sample jitter with zero mean and covariance computed from the
original (resampled) ensemble but scaled so that its variance is only a small
fraction `epsilon` of the original ensemble. Addition of noise increases
overall covariance by `1 + epsilon`, but they show that this can balance out if
the weights used for resampling are "flattened" with the same `epsilon` as an
exponent (`shrinking`).  This procedure leaves the posterior distribution
invariant, so that it can be applied iteratively when starting from a prior
which is far from the posterior.

One step of this resampling procedure can be activated with the `--iis` flag.
By default the epsilon factor is computed automatically to keep an "effective
ensemble size" in a reasonable proportion (50% to 90%) to the actual ensemble
size (see `--neff-bounds` parameter). No other jittering method is proposed.

References
----------
Annan, J. D., & Hargreaves, J. C. (2010). Efficient identification of
ocean thermodynamics in a physical/biogeochemical ocean model with an iterative
Importance Sampling method. Ocean Modelling, 32(3-4), 205-215.
doi:10.1016/j.ocemod.2010.02.003

Douc and Cappe. 2005. Comparison of resampling schemes for particle filtering.
ISPA2005, Proceedings of the 4th Symposium on Image and Signal Processing.

Hol, Jeroen D., Thomas B. Schön, and Fredrik Gustafsson,
"On Resampling Algorithms for Particle Filters",
in NSSPW - Nonlinear Statistical Signal Processing Workshop 2006,
2006 <http://dx.doi.org/10.1109/NSSPW.2006.4378824>

positional arguments:
  params_file           ensemble parameter flle to resample

optional arguments:
  -h, --help            show this help message and exit
  --weights-file WEIGHTS_FILE, -w WEIGHTS_FILE
                        typically the likelihood from a bayesian analysis,
                        i.e. exp(-((model - obs)**2/(2*variance), to be
                        multiplied when several observations are used
  --log                 set if weights are provided as log-likelihood (no
                        exponential)

jittering:
  --iis                 IIS-type resampling with likelihood flattening +
                        jitter
  --epsilon EPSILON     Exponent to flatten the weights and derive jitter
                        variance as a fraction of resampled parameter
                        variance. If not provided 0.05 is used as a starting
                        value but adjusted if the effective ensemble size is
                        not in the range specified by --neff-bounds.
  --neff-bounds NEFF_BOUNDS NEFF_BOUNDS
                        Acceptable range for the effective ensemble size when
                        --epsilon is not provided. Default to (0.5, 0.9).

sampling:
  --method {residual,multinomial}
                        resampling method (default: residual)
  -N SIZE, --size SIZE  New sample size (default: same size as before)
  --seed SEED           random seed, for reproducible results (default to
                        None)

output:
  -o OUT, --out OUT     output parameter file (print to scree otherwise)