Feat/read sfcf multi #210

jkuhl-uni · 2023-10-06T14:37:54Z

Hey there,

so as I mentioned in a previous pull request, I wanted to rewrite the read_sfcf method again.
So... I did. There is now a method "read_sfcf_multi", which is a superset of the "read_sfcf" method.
The read_sfcf method is now simply routed through the new method.
All of this was done for the following reason: I have tried to accelerate the reading of sfcf measurements in my projects for some time.
Finally, I found that the main time is spent in IO, as per read every file is opened once.
As it is usual to not read one correlator at a time, but e.g. 8 correlators at a time, this results in 8 $\times$ #cnfg open(file) calls, which needs to retrieve the file from disk and loads it in the buffer.
Instead, now I am doing the following:
I directly read all relevant correlators from a file, then move on to the next file. Therefore I only need #cnfg open(file) calls. The drawback being that afterwards I have to manage unpacking a python dict, which can be done much faster.

Here some benchmarks taken with timeit:

#cnfg 2146

Old implementation:
#corrs 72
read_sfcf 70.942 s

New implementation:
#corrs 72
read_sfcf 71.462 s
read_sfcf_multi 4.777 s

#corrs 36
read_sfcf 34.308 s
read_sfcf_multi 2.280 s

#corrs 20
read_sfcf 18.421 s
read_sfcf_multi 1.653 s

This new functionality is not yet tested extensively (mainly I have done the tests by pytest that are also in this PR and the benchmarks on a set of measurements that I have for a project of mine).

s-kuberski · 2023-10-06T14:57:27Z

Hi,
I think that it is a good idea, also to reduce the load on the file system in case of large projects. For the tests, you could in principle, correlator by correlator, compare the output of the old version with the output of the new one, right? This should be a quite strong test.

jkuhl-uni · 2023-10-06T15:05:54Z

Thanks for the quick answer. Yes, you are right, however at the complexity of the loops, I do not have data, in which I could test all loops at once, except for the one already checked in the test. I could come up with tests that use the benchmarks. These iterate over correlators and quarks, but not wavefunctions and offsets.

jkuhl-uni · 2023-10-06T15:11:14Z

Something I forgot to mention is that the benchmarks are done in the compact format of sfcf, where I expect the largest benefit. The other outputs are in some way already sorted by correlator, such that there only the number of wavefunctions, quarks and offsets plays a role.

fjosw · 2023-10-06T15:26:36Z

Thanks for this PR! I just glimpsed over the code and saw sets of nested for loops in many places. Is there maybe a way to refactore these?

jkuhl-uni · 2023-10-06T16:29:06Z

Hi, mmmm... for now I have spent my time organising the loops, as this is basically all that is done here.
Maybe there is a way to more efficiently iterate through multiple levels of dicts that I am not aware of?
That would help a lot.

jkuhl-uni · 2023-10-09T16:11:45Z

Hi, so the basic idea to make things in the loops easier, I am now using python-benedict (https://github.com/fabiocaccamo/python-benedict) to map the dict keys.

fjosw · 2023-10-13T12:14:02Z

Hi Justus, sorry for getting back to you so late. I had a look at your changes and I would prefer not to introduce additional dependencies unless necessary. If I understand it correctly you only use benedict to process the nested dicts you are using. Here is a proposal for a slightly different data structure without nested dicts which might be a bit more readable:

import itertools

names = ["f1", "fA"]
quarks = ["u", "d", "s"]
offs = ["1", "2", "8"]

sep = "/"  # Separator char


# Create dict with indentifiers separated by sep (for example 'f1/d/2')
corr_dict = {}
for tup in itertools.product(names, quarks, offs):
    corr_dict[sep.join(tup)] = []

# Retrieve entries
for key, value in corr_dict.items():
    name, quark, off = key.split(sep)
    # Process value here

Let me know what you think.

jkuhl-uni · 2023-10-16T10:23:34Z

Hi Fabian,
alright, that's what I thought and yes, something like this could work. I'll give it another look and implement the changes you suggest. I have to admit that I thought about implementing it myself and then had a look at the solutions on pypi and took this one instead. I'll see that I find time in the next days to change that and remove the dependency.

In the meanwhile I have tested the new method a bit more and checked the deltas read by the old and the new method for an $f_1$-matrix.
It seems everything is working as it should. I compared the deltas of multiple matrix entries and everything turns out fine.

jkuhl-uni · 2023-10-16T10:27:56Z

One thing, that we would have to do by hand in the solution you suggest is the "undoing" of the concat of keys, such that we have the original dict structure in the end. I think this would be essential and less confusing for the user.
At the same time, I wanted to make the nice_output function a little better. I'll see that I get that done to.

fjosw · 2023-10-16T11:29:54Z

One thing, that we would have to do by hand in the solution you suggest is the "undoing" of the concat of keys, such that we have the original dict structure in the end.

What do you mean by original dict structure? Can you explain what the dict that is returned should look like?

jkuhl-uni · 2023-10-16T11:48:02Z

Ah sorry, of course I can. I structured the returned dictionary in the following way:
dict[name][quarks][offset][wf][wf2].
Of course, one could reorder the keys, but in any case I'd like the dict to have a nested structure, in which the user can search for what is needed.
My plan was furthermore, to "clean up" this dictionary, such that, e.g. if len(name_list) == 1 the key is not needed after reading. With nice_output = False, one would have the complete nested structure, even if one just reads one single correlator.

fjosw · 2023-10-16T14:24:30Z

Okay, I guess if you want to stick with the nested dict than there is no benefit in using a different data structure for the computation.

This reverts commit 9696d68.

This reverts commit fa39874.

jkuhl-uni · 2023-10-19T12:09:51Z

Hi, I reverted the changes that introduced benedict. Also, I found a bug in one of the tests and excluded the "nice output" feature, that i built earlier, as it was not stable enough.

s-kuberski · 2023-10-20T09:12:45Z

Hi,
thanks for your changes. Of course, just by reading the code, it is not really possible from my side to spot possible bugs or inconsistencies that might have slipped in, when doing the changes. The code has reached a complexity where it is quite hard to debug, I think. But then it is only used for reading data which is a task that you can (and do) easily check with the test programs.

I think, as soon as you fix the linting errors, you are good to go to merge the pull request when you are confident that the checks catch (almost) all possible mistakes.

fjosw · 2023-10-20T09:14:51Z

I fully agree with Simon. I'm also concernced about maintainability and readability and would prefer a more concise version (for example using itertools). But then again this is very specialised code which we might never touch again and which is not performance critical so I would also be fine with merging the changes when flake8 is happy.

jkuhl-uni · 2023-10-20T10:02:03Z

Hi,
thank you for your opinions, I couldn't agree more.
There are two things I could offer to control the code better:

more specialised tests
reintroduce templating in the code. I did this when using benedict and I think It could contribute to cleaner code. What I mean is, that very similar dict-structures are currently being generated at multiple places in the code. There should be a way around this. Although the complexity would stay the same, the code would become easier to read.

Maybe addressing these two things could benefit the code?

fjosw · 2023-10-20T11:54:50Z

Sounds good, your call if it's worth investing the time to rewrite stuff!

jkuhl-uni · 2023-10-20T15:24:17Z

Hi, so finally I refactored everything as Fabian suggested in an earlier comment. as the result is put in its final norm in the last few lines of the method, I use the "flat" structure all the way through until there. I then added the option to either use nested dicts or a flat dict as the result.
I agree, that this makes the method quite more readable. Also, as one still needs to decide the key into it's parts from time to time, I think the structure of the key can also easily be understood.

fjosw · 2023-10-20T16:16:57Z

pyerrors/input/sfcf.py

+
+
+def _lists2key(*lists):
+    sep = "/"


I noticed that you define the separator char in two places.

Oh yes, you are right... That comes from trying somewhere else, then copying...

jkuhl-uni added 9 commits September 8, 2023 15:12

make template

8d57d66

read_sfcf_multi running with compact format

e9b59ec

fix append mode, norrmal tests work

4f80012

improve readability

6ea00f9

add simple test for multi_read

44b3759

simple multi_test works

5a8878b

add first method to check sfcf param hashes

1e16acb

add docstring

79db746

simple test for o format working

eb58920

jkuhl-uni requested a review from fjosw as a code owner October 6, 2023 14:37

jkuhl-uni added 2 commits October 9, 2023 16:06

use benedict to make loops easier

fa39874

introduce python-benedict as dep

9696d68

jkuhl-uni added 4 commits October 18, 2023 18:44

no nice_out, less error prone, found bug in tests

ade94ee

Revert "introduce python-benedict as dep"

5ecde3c

This reverts commit 9696d68.

Revert "use benedict to make loops easier"

9ee0b8f

This reverts commit fa39874.

no nice output after reverts

a694b08

fjosw requested a review from s-kuberski October 19, 2023 12:58

[build] Added jkuhl-uni as CODEOWNER for sfcf.

bd8db45

jkuhl-uni added 2 commits October 20, 2023 15:18

refactor: flatten internal dicts

0339976

very small test extension

6f55708

jkuhl-uni added 2 commits October 20, 2023 15:24

...flake8

cca480b

docu

1d82091

fjosw reviewed Oct 20, 2023

View reviewed changes

Delete second sep init

6ac33c0

fjosw approved these changes Oct 20, 2023

View reviewed changes

fjosw merged commit 0ef8649 into fjosw:develop Oct 20, 2023

Feat/read sfcf multi #210

Feat/read sfcf multi #210

Uh oh!

Conversation

jkuhl-uni commented Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s-kuberski commented Oct 6, 2023

Uh oh!

jkuhl-uni commented Oct 6, 2023

Uh oh!

jkuhl-uni commented Oct 6, 2023

Uh oh!

fjosw commented Oct 6, 2023

Uh oh!

jkuhl-uni commented Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkuhl-uni commented Oct 9, 2023

Uh oh!

fjosw commented Oct 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkuhl-uni commented Oct 16, 2023

Uh oh!

jkuhl-uni commented Oct 16, 2023

Uh oh!

fjosw commented Oct 16, 2023

Uh oh!

jkuhl-uni commented Oct 16, 2023

Uh oh!

fjosw commented Oct 16, 2023

Uh oh!

jkuhl-uni commented Oct 19, 2023

Uh oh!

s-kuberski commented Oct 20, 2023

Uh oh!

fjosw commented Oct 20, 2023

Uh oh!

jkuhl-uni commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjosw commented Oct 20, 2023

Uh oh!

jkuhl-uni commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjosw Oct 20, 2023

Choose a reason for hiding this comment

Uh oh!

jkuhl-uni Oct 20, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jkuhl-uni commented Oct 6, 2023 •

edited

Loading

jkuhl-uni commented Oct 6, 2023 •

edited

Loading

fjosw commented Oct 13, 2023 •

edited

Loading

jkuhl-uni commented Oct 20, 2023 •

edited

Loading

jkuhl-uni commented Oct 20, 2023 •

edited

Loading