Input Pipeline rework #245

M-R-Schaefer · 2024-03-25T16:59:54Z

I have reworked the data pipeline. The NL is no longer precomputed for the entire dataset and stored in memory.
This used up a log of RAM and the conversion of unpadded NLs to tf ragged tensors was terribly slow.
This rework drastically speeds up start up times and reduces memory consumption without compromising performance.

…ata pipeline

for more information, see https://pre-commit.ci

Tetracarbonylnickel · 2024-03-26T09:19:38Z

apax/data/initialization.py

+# def initialize_dataset(
+#     config,
+#     atoms_list,
+#     read_labels: bool = True,
+#     calc_stats: bool = True,
+# ):
+#     if calc_stats and not read_labels:
+#         raise ValueError(
+#             "Cannot calculate scale/shift parameters without reading labels."
+#         )
+#     inputs = process_inputs(
+#         atoms_list,
+#         r_max=config.model.r_max,
+#         disable_pbar=config.progress_bar.disable_nl_pbar,
+#         pos_unit=config.data.pos_unit,
+#     )
+#     labels = atoms_to_labels(
+#         atoms_list,
+#         additional_properties_info=config.data.additional_properties_info,
+#         read_labels=read_labels,
+#         pos_unit=config.data.pos_unit,
+#         energy_unit=config.data.energy_unit,
+#     )


Are these the placeholder for implementing arbitrary labels?

This was part of the old data pipeline. I forgot to remove the comment.

Tetracarbonylnickel · 2024-03-26T09:28:38Z

apax/data/input_pipeline.py

-        self.n_jit_steps = 1
+        if pre_shuffle:
+            shuffle(atoms)
+        self.sample_atoms = atoms[0]


I think in general we should be more consistent with atom, atoms and list of atoms. atoms should be one structure. Should we make an issue?

yes, I agree

Tetracarbonylnickel · 2024-03-26T09:43:10Z

apax/data/input_pipeline.py

+
+    def enqueue(self, num_elements):
+        for _ in range(num_elements):
+            data = self.prepare_item(self.count)


Maybe prepare_data? In my opinion item is somewhat misleading.

Tetracarbonylnickel · 2024-03-26T10:12:30Z

apax/train/run.py

+        config.data.shift_options,
+        config.data.scale_options,
+    )
+    # TODO IMPL DELETE FILES


Which files should be deleted here?

reminder comment from the .cache based implementation.

Tetracarbonylnickel

If everything is addressed the PR can be merged.

M-R-Schaefer added 9 commits March 8, 2024 18:28

removed unused jax nl in datapipeline

603282b

removed use of atoms list in dataset nl

35b83be

implemented dataset which does not precompute NL

74a8de5

fixed bug when buffer size was larger dataset size

af1eee4

added docstrings

9178497

linting

bed5876

removed caching from dataset in favor of on the fly NL. removed old d…

eff2fec

…ata pipeline

updated tests with new data pipeline

960b028

Merge branch 'dev' into otf_nl

1d32f5c

M-R-Schaefer added the enhancement New feature or request label Mar 25, 2024

M-R-Schaefer requested a review from Tetracarbonylnickel March 25, 2024 16:59

pre-commit-ci bot and others added 3 commits March 25, 2024 17:00

[pre-commit.ci] auto fixes from pre-commit.com hooks

6579eb8

for more information, see https://pre-commit.ci

linting

6a3ce5b

Merge branch 'otf_nl' of https://github.com/apax-hub/apax into otf_nl

de0eb6f

Tetracarbonylnickel reviewed Mar 26, 2024

View reviewed changes

Tetracarbonylnickel mentioned this pull request Mar 26, 2024

make atom / atoms / list of atoms more constant. #247

Closed

M-R-Schaefer added 2 commits March 26, 2024 11:26

removed commented out code

9c235f0

renamed prepare item to prepare data

5811501

Tetracarbonylnickel approved these changes Mar 26, 2024

View reviewed changes

M-R-Schaefer merged commit 10470b1 into dev Mar 26, 2024
3 checks passed

M-R-Schaefer deleted the otf_nl branch August 7, 2024 06:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input Pipeline rework #245

Input Pipeline rework #245

M-R-Schaefer commented Mar 25, 2024

Tetracarbonylnickel Mar 26, 2024

M-R-Schaefer Mar 26, 2024

Tetracarbonylnickel Mar 26, 2024

M-R-Schaefer Mar 26, 2024

M-R-Schaefer Mar 26, 2024

Tetracarbonylnickel Mar 26, 2024

Tetracarbonylnickel Mar 26, 2024

M-R-Schaefer Mar 26, 2024

Tetracarbonylnickel left a comment

Input Pipeline rework #245

Input Pipeline rework #245

Conversation

M-R-Schaefer commented Mar 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tetracarbonylnickel left a comment

Choose a reason for hiding this comment