Introduce `DataSource` #24

ndiamant · 2021-08-18T22:21:38Z

ml4ht.data.data_source.DataIndex generalizes the idea of a sample id by replacing an integer id with a dictionary of values to select data with.
The simplest DataIndex is something like {"sample_id": 2}, but you might also want to include something like the dates for your modalities: {"sample_id": 2, "ecg_date": 01-01-2000, "af_date": 02-01-2000}.

ml4ht.data.data_source.DataSource generalizes the data-getting side of ml4h TensorMaps and DataDescription.get_raw_data.
A DataSource returns a dictionary of model inputs, and a dictionary of model outputs. For example, ECGHD5Source might return {"ecg": np.array(...), "ecg_age": [12]}, {"AF": [0, 1]}.

In order to train using multiple DataSources, you can use ml4ht.data.data_source.TrainingDataset, which integrates with pytorchs DataLoader for multiprocessing capabilities.
If you want to skip errors, or change the indices each epoch, use ml4ht.data.data_source.TrainingIterableDataset.

ndiamant added 6 commits August 18, 2021 16:19

create DataFetcher with tests

6299c2a

fetcher -> source, and sources can return inputs and outputs

0336e62

properly rename tests

76af3b8

Add normal pytorch Dataset for DataSources

ba8d2c4

refactor sample_id_epoch_generator to data_index_epoch_generator

5f48ed9

added exploration for DataSource

f4f5f77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `DataSource` #24

Introduce `DataSource` #24

ndiamant commented Aug 18, 2021

Introduce DataSource #24

Are you sure you want to change the base?

Introduce DataSource #24

Conversation

ndiamant commented Aug 18, 2021

Introduce `DataSource` #24

Introduce `DataSource` #24