Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce DataSource #24

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Introduce DataSource #24

wants to merge 6 commits into from

Conversation

ndiamant
Copy link
Contributor

ml4ht.data.data_source.DataIndex generalizes the idea of a sample id by replacing an integer id with a dictionary of values to select data with.
The simplest DataIndex is something like {"sample_id": 2}, but you might also want to include something like the dates for your modalities: {"sample_id": 2, "ecg_date": 01-01-2000, "af_date": 02-01-2000}.

ml4ht.data.data_source.DataSource generalizes the data-getting side of ml4h TensorMaps and DataDescription.get_raw_data.
A DataSource returns a dictionary of model inputs, and a dictionary of model outputs. For example, ECGHD5Source might return {"ecg": np.array(...), "ecg_age": [12]}, {"AF": [0, 1]}.

In order to train using multiple DataSources, you can use ml4ht.data.data_source.TrainingDataset, which integrates with pytorchs DataLoader for multiprocessing capabilities.
If you want to skip errors, or change the indices each epoch, use ml4ht.data.data_source.TrainingIterableDataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant