Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materials Project time split dataset - load_data_from_json returns None during debugging (conditionally) #832

Open
sgbaird opened this issue Jun 4, 2022 · 0 comments

Comments

@sgbaird
Copy link

sgbaird commented Jun 4, 2022

Sorted by earliest year of reference, limited to experimental entries with fewer than 52 sites: https://figshare.com/articles/dataset/Materials_Project_Time_Split_Data/19991516

How does this seem in terms of a matminer dataset contribution? See How do I do a time-split of Materials Project entries? e.g. pre-2018 vs. post-2018 and sparks-baird/xtal2png#12 (comment) for additional context. Starting to feel like I'm reinventing the wheel by trying to host it myself.

In my own code, I've been running into a strange issue where if I use:

def load_dataframe_from_json(filename, pbar=True, decode=True):
"""Load pandas dataframe from a json file.
Automatically decodes and instantiates pymatgen objects in the dataframe.
Args:
filename (str): Path to json file. Can be a compressed file (gz and bz2)
are supported.
pbar (bool): If true, shows an ASCII progress bar for loading data from disk.
decode (bool): If true, will automatically decode objects (slow, convenient).
If false, will return json representations of the objects (fast, inconvenient).
Returns:
(Pandas.DataFrame): A pandas dataframe.
"""
# Progress bar for reading file with hook
pbar1 = tqdm(desc=f"Reading file {filename}", position=0, leave=True, ascii=True, disable=not pbar)
def is_monty_object(o):
"""
Determine if an object can be decoded into json
by monty.
Args:
o (object): An object in dict-form.
Returns:
(bool)
"""
if isinstance(o, dict) and "@class" in o:
return True
else:
return False
def pbar_hook(obj):
"""
A hook for a pbar reading the raw data from json, not
using monty decoding to decode the object.
Args:
obj (object): A dict-like
Returns:
obj (object)
"""
if is_monty_object(obj):
pbar1.update(1)
sys.stderr.flush()
return obj
# Progress bar for decoding objects
pbar2 = tqdm(desc=f"Decoding objects from {filename}", position=0, leave=True, ascii=True, disable=not pbar)
class MontyDecoderPbar(MontyDecoder):
"""
A pbar-friendly version of MontyDecoder.
"""
def process_decoded(self, d):
if isinstance(d, dict) and "data" in d and "index" in d and "columns" in d:
# total number of objects to decode
# is the number of @class mentions
pbar2.total = str(d).count("@class")
elif is_monty_object(d):
pbar2.update(1)
sys.stderr.flush()
return super().process_decoded(d)
if decode:
decoder = MontyDecoderPbar if pbar else MontyDecoder
else:
decoder = None
hook = pbar_hook if pbar else lambda x: x
with zopen(filename, "rb") as f:
dataframe_data = json.load(f, cls=decoder, object_hook=hook)
pbar1.close()
pbar2.close()
# if only keys are data, columns, index then orient=split
if isinstance(dataframe_data, dict):
if set(dataframe_data.keys()) == {"data", "columns", "index"}:
return pandas.DataFrame(**dataframe_data)
else:
return pandas.DataFrame(dataframe_data)

It returns None during an uninterrupted debugging run, but if I set a breakpoint and run the line manually in the debug console (VS Code) then it returns the expected DataFrame.
See https://github.com/sparks-baird/mp-time-split/runs/6739787243?check_suite_focus=true/#step:5:1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant