This folder contains some of the pre-processed data.
Use the table hereunder to find, for each file in this directory, which of the python scripts under src/scripts
was used to generate the data file, if you would want to recreate it yourself.
Most of these are loaded in the global ExtraDatasetInfo
object (see src/utils/data_utils.py
).
As mentioned in the main README, you can fin all the files here. Simply extract it (unzip extra_files.zip
) inside the root directory (not in here), and all the files will place themselves correctly (here).
File (name) | Script (name) | File (data) description | Depends on | Notes/Requirements |
---|---|---|---|---|
reviews_with_compound.csv | TODO.py | The Massive Rotten Tomatoes dataset, extended by performing VADER sentiment analysis on the reviews. | / | ~12GB RAM, 10minutes |
ratings_expert.csv | This comes from the pre-processing in Milestone 2, section II., list_movie variable. |
The movies from CMU, which also appear in the RT datasets and have (at least one) expert rating. | / | / |
cmu_topic_similarities.csv | get_topic_similarities.py | For each plot of the CMU dataset, contains a similarity comparison to a list of predefined topics. Similarity is computed using Glove embedding and by doing keyword extraction and filtering on the movie plots. | cmu_concepts.pkl | ~8GB of RAM, CPU only assuming keyword extraction has already been done (see next entry) , 16 minutes |
cmu_concepts.pkl | get_cmu_concepts.py | For each plot of the CMU dataset, contains a (filtered, and weighted) keyword extraction of the main words defining the plot. | A slightly modified version of the keybert api (see models.py) to enable GPU acceleration |
~10GB of RAM, GPU (4GB+ of VRAM) for ~4x acceleration, ~20minutes |