Skip to content

Latest commit

 

History

History
17 lines (14 loc) · 5.29 KB

File metadata and controls

17 lines (14 loc) · 5.29 KB

Pre-processed data

This folder contains some of the pre-processed data. Use the table hereunder to find, for each file in this directory, which of the python scripts under src/scripts was used to generate the data file, if you would want to recreate it yourself. Most of these are loaded in the global ExtraDatasetInfo object (see src/utils/data_utils.py). As mentioned in the main README, you can fin all the files here. Simply extract it (unzip extra_files.zip) inside the root directory (not in here), and all the files will place themselves correctly (here).

File (name) Script (name) File (data) description Depends on Notes/Requirements
reviews_with_compound.csv TODO.py The Massive Rotten Tomatoes dataset, extended by performing VADER sentiment analysis on the reviews. / ~12GB RAM, 10minutes
ratings_expert.csv This comes from the pre-processing in Milestone 2, section II., list_movie variable. The movies from CMU, which also appear in the RT datasets and have (at least one) expert rating. / /
cmu_topic_similarities.csv get_topic_similarities.py For each plot of the CMU dataset, contains a similarity comparison to a list of predefined topics. Similarity is computed using Glove embedding and by doing keyword extraction and filtering on the movie plots. cmu_concepts.pkl ~8GB of RAM, CPU only assuming keyword extraction has already been done (see next entry) , 16 minutes
cmu_concepts.pkl get_cmu_concepts.py For each plot of the CMU dataset, contains a (filtered, and weighted) keyword extraction of the main words defining the plot. A slightly modified version of the keybert api (see models.py) to enable GPU acceleration ~10GB of RAM, GPU (4GB+ of VRAM) for ~4x acceleration, ~20minutes