Pre-processed data

This folder contains some of the pre-processed data. Use the table hereunder to find, for each file in this directory, which of the python scripts under src/scripts was used to generate the data file, if you would want to recreate it yourself. Most of these are loaded in the global ExtraDatasetInfo object (see src/utils/data_utils.py). As mentioned in the main README, you can fin all the files here. Simply extract it (unzip extra_files.zip) inside the root directory (not in here), and all the files will place themselves correctly (here).

File (name)	Script (name)	File (data) description	Depends on	Notes/Requirements
reviews_with_compound.csv	TODO.py	The Massive Rotten Tomatoes dataset, extended by performing VADER sentiment analysis on the reviews.	/	~12GB RAM, 10minutes
ratings_expert.csv	This comes from the pre-processing in Milestone 2, section II., `list_movie` variable.	The movies from CMU, which also appear in the RT datasets and have (at least one) expert rating.	/	/
cmu_topic_similarities.csv	get_topic_similarities.py	For each plot of the CMU dataset, contains a similarity comparison to a list of predefined topics. Similarity is computed using Glove embedding and by doing keyword extraction and filtering on the movie plots.	cmu_concepts.pkl	~8GB of RAM, CPU only assuming keyword extraction has already been done (see next entry) , 16 minutes
cmu_concepts.pkl	get_cmu_concepts.py	For each plot of the CMU dataset, contains a (filtered, and weighted) keyword extraction of the main words defining the plot.	A slightly modified version of the `keybert` api (see models.py) to enable GPU acceleration	~10GB of RAM, GPU (4GB+ of VRAM) for ~4x acceleration, ~20minutes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pre-processed data

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pre-processed data