Skip to content

Latest commit

 

History

History

ICLR2023

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

This is the code and data for the paper: Language Models can teach themselves to code better https://arxiv.org/abs/2207.14502

LICENSE MIT License - as already specified in the ../LICENSE file of PythonProgrammingPuzzles repo

GPU USAGE GPU usage was large , especially for the 2.7B sized model which is ~20X the 125M. Data generation takes the most GPU usage and took about 2500 GPU hours for 2.7B (on v100) Finetuning on the 1M generated data took about 40 GPU hours for 2.7B (on v100) per epoch of finetuning - 10 epochs = 400 GPU hours Solving the 228 problem testset with 100 attempts using the finetuned 2.7B model took about 4 hours (on v100) We mostly used v100, but we used whatever was available, so T4 and A100 sometimes if they were free. Tried everything at 125M first - debug there and make it work perfect - then roll out the 1.3 and 2.7 jobs

DATASETS In data directory are the datasets used. We feel the most interesting dataset is data/Codex_PAPER_1M_iter_0.txt which is generated by Codex and gave the best results when finetuned on. All the datasets are part of our public release.

SETUP src/requirements.txt is what we install on our cluster machines - the cluster comes with NVidia drivers and matching pytorch ./requirements.txt is what I personally have installed on my local machine and tested this runs - but it has lots of stuff you don't need So try src/requirements.txt only - and if that doesn't work - then /requirements.txt has all versions of everything installed on my machine Getting a deepspeed 0.6.1 matching a pytorch matching a nvidia driver install was tricky for me on some machines, torch 1.10 and 1.11 both work

GENERATING/FINETUNING -> run "cd src, ./babysit.sh GPU_INDEX_TO_USE" -> GPU_INDEX_TO_USE=0 typically In src/babysit.sh is the script that generates data, and finetunes on that data in a loop, finetuning the GPT-Neo 125M/1.3B/2.7B models In src/babysit.sh TEST_LOCAL=1 controls running locally on machine's GPUs which is great for fast testing, or =0 is launching on the cluster which is slow but has lots of GPUs Realistically you have to train on a cluster - data generation takes a long time so having lots of machines all generating data is the feasible approach. But given enough time - this will run locally on 1 GPU. 1 year for 2.7B, or 2 weeks for 125M. We found generating 75k samples after deduping worked for iteration_0 - finetune on that data. Then using that fine_tuned model in iter_1 generating data happens more quickly - the finetuned model solves many more problems Repeating that process works well. On 125M we looked at just training on only 125M generated data from iter_0 versus iter_1 versus iter_2 - generating 600K for each iteration. It seemed finetuning on iter_2 data was best on the testset 26.9/228 solved vs iter_1=26.1/228 vs iter_0=22.2/228 With 1M samples from 125M generated data sampled across all the iterations 0,1,2 we got 26.75/228 We understand why it's faster to generate iter_2 data on a finetuned model - it solves more problems. But why are the generated puzzles&solutions better for training the model on? We will explore that more in the future - and try iterating a lot farther than 3 iterations - although our preliminary experiments on 125M show it tops out at 3 iterations

FINETUNING ONLY -> run "cd src, ./fine_tune1.sh GPU_INDEX_TO_USE" -> GPU_INDEX_TO_USE=0 typically

./fine_tune1.sh GPU MODEL_TO_TRAIN EXPERIMENT_NAME_DIRECTORY TRAIN_DATA EPOCHS

This allows the repeated finetuning on a specific dataset. Use this to do a temperature grid search, or try different variations of parameters on a specific dataset.

Detailed instructions for reproducing experiments:

Generating Codex data

python gen.py -n=32 -max_tokens=4096 -model_path=openai/code-davinci-002 -model_path_solve=openai/code-cushman-001 -out=../data/codex/iter_0 -seed=2022

Measuring codex accuracy via API calls

./solve2.sh python solve.py -prefix=../data/train_prefix.txt -attempts=1 -model_path=openai/code-cushman-001 -gpu=0 -fixed_temp=0.8 -out=../data/codex -puzzles=../data/test_228.json -seed=2022 -batch_size=64

Producing verified Codex_PAPER_1M_iter_0.txt from the puzzle/solution old style data generated by Codex

python preprocess.py -path=../data/codex/old_verified -f_name=Codex_PAPER_1M_iter_0.txt -max_sols_per_puzzle=8 -old_style_json=True -max_examples=1000000 -include_failures=False -seed=2022 cp ../data/codex/old/Codex_PAPER_1M_iter_0.txt ../data/Codex_PAPER_1M_iter_0.txt

Producing unverified Codex_unverified_PAPER_1M_iter_0.txt from the puzzle/solution old style data generated by Codex

python preprocess.py -path=../data/codex/old_unverified -f_name=Codex_unverified_PAPER_1M_iter_0.txt -max_sols_per_puzzle=8 -old_style_json=True -max_examples=1000000 -include_failures=True -seed=2022 cp ../data/codex/old_unverified/Codex_unverified_PAPER_1M_iter_0.txt ../data/Codex_unverified_PAPER_1M_iter_0.txt

Producing 125M_PAPER_25K_iter_0.txt from the puzzle/solution new style data

python preprocess.py ../data/125M_PAPER/iter_0 125M_PAPER_25K_iter_0.txt 8 False 25000 False -seed=2022 cp ../data/125M_PAPER/iter_0/125M_PAPER_25K_iter_0.txt ../data/125M_PAPER_25K_iter_0.txt

Producing 125M_PAPER_1M_iter_1.txt from the puzzle/solution new style data

python preprocess.py ../data/125M_PAPER/iter_1 125M_PAPER_1M_iter_1.txt 8 False 1000000 False -seed=2022 cp ../data/125M_PAPER/iter_1/125M_PAPER_1M_iter_1.txt ../data/125M_PAPER_1M_iter_1.txt

Producing 125M_PAPER_1M_iter_2.txt from the puzzle/solution new style data13B

python preprocess.py ../data/125M_PAPER/iter_2 125M_PAPER_1M_iter_2.txt 8 False 1000000 False -seed=2022 cp ../data/125M_PAPER/iter_2/125M_PAPER_1M_iter_2.txt ../data/125M_PAPER_1M_iter_2.txt

Producing 13B_PAPER_25K_iter_0.txt from the puzzle/solution new style data

python preprocess.py ../data/13B_PAPER/iter_0 13B_PAPER_25K_iter_0.txt 8 False 25000 False -seed=2022 cp ../data/13B_PAPER/iter_0/13B_PAPER_25K_iter_0.txt ../data/13B_PAPER_25K_iter_0.txt

Producing 13B_PAPER_1M_iter_1.txt from the puzzle/solution new style data

python preprocess.py ../data/13B_PAPER/iter_1 13B_PAPER_1M_iter_1.txt 8 False 1000000 False -seed=2022 cp ../data/13B_PAPER/iter_1/13B_PAPER_1M_iter_1.txt ../data/13B_PAPER_1M_iter_1.txt

Producing 13B_PAPER_1M_iter_2.txt from the puzzle/solution new style data

python preprocess.py ../data/13B_PAPER/iter_2 13B_PAPER_1M_iter_2.txt 8 False 1000000 False -seed=2022 cp ../data/13B_PAPER/iter_2/13B_PAPER_1M_iter_2.txt ../data/13B_PAPER_1M_iter_2.txt

Producing 27B_PAPER_25K_iter_0.txt from the puzzle/solution new style data

python preprocess.py ../data/27B_PAPER/iter_0 27B_PAPER_25K_iter_0.txt 8 False 25000 False -seed=2022 cp ../data/27B_PAPER/iter_0/27B_PAPER_25K_iter_0.txt ../data/27B_PAPER_25K_iter_0.txt

Producing 27B_PAPER_1M_iter_1.txt from the puzzle/solution new style data

python preprocess.py ../data/27B_PAPER/iter_1 27B_PAPER_1M_iter_1.txt 8 False 1000000 False -seed=2022 cp ../data/27B_PAPER/iter_1/27B_PAPER_1M_iter_1.txt ../data/27B_PAPER_1M_iter_1.txt

Producing 27B_PAPER_1M_iter_2.txt from the puzzle/solution new style data

python preprocess.py ../data/27B_PAPER/iter_2 27B_PAPER_1M_iter_2.txt 8 False 1000000 False -seed=2022 cp ../data/27B_PAPER/iter_2/27B_PAPER_1M_iter_2.txt ../data/27B_PAPER_1M_iter_2.txt

Data files produced by babysit.sh - generating data from gpt-neo-* and Codex

At the time of experiments running, Codex wasn't finetunable, so only iteration 0 data was available

Codex_PAPER_1M_iter_0.txt 125M_PAPER_25K_iter_0.txt 13B_PAPER_25K_iter_0.txt 27B_PAPER_25K_iter_0.txt 125M_PAPER_1M_iter_1.txt 13B_PAPER_1M_iter_1.txt 27B_PAPER_1M_iter_1.txt 125M_PAPER_1M_iter_2.txt 13B_PAPER_1M_iter_2.txt 27B_PAPER_1M_iter_2.txt

Figure 5 - 3 diagrams - showing the 3 GPT models trained on verified codex vs unverified codex vs baseline

5a GPT-NEO 125M

./fine_tune1.sh 0 125M ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt ./fine_tune1.sh 0 125M ft1_Codex_unverified_PAPER_1M_iter_0 Codex_unverified_PAPER_1M_iter_0.txt ./solve1.sh 0 125M 10 228

5b GPT-NEO 13B

./fine_tune1.sh 0 13B ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt ./fine_tune1.sh 0 13B ft1_Codex_unverified_PAPER_1M_iter_0 Codex_unverified_PAPER_1M_iter_0.txt ./solve1.sh 0 13B 10 228 5

5c GPT-NEO 27B

./fine_tune1.sh 0 27B ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt ./fine_tune1.sh 0 27B ft1_Codex_unverified_PAPER_1M_iter_0 Codex_unverified_PAPER_1M_iter_0.txt ./solve1.sh 0 13B 10 228 5

Figure 6 - 3 diagrams - showing test228 Pass@ for the 3 GPT models trained on data from 4 generators (codex and 3 GPT-Neo) and baseline

6a - GPT-NEO 125M trained on 4 different datasets and baseline

./fine_tune1.sh 0 125M ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt (dupe of 5a)

./fine_tune1.sh 0 125M ft1_125M_PAPER_1M_iter_2 125M_PAPER_1M_iter_2.txt ./fine_tune1.sh 0 125M ft1_13B_PAPER_1M_iter_2 13B_PAPER_1M_iter_2.txt ./fine_tune1.sh 0 125M ft1_27B_PAPER_1M_iter_2 27B_PAPER_1M_iter_2.txt

6b - GPT-NEO 13B trained on 4 different datasets and baseline

./fine_tune1.sh 0 13B ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt (dupe of 5b)

./fine_tune1.sh 0 13B ft1_125M_PAPER_1M_iter_2 125M_PAPER_1M_iter_2.txt ./fine_tune1.sh 0 13B ft1_13B_PAPER_1M_iter_2 13B_PAPER_1M_iter_2.txt ./fine_tune1.sh 0 13B ft1_27B_PAPER_1M_iter_2 27B_PAPER_1M_iter_2.txt

6c - GPT-NEO 27B trained on 4 different datasets and baseline

./fine_tune1.sh 0 27B ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt (dupe of 5c)

./fine_tune1.sh 0 27B ft1_125M_PAPER_1M_iter_2 125M_PAPER_1M_iter_2.txt ./fine_tune1.sh 0 27B ft1_13B_PAPER_1M_iter_2 13B_PAPER_1M_iter_2.txt ./fine_tune1.sh 0 27B ft1_27B_PAPER_1M_iter_2 27B_PAPER_1M_iter_2.txt

Launch on torch2020 - edit solve.yaml for correct parameters of model and epoch

./tst_human_eval_base.sh 0 125M 1024 ./tst_human_eval_ft1.sh 0 125M 1024 ./tst_human_eval_ft5.sh 0 125M 1024 ./tst_human_eval_ft10.sh 0 125M 1024