This is the code and data for the paper: Language Models can teach themselves to code better https://arxiv.org/abs/2207.14502
LICENSE MIT License - as already specified in the ../LICENSE file of PythonProgrammingPuzzles repo
GPU USAGE GPU usage was large , especially for the 2.7B sized model which is ~20X the 125M. Data generation takes the most GPU usage and took about 2500 GPU hours for 2.7B (on v100) Finetuning on the 1M generated data took about 40 GPU hours for 2.7B (on v100) per epoch of finetuning - 10 epochs = 400 GPU hours Solving the 228 problem testset with 100 attempts using the finetuned 2.7B model took about 4 hours (on v100) We mostly used v100, but we used whatever was available, so T4 and A100 sometimes if they were free. Tried everything at 125M first - debug there and make it work perfect - then roll out the 1.3 and 2.7 jobs
DATASETS In data directory are the datasets used. We feel the most interesting dataset is data/Codex_PAPER_1M_iter_0.txt which is generated by Codex and gave the best results when finetuned on. All the datasets are part of our public release.
SETUP src/requirements.txt is what we install on our cluster machines - the cluster comes with NVidia drivers and matching pytorch ./requirements.txt is what I personally have installed on my local machine and tested this runs - but it has lots of stuff you don't need So try src/requirements.txt only - and if that doesn't work - then /requirements.txt has all versions of everything installed on my machine Getting a deepspeed 0.6.1 matching a pytorch matching a nvidia driver install was tricky for me on some machines, torch 1.10 and 1.11 both work
GENERATING/FINETUNING -> run "cd src, ./babysit.sh GPU_INDEX_TO_USE" -> GPU_INDEX_TO_USE=0 typically In src/babysit.sh is the script that generates data, and finetunes on that data in a loop, finetuning the GPT-Neo 125M/1.3B/2.7B models In src/babysit.sh TEST_LOCAL=1 controls running locally on machine's GPUs which is great for fast testing, or =0 is launching on the cluster which is slow but has lots of GPUs Realistically you have to train on a cluster - data generation takes a long time so having lots of machines all generating data is the feasible approach. But given enough time - this will run locally on 1 GPU. 1 year for 2.7B, or 2 weeks for 125M. We found generating 75k samples after deduping worked for iteration_0 - finetune on that data. Then using that fine_tuned model in iter_1 generating data happens more quickly - the finetuned model solves many more problems Repeating that process works well. On 125M we looked at just training on only 125M generated data from iter_0 versus iter_1 versus iter_2 - generating 600K for each iteration. It seemed finetuning on iter_2 data was best on the testset 26.9/228 solved vs iter_1=26.1/228 vs iter_0=22.2/228 With 1M samples from 125M generated data sampled across all the iterations 0,1,2 we got 26.75/228 We understand why it's faster to generate iter_2 data on a finetuned model - it solves more problems. But why are the generated puzzles&solutions better for training the model on? We will explore that more in the future - and try iterating a lot farther than 3 iterations - although our preliminary experiments on 125M show it tops out at 3 iterations
FINETUNING ONLY -> run "cd src, ./fine_tune1.sh GPU_INDEX_TO_USE" -> GPU_INDEX_TO_USE=0 typically
This allows the repeated finetuning on a specific dataset. Use this to do a temperature grid search, or try different variations of parameters on a specific dataset.
Detailed instructions for reproducing experiments:
python gen.py -n=32 -max_tokens=4096 -model_path=openai/code-davinci-002 -model_path_solve=openai/code-cushman-001 -out=../data/codex/iter_0 -seed=2022
./solve2.sh python solve.py -prefix=../data/train_prefix.txt -attempts=1 -model_path=openai/code-cushman-001 -gpu=0 -fixed_temp=0.8 -out=../data/codex -puzzles=../data/test_228.json -seed=2022 -batch_size=64
Producing verified Codex_PAPER_1M_iter_0.txt from the puzzle/solution old style data generated by Codex
python preprocess.py -path=../data/codex/old_verified -f_name=Codex_PAPER_1M_iter_0.txt -max_sols_per_puzzle=8 -old_style_json=True -max_examples=1000000 -include_failures=False -seed=2022 cp ../data/codex/old/Codex_PAPER_1M_iter_0.txt ../data/Codex_PAPER_1M_iter_0.txt
Producing unverified Codex_unverified_PAPER_1M_iter_0.txt from the puzzle/solution old style data generated by Codex
python preprocess.py -path=../data/codex/old_unverified -f_name=Codex_unverified_PAPER_1M_iter_0.txt -max_sols_per_puzzle=8 -old_style_json=True -max_examples=1000000 -include_failures=True -seed=2022 cp ../data/codex/old_unverified/Codex_unverified_PAPER_1M_iter_0.txt ../data/Codex_unverified_PAPER_1M_iter_0.txt
python preprocess.py ../data/125M_PAPER/iter_0 125M_PAPER_25K_iter_0.txt 8 False 25000 False -seed=2022 cp ../data/125M_PAPER/iter_0/125M_PAPER_25K_iter_0.txt ../data/125M_PAPER_25K_iter_0.txt
python preprocess.py ../data/125M_PAPER/iter_1 125M_PAPER_1M_iter_1.txt 8 False 1000000 False -seed=2022 cp ../data/125M_PAPER/iter_1/125M_PAPER_1M_iter_1.txt ../data/125M_PAPER_1M_iter_1.txt
python preprocess.py ../data/125M_PAPER/iter_2 125M_PAPER_1M_iter_2.txt 8 False 1000000 False -seed=2022 cp ../data/125M_PAPER/iter_2/125M_PAPER_1M_iter_2.txt ../data/125M_PAPER_1M_iter_2.txt
python preprocess.py ../data/13B_PAPER/iter_0 13B_PAPER_25K_iter_0.txt 8 False 25000 False -seed=2022 cp ../data/13B_PAPER/iter_0/13B_PAPER_25K_iter_0.txt ../data/13B_PAPER_25K_iter_0.txt
python preprocess.py ../data/13B_PAPER/iter_1 13B_PAPER_1M_iter_1.txt 8 False 1000000 False -seed=2022 cp ../data/13B_PAPER/iter_1/13B_PAPER_1M_iter_1.txt ../data/13B_PAPER_1M_iter_1.txt
python preprocess.py ../data/13B_PAPER/iter_2 13B_PAPER_1M_iter_2.txt 8 False 1000000 False -seed=2022 cp ../data/13B_PAPER/iter_2/13B_PAPER_1M_iter_2.txt ../data/13B_PAPER_1M_iter_2.txt
python preprocess.py ../data/27B_PAPER/iter_0 27B_PAPER_25K_iter_0.txt 8 False 25000 False -seed=2022 cp ../data/27B_PAPER/iter_0/27B_PAPER_25K_iter_0.txt ../data/27B_PAPER_25K_iter_0.txt
python preprocess.py ../data/27B_PAPER/iter_1 27B_PAPER_1M_iter_1.txt 8 False 1000000 False -seed=2022 cp ../data/27B_PAPER/iter_1/27B_PAPER_1M_iter_1.txt ../data/27B_PAPER_1M_iter_1.txt
python preprocess.py ../data/27B_PAPER/iter_2 27B_PAPER_1M_iter_2.txt 8 False 1000000 False -seed=2022 cp ../data/27B_PAPER/iter_2/27B_PAPER_1M_iter_2.txt ../data/27B_PAPER_1M_iter_2.txt
At the time of experiments running, Codex wasn't finetunable, so only iteration 0 data was available
Codex_PAPER_1M_iter_0.txt 125M_PAPER_25K_iter_0.txt 13B_PAPER_25K_iter_0.txt 27B_PAPER_25K_iter_0.txt 125M_PAPER_1M_iter_1.txt 13B_PAPER_1M_iter_1.txt 27B_PAPER_1M_iter_1.txt 125M_PAPER_1M_iter_2.txt 13B_PAPER_1M_iter_2.txt 27B_PAPER_1M_iter_2.txt
Figure 5 - 3 diagrams - showing the 3 GPT models trained on verified codex vs unverified codex vs baseline
./fine_tune1.sh 0 125M ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt ./fine_tune1.sh 0 125M ft1_Codex_unverified_PAPER_1M_iter_0 Codex_unverified_PAPER_1M_iter_0.txt ./solve1.sh 0 125M 10 228
./fine_tune1.sh 0 13B ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt ./fine_tune1.sh 0 13B ft1_Codex_unverified_PAPER_1M_iter_0 Codex_unverified_PAPER_1M_iter_0.txt ./solve1.sh 0 13B 10 228 5
./fine_tune1.sh 0 27B ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt ./fine_tune1.sh 0 27B ft1_Codex_unverified_PAPER_1M_iter_0 Codex_unverified_PAPER_1M_iter_0.txt ./solve1.sh 0 13B 10 228 5
Figure 6 - 3 diagrams - showing test228 Pass@ for the 3 GPT models trained on data from 4 generators (codex and 3 GPT-Neo) and baseline
./fine_tune1.sh 0 125M ft1_125M_PAPER_1M_iter_2 125M_PAPER_1M_iter_2.txt ./fine_tune1.sh 0 125M ft1_13B_PAPER_1M_iter_2 13B_PAPER_1M_iter_2.txt ./fine_tune1.sh 0 125M ft1_27B_PAPER_1M_iter_2 27B_PAPER_1M_iter_2.txt
./fine_tune1.sh 0 13B ft1_125M_PAPER_1M_iter_2 125M_PAPER_1M_iter_2.txt ./fine_tune1.sh 0 13B ft1_13B_PAPER_1M_iter_2 13B_PAPER_1M_iter_2.txt ./fine_tune1.sh 0 13B ft1_27B_PAPER_1M_iter_2 27B_PAPER_1M_iter_2.txt
./fine_tune1.sh 0 27B ft1_125M_PAPER_1M_iter_2 125M_PAPER_1M_iter_2.txt ./fine_tune1.sh 0 27B ft1_13B_PAPER_1M_iter_2 13B_PAPER_1M_iter_2.txt ./fine_tune1.sh 0 27B ft1_27B_PAPER_1M_iter_2 27B_PAPER_1M_iter_2.txt
./tst_human_eval_base.sh 0 125M 1024 ./tst_human_eval_ft1.sh 0 125M 1024 ./tst_human_eval_ft5.sh 0 125M 1024 ./tst_human_eval_ft10.sh 0 125M 1024