diff --git a/papers/architectural-effectivity/tokenization/papers.csv b/papers/architectural-effectivity/tokenization/papers.csv
index fd98dea..51c82cf 100644
--- a/papers/architectural-effectivity/tokenization/papers.csv
+++ b/papers/architectural-effectivity/tokenization/papers.csv
@@ -14,4 +14,5 @@ Reconsidering Token Embeddings with the Definitions for Pre-trained Language Mod
 Monotonic Representation of Numeric Properties in Language Models,2024-08-15,http://arxiv.org/abs/2408.10381,Benjamin Heinzerling; Kentaro Inui
 Where is the signal in tokenization space?,2024-08-16,http://arxiv.org/abs/2408.08541,Renato Lui Geh; Honghua Zhang; Kareem Ahmed; Benjie Wang; Guy Van den Broeck
 Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?,2024-07-23,http://arxiv.org/abs/2407.16607,Jonathan Hayase; Alisa Liu; Yejin Choi; Sewoong Oh; Noah A. Smith
-An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models,2024-07-08,http://arxiv.org/abs/2407.05841,Nandini Mundra; Aditya Nanda Kishore; Raj Dabre; Ratish Puduppully; Anoop Kunchukuttan; Mitesh M. Khapra
\ No newline at end of file
+An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models,2024-07-08,http://arxiv.org/abs/2407.05841,Nandini Mundra; Aditya Nanda Kishore; Raj Dabre; Ratish Puduppully; Anoop Kunchukuttan; Mitesh M. Khapra
+Norm of Mean Contextualized Embeddings Determines their Variance,2024-09-17,http://arxiv.org/abs/2409.11253,Hiroaki Yamagiwa; Hidetoshi Shimodaira
\ No newline at end of file
diff --git a/papers/mechanistic-engineering/papers.csv b/papers/mechanistic-engineering/papers.csv
index 5f59526..9aac147 100644
--- a/papers/mechanistic-engineering/papers.csv
+++ b/papers/mechanistic-engineering/papers.csv
@@ -41,4 +41,5 @@ LLM Circuit Analyses Are Consistent Across Training and Scale,2024-07-15,http://
 Modularity in Transformers: Investigating Neuron Separability & Specialization,2024-08-30,http://arxiv.org/abs/2408.17324,Nicholas Pochinkov; Thomas Jones; Mohammed Rashidur Rahman
 Extracting Paragraphs from LLM Token Activations,2024-09-10,http://arxiv.org/abs/2409.06328,Nicholas Pochinkov; Angelo Benoit; Lovkush Agarwal; Zainab Ali Majid; Lucile Ter-Minassian
 Explaining Datasets in Words: Statistical Models with Natural Language Parameters,2024-09-13,http://arxiv.org/abs/2409.08466,Ruiqi Zhong; Heng Wang; Dan Klein; Jacob Steinhardt
-Optimal ablation for interpretability,2024-09-16,http://arxiv.org/abs/2409.09951,Maximilian Li; Lucas Janson
\ No newline at end of file
+Optimal ablation for interpretability,2024-09-16,http://arxiv.org/abs/2409.09951,Maximilian Li; Lucas Janson
+Self-Attention Limits Working Memory Capacity of Transformer-Based Models,2024-09-16,http://arxiv.org/abs/2409.10715,Dongyu Gong; Hantao Zhang
\ No newline at end of file
diff --git a/papers/miscellanea/papers.csv b/papers/miscellanea/papers.csv
index 5578263..92ecab2 100644
--- a/papers/miscellanea/papers.csv
+++ b/papers/miscellanea/papers.csv
@@ -73,4 +73,5 @@ Viewing Transformers Through the Lens of Long Convolutions Layers,2024-05-02,htt
 Modeling Language Tokens as Functionals of Semantic Fields,2024-05-02,http://openreview.net/pdf?id=EEO4Iktfjp,Zhengqi Pei; Anran Zhang; Shuhui Wang; Qingming Huang
 Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations,2024-07-10,http://openreview.net/pdf?id=qyilOnIRHI,Yize Zhao; Tina Behnia; Vala Vakilian; Christos Thrampoulidis
 Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts,2024-09-02,http://arxiv.org/abs/2409.00879,Youngseog Chung; Dhruv Malik; Jeff Schneider; Yuanzhi Li; Aarti Singh
-Reframing Data Value for Large Language Models Through the Lens of Plausability,2024-08-30,http://arxiv.org/abs/2409.00284,Mohamad Rida Rammal; Ruida Zhou; Suhas Diggavi
\ No newline at end of file
+Reframing Data Value for Large Language Models Through the Lens of Plausability,2024-08-30,http://arxiv.org/abs/2409.00284,Mohamad Rida Rammal; Ruida Zhou; Suhas Diggavi
+A Controlled Study on Long Context Extension and Generalization in LLMs,2024-09-18,http://arxiv.org/abs/2409.12181,Yi Lu; Jing Nathan Yan; Songlin Yang; Justin T. Chiu; Siyu Ren; Fei Yuan; Wenting Zhao; Zhiyong Wu; Alexander M. Rush
\ No newline at end of file
diff --git a/papers/phenomena-of-interest/in-context-learning/papers.csv b/papers/phenomena-of-interest/in-context-learning/papers.csv
index f909da2..cbd7a66 100644
--- a/papers/phenomena-of-interest/in-context-learning/papers.csv
+++ b/papers/phenomena-of-interest/in-context-learning/papers.csv
@@ -73,4 +73,5 @@ In-Context Learning with Representations: Contextual Generalization of Trained T
 Unveiling In-Context Learning: A Coordinate System to Understand Its Working Mechanism,2024-07-24,http://arxiv.org/abs/2407.17011,Anhao Zhao; Fanghua Ye; Jinlan Fu; Xiaoyu Shen
 Polynomial Regression as a Task for Understanding In-context Learning Through Finetuning and Alignment,2024-07-27,http://arxiv.org/abs/2407.19346,Max Wilcoxson; Morten Svendgård; Ria Doshi; Dylan Davis; Reya Vir; Anant Sahai
 One-Layer Transformer Provably Learns One-Nearest Neighbor In Context,2024-07-24,https://klusowski.princeton.edu/sites/g/files/toruqf5901/files/documents/li2024one.pdf,Zihao Li; Yuan Cao; Cheng Gao; Yihan He; Han Liu; Jason M. Klusowski; Jianqing Fan; Mengdi Wang
-Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs,2024-09-06,http://arxiv.org/abs/2409.04318,Aliakbar Nafar; Kristen Brent Venable; Parisa Kordjamshidi
\ No newline at end of file
+Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs,2024-09-06,http://arxiv.org/abs/2409.04318,Aliakbar Nafar; Kristen Brent Venable; Parisa Kordjamshidi
+Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers,2024-09-10,http://arxiv.org/abs/2409.10559,Siyu Chen; Heejune Sheen; Tianhao Wang; Zhuoran Yang
\ No newline at end of file
diff --git a/papers/phenomena-of-interest/knowledge/papers.csv b/papers/phenomena-of-interest/knowledge/papers.csv
index 13c552f..3cae712 100644
--- a/papers/phenomena-of-interest/knowledge/papers.csv
+++ b/papers/phenomena-of-interest/knowledge/papers.csv
@@ -24,4 +24,6 @@ Memorisation In In-Context Learning,2024-08-21,http://arxiv.org/abs/2408.11546,S
 "Understanding Memorisation in LLMs: Dynamics, Influencing Factors, and Implications",2024-07-27,http://arxiv.org/abs/2407.19262,Till Speicher; Mohammad Aflah Khan; Qinyuan Wu; Vedant Nanda; Soumi Das; Bishwamittra Ghosh; Krishna P. Gummadi; Evimaria Terzi
 Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data,2024-07-20,http://arxiv.org/abs/2407.14985,Antonis Antoniades; Xinyi Wang; Yanai Elazar; Alfonso Amayuelas; Alon Albalak; Kexun Zhang; William Yang Wang
 Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning,2024-07-09,http://arxiv.org/abs/2407.07011,J. Crosbie; E. Shutova
-"Schrodingers Memory: Large Language Models",2024-09-16,https://arxiv.org/pdf/2409.10482,Wei Wang; Qing Li
\ No newline at end of file
+"Schrodingers Memory: Large Language Models",2024-09-16,https://arxiv.org/pdf/2409.10482,Wei Wang; Qing Li
+Self-Attention Limits Working Memory Capacity of Transformer-Based Models,2024-09-16,http://arxiv.org/abs/2409.10715,Dongyu Gong; Hantao Zhang
+"Physics of Language Models: Part 3.1, Knowledge Storage and Extraction",2024-07-16,http://arxiv.org/abs/2309.14316,Zeyuan Allen-Zhu; Yuanzhi Li
\ No newline at end of file
diff --git a/papers/phenomena-of-interest/learning/papers.csv b/papers/phenomena-of-interest/learning/papers.csv
index 197abed..9766a7e 100644
--- a/papers/phenomena-of-interest/learning/papers.csv
+++ b/papers/phenomena-of-interest/learning/papers.csv
@@ -38,4 +38,5 @@ Reasoning in Large Language Models: A Geometric Perspective,2024-07-02,http://ar
 Unforgettable Generalization in Language Models,2024-09-03,http://arxiv.org/abs/2409.02228,Eric Zhang; Leshem Chosen; Jacob Andreas
 The Many Faces of Optimal Weak-to-Strong Learning,2024-08-30,http://arxiv.org/abs/2408.17148,Mikael Møller Høgsgaard; Kasper Green Larsen; Markus Engelund Mathiasen
 On the Empirical Complexity of Reasoning and Planning in LLMs,2024-04-17,http://arxiv.org/abs/2404.11041,Liwei Kang; Zirui Zhao; David Hsu; Wee Sun Lee
-Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics,2024-09-15,http://arxiv.org/abs/2409.09626,Yi Ren; Danica J. Sutherland
\ No newline at end of file
+Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics,2024-09-15,http://arxiv.org/abs/2409.09626,Yi Ren; Danica J. Sutherland
+"Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems",2024-08-29,http://arxiv.org/abs/2408.16293,Tian Ye; Zicheng Xu; Yuanzhi Li; Zeyuan Allen-Zhu
\ No newline at end of file
diff --git a/papers/representational-capacity/what-can-transformer-do/papers.csv b/papers/representational-capacity/what-can-transformer-do/papers.csv
index dc6f4c4..e7be5ba 100644
--- a/papers/representational-capacity/what-can-transformer-do/papers.csv
+++ b/papers/representational-capacity/what-can-transformer-do/papers.csv
@@ -52,4 +52,5 @@ Learning Randomized Algorithms with Transformers,2024-08-20,http://arxiv.org/abs
 Attention is a smoothed cubic spline,2024-08-19,http://arxiv.org/abs/2408.09624,Zehua Lai; Lek-Heng Lim; Yucong Liu
 Transformers As Approximations of Solomonoff Induction,2024-08-22,http://arxiv.org/abs/2408.12065,Nathan Young; Michael Witbrock
 Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations,2024-08-27,http://arxiv.org/abs/2408.15417,Yize Zhao; Tina Behnia; Vala Vakilian; Christos Thrampoulidis
-A Law of Next-Token Prediction in Large Language Models,2024-08-24,http://arxiv.org/abs/2408.13442,Hangfeng He; Weijie J. Su
\ No newline at end of file
+A Law of Next-Token Prediction in Large Language Models,2024-08-24,http://arxiv.org/abs/2408.13442,Hangfeng He; Weijie J. Su
+"Physics of Language Models: Part 1, Learning Hierarchical Language Structures",2024-06-02,http://arxiv.org/abs/2305.13673,Zeyuan Allen-Zhu; Yuanzhi Li
\ No newline at end of file
diff --git a/papers/representational-capacity/what-can-transformer-not-do/papers.csv b/papers/representational-capacity/what-can-transformer-not-do/papers.csv
index 26c024f..5016a97 100644
--- a/papers/representational-capacity/what-can-transformer-not-do/papers.csv
+++ b/papers/representational-capacity/what-can-transformer-not-do/papers.csv
@@ -20,4 +20,5 @@ Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Ho
 When can transformers compositionally generalize in-context?,2024-07-17,http://arxiv.org/abs/2407.12275,Seijin Kobayashi; Simon Schug; Yassir Akram; Florian Redhardt; Johannes von Oswald; Razvan Pascanu; Guillaume Lajoie; João Sacramento
 When Can Transformers Count to n?,2024-07-21,http://arxiv.org/abs/2407.15160,Gilad Yehudai; Haim Kaplan; Asma Ghandeharioun; Mor Geva; Amir Globerson
 Your Context Is Not an Array: Unveiling Random Access Limitations in Transformers,2024-08-10,http://arxiv.org/abs/2408.05506,MohammadReza Ebrahimi; Sunny Panchal; Roland Memisevic
-One-layer transformers fail to solve the induction heads task,2024-08-26,http://arxiv.org/abs/2408.14332,Clayton Sanford; Daniel Hsu; Matus Telgarsky
\ No newline at end of file
+One-layer transformers fail to solve the induction heads task,2024-08-26,http://arxiv.org/abs/2408.14332,Clayton Sanford; Daniel Hsu; Matus Telgarsky
+Self-Attention Limits Working Memory Capacity of Transformer-Based Models,2024-09-16,http://arxiv.org/abs/2409.10715,Dongyu Gong; Hantao Zhang
\ No newline at end of file