add Nov

Furyton · Nov 11, 2024 · 10ca6ac · 10ca6ac
1 parent ee56f83
commit 10ca6ac
Show file tree

Hide file tree

Showing 10 changed files with 27 additions and 10 deletions.
diff --git a/papers/architectural-effectivity/linear-attention/papers.csv b/papers/architectural-effectivity/linear-attention/papers.csv
@@ -6,4 +6,5 @@ Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models,2024
 Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations,2024-08-20,http://arxiv.org/abs/2408.10920,Róbert Csordás; Christopher Potts; Christopher D. Manning; Atticus Geiger
 "Theory, Analysis, and Best Practices for Sigmoid Self-Attention",2024-09-06,http://arxiv.org/abs/2409.04431,Jason Ramapuram; Federico Danieli; Eeshan Dhekane; Floris Weers; Dan Busbridge; Pierre Ablin; Tatiana Likhomanenko; Jagrit Digani; Zijin Gu; Amitis Shidani; Russ Webb
 "Autoregressive + Chain of Thought (CoT) ≃ Recurrent: Recurrence's Role in Language Models and a Revist of Recurrent Transformer",2024-09-14,http://arxiv.org/abs/2409.09239,Xiang Zhang; Muhammad Abdul-Mageed; Laks V.S. Lakshmanan
-Fundamental Limitations on Subquadratic Alternatives to Transformers,2024-10-05,http://arxiv.org/abs/2410.04271,Josh Alman; Hantao Yu
+Fundamental Limitations on Subquadratic Alternatives to Transformers,2024-10-05,http://arxiv.org/abs/2410.04271,Josh Alman; Hantao Yu
+kNN Attention Demystified: A Theoretical Exploration for Scalable Transformers,2024-11-06,http://arxiv.org/abs/2411.04013,Themistoklis Haris
diff --git a/papers/mechanistic-engineering/papers.csv b/papers/mechanistic-engineering/papers.csv
@@ -45,4 +45,6 @@ Optimal ablation for interpretability,2024-09-16,http://arxiv.org/abs/2409.09951
 Self-Attention Limits Working Memory Capacity of Transformer-Based Models,2024-09-16,http://arxiv.org/abs/2409.10715,Dongyu Gong; Hantao Zhang
 Extracting Finite State Machines from Transformers,2024-10-08,http://arxiv.org/abs/2410.06045,Rik Adriaensen; Jaron Maene
 Interpreting Affine Recurrence Learning in GPT-style Transformers,2024-10-22,http://arxiv.org/abs/2410.17438,Samarth Bhargav; Alexander Gu
-Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks,2024-10-23,http://arxiv.org/abs/2410.17498,Paul Smolensky; Roland Fernandez; Zhenghao Herbert Zhou; Mattia Opper; Jianfeng Gao
+Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks,2024-10-23,http://arxiv.org/abs/2410.17498,Paul Smolensky; Roland Fernandez; Zhenghao Herbert Zhou; Mattia Opper; Jianfeng Gao
+How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis,2024-11-06,http://arxiv.org/abs/2411.04105,Guan Zhe Hong; Nishanth Dikkala; Enming Luo; Cyrus Rashtchian; Xin Wang; Rina Panigrahy
+Towards Interpreting Language Models: A Case Study in Multi-Hop Reasoning,2024-11-06,http://arxiv.org/abs/2411.05037,Mansi Sakarvadia
diff --git a/papers/miscellanea/papers.csv b/papers/miscellanea/papers.csv
@@ -83,4 +83,6 @@ softmax is not enough (for sharp out-of-distribution),2024-10-01,http://arxiv.or
 Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers,2024-10-10,http://arxiv.org/abs/2410.07799,Alireza Naderi; Thiziri Nait Saada; Jared Tanner
 Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies,2024-10-04,http://arxiv.org/abs/2410.03968,Sijin Chen; Omar Hagrass; Jason M. Klusowski
 Inevitable Trade-off between Watermark Strength and Speculative Sampling Efficiency for Language Models,2024-10-27,http://arxiv.org/abs/2410.20418,Zhengmian Hu; Heng Huang
-Length-Induced Embedding Collapse in Transformer-based Models,2024-10-31,http://arxiv.org/abs/2410.24200,Yuqi Zhou; Sunhao Dai; Zhanshuo Cao; Xiao Zhang; Jun Xu
+Length-Induced Embedding Collapse in Transformer-based Models,2024-10-31,http://arxiv.org/abs/2410.24200,Yuqi Zhou; Sunhao Dai; Zhanshuo Cao; Xiao Zhang; Jun Xu
+A Theoretical Perspective for Speculative Decoding Algorithm,2024-10-30,http://arxiv.org/abs/2411.00841,Ming Yin; Minshuo Chen; Kaixuan Huang; Mengdi Wang
+Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training,2024-10-31,http://arxiv.org/abs/2410.23922,Atli Kosson; Bettina Messmer; Martin Jaggi
diff --git a/papers/phenomena-of-interest/in-context-learning/papers.csv b/papers/phenomena-of-interest/in-context-learning/papers.csv
@@ -89,4 +89,6 @@ Can Transformers In-Context Learn Behavior of a Linear Dynamical System?,2024-10
 Bayesian scaling laws for in-context learning,2024-10-21,http://arxiv.org/abs/2410.16531,Aryaman Arora; Dan Jurafsky; Christopher Potts; Noah D. Goodman
 Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks,2024-10-23,http://arxiv.org/abs/2410.17498,Paul Smolensky; Roland Fernandez; Zhenghao Herbert Zhou; Mattia Opper; Jianfeng Gao
 On the Role of Depth and Looping for In-Context Learning with Task Diversity,2024-10-29,http://arxiv.org/abs/2410.21698,Khashayar Gatmiry; Nikunj Saunshi; Sashank J. Reddi; Stefanie Jegelka; Sanjiv Kumar
-Toward Understanding In-context vs. In-weight Learning,2024-10-30,http://arxiv.org/abs/2410.23042,Bryan Chan; Xinyi Chen; András György; Dale Schuurmans
+Toward Understanding In-context vs. In-weight Learning,2024-10-30,http://arxiv.org/abs/2410.23042,Bryan Chan; Xinyi Chen; András György; Dale Schuurmans
+Provable In-Context Learning with Transformers: A Case Study on Linear Regression,2024-11-04,http://arxiv.org/abs/2411.02199,Dake Bu; Wei Huang; Andi Han; Atsushi Nitanda; Taiji Suzuki; Qingfu Zhang; Hau-San Wong
+Pretrained transformer efficiently learns low-dimensional target functions in-context,2024-11-04,http://arxiv.org/abs/2411.02544,Kazusato Oko; Yujin Song; Taiji Suzuki; Denny Wu
diff --git a/papers/phenomena-of-interest/knowledge/papers.csv b/papers/phenomena-of-interest/knowledge/papers.csv
@@ -27,4 +27,5 @@ Induction Heads as an Essential Mechanism for Pattern Matching in In-context Lea
 "Schrodingers Memory: Large Language Models",2024-09-16,https://arxiv.org/pdf/2409.10482,Wei Wang; Qing Li
 Self-Attention Limits Working Memory Capacity of Transformer-Based Models,2024-09-16,http://arxiv.org/abs/2409.10715,Dongyu Gong; Hantao Zhang
 "Physics of Language Models: Part 3.1, Knowledge Storage and Extraction",2024-07-16,http://arxiv.org/abs/2309.14316,Zeyuan Allen-Zhu; Yuanzhi Li
-Optimal Memorization Capacity of Transformers,2024-09-26,http://arxiv.org/abs/2409.17677,Tokio Kajitsuka; Issei Sato
+Optimal Memorization Capacity of Transformers,2024-09-26,http://arxiv.org/abs/2409.17677,Tokio Kajitsuka; Issei Sato
+A Geometric Framework for Understanding Memorization in Generative Models,2024-10-31,http://arxiv.org/abs/2411.00113,Brendan Leigh Ross; Hamidreza Kamkari; Tongzi Wu; Rasa Hosseinzadeh; Zhaoyan Liu; George Stein; Jesse C. Cresswell; Gabriel Loaiza-Ganem
diff --git a/papers/phenomena-of-interest/learning/papers.csv b/papers/phenomena-of-interest/learning/papers.csv
@@ -51,4 +51,7 @@ A Formal Framework for Understanding Length Generalization in Transformers,2024-
 Benign Overfitting for Regression with Trained Two-Layer ReLU Networks,2024-10-08,http://arxiv.org/abs/2410.06191,Junhyung Park; Patrick Bloebaum; Shiva Prasad Kasiviswanathan
 Dynamics of Concept Learning and Compositional Generalization,2024-10-10,http://arxiv.org/abs/2410.08309,Yongyi Yang; Core Francisco Park; Ekdeep Singh Lubana; Maya Okawa; Wei Hu; Hidenori Tanaka
 On Rank-Dependent Generalisation Error Bounds for Transformers,2024-10-15,http://arxiv.org/abs/2410.11500,Lan V. Truong
-Mixture of Parrots: Experts improve memorization more than reasoning,2024-10-24,http://arxiv.org/abs/2410.19034,Samy Jelassi; Clara Mohri; David Brandfonbrener; Alex Gu; Nikhil Vyas; Nikhil Anand; David Alvarez-Melis; Yuanzhi Li; Sham M. Kakade; Eran Malach
+Mixture of Parrots: Experts improve memorization more than reasoning,2024-10-24,http://arxiv.org/abs/2410.19034,Samy Jelassi; Clara Mohri; David Brandfonbrener; Alex Gu; Nikhil Vyas; Nikhil Anand; David Alvarez-Melis; Yuanzhi Li; Sham M. Kakade; Eran Malach
+RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught Reasoner,2024-10-31,http://arxiv.org/abs/2410.23912,Fu-Chieh Chang; Yu-Ting Lee; Hui-Ying Shih; Pei-Yuan Wu
+Generalization and Risk Bounds for Recurrent Neural Networks,2024-11-05,http://arxiv.org/abs/2411.02784,Xuewei Cheng; Ke Huang; Shujie Ma
+Provable Length Generalization in Sequence Prediction via Spectral Filtering,2024-11-01,http://arxiv.org/abs/2411.01035,Annie Marsden; Evan Dogariu; Naman Agarwal; Xinyi Chen; Daniel Suo; Elad Hazan
diff --git a/papers/phenomena-of-interest/other-phenomena/papers.csv b/papers/phenomena-of-interest/other-phenomena/papers.csv
@@ -29,4 +29,6 @@ Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Pheno
 Emergent properties with repeated examples,2024-10-09,http://arxiv.org/abs/2410.07041,François Charton; Julia Kempe
 Looking Beyond The Top-1: Transformers Determine Top Tokens In Order,2024-10-26,http://arxiv.org/abs/2410.20210,Daria Lioubashevski; Tomer Schlank; Gabriel Stanovsky; Ariel Goldstein
 All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling,2024-10-30,http://arxiv.org/abs/2410.23501,Emanuele Marconato; Sébastien Lachapelle; Sebastian Weichwald; Luigi Gresele
-Weight decay induces low-rank attention layers,2024-10-31,http://arxiv.org/abs/2410.23819,Seijin Kobayashi; Yassir Akram; Johannes Von Oswald
+Weight decay induces low-rank attention layers,2024-10-31,http://arxiv.org/abs/2410.23819,Seijin Kobayashi; Yassir Akram; Johannes Von Oswald
+On the loss of context-awareness in general instruction fine-tuning,2024-11-05,http://arxiv.org/abs/2411.02688,Yihan Wang; Andrew Bai; Nanyun Peng; Cho-Jui Hsieh
+Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs,2024-10-17,http://arxiv.org/abs/2410.13835,Tianyu Guo; Druv Pai; Yu Bai; Jiantao Jiao; Michael I. Jordan; Song Mei
diff --git a/papers/phenomena-of-interest/scaling-laws/papers.csv b/papers/phenomena-of-interest/scaling-laws/papers.csv
@@ -48,4 +48,5 @@ Grokking at the Edge of Linear Separability,2024-10-06,https://arxiv.org/abs/241
 Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models,2024-10-08,https://arxiv.org/abs/2410.05661,Siqi Wang; Zhengyu Chen; Bei Li; Keqing He; Min Zhang; Jingang Wang
 "An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models",2024-10-15,https://arxiv.org/abs/2410.01243,Anuj K. Nayak; Lav R. Varshney
 A Hitchhiker's Guide to Scaling Law Estimation,2024-10-15,http://arxiv.org/abs/2410.11840,Leshem Choshen; Yang Zhang; Jacob Andreas
-How Does Critical Batch Size Scale in Pre-training?,2024-10-29,http://arxiv.org/abs/2410.21676,Hanlin Zhang; Depen Morwani; Nikhil Vyas; Jingfeng Wu; Difan Zou; Udaya Ghai; Dean Foster; Sham Kakade
+How Does Critical Batch Size Scale in Pre-training?,2024-10-29,http://arxiv.org/abs/2410.21676,Hanlin Zhang; Depen Morwani; Nikhil Vyas; Jingfeng Wu; Difan Zou; Udaya Ghai; Dean Foster; Sham Kakade
+Unlocking the Theory Behind Scaling 1-Bit Neural Networks,2024-11-03,http://arxiv.org/abs/2411.01663,Majid Daliri; Zhao Song; Chiwun Yang
diff --git a/papers/phenomena-of-interest/training-dynamics/papers.csv b/papers/phenomena-of-interest/training-dynamics/papers.csv
@@ -48,4 +48,5 @@ LoRA vs Full Fine-tuning: An Illusion of Equivalence,2024-10-28,http://arxiv.org
 Abrupt Learning in Transformers: A Case Study on Matrix Completion,2024-10-29,http://arxiv.org/abs/2410.22244,Pulkit Gopalani; Ekdeep Singh Lubana; Wei Hu
 Global Convergence in Training Large-Scale Transformers,2024-10-31,http://arxiv.org/abs/2410.23610,Cheng Gao; Yuan Cao; Zihao Li; Yihan He; Mengdi Wang; Han Liu; Jason Matthew Klusowski; Jianqing Fan
 What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective,2024-10-31,http://arxiv.org/abs/2410.23743,Ming Li; Yanhong Li; Tianyi Zhou
-Learning and Transferring Sparse Contextual Bigrams with Linear Transformers,2024-10-30,http://arxiv.org/abs/2410.23438,Yunwei Ren; Zixuan Wang; Jason D. Lee
+Learning and Transferring Sparse Contextual Bigrams with Linear Transformers,2024-10-30,http://arxiv.org/abs/2410.23438,Yunwei Ren; Zixuan Wang; Jason D. Lee
+Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs,2024-10-17,http://arxiv.org/abs/2410.13835,Tianyu Guo; Druv Pai; Yu Bai; Jiantao Jiao; Michael I. Jordan; Song Mei
diff --git a/papers/representational-capacity/what-can-transformer-do/papers.csv b/papers/representational-capacity/what-can-transformer-do/papers.csv
@@ -66,4 +66,6 @@ Towards Understanding the Universality of Transformers for Next-Token Prediction
 Attention layers provably solve single-location regression,2024-10-02,http://arxiv.org/abs/2410.01537,Pierre Marion; Raphaël Berthier; Gérard Biau; Claire Boyer
 Large Language Models as Markov Chains,2024-10-03,http://arxiv.org/abs/2410.02724,Oussama Zekri; Ambroise Odonnat; Abdelhakim Benechehab; Linus Bleistein; Nicolas Boullé; Ievgen Redko
 Fundamental Limitations on Subquadratic Alternatives to Transformers,2024-10-05,http://arxiv.org/abs/2410.04271,Josh Alman; Hantao Yu
-Provable Optimal Transport with Transformers: The Essence of Depth and Prompt Engineering,2024-10-25,http://arxiv.org/abs/2410.19931,Hadi Daneshmand
+Provable Optimal Transport with Transformers: The Essence of Depth and Prompt Engineering,2024-10-25,http://arxiv.org/abs/2410.19931,Hadi Daneshmand
+"Ask, and it shall be given: Turing completeness of prompting",2024-11-04,http://arxiv.org/abs/2411.01992,Ruizhong Qiu; Zhe Xu; Wenxuan Bao; Hanghang Tong
+Measure-to-measure interpolation using Transformers,2024-11-07,http://arxiv.org/abs/2411.04551,Borjan Geshkovski; Philippe Rigollet; Domènec Ruiz-Balet