Skip to content

Commit

Permalink
add Nov
Browse files Browse the repository at this point in the history
  • Loading branch information
Furyton committed Nov 11, 2024
1 parent ee56f83 commit 10ca6ac
Show file tree
Hide file tree
Showing 10 changed files with 27 additions and 10 deletions.
3 changes: 2 additions & 1 deletion papers/architectural-effectivity/linear-attention/papers.csv
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models,2024
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations,2024-08-20,http://arxiv.org/abs/2408.10920,Róbert Csordás; Christopher Potts; Christopher D. Manning; Atticus Geiger
"Theory, Analysis, and Best Practices for Sigmoid Self-Attention",2024-09-06,http://arxiv.org/abs/2409.04431,Jason Ramapuram; Federico Danieli; Eeshan Dhekane; Floris Weers; Dan Busbridge; Pierre Ablin; Tatiana Likhomanenko; Jagrit Digani; Zijin Gu; Amitis Shidani; Russ Webb
"Autoregressive + Chain of Thought (CoT) ≃ Recurrent: Recurrence's Role in Language Models and a Revist of Recurrent Transformer",2024-09-14,http://arxiv.org/abs/2409.09239,Xiang Zhang; Muhammad Abdul-Mageed; Laks V.S. Lakshmanan
Fundamental Limitations on Subquadratic Alternatives to Transformers,2024-10-05,http://arxiv.org/abs/2410.04271,Josh Alman; Hantao Yu
Fundamental Limitations on Subquadratic Alternatives to Transformers,2024-10-05,http://arxiv.org/abs/2410.04271,Josh Alman; Hantao Yu
kNN Attention Demystified: A Theoretical Exploration for Scalable Transformers,2024-11-06,http://arxiv.org/abs/2411.04013,Themistoklis Haris
4 changes: 3 additions & 1 deletion papers/mechanistic-engineering/papers.csv
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,6 @@ Optimal ablation for interpretability,2024-09-16,http://arxiv.org/abs/2409.09951
Self-Attention Limits Working Memory Capacity of Transformer-Based Models,2024-09-16,http://arxiv.org/abs/2409.10715,Dongyu Gong; Hantao Zhang
Extracting Finite State Machines from Transformers,2024-10-08,http://arxiv.org/abs/2410.06045,Rik Adriaensen; Jaron Maene
Interpreting Affine Recurrence Learning in GPT-style Transformers,2024-10-22,http://arxiv.org/abs/2410.17438,Samarth Bhargav; Alexander Gu
Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks,2024-10-23,http://arxiv.org/abs/2410.17498,Paul Smolensky; Roland Fernandez; Zhenghao Herbert Zhou; Mattia Opper; Jianfeng Gao
Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks,2024-10-23,http://arxiv.org/abs/2410.17498,Paul Smolensky; Roland Fernandez; Zhenghao Herbert Zhou; Mattia Opper; Jianfeng Gao
How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis,2024-11-06,http://arxiv.org/abs/2411.04105,Guan Zhe Hong; Nishanth Dikkala; Enming Luo; Cyrus Rashtchian; Xin Wang; Rina Panigrahy
Towards Interpreting Language Models: A Case Study in Multi-Hop Reasoning,2024-11-06,http://arxiv.org/abs/2411.05037,Mansi Sakarvadia
4 changes: 3 additions & 1 deletion papers/miscellanea/papers.csv
Original file line number Diff line number Diff line change
Expand Up @@ -83,4 +83,6 @@ softmax is not enough (for sharp out-of-distribution),2024-10-01,http://arxiv.or
Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers,2024-10-10,http://arxiv.org/abs/2410.07799,Alireza Naderi; Thiziri Nait Saada; Jared Tanner
Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies,2024-10-04,http://arxiv.org/abs/2410.03968,Sijin Chen; Omar Hagrass; Jason M. Klusowski
Inevitable Trade-off between Watermark Strength and Speculative Sampling Efficiency for Language Models,2024-10-27,http://arxiv.org/abs/2410.20418,Zhengmian Hu; Heng Huang
Length-Induced Embedding Collapse in Transformer-based Models,2024-10-31,http://arxiv.org/abs/2410.24200,Yuqi Zhou; Sunhao Dai; Zhanshuo Cao; Xiao Zhang; Jun Xu
Length-Induced Embedding Collapse in Transformer-based Models,2024-10-31,http://arxiv.org/abs/2410.24200,Yuqi Zhou; Sunhao Dai; Zhanshuo Cao; Xiao Zhang; Jun Xu
A Theoretical Perspective for Speculative Decoding Algorithm,2024-10-30,http://arxiv.org/abs/2411.00841,Ming Yin; Minshuo Chen; Kaixuan Huang; Mengdi Wang
Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training,2024-10-31,http://arxiv.org/abs/2410.23922,Atli Kosson; Bettina Messmer; Martin Jaggi
4 changes: 3 additions & 1 deletion papers/phenomena-of-interest/in-context-learning/papers.csv
Original file line number Diff line number Diff line change
Expand Up @@ -89,4 +89,6 @@ Can Transformers In-Context Learn Behavior of a Linear Dynamical System?,2024-10
Bayesian scaling laws for in-context learning,2024-10-21,http://arxiv.org/abs/2410.16531,Aryaman Arora; Dan Jurafsky; Christopher Potts; Noah D. Goodman
Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks,2024-10-23,http://arxiv.org/abs/2410.17498,Paul Smolensky; Roland Fernandez; Zhenghao Herbert Zhou; Mattia Opper; Jianfeng Gao
On the Role of Depth and Looping for In-Context Learning with Task Diversity,2024-10-29,http://arxiv.org/abs/2410.21698,Khashayar Gatmiry; Nikunj Saunshi; Sashank J. Reddi; Stefanie Jegelka; Sanjiv Kumar
Toward Understanding In-context vs. In-weight Learning,2024-10-30,http://arxiv.org/abs/2410.23042,Bryan Chan; Xinyi Chen; András György; Dale Schuurmans
Toward Understanding In-context vs. In-weight Learning,2024-10-30,http://arxiv.org/abs/2410.23042,Bryan Chan; Xinyi Chen; András György; Dale Schuurmans
Provable In-Context Learning with Transformers: A Case Study on Linear Regression,2024-11-04,http://arxiv.org/abs/2411.02199,Dake Bu; Wei Huang; Andi Han; Atsushi Nitanda; Taiji Suzuki; Qingfu Zhang; Hau-San Wong
Pretrained transformer efficiently learns low-dimensional target functions in-context,2024-11-04,http://arxiv.org/abs/2411.02544,Kazusato Oko; Yujin Song; Taiji Suzuki; Denny Wu
3 changes: 2 additions & 1 deletion papers/phenomena-of-interest/knowledge/papers.csv
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,5 @@ Induction Heads as an Essential Mechanism for Pattern Matching in In-context Lea
"Schrodingers Memory: Large Language Models",2024-09-16,https://arxiv.org/pdf/2409.10482,Wei Wang; Qing Li
Self-Attention Limits Working Memory Capacity of Transformer-Based Models,2024-09-16,http://arxiv.org/abs/2409.10715,Dongyu Gong; Hantao Zhang
"Physics of Language Models: Part 3.1, Knowledge Storage and Extraction",2024-07-16,http://arxiv.org/abs/2309.14316,Zeyuan Allen-Zhu; Yuanzhi Li
Optimal Memorization Capacity of Transformers,2024-09-26,http://arxiv.org/abs/2409.17677,Tokio Kajitsuka; Issei Sato
Optimal Memorization Capacity of Transformers,2024-09-26,http://arxiv.org/abs/2409.17677,Tokio Kajitsuka; Issei Sato
A Geometric Framework for Understanding Memorization in Generative Models,2024-10-31,http://arxiv.org/abs/2411.00113,Brendan Leigh Ross; Hamidreza Kamkari; Tongzi Wu; Rasa Hosseinzadeh; Zhaoyan Liu; George Stein; Jesse C. Cresswell; Gabriel Loaiza-Ganem
5 changes: 4 additions & 1 deletion papers/phenomena-of-interest/learning/papers.csv
Original file line number Diff line number Diff line change
Expand Up @@ -51,4 +51,7 @@ A Formal Framework for Understanding Length Generalization in Transformers,2024-
Benign Overfitting for Regression with Trained Two-Layer ReLU Networks,2024-10-08,http://arxiv.org/abs/2410.06191,Junhyung Park; Patrick Bloebaum; Shiva Prasad Kasiviswanathan
Dynamics of Concept Learning and Compositional Generalization,2024-10-10,http://arxiv.org/abs/2410.08309,Yongyi Yang; Core Francisco Park; Ekdeep Singh Lubana; Maya Okawa; Wei Hu; Hidenori Tanaka
On Rank-Dependent Generalisation Error Bounds for Transformers,2024-10-15,http://arxiv.org/abs/2410.11500,Lan V. Truong
Mixture of Parrots: Experts improve memorization more than reasoning,2024-10-24,http://arxiv.org/abs/2410.19034,Samy Jelassi; Clara Mohri; David Brandfonbrener; Alex Gu; Nikhil Vyas; Nikhil Anand; David Alvarez-Melis; Yuanzhi Li; Sham M. Kakade; Eran Malach
Mixture of Parrots: Experts improve memorization more than reasoning,2024-10-24,http://arxiv.org/abs/2410.19034,Samy Jelassi; Clara Mohri; David Brandfonbrener; Alex Gu; Nikhil Vyas; Nikhil Anand; David Alvarez-Melis; Yuanzhi Li; Sham M. Kakade; Eran Malach
RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught Reasoner,2024-10-31,http://arxiv.org/abs/2410.23912,Fu-Chieh Chang; Yu-Ting Lee; Hui-Ying Shih; Pei-Yuan Wu
Generalization and Risk Bounds for Recurrent Neural Networks,2024-11-05,http://arxiv.org/abs/2411.02784,Xuewei Cheng; Ke Huang; Shujie Ma
Provable Length Generalization in Sequence Prediction via Spectral Filtering,2024-11-01,http://arxiv.org/abs/2411.01035,Annie Marsden; Evan Dogariu; Naman Agarwal; Xinyi Chen; Daniel Suo; Elad Hazan
4 changes: 3 additions & 1 deletion papers/phenomena-of-interest/other-phenomena/papers.csv
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,6 @@ Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Pheno
Emergent properties with repeated examples,2024-10-09,http://arxiv.org/abs/2410.07041,François Charton; Julia Kempe
Looking Beyond The Top-1: Transformers Determine Top Tokens In Order,2024-10-26,http://arxiv.org/abs/2410.20210,Daria Lioubashevski; Tomer Schlank; Gabriel Stanovsky; Ariel Goldstein
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling,2024-10-30,http://arxiv.org/abs/2410.23501,Emanuele Marconato; Sébastien Lachapelle; Sebastian Weichwald; Luigi Gresele
Weight decay induces low-rank attention layers,2024-10-31,http://arxiv.org/abs/2410.23819,Seijin Kobayashi; Yassir Akram; Johannes Von Oswald
Weight decay induces low-rank attention layers,2024-10-31,http://arxiv.org/abs/2410.23819,Seijin Kobayashi; Yassir Akram; Johannes Von Oswald
On the loss of context-awareness in general instruction fine-tuning,2024-11-05,http://arxiv.org/abs/2411.02688,Yihan Wang; Andrew Bai; Nanyun Peng; Cho-Jui Hsieh
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs,2024-10-17,http://arxiv.org/abs/2410.13835,Tianyu Guo; Druv Pai; Yu Bai; Jiantao Jiao; Michael I. Jordan; Song Mei
3 changes: 2 additions & 1 deletion papers/phenomena-of-interest/scaling-laws/papers.csv
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,5 @@ Grokking at the Edge of Linear Separability,2024-10-06,https://arxiv.org/abs/241
Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models,2024-10-08,https://arxiv.org/abs/2410.05661,Siqi Wang; Zhengyu Chen; Bei Li; Keqing He; Min Zhang; Jingang Wang
"An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models",2024-10-15,https://arxiv.org/abs/2410.01243,Anuj K. Nayak; Lav R. Varshney
A Hitchhiker's Guide to Scaling Law Estimation,2024-10-15,http://arxiv.org/abs/2410.11840,Leshem Choshen; Yang Zhang; Jacob Andreas
How Does Critical Batch Size Scale in Pre-training?,2024-10-29,http://arxiv.org/abs/2410.21676,Hanlin Zhang; Depen Morwani; Nikhil Vyas; Jingfeng Wu; Difan Zou; Udaya Ghai; Dean Foster; Sham Kakade
How Does Critical Batch Size Scale in Pre-training?,2024-10-29,http://arxiv.org/abs/2410.21676,Hanlin Zhang; Depen Morwani; Nikhil Vyas; Jingfeng Wu; Difan Zou; Udaya Ghai; Dean Foster; Sham Kakade
Unlocking the Theory Behind Scaling 1-Bit Neural Networks,2024-11-03,http://arxiv.org/abs/2411.01663,Majid Daliri; Zhao Song; Chiwun Yang
3 changes: 2 additions & 1 deletion papers/phenomena-of-interest/training-dynamics/papers.csv
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,5 @@ LoRA vs Full Fine-tuning: An Illusion of Equivalence,2024-10-28,http://arxiv.org
Abrupt Learning in Transformers: A Case Study on Matrix Completion,2024-10-29,http://arxiv.org/abs/2410.22244,Pulkit Gopalani; Ekdeep Singh Lubana; Wei Hu
Global Convergence in Training Large-Scale Transformers,2024-10-31,http://arxiv.org/abs/2410.23610,Cheng Gao; Yuan Cao; Zihao Li; Yihan He; Mengdi Wang; Han Liu; Jason Matthew Klusowski; Jianqing Fan
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective,2024-10-31,http://arxiv.org/abs/2410.23743,Ming Li; Yanhong Li; Tianyi Zhou
Learning and Transferring Sparse Contextual Bigrams with Linear Transformers,2024-10-30,http://arxiv.org/abs/2410.23438,Yunwei Ren; Zixuan Wang; Jason D. Lee
Learning and Transferring Sparse Contextual Bigrams with Linear Transformers,2024-10-30,http://arxiv.org/abs/2410.23438,Yunwei Ren; Zixuan Wang; Jason D. Lee
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs,2024-10-17,http://arxiv.org/abs/2410.13835,Tianyu Guo; Druv Pai; Yu Bai; Jiantao Jiao; Michael I. Jordan; Song Mei
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,6 @@ Towards Understanding the Universality of Transformers for Next-Token Prediction
Attention layers provably solve single-location regression,2024-10-02,http://arxiv.org/abs/2410.01537,Pierre Marion; Raphaël Berthier; Gérard Biau; Claire Boyer
Large Language Models as Markov Chains,2024-10-03,http://arxiv.org/abs/2410.02724,Oussama Zekri; Ambroise Odonnat; Abdelhakim Benechehab; Linus Bleistein; Nicolas Boullé; Ievgen Redko
Fundamental Limitations on Subquadratic Alternatives to Transformers,2024-10-05,http://arxiv.org/abs/2410.04271,Josh Alman; Hantao Yu
Provable Optimal Transport with Transformers: The Essence of Depth and Prompt Engineering,2024-10-25,http://arxiv.org/abs/2410.19931,Hadi Daneshmand
Provable Optimal Transport with Transformers: The Essence of Depth and Prompt Engineering,2024-10-25,http://arxiv.org/abs/2410.19931,Hadi Daneshmand
"Ask, and it shall be given: Turing completeness of prompting",2024-11-04,http://arxiv.org/abs/2411.01992,Ruizhong Qiu; Zhe Xu; Wenxuan Bao; Hanghang Tong
Measure-to-measure interpolation using Transformers,2024-11-07,http://arxiv.org/abs/2411.04551,Borjan Geshkovski; Philippe Rigollet; Domènec Ruiz-Balet

0 comments on commit 10ca6ac

Please sign in to comment.