Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support determining mode based on shebang interpreter directive #47

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions zee-grammar/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ libloading = "0.7.3"
log = "0.4.16"
once_cell = { version = "1.10.0", features = ["parking_lot"] }
rayon = "1.5.2"
regex = "1.5.5"
serde = "1.0.136"
serde_derive = "1.0.136"
tree-sitter = "0.20.6"
2 changes: 2 additions & 0 deletions zee-grammar/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ pub struct ModeConfig {
pub comment: Option<CommentConfig>,
pub indentation: IndentationConfig,
pub grammar: Option<GrammarConfig>,
#[serde(default)]
pub shebangs: Vec<String>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking whether this should be a new field, rather than a new variant of FilenamePattern -- granted, we should probably rename it to FilePattern as it will look inside the file too to determine if the mode applies. I.e. something like FilePattern::Shebang.

The structure I suggest may make it harder to avoid reading the file when the filename would suffice.

}

#[derive(Clone, Debug, Deserialize, Serialize)]
Expand Down
15 changes: 15 additions & 0 deletions zee-grammar/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,15 @@ mod git;

use anyhow::Result;
use once_cell::sync::Lazy;
use regex::Regex;
use std::path::Path;
use tree_sitter::{Language, Query};

use self::config::{CommentConfig, FilenamePattern, IndentationConfig, ModeConfig};

static SHEBANG_REGEX: Lazy<Regex> =
Lazy::new(|| Regex::new(r"^#!\s*(?:\S*[/\\](?:env\s+(?:\-\S+\s+)*)?)?([^\s\.\d]+)").unwrap());

#[derive(Debug)]
pub struct Mode {
pub name: String,
Expand All @@ -19,6 +23,7 @@ pub struct Mode {
pub comment: Option<CommentConfig>,
pub indentation: IndentationConfig,
grammar: LazyGrammar,
pub shebangs: Vec<String>,
}

impl Mode {
Expand All @@ -31,6 +36,7 @@ impl Mode {
comment,
indentation,
grammar: grammar_config,
shebangs,
} = config;
Self {
name,
Expand All @@ -44,6 +50,7 @@ impl Mode {
.map(|grammar_config| grammar_config.grammar_id)
.map(builder::load_grammar)
})),
shebangs,
}
}

Expand All @@ -53,6 +60,13 @@ impl Mode {
.any(|pattern| pattern.matches(filename.as_ref()))
}

pub fn matches_by_shebang(&self, shebang: &str) -> bool {
SHEBANG_REGEX
.captures(shebang)
.and_then(|captures| self.shebangs.contains(&captures[1].into()).then(|| 0))
.is_some()
}

pub fn language(&self) -> Option<Result<Language, &anyhow::Error>> {
Some(self.grammar()?.map(|parser| parser.language))
}
Expand All @@ -74,6 +88,7 @@ impl Default for Mode {
comment: None,
indentation: Default::default(),
grammar: Lazy::new(Box::new(|| None)),
shebangs: vec![],
}
}
}
Expand Down
3 changes: 3 additions & 0 deletions zee/config/config.ron
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@
width: 4,
unit: Space,
),
shebangs: ["node"],
grammar: Some(
Grammar(
id: "javascript",
Expand Down Expand Up @@ -248,6 +249,7 @@
width: 4,
unit: Space,
),
shebangs: ["python"],
iainh marked this conversation as resolved.
Show resolved Hide resolved
grammar: Some(
Grammar(
id: "python",
Expand Down Expand Up @@ -464,6 +466,7 @@
width: 2,
unit: Space,
),
shebangs: ["sh", "bash", "dash", "zsh"],
grammar: Some(
Grammar(
id: "bash",
Expand Down
13 changes: 9 additions & 4 deletions zee/src/editor/buffer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -173,11 +173,16 @@ impl Buffer {
file_path: Option<PathBuf>,
repo: Option<RepositoryRc>,
) -> Self {
let mode = file_path
.as_ref()
.map(|path| context.0.mode_by_filename(path))
let mode = text
.line(0)
.as_str()
.and_then(|shebang| context.0.mode_by_shebang(shebang))
.or_else(|| {
file_path
.as_ref()
.and_then(|path| context.0.mode_by_filename(path))
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may potentially read the whole file in pathological cases, e.g. a minified file that has no new lines. One goal of zee is to keep being fast for any kind of pathological file you can think of and do anything that is potentially blocking in the background (e.g. parsing syntax or writing the file to disk). I've also been trying to avoid doing anything linear in the length of a line in the UI thread.

I think the right solution here long term is to build buffers, i.e. call Buffer::new() in a background thread, rather than in the main, UI thread.

For this PR though, I'd be happy if instead you just bound how much of the file you read, say 256 bytes at most and test the regex for that. You'll have to deal with potentially truncated utf-8...

Maybe a better solution is to read characters until you encounter either 1. a new line or 2. if you don't after X characters, you give you and don't check the shebang -- essentially we only test if shebangs apply up to a certain line length.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 2nd comment is that I think we should test if "mode_by_filename" applies and only then check the shebangs to avoid having to read the file if the name already matches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestions. I'm looking to see if there is any standards around whether white space is allowed before the #! and what the maximum length is on most platforms. I have some outdated information that 127-512 bytes is the maximum but with a lack of a standard to point at, I think an overly large maximum might be best. FreeBSD for example historically supported 4096. If white space before the shebang is not allowed then a two pass strategy where only the first two characters are examined followed by the remainder of the line up to the maximum discussed might be the most efficient.

.unwrap_or(&PLAIN_TEXT_MODE);

let mut parser = mode
.language()
.and_then(|result| result.ok())
Expand Down
11 changes: 8 additions & 3 deletions zee/src/editor/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ use crate::{
splash::{Properties as SplashProperties, Splash},
theme::{Theme, THEMES},
},
config::{EditorConfig, PLAIN_TEXT_MODE},
config::EditorConfig,
error::Result,
task::TaskPool,
};
Expand Down Expand Up @@ -94,11 +94,16 @@ pub struct Context {
}

impl Context {
pub fn mode_by_filename(&self, filename: impl AsRef<Path>) -> &Mode {
pub fn mode_by_filename(&self, filename: impl AsRef<Path>) -> Option<&Mode> {
self.modes
.iter()
.find(|&mode| mode.matches_by_filename(filename.as_ref()))
.unwrap_or(&PLAIN_TEXT_MODE)
}

pub fn mode_by_shebang(&self, shebang: &str) -> Option<&Mode> {
self.modes
.iter()
.find(|&mode| mode.matches_by_shebang(shebang))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If shebang is a variant of FilenamePattern, there would be one function and we could continue to return &Mode rather than Option<&Mode> -- I would like to continue have Context own coming up with a Mode for any possible file whatsoever and not have a default PLAIN_TEXT_MODE potentially duplicated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about merging these methods into something like mode_by_file_pattern that would always return a &Mode but I'm stuck on how to handle the difference in parameters. Two of the variants operate on the filename while the other needs a portion of the file content. To have one method we would have to always pass a portion of the file content to the function in addition to the filename which would require reading at least part of the content of every file.

All of that being said, prior to detecting the mode or creating the buffer open_file() is calling Rope::from_reader() which I think might be reading the entire file anyway.

Rope::from_reader(BufReader::new(File::open(&file_path)?))?,
I reading that right? If so we already have the Rope created and available in Buffer::new() so reading a slice should be quite efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an updated version of this code with a limit on the length of text examined for the interpreter directive, an inversion of the order mode checks are performed (filename checks first, then shebang), and a merging of the mode_by_filename() and mode_by_shebang() methods into one, mode_by_file(), which always returns a &Mode.

I ended up settling on 256 characters for the maximum length of the shebang directive, matching what linux has done since 2018 (https://lore.kernel.org/lkml/[email protected]/) (https://github.com/torvalds/linux/blob/master/include/uapi/linux/binfmts.h)

I haven’t found a satisfactory way of adding a shebang case to FilenamePattern, the biggest stumbling blocks being that the shebang line would need to be passed to the matches() method and the pattern list would need to be sorted/filtered if we wanted to ensure that filename patterns were always examined before shebang directives.

}
}

Expand Down