Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-commit pylint checks are doing too much work #121

Open
jepler opened this issue Mar 23, 2021 · 6 comments
Open

pre-commit pylint checks are doing too much work #121

jepler opened this issue Mar 23, 2021 · 6 comments

Comments

@jepler
Copy link
Member

jepler commented Mar 23, 2021

I noticed that in adafruit_datetime, the pre-commit check could take a long time, especially for the "tests" step. Furthermore, all 4 of my CPU cores were active.

I believe this is because by default, pre-commit

  • assumes that the program accepts filenames on its commandline/argv
  • runs multiple processes in parallel if there are multiple files

Since pylint needs to get a view of all the source files it's checking in order to do proper code duplication checks, we make our own list of files to pylint with find and ignore the positional arguments that are given. But unless we also specify pass_filenames: false pre-commit doesn't know about it and starts invoking the "pylint all files" command once for each file!

This change is one I'm testing locally in adafruit_datetime:

From 10ebce0339182073dc0aabd010f73df00ae348a6 Mon Sep 17 00:00:00 2001
From: Jeff Epler <[email protected]>
Date: Mon, 22 Mar 2021 20:34:25 -0500
Subject: [PATCH] pre-commit: don't do too much work

---
 .pre-commit-config.yaml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index cce4c7b..96f0626 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -32,6 +32,7 @@ repos:
         entry: /usr/bin/env bash -c
         args: ['([[ ! -d "examples" ]] || for example in $(find . -path "./examples/*.py"); do pylint --disable=missing-docstring,invalid-name $example; done)']
         language: system
+        pass_filenames: false
 -   repo: local
     hooks:
     -   id: pylint_tests
@@ -40,3 +41,4 @@ repos:
         entry: /usr/bin/env bash -c
         args: ['([[ ! -d "tests" ]] || for test in $(find . -path "./tests/*.py"); do pylint --disable=missing-docstring $test; done)']
         language: system
+        pass_filenames: false
-- 
2.29.2

If we want to make a change like this we'll have to apply it with adabot to existing repos as well.

@kattni
Copy link
Contributor

kattni commented Mar 23, 2021

@jepler I'm not entirely sure I follow. We're using find to avoid duplicate code checking on examples and tests. That is how we did it, was to force Pylint to run separately on each file in those directories. If you can, please clarify what pass_filenames: false is doing differently.

@jepler
Copy link
Member Author

jepler commented Mar 29, 2021

To better show what I mean, suppose you change the pylint to a "teapot check", which

  • prints "I am a teapot"
  • prints the files that pre-commit believed the hook should run on
  • and then reports an error to pre-commit
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index cce4c7b..1b8ebe4 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -38,5 +38,5 @@ repos: pylint (tests code)
         description: Run pylint rules on "tests/*.py" files
         entry: /usr/bin/env bash -c
-        args: ['([[ ! -d "tests" ]] || for test in $(find . -path "./tests/*.py"); do pylint --disable=missing-docstring $test; done)']
+        args: ['echo "I am a teapot"; echo "files:"; echo "$@"; exit 1', '-']
         language: system

Then run the pre-commit checks:

$ pre-commit run --all
black....................................................................Passed
reuse....................................................................Passed
Check Yaml...............................................................Passed
Fix End of Files.........................................................Passed
Trim Trailing Whitespace.................................................Passed
pylint (library code)....................................................Passed
pylint (examples code)...................................................Passed
pylint (tests code)......................................................Failed
- hook id: pylint_tests
- exit code: 1

I am a teapot
files:
examples/datetime_simpletest.py docs/_static/favicon.ico.license LICENSES/Unlicense.txt setup.py LICENSES/MIT.txt
I am a teapot
files:
.github/workflows/build.yml examples/datetime_time.py .pre-commit-config.yaml examples/datetime_timedelta.py docs/api.rst.license
I am a teapot
files:
pyproject.toml docs/conf.py adafruit_datetime.py .github/workflows/release.yml .readthedocs.yml
I am a teapot
files:
README.rst LICENSES/CC-BY-4.0.txt docs/index.rst.license CODE_OF_CONDUCT.md docs/index.rst
I am a teapot
files:
docs/examples.rst.license .gitignore tests/test_datetime.py requirements.txt .pylintrc
I am a teapot
files:
docs/api.rst LICENSE docs/_static/favicon.ico docs/examples.rst tests/test_time.py
I am a teapot
files:
LICENSES/Python-2.0.txt README.rst.license tests/test_date.py

What is going on here is, pre-commit has noticed that my system has multiple cores, and it believes (because we have not told it otherwise) that pylint_tests runs on all the files and it accepts a list of specific files to operate on via its commandline. In an attempt to make pre-commit finish faster, it divides the list of all the files up into several approximately equal pieces and invokes the given command multiple times in parallel.

When pass_filenames = false is added, the teapot test will be run just once (and no filenames will actually be passed on the commandline):

pylint (tests code)......................................................Failed
- hook id: pylint_tests
- exit code: 1

I am a teapot
files:

@jepler
Copy link
Member Author

jepler commented Mar 29, 2021

Since the last round of pre-commit changes, we've discovered a few things about how better to use pylint from pre-commit.

However, as the impact of this is actually rather modest (just some extra CPU time spent), it's fine to leave until we have another adabot patch we wanted to do on all the repositories, especially since I feel like my own ideas about this are still a bit in flux as I learn more.

pre-commit Parallelism

By default, pre-commit parallelizes each check that it runs. It makes a list of files, divides it into a number of groups of files, and then invokes the command multiple times, possibly in parallel.

This has several consequences that we care about:

Parallelism & the main package

If the main package consists of enough files, the "pylint (library code)" step will invoke the pylint command itself multiple times, each with a subset of the files. This can inhibit duplicate code checking, or cause possibly even cause duplicate code checking to give different results depending on the number of cores a developer's system has.

Resolution: Add require_serial: true

This worked as expected for a single-file library (adafruit_datetime) and a package (jepler_udecimal) even when a total of 105 files were within the package.

Parallelism, tests, & examples

A different problem existed with the pylinting of tests and examples. Because these had a command (args) to find the list of files to run on, but did not inform pre-commit about what files they would use. As a consequence, pre-commit assumed that every file was "relevant" to the check. This caused pre-commit to invoke the pylint step repeatedly. Each step received but ignored a list of files, instead checking the whole list of files that find emitted.

pylint duplicate-code checking

Pylint always had a duplicate code check, but due to bugs it was historically ineffective. Starting perhaps with pylint 2.7.0, the check became effective again. We don't want to apply the duplicate code check to tests and examples, so we used a workaround where we invoked pylint separately for each individual file.

Several ways to disable the check within pylint were investigated, but initially the one effective way was not found.

Not working:

  • specifying it in a comment in an individual file
  • specifying it in .pylintrc

Working:

  • directly specifying it on the pylint commandline

Once we have an effective way to disable duplicate-code checking on just the files we want, we can change from using a "local hook" to using a "pylint hook". pre-commit includes adquate filters ("types", "exclude" and "files") to let us run three different versions of the pylint hook on three different subsets of files

It's worth noting that it's safe to have files: tests/ even if there's no tests/ folder committed to git.

How it's working out

I initially noticed this in adafruit_datetime, where it felt like running the pylint (tests) step took a very long time and made my computer's fan spin up.

Before this change, pre-commit --all took 16s and used 1.5 minutes of CPU time. After this change, pre-commit --all takes just 7s and uses just 7s of CPU time. Even better, if I am actually committing a change to a single file in examples, the time decreases to less than 1s

Then another wrinkle occurred…

Duplicate code checking and the main pylint step still aren't right. Well, they're right when you run pre-commit --all (which CI does) but not when you have installed pre-commit and create a commit. This is because the "pylint (library code)" step is only passing changed files to pylint, which prevents it from noticing new duplication between that file and other unmodified files. This is not a huge deal, as github will verify it and kick it back if there's a problem, and then the contributor can run pre-commit --all to check locally once they are aware their contribution is potentially violating the duplicate code check.

The only alternative I'm aware of would be to return to the "local hook", but this time for the library/package itself, to ensure that pylint is run on all files all the time. Using "exclude + types" directives could still allow this step to run only when something in library/package is modified.

Patch

Here's what I'm testing locally. However, it doesn't incorporate any fix for the last item.

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index cce4c7b..a8b58f9 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -24,19 +24,14 @@ repos:
         name: pylint (library code)
         types: [python]
         exclude: "^(docs/|tests/|examples/|setup.py$)"
--   repo: local
-    hooks:
-    -   id: pylint_examples
-        name: pylint (examples code)
-        description: Run pylint rules on "examples/*.py" files
-        entry: /usr/bin/env bash -c
-        args: ['([[ ! -d "examples" ]] || for example in $(find . -path "./examples/*.py"); do pylint --disable=missing-docstring,invalid-name $example; done)']
-        language: system
--   repo: local
-    hooks:
-    -   id: pylint_tests
-        name: pylint (tests code)
-        description: Run pylint rules on "tests/*.py" files
-        entry: /usr/bin/env bash -c
-        args: ['([[ ! -d "tests" ]] || for test in $(find . -path "./tests/*.py"); do pylint --disable=missing-docstring $test; done)']
-        language: system
+        require_serial: true  # so that duplicate code checking is the same on all systems
+    -   id: pylint
+        name: pylint (example code)
+        types: [python]
+        files: "^examples/"
+        args: ["--disable=missing-docstring,invalid-name,duplicate-code"]
+    -   id: pylint
+        name: pylint (test code)
+        types: [python]
+        files: "^tests/"
+        args: ["--disable=missing-docstring,duplicate-code"]

@tannewt
Copy link
Member

tannewt commented Apr 7, 2021

Any progress on this? I'm getting pre-commit errors on an example when only changing README.rst. I'll follow up to fix the example lint in a subsequent commit.

We should probably switch to using an external repo for pre-commit checks: https://pre-commit.com/#plugins That way we'd be able to have the code to execute it in one spot instead of many.

@lesamouraipourpre
Copy link
Contributor

I have a possible improvement for the examples and tests pylint hooks.

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 354c761..6da32d6 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -27,8 +27,9 @@ repos:
 -   repo: local
     hooks:
     -   id: pylint_examples
+        require_serial: true
         name: pylint (examples code)
         description: Run pylint rules on "examples/*.py" files
         entry: /usr/bin/env bash -c
-        args: ['([[ ! -d "examples" ]] || for example in $(find . -path "./examples/*.py"); do pylint --disable=missing-docstring,invalid-name $example; done)']
+        args: ['([[ ! -d "examples" ]] || pids=();for ex in $(find . -path "./examples/*.py");do pylint --disable=missing-docstring,invalid-name $ex & pids+=($!); done; exval=0; for pid in ${pids[*]}; do wait $pid; exval=$(($exval+$?)); done; exit $exval)']
         language: system

The bash command broken out:

pids=();
for ex in $(find . -path "./examples/*.py"); do
    # Run in parallel and collect the pids
    pylint --disable=missing-docstring,invalid-name $ex & pids+=($!);
done;
exval=0;
for pid in ${pids[*]}; do
    # Wait for the pids to finish and collect the exit values
    wait $pid;
    exval=$(($exval+$?));
done;
# Return a combined exit value
exit $exval

Reference for the bash idea: https://unix.stackexchange.com/a/595838

This was tested on a Raspberry Pi 4 (4 cores) on a library with 10 files in the examples directory with the following results:
Old Method 80 seconds and examples passed pylint even though pre-commit run --all-files --verbose shows problems.
New Method 25 seconds and examples fails pylint on the problems passed above.

My Bash scripting skills are negligible and this script is beyond the skill level I feel I have. Assuming this is repeatable, and works on Github (I've only tested locally) hopefully this can be the basis for speeding up workflows.

@kattni
Copy link
Contributor

kattni commented Aug 22, 2023

Removing myself as an assignee because I cannot resolve this myself, and should not have been assigned in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants