-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(repr): add show_count
option to interactive repr
#10518
base: main
Are you sure you want to change the base?
Conversation
18eafb3
to
6dee4e6
Compare
This looks fantastic! I tried it out in a Jupyter kernel with a fairly wide table, which appears as a horizontally scrollable output using VSCode's interactive window. The dimensions appear but are center-justified, so the user has to scroll horizontally quite a bit to see them. I would suggest making it left-justified. |
6dee4e6
to
6868cd8
Compare
@@ -76,6 +79,7 @@ class Interactive(Config): | |||
max_string: int = 80 | |||
max_depth: int = 1 | |||
show_types: bool = True | |||
show_count: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a nice feature to have. I do not think it should default to true.
Interactive mode is supposed to be interactive, and something that can be computationally expensive isn't something we want to turn on by default.
The main failure case I can think of here is reading from CSVs in a blob store. With show_count=False
, this can be done with a range request and only download the first N rows (ish), except if there's an order-by. show_count=True
would mean any operation would trigger a download of the entire file.
And yes, that is one more reason not to use CSVs, but 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I am OK with defaulting to False. Let's just wait for Philip to chime in to verify, but I assume he is going to have the same opinion as you.
I really don't think it is worth the complexity/cleverness, but throwing it out there that we could make each relation have its own config, and that could be auto-determined based on if it's a CSV/parquet/physical table, local/remote, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume he is going to have the same opinion as you.
We are slowly merging into the same person. The transformation is nearly complete.
I really don't think it is worth the complexity/cleverness, but throwing it out there that we could make each relation have its own config, and that could be auto-determined based on if it's a CSV/parquet/physical table, local/remote, etc.
I mean, it's definitely a reliably cheap(er) operation on anything that isn't a CSV, but I agree this seems like too much complexity. I think a better move is your suggestion of persistent default configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's leave the default to False
, otherwise it's a huge performance footgun/bug magnet.
else: | ||
nrows = "…" | ||
if isinstance(tablish, ir.Table): | ||
dims = f"{orig_ncols:_} cols by {nrows} rows" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this follow the long-standing convention of "rows by columns"?
Resolves #10231
Doesn't have tests yet. I want to get feedback on the semantics before I do that. I assume CI is going to fail all over the place because all the docstrings etc are going to change.
CHANGELOG
Currently looks like
I made the config default be
show_count=True
. I can change this with pushback, but this is what I would start with. I would love to brainstorm a few benchmark test cases to run to see the perf difference. Some ideas:ibis.duckdb.connect("mydb.db").table("my_table")
(I expect ~0 difference)ibis.duckdb.connect().read_parquet("t.pq")
(I expect~0 difference)ibis.duckdb.connect().read_csv("thousand_rows.csv")
(I expect small difference)ibis.duckdb.connect().read_csv("billion_rows.csv")
(I expect large difference)ibis.duckdb.connect().read_csv("thousand_rows.csv").some_expensive computation()
(I expect a difference, size depends on semantics of the function)I considered adding a
.repr_options
attribute to expressions as described in #10231 (comment), but I decided that was too complicated.I considered showing the table name in the repr, eg with
table.get_name()
, but that is a separate question.Currently, the option has the semantics of
show_count: bool
, and we ALWAYS show the column count. I considered other encodings such asshow_shape: Literal["rows", "cols", "both", None]
, but I thought that was overkill.I considered adding the row count to the bottom of the table, eg something like
but then if you set a high
max_rows
it would be hard to see. Plus then this info would need to be repeated in every column. Anyway, if you have other ideas on the graphic design of where to present the counts, I'm all ears.