-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Streaming partitioned writes #6569
Comments
FYI @devinjdangelo I updated this ticket with various issues related to the write code you are working on |
Thanks @devinjdangelo -- I added it to the list |
An additional issue we should cut and add to this epic is allowing inserts to a sorted ListingTable. In the case of appending new files to a directory, I think it is as simple as having FileSinkExec require its input be sorted. It can't really be supported efficiently for Append to existing file since it would require reading the existing file, sorting with the new data and rewriting the whole file. For this case, you could use insert overwrite instead if you really want to do this (which is another thing which we could cut a ticket to add support for). Alternatively, we could have a check to see if 1) the table is sorted and 2) the input to FileSinkExec is sorted. If 1) is true but 2) is not, we would need to update the metadata about the table to indicate for subsequent queries it is no longer guaranteed to be sorted. |
Filed #7354 to track |
Thanks @devinjdangelo -- I added it to the list on this ticket |
@alamb I made some progress on inserts to sorted tables #7354 This also got me thinking about inserts to partitioned tables, so I opened issue to track: Lastly, I've been thinking we may want to deprecate and eventually remove the |
Thank you -- I added #7744 to the list on this ticket
That sounds like a reasonable idea to me. One challenge might be that the Hooking them into |
Is your feature request related to a problem or challenge?
This is a tracking epic for a collection of features related to writing data.
The basic idea is better / full support for writing data:
This is partially supported today programmatically (see SessionContext::write_csv, etc)
Subtasks:
COPY ... TO
statement #5654DataFrame.write_*
to useLogicalPlan::Write
#5076CopyOptions
for controlling copy behavior #7322allow_single_file_parallelism
by default to write out parquet files in parallel #7590Dictionary(UInt16, Utf8)
#7891COPY
command #8493SINGLE_FILE_OUTPUT
option from COPY statement #8621FileType
enum and replace with atrait
#8657DataFrame::write
command #9237The text was updated successfully, but these errors were encountered: