Limit number of records sampled and filter tables #4936
-
Thank you for this great product. It will add a lot of value to our organisation where we are trying to set up a glossary, tag tables and columns etc. One doubt - is is possible to limit the number of records scanned for profiling, or turn it off for specific tables? In our tables with millions of records, the ingestion job just keeps running. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Thanks for reaching out and bigger thanks for your words. We currently have two ways of computing the data profiling:
What could be done here is to disable it during the metadata ingestion (we are thinking about removing it from there completely exactly for the same reasons you're giving) and just have separated Profiler Workflows handling that. The interesting part about the Profiler Workflows is that we're not limited to a single workflow per service. You could deploy multiple workflows, filtering by different sets of tables, and run each of them at a different schedule. Regarding limiting the number of records, we have implemented the first version of a data sampling in the profiler backend. It allows running the workflow on a specific % of a random sample for each table. In the next releases, we'll update the UI to allow select that % on each table. |
Beta Was this translation helpful? Give feedback.
-
Hey @jayadevanm, Let us know how using Openmetadata in your company is going, or if we can help you with anything else! |
Beta Was this translation helpful? Give feedback.
Thanks for reaching out and bigger thanks for your words.
We currently have two ways of computing the data profiling:
What could be done here is to disable it during the metadata ingestion (we are thinking about removing it from there completely exactly for the same reasons you're giving) and just have separated Profiler Workflows handling that.
The interesting part about the Profiler Workflows is that we're not limited to a single workflow per service. You could deploy multiple workflows, filtering by different sets of tables, and run each of them at a different schedule.
…