-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bed12 reader #57
base: main
Are you sure you want to change the base?
Add bed12 reader #57
Conversation
impl BatchBuilder for BedBatchBuilder { | ||
// Noodles implements bed Record structs for bed3-bed9 and bed12. | ||
// The following only handles bed12. | ||
type Record<'a> = &'a bed::Record<12>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to deviate from oxbow's BatchBuilder
to accomodate different record numbers.
color: StructBuilder::new( | ||
vec![ | ||
Field::new("r", DataType::UInt8, true), | ||
Field::new("g", DataType::UInt8, true), | ||
Field::new("b", DataType::UInt8, true), | ||
], | ||
vec![ | ||
Box::new(UInt8Builder::new()), | ||
Box::new(UInt8Builder::new()), | ||
Box::new(UInt8Builder::new()), | ||
], | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The StructBuilder
for the arrow Struct Array takes the schema of the struct (the boolean sets the nullability of the field) and the builders for each field.
// Handle 1 or 2 missing colors | ||
if colors.len() == 2 { | ||
self.color | ||
.field_builder::<UInt8Builder>(2) | ||
.unwrap() | ||
.append_null(); | ||
} else if colors.len() == 1 { | ||
self.color | ||
.field_builder::<UInt8Builder>(1) | ||
.unwrap() | ||
.append_null(); | ||
self.color | ||
.field_builder::<UInt8Builder>(2) | ||
.unwrap() | ||
.append_null(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A common issue when debugging struct arrays is having different numbers of records for the resulting Record Batch
.unwrap() | ||
.append_null(); | ||
} | ||
self.color.append(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another arrow builder gotcha is the difference between append_value
and append
. Not all builders have both, but append
is for the struct and append_value
(in the for loop above) is for the subarray
This PR handles spec compliant BED12 files, so it won't work as is with BED3-9 and optional fields. Handling the other types of bed files could require creating separate reader structs for each one that build off the previous and
impl
different record column sizes as noodles does withBedN
andRecord
. There may be a simpler solution. In the interim, it is still easier to read bed files with a csv reader (delimited by\t
), especially given the variety of column arrangements.Of note are 2 types of arrow builders used in the
BedBatchBuilder
that have not been used elsewhere in oxbow:StructBuilder
for Struct Arrayscolor
fieldListBuilder
for listsblock_sizes
andblock_starts
fieldsThese builders could potentially be used for the nested fields of other readers in oxbow.