Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bed12 reader #57

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add bed12 reader #57

wants to merge 1 commit into from

Conversation

GarrettNg
Copy link
Contributor

This PR handles spec compliant BED12 files, so it won't work as is with BED3-9 and optional fields. Handling the other types of bed files could require creating separate reader structs for each one that build off the previous and impl different record column sizes as noodles does with BedN and Record. There may be a simpler solution. In the interim, it is still easier to read bed files with a csv reader (delimited by \t), especially given the variety of column arrangements.

Of note are 2 types of arrow builders used in the BedBatchBuilder that have not been used elsewhere in oxbow:

  • StructBuilder for Struct Arrays
    • used for the color field
  • ListBuilder for lists
    • used for the block_sizes and block_starts fields

These builders could potentially be used for the nested fields of other readers in oxbow.

Comment on lines +95 to +98
impl BatchBuilder for BedBatchBuilder {
// Noodles implements bed Record structs for bed3-bed9 and bed12.
// The following only handles bed12.
type Record<'a> = &'a bed::Record<12>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to deviate from oxbow's BatchBuilder to accomodate different record numbers.

Comment on lines +77 to +88
color: StructBuilder::new(
vec![
Field::new("r", DataType::UInt8, true),
Field::new("g", DataType::UInt8, true),
Field::new("b", DataType::UInt8, true),
],
vec![
Box::new(UInt8Builder::new()),
Box::new(UInt8Builder::new()),
Box::new(UInt8Builder::new()),
],
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StructBuilder for the arrow Struct Array takes the schema of the struct (the boolean sets the nullability of the field) and the builders for each field.

Comment on lines +145 to +160
// Handle 1 or 2 missing colors
if colors.len() == 2 {
self.color
.field_builder::<UInt8Builder>(2)
.unwrap()
.append_null();
} else if colors.len() == 1 {
self.color
.field_builder::<UInt8Builder>(1)
.unwrap()
.append_null();
self.color
.field_builder::<UInt8Builder>(2)
.unwrap()
.append_null();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common issue when debugging struct arrays is having different numbers of records for the resulting Record Batch

.unwrap()
.append_null();
}
self.color.append(true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another arrow builder gotcha is the difference between append_value and append. Not all builders have both, but append is for the struct and append_value (in the for loop above) is for the subarray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant