Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New metadata fields for work entity #210

Open
massimoaria opened this issue Feb 20, 2024 · 3 comments
Open

New metadata fields for work entity #210

massimoaria opened this issue Feb 20, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@massimoaria
Copy link
Collaborator

@trangdata
@yjunechoe
Recently, OA has added a lot of new metadata for entity work.
In particular, the API now also reports info regarding keywords, topics, grants that funded the research, APC paid, etc.

At the moment the only way to access this information is to use the "list" format.

TO DO:
Modify the works2df() function so that the data frame also includes this new metadata. This way even using the "tibble" or "data.frame" format will output this new metadata.

@yjunechoe
Copy link
Collaborator

Good point! And actually as a first step, I think it'd be helpful if we tracked somewhere what fields we already have covered vs. those that are new.

As a naive approach, this lists all fields from output="list" that's not present as a column in output="tibble":

library(openalexR)

tbl <- oa_fetch(id = "W2755950973")
lst <- oa_fetch(id = "W2755950973", output = "list")

sort(names(lst)[!names(lst) %in% colnames(tbl)])
#>  [1] "abstract_inverted_index"       "apc_list"                     
#>  [3] "apc_paid"                      "authorships"                  
#>  [5] "best_oa_location"              "biblio"                       
#>  [7] "cited_by_percentile_year"      "corresponding_author_ids"     
#>  [9] "corresponding_institution_ids" "countries_distinct_count"     
#> [11] "created_date"                  "fulltext_origin"              
#> [13] "has_fulltext"                  "indexed_in"                   
#> [15] "institutions_distinct_count"   "keywords"                     
#> [17] "locations"                     "locations_count"              
#> [19] "mesh"                          "ngrams_url"                   
#> [21] "open_access"                   "primary_location"             
#> [23] "primary_topic"                 "referenced_works_count"       
#> [25] "sustainable_development_goals" "title"                        
#> [27] "topics"                        "type_crossref"                
#> [29] "updated_date"

This of course doesn't mean we're missing coverage for these fields - some of them have been renamed in the df (e.g., authorships), intentionally dropped due to redundancy or low merit (e.g., title), or already covered via other means (e.g., we might not need ngrams_url given that we have the oa_ngrams() interface). But it's hard to distinguish those cases from fields like apc_list which is clearly new and not yet covered.

So as a preliminary, maybe it's worth introducing something to internally track covered fields, like:

#' @keywords internal
covered_fields <- c("title", "authorships", ...)

Then we (or at least I) can get a clearer picture of what we're missing and have a programmatic way to track the introduction of new fields.

I can take a stab at this, then reconvene here to decide how to deal with the new fields? For example, it immediately jumps out to me that apc_paid and apd_list share similar structures - I think it may be worth combining them into a single list column
apc of data frames. Ex:

Original:

lst$apc_list
#> $value
#> [1] 3680
#> 
#> $currency
#> [1] "USD"
#> 
#> $value_usd
#> [1] 3680
#> 
#> $provenance
#> [1] "doaj"

lst$apc_paid
#> $value
#> [1] 3680
#> 
#> $currency
#> [1] "USD"
#> 
#> $value_usd
#> [1] 3680
#> 
#> $provenance
#> [1] "doaj"

Formatted:

rbind.data.frame(
  c(type = "list", lst$apc_list),
  c(type = "paid", lst$apc_paid)
)
#>   type value currency value_usd provenance
#> 1 list  3680      USD      3680       doaj
#> 2 paid  3680      USD      3680       doaj

@massimoaria
Copy link
Collaborator Author

I totally agree

@trangdata trangdata added the enhancement New feature or request label Jul 3, 2024
@trangdata
Copy link
Collaborator

Coverage now tracked in #211.

TODO: we need to agree on what other fields we should export in the dataframe (we now have topics, apc already).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants