You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for developing this cool packages! Unfortunately, I have a problem concerning RAM usage for big crawling processes, which seems to only has been occurring with the latest version of your package (0.1.9-1). I did not encounter this problem with the old version of your package (0.1.8-0).
I am running the script in a loop and save the output of each iteration. All objects are being removed after the iteration. Yet, the script eats about 100 MB of RAM every 20 minutes exceeding my total RAM after a few hours.
With the old version of your package, I was able to run the script for several weeks without any memory problems.
Do you have an idea, why this problem might has occurred now?
Please find my script attached (data, titlepath, articlepath, paper and ending have been loaded beforehand):
`
for(i in 1:length(data)){
Rcrawler(data[i], MaxDepth = 0, ExtractXpathPat = c(titlepath[t], articlepath[t]), crawlUrlfilter = paste0(".*www\\.",paper,"\\",ending,".*html$"), no_cores = 1, no_conn = 1, saveOnDisk = FALSE) `
cat(paste0("Timeframe ", year[t], "-", month[t], "\n", "Already ", i, " sites scraped. There are ", length(data) - i , " sites left to scrape"))
if(exists("DATA")){
#restructure data list
dat <- DATA %>%
map_df(enframe) %>%
slice(-1) %>%
unnest() %>%
mutate(Id = rep(1:nrow(INDEX), each = 2)) %>%
group_by(Id) %>%
rowid_to_column("id_h") %>%
ungroup() %>%
select(id_h, Id, value) %>%
arrange(id_h) %>%
spread(id_h, value) %>%
rename(title = `1`, article = `2`)
#extract date of article and broader area (sports, politics etc...)
INDEX <- INDEX %>%
mutate(Id = as.numeric(Id),
date = str_match(Url,"/web/(\\w+?)/")[,2],
date = parse_datetime(str_sub(date, 1, 8), "%Y%m%d")
) %>%
as.tibble()
#merge articles with meta data and clean up a bit
dat_full <- INDEX %>%
mutate(Id = as.numeric(Id)) %>%
left_join(dat, by = c("Id")) %>%
mutate(article = str_replace_all(article, pattern = "\n", " "),
article = str_replace_all(article, pattern = "\t", " "),
article = str_replace_all(article, pattern = "\\s+", " ")
)
save(dat_full, file = paste0(wd,"/", year[t], "/articles_", paper, year[t], "-", month[t], "-", i, ".RData"))
gc()
rm(DATA, INDEX, dat, dat_full)
} else {
next
}
}
`
Update: I am running both versions of the package on the same URL-list on two virtual machines with the same properties (4 cores, 8 GB RAM, Ubuntu 16) for six hours now. In version 0.1.9-1 the process has occupied nearly 5 GB RAM until now (incrementing for about 100 MB every 15 minutes), whereas the process in 0.1.8-0 occupies 1 GB RAM with no differences in memory usage for six hours now.
The text was updated successfully, but these errors were encountered:
Thank you for developing this cool packages! Unfortunately, I have a problem concerning RAM usage for big crawling processes, which seems to only has been occurring with the latest version of your package (0.1.9-1). I did not encounter this problem with the old version of your package (0.1.8-0).
I am running the script in a loop and save the output of each iteration. All objects are being removed after the iteration. Yet, the script eats about 100 MB of RAM every 20 minutes exceeding my total RAM after a few hours.
With the old version of your package, I was able to run the script for several weeks without any memory problems.
Do you have an idea, why this problem might has occurred now?
Please find my script attached (data, titlepath, articlepath, paper and ending have been loaded beforehand):
`
`
Update: I am running both versions of the package on the same URL-list on two virtual machines with the same properties (4 cores, 8 GB RAM, Ubuntu 16) for six hours now. In version 0.1.9-1 the process has occupied nearly 5 GB RAM until now (incrementing for about 100 MB every 15 minutes), whereas the process in 0.1.8-0 occupies 1 GB RAM with no differences in memory usage for six hours now.
The text was updated successfully, but these errors were encountered: