-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preparation release 0.8.1 #1123
Conversation
I've ran the evaluation with a partial glutton (around 80-90M records from Since I don't have a GPU machine I can log in, I
Since I did not use the standard method, this should be taken with a pinch of salt. TLDR: Header metadata and citation context performances have decreased, the rest as increased.
|
I'm attaching all the results as files for completeness: |
Hi Luca ! I think there is a major issue with the the jvm version indicated by the Kotlin jvmToolchain
The classes and jar become incompatible with jvm lower than 17... So it's not possible to run grobid any more with a jvm 11:
In addition, it has blocking consequences for other modules and libraries using grobid which can't be run with jvm 17. The solution seems to simply make everything to java 11:
although source compatibility java 11 is not working:
gives
|
It seems the Java 11 compatibility is broken by the recent changes in FundingAcknowledgementParser:
|
Hi @kermitt2, I checked grobid-quantities, software-mentions, datastet and they seems to be compatible with JDK 17. I would say that the old modules may stay with an older version. If you want to keep jdk 11 compatibility, for the second problem, you can replace |
I think it's good to move to JDK 17 in general, but we need to update the other modules first, otherwise this is blocking for users. This is also a general issue for everything that depends on Grobid and for existing production environment where Grobid runs. For example I am currently stuck and failed to upgrade entity-fishing from JDK 8 to JDK 11 and this is very annoying for the users. I think it's better to ensure JDK 11 compatibility for this release - 17 would be a breaking change for version 0.9.0, especially given that the move to 17 is more for our comfort than providing really actual advantages? |
OK, no problem. I might be to optimistic in thinking that people would have migrated to Docker by now. Let me help you with entity-fishing. Could you commit and push everything you've done so far on a branch of the project, I will have a look ASAP 😉
Sure. 👍 |
Thank you very much @lfoppiano it is working also for me now with jdk 11 on Linux (as you, I usually run jdk 17, and it's why I saw the issue only recently). About entity-fishing, the master has the latest commit if I am not wrong, and running with grobid 0.8.0 and jdk 11 fails because the current version uses an incubator module that has disappeared after jdk 1.8. I did not analyze further which dependency uses this module and if there is a possible replacement in jdk 11. |
I observed the crashes with more PDF, usually from 10-20K I think, and never getting more than 25-30K PDF. When running with gradlew: grobid_client_python with concurrency at 15 No crash with JVMtoolkit set to JDK 17 after 700K PDF. |
Thanks @kermitt2 ! I try again with a larger dataset, I might need some more days to assemble it, meanwhile if you still have the JVM dump somewhere, could you share it? |
I added an additional 40000 unique articles, to the previous 30000, ran again but could not reproduce the problem. I'm using a 8vCPU with 32Gb of RAM, only CRF with jdk 17, and jdk 11 version of the bytecode. 😭 As alternative, to solve the issue with JDK 11, I could try to run entity-fishing with JDK 17 💦 Are there other modules that require JDK 11? |
# Conflicts: # .github/workflows/ci-build-manual-crf.yml
Back to the JVM crash problem:
The same behavior happens when using command line More info on javaToolchains as appearing on my system: :~/grobid$ ./gradlew -q javaToolchains
+ Options
| Auto-detection: Enabled
| Auto-download: Enabled
+ Eclipse Adoptium JDK 11.0.23+9
| Location: /home/lopez/.gradle/jdks/jdk-11.0.23+9
| Language Version: 11
| Vendor: Eclipse Adoptium
| Is JDK: true
| Detected by: Auto-provisioned by Gradle
+ Ubuntu JDK 17.0.12+7-Ubuntu-1ubuntu222.04
| Location: /usr/lib/jvm/java-17-openjdk-amd64
| Language Version: 17
| Vendor: Ubuntu
| Is JDK: true
| Detected by: Common Linux Locations
+ Invalid toolchains
+ /usr/lib/jvm/openjdk-17
| Error: A problem occurred starting process 'command '/usr/lib/jvm/openjdk-17/bin/java''
|
…arget compatibility with kotlin
Hi @kermitt2 thanks again, this indeed helps more understanding the problem. In my test I had the JDK 11.0.24 that was automatically downloaded by gradle. Anyway, I pushed a small change that should solve the issue and allow us to keep everything 🤞, in brief:
Regarding the observation with docker, in principle we don't use gradle to run the service, so, I'm not sure why of the crashes... 🤔 |
I've ran grobid natively with gradle, built with the latest commits on this branch, on ~70000 documents using JDK 17.0.12 and JDK 11.0.24 (installed with the ubuntu 22.04). |
I did test also the docker image resulting from my last change and it was not crashing. |
For version 0.8.1 I have set up the infrastructure so that I can reproduce the same end 2 end evaluation results :-) |
I made some test with the updated version without jvmToolchain and automatic download of JVM and I had no problem anymore. So with The problem I think was related to the built version of the JDK downloaded by jvmToolchain. It was a JDK 11 distribution from eclipse (OpenJDK Runtime Environment Temurin-11.0.23+9 ), while normally we should use the Ubuntu packaged one for safety. It means jvmToolchain might not be reliable in the future, because it might download one JDK built independently from the linux distribution instead of the one specifically built for the used linux distribution. For the docker image, I suppose the Grobid project was built with the downloaded JDK 11 (in the first build layer), then Ubuntu JRE 17 from the base image was used in the runtime, so possible clash of JDK here. I think we're good for the release ? :) |
Great!!!!! I can take care of the release, leaving to you only double checking it? 😄 |
Grobid
The docker images were built with github actions. I just re-tagged it accordingly. You can save time for build by re-tagging the full image and push it under grobid:
Grobid modulesHere the list of grobid modules, I did not included the one that are old, it's hard to maintain everything, @kermitt2 feel free to add if there are other
Since I cannot control the S3 repository, I usually ship the JARs with the repository as flat dependencies, this requires specify all the dependencies, but I don't know anything better. @kermitt2 do you want me to update Software Mentions and Entity-fishing as well? |
@lfoppiano all the artifacts for 0.8.1 have been published on https://grobid.s3.eu-west-1.amazonaws.com/repo |
@lfoppiano I'll update software-mentions, entity-fishing, DataStet sure |
I dont' have any particular error, but if I decide to move to a SNAPSHOT version for development I will need to ship the JARs anyway in my repo. OK. For DataStet I've updated the DataSeer's branch (https://github.com/DataSeer/datastet) cause I don't have access to your repository. I'm not sure I pushed up some PRs already. |
Does it mean it is working ? You normally have snapshot versions in your local maven repo for development. These DIY stuff anyway are more for java clients, but you should never need a localLibs/grobid-core-0.8.1.jar added in a project no? |
Yes it works. :-) For grobid-quantities and grobid-superconductors I do ship the jars in the repo. In this case, grobid-superconductors also ships the grobid-quantities's JAR. |
@kermitt2 for DataStet I've implemented few useful things: 1) TEI processing and 2) parallel processing for DataSeerML (I know it's obsolete, but it was needed at DataSeer) 3) refactor the build using the grobid-full image. I will send a couple of PRs next week. Would be good to have a review (without rush) so that I can consolidate my knowledge on the application for the BSO project 😄 |
@lfoppiano So on my side, I have updated software-mentions, grobid-ner, DataStet (standard), entity-fishing, grobid demo on HuggingFace. I will study the PR for DataStet carefully because processing a TEI is likely very complicated. Great addition I think. I notice that the Docker image for Grobid is 2 GB larger than before (compressed size) with 0.8.1. Not that it is a problem I think, but any particular reasons? |
Great thanks! The image of the 0.8.1 that I built via github actions is 10.92 Gb (compressed), version 0.8.0 was 10.5 approximately. 🤔 It might be something to do with your build (maybe you used a source with additional models that have been included?). |
Ah yes sorry, this is exactly what happened :D |
This PR contains the updates for the release 0.8.1