Merge pull request #1032 from kermitt2/feature/upgrade-dropwizard

Update dropwizard
kermitt2 · Nov 18, 2023 · 9cd9207 · 9cd9207
2 parents c1bf0a2 + 8df47ec
commit 9cd9207
Show file tree

Hide file tree

Showing 31 changed files with 220 additions and 176 deletions.
diff --git a/Readme.md b/Readme.md
@@ -24,14 +24,15 @@ The following functionalities are available:
 - __Header extraction and parsing__ from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
 - __References extraction and parsing__ from articles in PDF format, around .87 F1-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references, and around .90 on a similar bioRxiv set of 2000 PDF (using the Deep Learning citation model). All the usual publication metadata are covered (including DOI, PMID, etc.).
 - __Citation contexts recognition and resolution__ of the full bibliographical references of the article. The accuracy of citation contexts resolution is between .76 and .91 F1-score depending on the evaluation collection (this corresponds to both the correct identification of the citation callout and its correct association with a full bibliographical reference).
-- __Full text extraction and structuring__ from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, etc.). 
+- __Full text extraction and structuring__ from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, data availability statements, etc.). 
 - __PDF coordinates__ for extracted information, allowing to create "augmented" interactive PDF based on bounding boxes of the identified structures.
 - Parsing of __references in isolation__ (above .90 F1-score at instance-level, .95 F1-score at field level, using the Deep Learning model).
 - __Parsing of names__ (e.g. person title, forenames, middle name, etc.), in particular author names in header, and author names in references (two distinct models).
 - __Parsing of affiliation and address__ blocks.
 - __Parsing of dates__, ISO normalized day, month, year.
 - __Consolidation/resolution of the extracted bibliographical references__ using the [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service or the [CrossRef REST API](https://github.com/CrossRef/rest-api-doc). In both cases, DOI/PMID resolution performance is higher than 0.95 F1-score from PDF extraction.
 - __Extraction and parsing of patent and non-patent references in patent__ publications.
+- __Extraction of Funders and funding information__ with optional matching of extracted funders with the CrossRef Funder Registry.
 
 In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middle names, affiliation types, detailed address, journal, volume, issue, pages, DOI, PMID, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure captions, etc.).
 

diff --git a/build.gradle b/build.gradle
@@ -34,6 +34,8 @@ allprojects {
 
     tasks.withType(JavaCompile) {
         options.encoding = 'UTF-8'
+        // note: the following is not working
+        options.compilerArgs << '-parameters'
     }
 }
 
@@ -53,8 +55,8 @@ subprojects {
         }
     }
 
-    sourceCompatibility = 1.8
-    targetCompatibility = 1.8
+    sourceCompatibility = 1.11
+    targetCompatibility = 1.11
 
     repositories {
         mavenCentral()
@@ -84,8 +86,10 @@ subprojects {
         // packaging local libs inside grobid-core.jar
         implementation fileTree(dir: new File(rootProject.rootDir, 'grobid-core/localLibs'), include: localLibs)
 
-        testImplementation "junit:junit:4.12"
-        testImplementation "org.easymock:easymock:3.4"
+        testRuntimeOnly 'org.junit.vintage:junit-vintage-engine:5.9.3'
+        testImplementation(platform('org.junit:junit-bom:5.9.3'))
+        testImplementation('org.junit.jupiter:junit-jupiter')
+        testImplementation 'org.easymock:easymock:5.1.0'
         testImplementation "org.powermock:powermock-api-easymock:2.0.7"
         testImplementation "org.powermock:powermock-module-junit4:2.0.7"
         testImplementation "xmlunit:xmlunit:1.6"
@@ -99,7 +103,7 @@ subprojects {
         implementation "org.apache.commons:commons-collections4:4.1"
         implementation 'org.apache.commons:commons-text:1.11.0'
         implementation "commons-dbutils:commons-dbutils:1.7"
-        implementation "com.google.guava:guava:28.2-jre"
+        implementation "com.google.guava:guava:31.0.1-jre"
         implementation "org.apache.httpcomponents:httpclient:4.5.3"
         implementation "black.ninia:jep:4.0.2"
 
@@ -147,6 +151,8 @@ subprojects {
 //    }
 
     test {
+        useJUnitPlatform()
+
         testLogging.showStandardStreams = true
         // enable for having separate test executor for different tests
         forkEvery = 1
@@ -341,31 +347,25 @@ project(":grobid-service") {
     dependencies {
         implementation project(':grobid-core') 
         implementation project(':grobid-trainer')
-        implementation "io.dropwizard:dropwizard-core:1.3.29"
-        implementation "io.dropwizard:dropwizard-assets:1.3.29"
-        implementation "com.hubspot.dropwizard:dropwizard-guicier:1.3.5.2"
-        implementation "io.dropwizard:dropwizard-forms:1.3.29"
-        implementation "io.dropwizard:dropwizard-client:1.3.29"
-        implementation "io.dropwizard:dropwizard-auth:1.3.29"
-        implementation "io.dropwizard:dropwizard-json-logging:1.3.29"
-        testImplementation "io.dropwizard:dropwizard-testing:1.3.29"
-
-        // note: moving to dropwizard 2.* breaks the support of JDK 1.8
-        // Guise dependency requires to change to the more modern package ru.vyarus.dropwizard-guicey
-        // and a few code updates
-        /*implementation "io.dropwizard:dropwizard-core:2.1.10"
-        implementation "io.dropwizard:dropwizard-assets:2.1.10"
-        implementation "ru.vyarus:dropwizard-guicey:5.2.0"
-        implementation "io.dropwizard:dropwizard-forms:2.1.10"
-        implementation "io.dropwizard:dropwizard-client:2.1.10"
-        implementation "io.dropwizard:dropwizard-auth:2.1.10"
-        implementation "io.dropwizard:dropwizard-json-logging:2.1.10"
-        testImplementation "io.dropwizard:dropwizard-testing:2.1.10"*/
 
+        //Dropwizard
+        implementation 'ru.vyarus:dropwizard-guicey:7.0.0'
+
+        implementation 'io.dropwizard:dropwizard-bom:4.0.0'
+        implementation 'io.dropwizard:dropwizard-core:4.0.0'
+        implementation 'io.dropwizard:dropwizard-assets:4.0.0'
+        implementation 'io.dropwizard:dropwizard-testing:4.0.0'
+        implementation 'io.dropwizard.modules:dropwizard-testing-junit4:4.0.0'
+        implementation 'io.dropwizard:dropwizard-forms:4.0.0'
+        implementation 'io.dropwizard:dropwizard-client:4.0.0'
+        implementation 'io.dropwizard:dropwizard-auth:4.0.0'
+        implementation 'io.dropwizard.metrics:metrics-core:4.2.22'
+        implementation 'io.dropwizard.metrics:metrics-servlets:4.2.22'
+
         implementation "org.apache.pdfbox:pdfbox:2.0.3"
         implementation "javax.activation:activation:1.1.1"
-        implementation "io.prometheus:simpleclient_dropwizard:0.11.0"
-        implementation "io.prometheus:simpleclient_servlet:0.11.0"        
+        implementation "io.prometheus:simpleclient_dropwizard:0.16.0"
+        implementation "io.prometheus:simpleclient_servlet:0.16.0"
     }
 
     shadowJar {

diff --git a/doc/Configuration.md b/doc/Configuration.md
@@ -207,15 +207,23 @@ logging:
   level: INFO
   loggers:
     org.apache.pdfbox.pdmodel.font.PDSimpleFont: "OFF"
+    org.glassfish.jersey.internal: "OFF"
+    com.squarespace.jersey2.guice.JerseyGuiceUtils: "OFF"
   appenders:
     - type: console
-      threshold: ALL
+      threshold: WARN
       timeZone: UTC
+      # uncomment to have the logs in json format
+      #layout:
+      #  type: json
     - type: file
       currentLogFilename: logs/grobid-service.log
-      threshold: ALL
+      threshold: INFO
       archive: true
       archivedLogFilenamePattern: logs/grobid-service-%d.log
       archivedFileCount: 5
       timeZone: UTC
+      # uncomment to have the logs in json format
+      #layout:
+      #  type: json
 ```
diff --git a/doc/Grobid-service.md b/doc/Grobid-service.md
@@ -2,7 +2,12 @@
 
 The GROBID Web API provides a simple and efficient way to use the tool. A service console is available to test GROBID in a human friendly manner. For production and benchmarking, we strongly recommand to use this web service mode on a multi-core machine and to avoid running GROBID in the batch mode.  
 
-## Start the server with Gradle
+## Start the server with Docker
+
+This is the recommended and standard way to run the Grobid web services. 
+
+
+## Start a development server with Gradle
 
 Go under the `grobid/` main directory. Be sure that the GROBID project is built, see [Install GROBID](Install-Grobid.md).
 
@@ -16,7 +21,7 @@ The following command will start the server on the default port __8070__:
 
 ## Install and run the service as standalone application
 
-You could also build and install the service as a standalone service (let's supposed the destination directory is grobid-installation) 
+From a development installation, you can also build and install the service as a standalone service - here let's supposed the destination directory is grobid-installation: 
 
 ```console
 ./gradlew clean assemble
@@ -57,16 +62,16 @@ If required, modify the file under `grobid/grobid-home/config/grobid.yaml` for s
 You can choose to load all the models at the start of the service or lazily when a model is used the first time, the latter being the default. 
 Loading all models at service startup will slow down the start of the server and will use more memories than the lazy mode in case only a few services will be used. 
 
-For preloading all the models, set the following config parameter to `true`:
+Preloading all the models at server start is the default setting, but you choose a lazy loading of the model:
 
 ```yaml
 grobid:
   # for **service only**: how to load the models, 
-  # false -> models are loaded when needed (default), avoiding putting in memory useless models but slow down significantly
-  #          the service at first call
-  # true -> all the models are loaded into memory at the server startup, slow the start of the services and models not
-  #         used will take some memory, but server is immediatly warm and ready
-  modelPreload: false
+  # false -> models are loaded when needed, avoiding putting in memory useless models (only in case of CRF) but slow down 
+  #          significantly the service at first call
+  # true -> all the models are loaded into memory at the server startup (default), slow the start of the services 
+  #         and models not used will take some more memory (only in case of CRF), but server is immediatly warm and ready
+  modelPreload: true
 ```  
 
 ## CORS (Cross-Origin Resource Share)
@@ -89,13 +94,13 @@ We provide clients written in Python, Java, node.js using the GROBID PDF-to-TEI
 * <a href="https://github.com/kermitt2/grobid-client-java" target="_blank">Java GROBID client</a>
 * <a href="https://github.com/kermitt2/grobid-client-node" target="_blank">Node.js GROBID client</a>
 
-All these clients will take advantage of the multi-threading for scaling PDF batch processing. As a consequence, they will be much more efficient than the [batch command lines](Grobid-batch.md) (which use only one thread) and should be prefered. 
+All these clients will take advantage of the multi-threading for scaling PDF batch processing. As a consequence, they will be much more efficient than the [batch command lines](Grobid-batch.md) (which use only one thread) and should be prefered. The Python client is the more up-to-date and complete and can be adapted for your needs. 
 
 ## Use GROBID test console
 
-On your browser, the welcome page of the Service console is available at the URL <http://localhost:8070>.
+On your browser, the welcome page of the service console is available at the URL <http://localhost:8070>.
 
-On the console, the RESTful API can be tested under the `TEI` tab for service returning a TEI document, under the `PDF` tab for services returning annotations relative to PDF or an annotated PDF and under the `Patent` tab for patent-related services:
+On the service console, the RESTful API can be tested under the `TEI` tab for service returning a TEI document, under the `PDF` tab for services returning annotations relative to PDF or an annotated PDF and under the `Patent` tab for patent-related services:
 
 ![Example of GROBID Service console usage](img/grobid-rest-example.png)
 

diff --git a/doc/Install-Grobid.md b/doc/Install-Grobid.md
@@ -1,8 +1,10 @@
-<h1>Install GROBID</h1>>
+<h1>Install a GROBID development environment</h1>>
 
-## Getting GROBID
+## Getting the GROBID project source
 
-GROBID requires a JVM installed on your machine, we tested the tool successfully up version **JVM 17**. Other recent JVM versions should work correctly. 
+For building GROBID yourself, a JDK must be installed on your machine. We tested the tool successfully from **JDK 1.11** up version **JDK 1.17**. Other recent JDK versions should work correctly. 
+
+Note: Java/JDK 8 is not supported anymore from Grobid version `0.8.0` and the minimum requirement for Java is JDK 1.11.
 
 ### Latest stable release
 
@@ -29,7 +31,7 @@ Or download directly the zip file:
 > unzip master
 ```
 
-## Build GROBID
+## Build GROBID from the source
 
 **Please make sure that Grobid is installed in a path with no parent directories containing spaces.**
 
@@ -59,9 +61,8 @@ systemProp.https.proxyUser=username
 systemProp.https.proxyPassword=password
 ```
 
-## Use GROBID
+## Use a built GROBID project
 
 From there, the easiest and most efficient way to use GROBID is the [web service mode](Grobid-service.md). 
 You can also use the tool in [batch mode](Grobid-batch.md) or integrate it in your Java project via the [Java API](Grobid-java-library.md). 
 
-
diff --git a/doc/Introduction.md b/doc/Introduction.md
@@ -22,16 +22,17 @@ The following functionalities are available:
 - __Header extraction and parsing__ from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
 - __References extraction and parsing__ from articles in PDF format, around .87 F1-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references, and around .90 on a similar bioRxiv set of 2000 PDF (using the Deep Learning citation model). All the usual publication metadata are covered (including DOI, PMID, etc.).
 - __Citation contexts recognition and resolution__ of the full bibliographical references of the article. The accuracy of citation contexts resolution is between .76 and .91 F1-score depending on the evaluation collection (this corresponds to both the correct identification of the citation callout and its correct association with a full bibliographical reference).
-- __Full text extraction and structuring__ from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, etc.). 
+- __Full text extraction and structuring__ from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, data availability statements, etc.). 
 - __PDF coordinates__ for extracted information, allowing to create "augmented" interactive PDF based on bounding boxes of the identified structures.
 - Parsing of __references in isolation__ (above .90 F1-score at instance-level, .95 F1-score at field level, using the Deep Learning model).
 - __Parsing of names__ (e.g. person title, forenames, middle name, etc.), in particular author names in header, and author names in references (two distinct models).
 - __Parsing of affiliation and address__ blocks.
 - __Parsing of dates__, ISO normalized day, month, year.
 - __Consolidation/resolution of the extracted bibliographical references__ using the [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service or the [CrossRef REST API](https://github.com/CrossRef/rest-api-doc). In both cases, DOI/PMID resolution performance is higher than 0.95 F1-score from PDF extraction.
 - __Extraction and parsing of patent and non-patent references in patent__ publications.
+- __Extraction of Funders and funding information__ with optional matching of extracted funders with the CrossRef Funder Registry.
 
-In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middle names, affiliation types, detailed address, journal, volume, issue, pages, DOI, PMID, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure captions, etc.).
+In a complete PDF processing, GROBID manages more than 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middle names, affiliation types, detailed address, journal, volume, issue, pages, DOI, PMID, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure captions, etc.).
 
 GROBID includes a comprehensive [web service API](https://grobid.readthedocs.io/en/latest/Grobid-service/), [Docker images](https://grobid.readthedocs.io/en/latest/Grobid-docker/), [batch processing](https://grobid.readthedocs.io/en/latest/Grobid-batch/), a JAVA API, a generic [training and evaluation framework](https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/) (precision, recall, etc., n-fold cross-evaluation), systematic [end-to-end benchmarking](https://grobid.readthedocs.io/en/latest/Benchmarking/) on thousand documents and the semi-automatic generation of training data.
 
@@ -42,7 +43,7 @@ The key aspects of GROBID are the following ones:
 + Written in Java, with JNI call to native CRF libraries and/or Deep Learning libraries via Python JNI bridge. 
 + Speed - on low profile Linux machine (8 threads): header extraction from 4000 PDF in 2 minutes (36 PDF per second with the RESTful API), parsing of 3500 references in 4 seconds, full processing of 4000 PDF (full body, header and reference, structured) in 26 minutes (around 2.5 PDF per second). 
 + Scalability and robustness: We have been able recently to run the complete fulltext processing at around 10.6 PDF per second (around 915,000 PDF per day, around 20M pages per day) during one week on one 16 CPU machine (16 threads, 32GB RAM, no SDD, articles from mainstream publishers), see [here](https://github.com/kermitt2/grobid/issues/443#issuecomment-505208132) (11.3M PDF were processed in 6 days by 2 servers without crash).
-+ Lazy loading of models and resources. Depending on the selected process, only the required data are loaded in memory. For instance, extracting only metadata header from a PDF requires less than 2 GB memory in a multithreading usage, extracting citations uses around 3GB and extracting all the PDF structures around 4GB.  
++ Optional lazy loading of models and resources. Depending on the selected process, only the required data are loaded in memory. For instance, extracting only metadata header from a PDF requires less than 2 GB memory in a multithreading usage, extracting citations uses around 3GB and extracting all the PDF structures around 4GB.  
 + Robust and fast PDF processing with [pdfalto](https://github.com/kermitt2/pdfalto), based on xpdf, and dedicated post-processing.
 + Modular and reusable machine learning models for sequence labelling. The default extractions are based on Linear Chain Conditional Random Fields, with the possibility to use various Deep Learning architectures for sequence labelling (including ELMo and BERT-CRF) for improving accuracy. The specialized sequence labelling models are cascaded to build a complete (hierarchical) document structure.  
 + Full encoding in [__TEI__](http://www.tei-c.org/Guidelines/P5/index.xml), both for the training corpus and the parsed results.