-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removal of specialized HTML literal handling? #2946
Comments
I would say yes! It is difficult to understand what that library is doing and for what reason though. At the moment I get for every HTML element that I am processing within RDFlib an error message that resembles something like this (example for Doctype node): ile "C:<path>\Python\Python312\Lib\site-packages\html5lib\html5parser.py", line 247, in mainLoop This seriously delays processing of any HTML document as every element has to undergo this treatment. I am trying to finish my work on the HTML vocabulary (see https://www.w3.org/community/htmlvoc/) and a proper open source based implementation of the HTML vocabulary using RDFlib/PyShacl is number one on my list for more than a year. Would be awesome if you could fix this permanently. From your post I gather that you also do not think there are any undesired effects of removing html5lib. I trust we can keep using the datatype rdf:HTML for html literals in our RDF/SPARQL? |
@floresbakker Thanks for your input on this. After a quick meeting with some other rdflib maintainers yesterday, this is the plan we came up with:
|
Yes, when html5rdf support is disabled, or even if we remove the feature entirely, then |
It seems that in this plan, html5 is not really deprecated but adopted into rdflib and fixed? For extra information: the error message that I reported above was already present in the original html5lib before you made the html5lib-modern. Perhaps this helps in understanding the cause. I trust the html4rdf reference is a typo and should be html5rdf? |
Not necessarily. You can link of It will be maintained by the RDFLib team for that purpose, for the use in As for the issue you described in your original post, I'm not seeing those in my testing, are you able to send an example RDF file that reproduces those errors? |
I tried reproducing the errors on the newest release 7.1.1 from yesterday, but I was to my surprise unable to do so. That is good news for the htmlvoc project. I think I have only one remaining (unrelated to this discussion) issue, being unable to process trig files in RDFlib/PyShacl, for which I will work out a minimal working example. Thanks Ashley! There is a lot of movement within RDFlib/PyShacl, which is greatly appreciated. |
Possible easy solution for #2935 and #2945
The reason we forked
html5lib
to makehtml5lib-modern
was because there is no new replacement forhtml5lib
that provides the same XML-based HTML-tokenizing functionality thathtml5lib
does. There's no alternative to move to.Beautifulsoup4 is the logical replacement, but it includes
html5lib
in its dependency tree, so defeats the whole point.But what if we just dropped that feature entirely? Why does RDFLib even want to be able to tokenize HTML Literals? The feature was added for a reason, but do we need to keep it?
Can we simply drop that feature, and treat HTML the same as any other string literal, and remove
html5lib
from our dependencies entirely?The text was updated successfully, but these errors were encountered: