- parseArticle(options, socket) ⇒
main article parser module export function
- articleParser(options, socket) ⇒
article scraping function
- spellCheck(text, options) ⇒
checks the spelling of the article
- getRawText(html) ⇒
takes the article body and returns the raw text of the article
- getFormattedText(html, title, baseurl, options) ⇒
takes the article body and the derived title and returns the formatted text of the article with links made absolute.
- getHtmlText(text) ⇒
takes the formatted article body text and returns the "clean" html text of the article
- htmlCleaner(html, options) ⇒
takes a string of html and runs it through clean-html
- keywordParser(html, options) ⇒
takes a string of html and runs it through retext-keywords and returns keyword and keyphrase suggestions
- lighthouseAnalysis(options) ⇒
runs a google lighthouse audit on the target article
- getTitle(document) ⇒
gets the best available title for the article
- findMetaTitle(document) ⇒
gets the best available meta title of the article
- setDefaultOptions(options) ⇒
sets the default options
- prepDocument(document) ⇒
Prepare the HTML document for readability to process it. This includes things like stripping javascript, CSS, and handling terrible markup.
- cleanStyles(element) ⇒
Remove the style attribute on every e and under.
- killBreaks(element) ⇒
Remove extraneous break tags from a node.
- getInnerText(element) ⇒
Get the inner text of a node - cross browser compatibly. This also strips out any excess whitespace to be found.
- getCharCount(element, string) ⇒
Get the number of times a string s appears in the node e.
- getLinkDensity(element) ⇒
Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.
- getClassWeight(element) ⇒
Get an elements class/id weight. Uses regular expressions to tell if this element looks good or bad.
- clean(element, string) ⇒
Clean a node of all elements of type "tag". (Unless it's a youtube/vimeo video. People love movies.)
- cleanConditionally() ⇒
Clean an element of all tags of type "tag" if they look fishy. "Fishy" is an algorithm based on content length, classnames, link density, number of images & embeds, etc.
- fixLinks(element) ⇒
Converts relative urls to absolute for images and links
- cleanHeaders(element) ⇒
Clean out spurious headers from an Element. Checks things like classnames and link density.
- cleanSingleHeader(element) ⇒
Remove the header that doesn't have next sibling.
- prepArticle(element) ⇒
Cleans the article content
- initializeNode(element) ⇒
Initialize a node with the readability object. Also checks the className/id for special names to add to its score.
main article parser module export function
Kind: global function
Returns: Object
- article parser results object
Param | Type | Description |
options | Object |
the options object |
socket | Object |
the optional socket |
article scraping function
Kind: global function
Returns: Object
- article parser results object
Param | Type | Description |
options | Object |
the options object |
socket | Object |
the optional socket |
checks the spelling of the article
Kind: global function
Returns: Object
- object containing potentially misspelled words
Param | Type | Description |
text | String |
the string of text to run the spellcheck against |
options | Object |
retext-spell options |
options.dictionary | Array |
by default is set to en-gb. |
takes the article body and returns the raw text of the article
Kind: global function
Returns: String
- raw text of the article in lower case
Param | Type | Description |
html | String |
the html string to process |
takes the article body and the derived title and returns the formatted text of the article with links made absolute.
Kind: global function
Returns: String
- formatted text of the article
Param | Type | Description |
html | String |
the body html string to process |
title | String |
the title string to process |
baseurl | String |
the base url of the page being scraped |
options | Object |
the htmltotext formatting options |
takes the formatted article body text and returns the "clean" html text of the article
Kind: global function
Returns: String
- the clean html text of the article
Param | Type | Description |
text | String |
the formatted text string to process |
takes a string of html and runs it through clean-html
Kind: global function
Returns: String
- the cleaned html
Param | Type | Description |
html | String |
the html to clean |
options | Object |
the clean-html options |
takes a string of html and runs it through retext-keywords and returns keyword and keyphrase suggestions
Kind: global function
Returns: Object
- the keyword and keyphrase suggestions
Param | Type | Description |
html | String |
the html to process |
options | Object |
the retext-keywords options |
runs a google lighthouse audit on the target article
Kind: global function
Returns: Object
- the google lighthouse analysis
Param | Type | Description |
options | Object |
the article parser options object |
options.puppeteer.launch | Object |
the pupperteer launch options |
gets the best available title for the article
Kind: global function
Returns: String
- the title of the article
Param | Type | Description |
document | String |
the html document |
gets the best available meta title of the article
Kind: global function
Returns: String
- the best available meta title of the article
Param | Type | Description |
document | String |
the html document |
sets the default options
Kind: global function
Returns: Object
- options with defaults set if options are not specified
Param | Type | Description |
options | Object |
the options object |
Prepare the HTML document for readability to process it. This includes things like stripping javascript, CSS, and handling terrible markup.
Kind: global function
Param | Type |
document | String |
Remove the style attribute on every e and under.
Kind: global function
Param | Type |
element | jQuery |
Remove extraneous break tags from a node.
Kind: global function
Param | Type |
element | jQuery |
Get the inner text of a node - cross browser compatibly. This also strips out any excess whitespace to be found.
Kind: global function
Param | Type |
element | jQuery |
Get the number of times a string s appears in the node e.
Kind: global function
Returns: Number
- (integer)
Param | Type | Description |
element | jQuery |
string | string |
character to split on. Default is "," |
Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.
Kind: global function
Returns: Number
- (float)
Param | Type |
element | jQuery |
Get an elements class/id weight. Uses regular expressions to tell if this element looks good or bad.
Kind: global function
Returns: Number
- (Integer)
Param | Type |
element | jQuery |
Clean a node of all elements of type "tag". (Unless it's a youtube/vimeo video. People love movies.)
Kind: global function
Param | Type | Description |
element | jQuery |
string | tag to clean |
Clean an element of all tags of type "tag" if they look fishy. "Fishy" is an algorithm based on content length, classnames, link density, number of images & embeds, etc.
If there are not very many commas, and the number of non-paragraph elements is more than paragraphs or other ominous signs, remove the element.
Kind: inner constant of cleanConditionally
Converts relative urls to absolute for images and links
Kind: global function
Param | Type |
element | jQuery |
Clean out spurious headers from an Element. Checks things like classnames and link density.
Kind: global function
Param | Type |
element | jQuery |
Remove the header that doesn't have next sibling.
Kind: global function
Param | Type |
element | jQuery |
Cleans the article content
Kind: global function
Param | Type |
element | jQuery |
Initialize a node with the readability object. Also checks the className/id for special names to add to its score.
Kind: global function
Param | Type |
element | jQuery |