Skip to content

Latest commit

 

History

History
412 lines (306 loc) · 14 KB

APIDOC.md

File metadata and controls

412 lines (306 loc) · 14 KB

Functions

parseArticle(options, socket)Object

main article parser module export function

articleParser(options, socket)Object

article scraping function

spellCheck(text, options)Object

checks the spelling of the article

getRawText(html)String

takes the article body and returns the raw text of the article

getFormattedText(html, title, baseurl, options)String

takes the article body and the derived title and returns the formatted text of the article with links made absolute.

getHtmlText(text)String

takes the formatted article body text and returns the "clean" html text of the article

htmlCleaner(html, options)String

takes a string of html and runs it through clean-html

keywordParser(html, options)Object

takes a string of html and runs it through retext-keywords and returns keyword and keyphrase suggestions

lighthouseAnalysis(options)Object

runs a google lighthouse audit on the target article

getTitle(document)String

gets the best available title for the article

findMetaTitle(document)String

gets the best available meta title of the article

setDefaultOptions(options)Object

sets the default options

prepDocument(document)Void

Prepare the HTML document for readability to process it. This includes things like stripping javascript, CSS, and handling terrible markup.

cleanStyles(element)Void

Remove the style attribute on every e and under.

killBreaks(element)Void

Remove extraneous break tags from a node.

getInnerText(element)String

Get the inner text of a node - cross browser compatibly. This also strips out any excess whitespace to be found.

getCharCount(element, string)Number

Get the number of times a string s appears in the node e.

getLinkDensity(element)Number

Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.

getClassWeight(element)Number

Get an elements class/id weight. Uses regular expressions to tell if this element looks good or bad.

clean(element, string)Void

Clean a node of all elements of type "tag". (Unless it's a youtube/vimeo video. People love movies.)

cleanConditionally()Void

Clean an element of all tags of type "tag" if they look fishy. "Fishy" is an algorithm based on content length, classnames, link density, number of images & embeds, etc.

fixLinks(element)Void

Converts relative urls to absolute for images and links

cleanHeaders(element)Void

Clean out spurious headers from an Element. Checks things like classnames and link density.

cleanSingleHeader(element)Void

Remove the header that doesn't have next sibling.

prepArticle(element)Void

Cleans the article content

initializeNode(element)Void

Initialize a node with the readability object. Also checks the className/id for special names to add to its score.

parseArticle(options, socket) ⇒ Object

main article parser module export function

Kind: global function
Returns: Object - article parser results object

Param Type Description
options Object the options object
socket Object the optional socket

articleParser(options, socket) ⇒ Object

article scraping function

Kind: global function
Returns: Object - article parser results object

Param Type Description
options Object the options object
socket Object the optional socket

spellCheck(text, options) ⇒ Object

checks the spelling of the article

Kind: global function
Returns: Object - object containing potentially misspelled words

Param Type Description
text String the string of text to run the spellcheck against
options Object retext-spell options
options.dictionary Array by default is set to en-gb.

getRawText(html) ⇒ String

takes the article body and returns the raw text of the article

Kind: global function
Returns: String - raw text of the article in lower case

Param Type Description
html String the html string to process

getFormattedText(html, title, baseurl, options) ⇒ String

takes the article body and the derived title and returns the formatted text of the article with links made absolute.

Kind: global function
Returns: String - formatted text of the article

Param Type Description
html String the body html string to process
title String the title string to process
baseurl String the base url of the page being scraped
options Object the htmltotext formatting options

getHtmlText(text) ⇒ String

takes the formatted article body text and returns the "clean" html text of the article

Kind: global function
Returns: String - the clean html text of the article

Param Type Description
text String the formatted text string to process

htmlCleaner(html, options) ⇒ String

takes a string of html and runs it through clean-html

Kind: global function
Returns: String - the cleaned html

Param Type Description
html String the html to clean
options Object the clean-html options

keywordParser(html, options) ⇒ Object

takes a string of html and runs it through retext-keywords and returns keyword and keyphrase suggestions

Kind: global function
Returns: Object - the keyword and keyphrase suggestions

Param Type Description
html String the html to process
options Object the retext-keywords options

lighthouseAnalysis(options) ⇒ Object

runs a google lighthouse audit on the target article

Kind: global function
Returns: Object - the google lighthouse analysis

Param Type Description
options Object the article parser options object
options.puppeteer.launch Object the pupperteer launch options

getTitle(document) ⇒ String

gets the best available title for the article

Kind: global function
Returns: String - the title of the article

Param Type Description
document String the html document

findMetaTitle(document) ⇒ String

gets the best available meta title of the article

Kind: global function
Returns: String - the best available meta title of the article

Param Type Description
document String the html document

setDefaultOptions(options) ⇒ Object

sets the default options

Kind: global function
Returns: Object - options with defaults set if options are not specified

Param Type Description
options Object the options object

prepDocument(document) ⇒ Void

Prepare the HTML document for readability to process it. This includes things like stripping javascript, CSS, and handling terrible markup.

Kind: global function

Param Type
document String

cleanStyles(element) ⇒ Void

Remove the style attribute on every e and under.

Kind: global function

Param Type
element jQuery

killBreaks(element) ⇒ Void

Remove extraneous break tags from a node.

Kind: global function

Param Type
element jQuery

getInnerText(element) ⇒ String

Get the inner text of a node - cross browser compatibly. This also strips out any excess whitespace to be found.

Kind: global function

Param Type
element jQuery

getCharCount(element, string) ⇒ Number

Get the number of times a string s appears in the node e.

Kind: global function
Returns: Number - (integer)

Param Type Description
element jQuery
string string character to split on. Default is ","

getLinkDensity(element) ⇒ Number

Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.

Kind: global function
Returns: Number - (float)

Param Type
element jQuery

getClassWeight(element) ⇒ Number

Get an elements class/id weight. Uses regular expressions to tell if this element looks good or bad.

Kind: global function
Returns: Number - (Integer)

Param Type
element jQuery

clean(element, string) ⇒ Void

Clean a node of all elements of type "tag". (Unless it's a youtube/vimeo video. People love movies.)

Kind: global function

Param Type Description
element jQuery
string tag to clean

cleanConditionally() ⇒ Void

Clean an element of all tags of type "tag" if they look fishy. "Fishy" is an algorithm based on content length, classnames, link density, number of images & embeds, etc.

Kind: global function

cleanConditionally~p

If there are not very many commas, and the number of non-paragraph elements is more than paragraphs or other ominous signs, remove the element.

Kind: inner constant of cleanConditionally

fixLinks(element) ⇒ Void

Converts relative urls to absolute for images and links

Kind: global function

Param Type
element jQuery

cleanHeaders(element) ⇒ Void

Clean out spurious headers from an Element. Checks things like classnames and link density.

Kind: global function

Param Type
element jQuery

cleanSingleHeader(element) ⇒ Void

Remove the header that doesn't have next sibling.

Kind: global function

Param Type
element jQuery

prepArticle(element) ⇒ Void

Cleans the article content

Kind: global function

Param Type
element jQuery

initializeNode(element) ⇒ Void

Initialize a node with the readability object. Also checks the className/id for special names to add to its score.

Kind: global function

Param Type
element jQuery