Functions

parseArticle(options, socket) ⇒ Object: main article parser module export function
articleParser(options, socket) ⇒ Object: article scraping function
spellCheck(text, options) ⇒ Object: checks the spelling of the article
getRawText(html) ⇒ String: takes the article body and returns the raw text of the article
getFormattedText(html, title, baseurl, options) ⇒ String: takes the article body and the derived title and returns the formatted text of the article with links made absolute.
getHtmlText(text) ⇒ String: takes the formatted article body text and returns the "clean" html text of the article
htmlCleaner(html, options) ⇒ String: takes a string of html and runs it through clean-html
keywordParser(html, options) ⇒ Object: takes a string of html and runs it through retext-keywords and returns keyword and keyphrase suggestions
lighthouseAnalysis(options) ⇒ Object: runs a google lighthouse audit on the target article
getTitle(document) ⇒ String: gets the best available title for the article
findMetaTitle(document) ⇒ String: gets the best available meta title of the article
setDefaultOptions(options) ⇒ Object: sets the default options
prepDocument(document) ⇒ Void: Prepare the HTML document for readability to process it. This includes things like stripping javascript, CSS, and handling terrible markup.
cleanStyles(element) ⇒ Void: Remove the style attribute on every e and under.
killBreaks(element) ⇒ Void: Remove extraneous break tags from a node.
getInnerText(element) ⇒ String: Get the inner text of a node - cross browser compatibly. This also strips out any excess whitespace to be found.
getCharCount(element, string) ⇒ Number: Get the number of times a string s appears in the node e.
getLinkDensity(element) ⇒ Number: Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.
getClassWeight(element) ⇒ Number: Get an elements class/id weight. Uses regular expressions to tell if this element looks good or bad.
clean(element, string) ⇒ Void: Clean a node of all elements of type "tag". (Unless it's a youtube/vimeo video. People love movies.)
cleanConditionally() ⇒ Void: Clean an element of all tags of type "tag" if they look fishy. "Fishy" is an algorithm based on content length, classnames, link density, number of images & embeds, etc.
fixLinks(element) ⇒ Void: Converts relative urls to absolute for images and links
cleanHeaders(element) ⇒ Void: Clean out spurious headers from an Element. Checks things like classnames and link density.
cleanSingleHeader(element) ⇒ Void: Remove the header that doesn't have next sibling.
prepArticle(element) ⇒ Void: Cleans the article content
initializeNode(element) ⇒ Void: Initialize a node with the readability object. Also checks the className/id for special names to add to its score.

parseArticle(options, socket) ⇒ `Object`

main article parser module export function

Kind: global function
Returns: Object - article parser results object

Param	Type	Description
options	`Object`	the options object
socket	`Object`	the optional socket

articleParser(options, socket) ⇒ `Object`

article scraping function

Kind: global function
Returns: Object - article parser results object

Param	Type	Description
options	`Object`	the options object
socket	`Object`	the optional socket

spellCheck(text, options) ⇒ `Object`

checks the spelling of the article

Kind: global function
Returns: Object - object containing potentially misspelled words

Param	Type	Description
text	`String`	the string of text to run the spellcheck against
options	`Object`	retext-spell options
options.dictionary	`Array`	by default is set to en-gb.

getRawText(html) ⇒ `String`

takes the article body and returns the raw text of the article

Kind: global function
Returns: String - raw text of the article in lower case

Param	Type	Description
html	`String`	the html string to process

getFormattedText(html, title, baseurl, options) ⇒ `String`

takes the article body and the derived title and returns the formatted text of the article with links made absolute.

Kind: global function
Returns: String - formatted text of the article

Param	Type	Description
html	`String`	the body html string to process
title	`String`	the title string to process
baseurl	`String`	the base url of the page being scraped
options	`Object`	the htmltotext formatting options

getHtmlText(text) ⇒ `String`

takes the formatted article body text and returns the "clean" html text of the article

Kind: global function
Returns: String - the clean html text of the article

Param	Type	Description
text	`String`	the formatted text string to process

htmlCleaner(html, options) ⇒ `String`

takes a string of html and runs it through clean-html

Kind: global function
Returns: String - the cleaned html

Param	Type	Description
html	`String`	the html to clean
options	`Object`	the clean-html options

keywordParser(html, options) ⇒ `Object`

takes a string of html and runs it through retext-keywords and returns keyword and keyphrase suggestions

Kind: global function
Returns: Object - the keyword and keyphrase suggestions

Param	Type	Description
html	`String`	the html to process
options	`Object`	the retext-keywords options

lighthouseAnalysis(options) ⇒ `Object`

runs a google lighthouse audit on the target article

Kind: global function
Returns: Object - the google lighthouse analysis

Param	Type	Description
options	`Object`	the article parser options object
options.puppeteer.launch	`Object`	the pupperteer launch options

getTitle(document) ⇒ `String`

gets the best available title for the article

Kind: global function
Returns: String - the title of the article

Param	Type	Description
document	`String`	the html document

findMetaTitle(document) ⇒ `String`

gets the best available meta title of the article

Kind: global function
Returns: String - the best available meta title of the article

Param	Type	Description
document	`String`	the html document

setDefaultOptions(options) ⇒ `Object`

sets the default options

Kind: global function
Returns: Object - options with defaults set if options are not specified

Param	Type	Description
options	`Object`	the options object

prepDocument(document) ⇒ `Void`

Prepare the HTML document for readability to process it. This includes things like stripping javascript, CSS, and handling terrible markup.

Kind: global function

Param	Type
document	`String`

cleanStyles(element) ⇒ `Void`

Remove the style attribute on every e and under.

Kind: global function

Param	Type
element	`jQuery`

killBreaks(element) ⇒ `Void`

Remove extraneous break tags from a node.

Kind: global function

Param	Type
element	`jQuery`

getInnerText(element) ⇒ `String`

Get the inner text of a node - cross browser compatibly. This also strips out any excess whitespace to be found.

Kind: global function

Param	Type
element	`jQuery`

getCharCount(element, string) ⇒ `Number`

Get the number of times a string s appears in the node e.

Kind: global function
Returns: Number - (integer)

Param	Type	Description
element	`jQuery`
string	`string`	character to split on. Default is ","

getLinkDensity(element) ⇒ `Number`

Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.

Kind: global function
Returns: Number - (float)

Param	Type
element	`jQuery`

getClassWeight(element) ⇒ `Number`

Get an elements class/id weight. Uses regular expressions to tell if this element looks good or bad.

Kind: global function
Returns: Number - (Integer)

Param	Type
element	`jQuery`

clean(element, string) ⇒ `Void`

Clean a node of all elements of type "tag". (Unless it's a youtube/vimeo video. People love movies.)

Kind: global function

Param	Type	Description
element	`jQuery`
string		tag to clean

cleanConditionally() ⇒ `Void`

Clean an element of all tags of type "tag" if they look fishy. "Fishy" is an algorithm based on content length, classnames, link density, number of images & embeds, etc.

Kind: global function

cleanConditionally~p

If there are not very many commas, and the number of non-paragraph elements is more than paragraphs or other ominous signs, remove the element.

Kind: inner constant of cleanConditionally

fixLinks(element) ⇒ `Void`

Converts relative urls to absolute for images and links

Kind: global function

Param	Type
element	`jQuery`

cleanHeaders(element) ⇒ `Void`

Clean out spurious headers from an Element. Checks things like classnames and link density.

Kind: global function

Param	Type
element	`jQuery`

cleanSingleHeader(element) ⇒ `Void`

Remove the header that doesn't have next sibling.

Kind: global function

Param	Type
element	`jQuery`

prepArticle(element) ⇒ `Void`

Cleans the article content

Kind: global function

Param	Type
element	`jQuery`

initializeNode(element) ⇒ `Void`

Initialize a node with the readability object. Also checks the className/id for special names to add to its score.

Kind: global function

Param	Type
element	`jQuery`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APIDOC.md

APIDOC.md

Functions

parseArticle(options, socket) ⇒ `Object`

articleParser(options, socket) ⇒ `Object`

spellCheck(text, options) ⇒ `Object`

getRawText(html) ⇒ `String`

getFormattedText(html, title, baseurl, options) ⇒ `String`

getHtmlText(text) ⇒ `String`

htmlCleaner(html, options) ⇒ `String`

keywordParser(html, options) ⇒ `Object`

lighthouseAnalysis(options) ⇒ `Object`

getTitle(document) ⇒ `String`

findMetaTitle(document) ⇒ `String`

setDefaultOptions(options) ⇒ `Object`

prepDocument(document) ⇒ `Void`

cleanStyles(element) ⇒ `Void`

killBreaks(element) ⇒ `Void`

getInnerText(element) ⇒ `String`

getCharCount(element, string) ⇒ `Number`

getLinkDensity(element) ⇒ `Number`

getClassWeight(element) ⇒ `Number`

clean(element, string) ⇒ `Void`

cleanConditionally() ⇒ `Void`

cleanConditionally~p

fixLinks(element) ⇒ `Void`

cleanHeaders(element) ⇒ `Void`

cleanSingleHeader(element) ⇒ `Void`

prepArticle(element) ⇒ `Void`

initializeNode(element) ⇒ `Void`

Files

APIDOC.md

Latest commit

History

APIDOC.md

File metadata and controls

Functions

parseArticle(options, socket) ⇒ Object

articleParser(options, socket) ⇒ Object

spellCheck(text, options) ⇒ Object

getRawText(html) ⇒ String

getFormattedText(html, title, baseurl, options) ⇒ String

getHtmlText(text) ⇒ String

htmlCleaner(html, options) ⇒ String

keywordParser(html, options) ⇒ Object

lighthouseAnalysis(options) ⇒ Object

getTitle(document) ⇒ String

findMetaTitle(document) ⇒ String

setDefaultOptions(options) ⇒ Object

prepDocument(document) ⇒ Void

cleanStyles(element) ⇒ Void

killBreaks(element) ⇒ Void

getInnerText(element) ⇒ String

getCharCount(element, string) ⇒ Number

getLinkDensity(element) ⇒ Number

getClassWeight(element) ⇒ Number

clean(element, string) ⇒ Void

cleanConditionally() ⇒ Void

cleanConditionally~p

fixLinks(element) ⇒ Void

cleanHeaders(element) ⇒ Void

cleanSingleHeader(element) ⇒ Void

prepArticle(element) ⇒ Void

initializeNode(element) ⇒ Void

parseArticle(options, socket) ⇒ `Object`

articleParser(options, socket) ⇒ `Object`

spellCheck(text, options) ⇒ `Object`

getRawText(html) ⇒ `String`

getFormattedText(html, title, baseurl, options) ⇒ `String`

getHtmlText(text) ⇒ `String`

htmlCleaner(html, options) ⇒ `String`

keywordParser(html, options) ⇒ `Object`

lighthouseAnalysis(options) ⇒ `Object`

getTitle(document) ⇒ `String`

findMetaTitle(document) ⇒ `String`

setDefaultOptions(options) ⇒ `Object`

prepDocument(document) ⇒ `Void`

cleanStyles(element) ⇒ `Void`

killBreaks(element) ⇒ `Void`

getInnerText(element) ⇒ `String`

getCharCount(element, string) ⇒ `Number`

getLinkDensity(element) ⇒ `Number`

getClassWeight(element) ⇒ `Number`

clean(element, string) ⇒ `Void`

cleanConditionally() ⇒ `Void`

fixLinks(element) ⇒ `Void`

cleanHeaders(element) ⇒ `Void`

cleanSingleHeader(element) ⇒ `Void`

prepArticle(element) ⇒ `Void`

initializeNode(element) ⇒ `Void`