Skip to content
This repository has been archived by the owner on Aug 14, 2021. It is now read-only.

Commit

Permalink
Merge pull request #54 from andreskrey/development
Browse files Browse the repository at this point in the history
Prepare for release
  • Loading branch information
andreskrey authored Mar 19, 2018
2 parents f0f6906 + a7b5fa2 commit 45c5826
Show file tree
Hide file tree
Showing 93 changed files with 3,588 additions and 1,447 deletions.
3 changes: 3 additions & 0 deletions .coveralls.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
coverage_clover: test/clover.xml
json_path: test/coveralls-upload.json
service_name: travis-ci
10 changes: 9 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,19 @@
language: php

install: composer install
install:
- composer install

php:
- "5.6"
- "7.0"
- "7.1"
- "7.2"

script:
- ./vendor/bin/phpunit --coverage-clover ./test/clover.xml

after_script:
- composer require php-coveralls/php-coveralls:^2.0
- php ./vendor/php-coveralls/php-coveralls/bin/php-coveralls -v

sudo: false
10 changes: 9 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,17 @@ All notable changes to this project will be documented in this file.

## Unreleased

- Merged PR#49 (Missing object when calling `->getContent()`)
- Imported all changes from Readability.js as of 2 March 2018 ([8525c6a](https://github.com/mozilla/readability/commit/8525c6af36d3badbe27c4672a6f2dd99ddb4097f)):
- Check for `<base>` elements before converting URLs to absolute.
- Clean `<link>` tags on `prepArticle()`
- Attempt to return at least some text if all the algorithm runs fail (Check PR [#423](https://github.com/mozilla/readability/pull/423) on JS version)
- Add new test cases for the previous changes
- And all other changes reflected [in this diff](https://github.com/mozilla/readability/compare/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5...8525c6af36d3badbe27c4672a6f2dd99ddb4097f)

## [v1.1.1](https://github.com/andreskrey/readability.php/releases/tag/v1.1.1)

- Switched from assertEquals to assertSame on unit testing to avoid weak comparisons.
- Switched from assertEquals to assertSame on unit testing to avoid weak comparisons.
- Added a safe check to avoid sending the DOMDocument as a node when scanning for node ancestors.
- Fix issue #45: Small mistake in documentation
- Fix issue #46: Added `data-src` as a image source path
Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
# Readability.php
[![Latest Stable Version](https://poser.pugx.org/andreskrey/readability.php/v/stable)](https://packagist.org/packages/andreskrey/readability.php) [![StyleCI](https://styleci.io/repos/71042668/shield?branch=master)](https://styleci.io/repos/71042668) [![Build Status](https://travis-ci.org/andreskrey/readability.php.svg?branch=master)](https://travis-ci.org/andreskrey/readability.php) [![Total Downloads](https://poser.pugx.org/andreskrey/readability.php/downloads)](https://packagist.org/packages/andreskrey/readability.php) [![Monthly Downloads](https://poser.pugx.org/andreskrey/readability.php/d/monthly)](https://packagist.org/packages/andreskrey/readability.php)
[![Latest Stable Version](https://poser.pugx.org/andreskrey/readability.php/v/stable)](https://packagist.org/packages/andreskrey/readability.php) [![Build Status](https://travis-ci.org/andreskrey/readability.php.svg?branch=master)](https://travis-ci.org/andreskrey/readability.php) [![Coverage Status](https://coveralls.io/repos/github/andreskrey/readability.php/badge.svg?branch=master)](https://coveralls.io/github/andreskrey/readability.php/?branch=master) [![StyleCI](https://styleci.io/repos/71042668/shield?branch=master)](https://styleci.io/repos/71042668) [![Total Downloads](https://poser.pugx.org/andreskrey/readability.php/downloads)](https://packagist.org/packages/andreskrey/readability.php) [![Monthly Downloads](https://poser.pugx.org/andreskrey/readability.php/d/monthly)](https://packagist.org/packages/andreskrey/readability.php)

PHP port of *Mozilla's* **[Readability.js](https://github.com/mozilla/readability)**. Parses html text (usually news and other articles) and returns **title**, **author**, **main image** and **text content** without nav bars, ads, footers, or anything that isn't the main body of the text. Analyzes each node, gives them a score, and determines what's relevant and what can be discarded.

![Screenshot](https://raw.githubusercontent.com/andreskrey/readability.php/assets/screenshot.png)

The project aim is to be a 1 to 1 port of Mozilla's version and to follow closely all changes introduced there, but there are some major differences on the structure. Most of the code is a 1:1 copy –even the comments were imported– but some functions and structures were adapted to suit better the PHP language.

**Lead Developer**: Andres Rey

## Requirements

PHP 5.6+, ext-dom, ext-xml, and ext-mbstring. To install all this dependencies (in the rare case your system does not have them already), you could try something like this in *nix like environments:

`$ sudo apt-get install php7.1-xml php7.1-mbstring`

**Lead Developer**: Andres Rey

## How to use it

First you have to require the library using composer:
Expand Down Expand Up @@ -152,7 +152,7 @@ Self closing tags like `<br />` get automatically expanded to `<br></br`. No way

## Dependencies

Readability.php uses the [PSR Log](https://github.com/php-fig/log) interface to define the allowed type of loggers.
Readability.php uses the [PSR Log](https://github.com/php-fig/log) interface to define the allowed type of loggers. [Monolog](https://github.com/Seldaek/monolog) is only required on development installations. (`--dev` option during `composer install`).

## To-do

Expand All @@ -165,7 +165,7 @@ Readability parses all the text with DOMDocument, scans the text nodes and gives

## Code porting

Up to date with readability.js as of [16 Oct 2017](https://github.com/mozilla/readability/commit/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5).
Up to date with readability.js as of [2 Mar 2018](https://github.com/mozilla/readability/commit/8525c6af36d3badbe27c4672a6f2dd99ddb4097f).

## License

Expand Down
3 changes: 2 additions & 1 deletion composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@
"psr/log": "^1.0"
},
"require-dev": {
"phpunit/phpunit": "^5.7"
"phpunit/phpunit": "^5.7",
"monolog/monolog": "^1.23"
},
"suggest": {
"monolog/monolog": "Allow logging debug information"
Expand Down
12 changes: 12 additions & 0 deletions src/Configuration.php
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,18 @@ public function getLogger()
}
}

/**
* @param LoggerInterface $logger
*
* @return Configuration
*/
public function setLogger(LoggerInterface $logger)
{
$this->logger = $logger;

return $this;
}

/**
* @return int
*/
Expand Down
3 changes: 2 additions & 1 deletion src/Nodes/DOM/DOMDocument.php
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,11 @@ public function __construct($version, $encoding)
$this->registerNodeClass('DOMDocumentFragment', DOMDocumentFragment::class);
$this->registerNodeClass('DOMDocumentType', DOMDocumentType::class);
$this->registerNodeClass('DOMElement', DOMElement::class);
$this->registerNodeClass('DOMEntity', DOMEntity::class);
$this->registerNodeClass('DOMEntityReference', DOMEntityReference::class);
$this->registerNodeClass('DOMNode', DOMNode::class);
$this->registerNodeClass('DOMNotation', DOMNotation::class);
$this->registerNodeClass('DOMProcessingInstruction', DOMProcessingInstruction::class);
$this->registerNodeClass('DOMText', DOMText::class);
$this->registerNodeClass('DOMEntityReference', DOMEntityReference::class);
}
}
10 changes: 10 additions & 0 deletions src/Nodes/DOM/DOMEntity.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<?php

namespace andreskrey\Readability\Nodes\DOM;

use andreskrey\Readability\Nodes\NodeTrait;

class DOMEntity extends \DOMEntity
{
use NodeTrait;
}
3 changes: 3 additions & 0 deletions src/Nodes/NodeTrait.php
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@
use andreskrey\Readability\Nodes\DOM\DOMNode;
use andreskrey\Readability\Nodes\DOM\DOMText;

/**
* @method \DOMNode removeAttribute($name)
*/
trait NodeTrait
{
/**
Expand Down
116 changes: 93 additions & 23 deletions src/Readability.php
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,13 @@ class Readability
*/
private $logger;

/**
* Collection of attempted text extractions.
*
* @var array
*/
private $attempts = [];

/**
* @var array
*/
Expand Down Expand Up @@ -155,54 +162,76 @@ public function parse($html)
* finding the -right- content.
*/

$length = 0;
foreach ($result->getElementsByTagName('p') as $p) {
$length += mb_strlen($p->textContent);
}
$length = mb_strlen(preg_replace(NodeUtility::$regexps['onlyWhitespace'], '', $result->textContent));

$this->logger->info(sprintf('[Parsing] Article parsed. Amount of words: %s. Current threshold is: %s', $length, $this->configuration->getWordThreshold()));

if ($result && mb_strlen(preg_replace('/\s/', '', $result->textContent)) < $this->configuration->getWordThreshold()) {
$parseSuccessful = true;

if ($result && $length < $this->configuration->getWordThreshold()) {
$this->dom = $this->loadHTML($html);
$root = $this->dom->getElementsByTagName('body')->item(0);
$parseSuccessful = false;

if ($this->configuration->getStripUnlikelyCandidates()) {
$this->logger->debug('[Parsing] Threshold not met, trying again setting StripUnlikelyCandidates as false');
$this->configuration->setStripUnlikelyCandidates(false);
$this->attempts[] = ['articleContent' => $result, 'textLength' => $length];
} elseif ($this->configuration->getWeightClasses()) {
$this->logger->debug('[Parsing] Threshold not met, trying again setting WeightClasses as false');
$this->configuration->setWeightClasses(false);
$this->attempts[] = ['articleContent' => $result, 'textLength' => $length];
} elseif ($this->configuration->getCleanConditionally()) {
$this->logger->debug('[Parsing] Threshold not met, trying again setting CleanConditionally as false');
$this->configuration->setCleanConditionally(false);
$this->attempts[] = ['articleContent' => $result, 'textLength' => $length];
} else {
$this->logger->emergency('[Parsing] Could not parse text, giving up :(');
$this->logger->debug('[Parsing] Threshold not met, searching across attempts for some content.');
$this->attempts[] = ['articleContent' => $result, 'textLength' => $length];

// No luck after removing flags, just return the longest text we found during the different loops
usort($this->attempts, function ($a, $b) {
return $a['textLength'] < $b['textLength'];
});

// But first check if we actually have something
if (!$this->attempts[0]['textLength']) {
$this->logger->emergency('[Parsing] Could not parse text, giving up :(');

throw new ParseException('Could not parse text.');
throw new ParseException('Could not parse text.');
}

$this->logger->debug('[Parsing] Threshold not met, but found some content in previous attempts.');

$result = $this->attempts[0]['articleContent'];
$parseSuccessful = true;
break;
}
} else {
break;
}
}

$result = $this->postProcessContent($result);

// If we haven't found an excerpt in the article's metadata, use the article's
// first paragraph as the excerpt. This can be used for displaying a preview of
// the article's content.
if (!$this->getExcerpt()) {
$this->logger->debug('[Parsing] No excerpt text found on metadata, extracting first p node and using it as excerpt.');
$paragraphs = $result->getElementsByTagName('p');
if ($paragraphs->length > 0) {
$this->setExcerpt(trim($paragraphs->item(0)->textContent));
if ($parseSuccessful) {
$result = $this->postProcessContent($result);

// If we haven't found an excerpt in the article's metadata, use the article's
// first paragraph as the excerpt. This can be used for displaying a preview of
// the article's content.
if (!$this->getExcerpt()) {
$this->logger->debug('[Parsing] No excerpt text found on metadata, extracting first p node and using it as excerpt.');
$paragraphs = $result->getElementsByTagName('p');
if ($paragraphs->length > 0) {
$this->setExcerpt(trim($paragraphs->item(0)->textContent));
}
}
}

$this->setContent($result);
$this->setContent($result);

$this->logger->info('*** Parse successful :)');
$this->logger->info('*** Parse successful :)');

return true;
return true;
}
}

/**
Expand Down Expand Up @@ -468,6 +497,10 @@ private function getArticleTitle()
if (count(preg_split('/\s+/', $curTitle)) < 3) {
$curTitle = substr($originalTitle, strpos($originalTitle, ':') + 1);
$this->logger->info(sprintf('[Metadata] Title too short, using the first part of the title instead: \'%s\'', $curTitle));
} elseif (count(preg_split('/\s+/', substr($curTitle, 0, strpos($curTitle, ':')))) > 5) {
// But if we have too many words before the colon there's something weird
// with the titles and the H tags so let's just use the original title instead
$curTitle = $originalTitle;
}
}
} elseif (mb_strlen($curTitle) > 150 || mb_strlen($curTitle) < 15) {
Expand Down Expand Up @@ -549,7 +582,19 @@ private function toAbsoluteURI($uri)
*/
public function getPathInfo($url)
{
$pathBase = parse_url($url, PHP_URL_SCHEME) . '://' . parse_url($url, PHP_URL_HOST) . dirname(parse_url($url, PHP_URL_PATH)) . '/';
// Check for base URLs
if ($this->dom->baseURI !== null) {
if (substr($this->dom->baseURI, 0, 1) === '/') {
// URLs starting with '/' override completely the URL defined in the link
$pathBase = parse_url($url, PHP_URL_SCHEME) . '://' . parse_url($url, PHP_URL_HOST) . $this->dom->baseURI;
} else {
// Otherwise just prepend the base to the actual path
$pathBase = parse_url($url, PHP_URL_SCHEME) . '://' . parse_url($url, PHP_URL_HOST) . dirname(parse_url($url, PHP_URL_PATH)) . '/' . rtrim($this->dom->baseURI, '/') . '/';
}
} else {
$pathBase = parse_url($url, PHP_URL_SCHEME) . '://' . parse_url($url, PHP_URL_HOST) . dirname(parse_url($url, PHP_URL_PATH)) . '/';
}

$scheme = parse_url($pathBase, PHP_URL_SCHEME);
$prePath = $scheme . '://' . parse_url($pathBase, PHP_URL_HOST);

Expand Down Expand Up @@ -1129,6 +1174,7 @@ public function prepArticle(DOMDocument $article)
$this->_clean($article, 'embed');
$this->_clean($article, 'h1');
$this->_clean($article, 'footer');
$this->_clean($article, 'link');

// Clean out elements have "share" in their id/class combinations from final top candidates,
// which means we don't remove the top candidates even they have "share".
Expand Down Expand Up @@ -1479,6 +1525,28 @@ public function _cleanHeaders(DOMDocument $article)
}
}

/**
* Removes the class="" attribute from every element in the given
* subtree.
*
* Readability.js has a special filter to avoid cleaning the classes that the algorithm adds. We don't add classes
* here so no need to filter those.
*
* @param DOMDocument|DOMNode $node
*
* @return void
**/
public function _cleanClasses($node)
{
if ($node->getAttribute('class') !== '') {
$node->removeAttribute('class');
}

for ($node = $node->firstChild; $node !== null; $node = $node->nextSibling) {
$this->_cleanClasses($node);
}
}

/**
* @param DOMDocument $article
*
Expand Down Expand Up @@ -1532,6 +1600,8 @@ public function postProcessContent(DOMDocument $article)
}
}

$this->_cleanClasses($article);

return $article;
}

Expand Down Expand Up @@ -1564,7 +1634,7 @@ protected function setTitle($title)
*/
public function getContent()
{
return $this->content->C14N();
return ($this->content instanceof DOMDocument) ? $this->content->C14N() : null;
}

/**
Expand Down
Loading

0 comments on commit 45c5826

Please sign in to comment.