Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code Challenge Submission #296

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
source 'https://rubygems.org'

gem 'nokogiri'

# For Testing
gem 'rspec'
45 changes: 45 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
GEM
remote: https://rubygems.org/
specs:
diff-lcs (1.5.1)
nokogiri (1.17.2-aarch64-linux)
racc (~> 1.4)
nokogiri (1.17.2-arm-linux)
racc (~> 1.4)
nokogiri (1.17.2-arm64-darwin)
racc (~> 1.4)
nokogiri (1.17.2-x86-linux)
racc (~> 1.4)
nokogiri (1.17.2-x86_64-darwin)
racc (~> 1.4)
nokogiri (1.17.2-x86_64-linux)
racc (~> 1.4)
racc (1.8.1)
rspec (3.13.0)
rspec-core (~> 3.13.0)
rspec-expectations (~> 3.13.0)
rspec-mocks (~> 3.13.0)
rspec-core (3.13.2)
rspec-support (~> 3.13.0)
rspec-expectations (3.13.3)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-mocks (3.13.2)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-support (3.13.2)

PLATFORMS
aarch64-linux
arm-linux
arm64-darwin
x86-linux
x86_64-darwin
x86_64-linux

DEPENDENCIES
nokogiri
rspec

BUNDLED WITH
2.6.1
60 changes: 32 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,32 @@
# Extract Van Gogh Paintings Code Challenge

Goal is to extract a list of Van Gogh paintings from the attached Google search results page.

![Van Gogh paintings](https://github.com/serpapi/code-challenge/blob/master/files/van-gogh-paintings.png?raw=true "Van Gogh paintings")

## Instructions

This is already fully supported on SerpApi. ([relevant test], [html file], [sample json], and [expected array].)
Try to come up with your own solution and your own test.
Extract the painting `name`, `extensions` array (date), and Google `link` in an array.

Fork this repository and make a PR when ready.

Programming language wise, Ruby (with RSpec tests) is strongly suggested but feel free to use whatever you feel like.

Parse directly the HTML result page ([html file]) in this repository. No extra HTTP requests should be needed for anything.

[relevant test]: https://github.com/serpapi/test-knowledge-graph-desktop/blob/master/spec/knowledge_graph_claude_monet_paintings_spec.rb
[sample json]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.json
[html file]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.html
[expected array]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/expected-array.json

Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed).

Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.)

The suggested time for this challenge is 4 hours. But, you can take your time and work more on it if you want.
# SerpApi Code Challenge

This is a solution to the SerpApi Code Challenge.
Creates a class `GoogleHtmlParser` that can be used to scrape Google search results pages and then saves the results to a JSON file.

- `scraper.rb` is the file that executes the scraper.
- `document_parser.rb` is the parent class to handle the demo mode logic and the parsing of the HTML file.
- `google_html_parser.rb` is the class that encapsulates the logic to parse the Google search results page.
- `carousel_item_parser.rb` is the class that parses the carousel items and extracts the name, link, image, and extensions.
- `image_map.rb` is the class that creates the hash mapping image id to image url from the HTML file.

To run tests:
```
bundle install
bundle exec rspec
```

To run the scraper:
```
bundle install
ruby scraper.rb
```

The demo mode is set to true if the file being scraped is `van-gogh-paintings.html`. which is the test file provided in the challenge.
I found searched up two other similar result pages to test the scraper against. The demo mode is needed to differentiate between the test file and the other two because they have different layouts.

# Challenges

1. Different layouts for the carousel items on certain search results pages.
2. Getting used to the Nokogiri gem and its methods.
3. Refactoring the code to be more readable and maintainable.
4. Took a little longer than expected to get the hang of the gem and the methods.
35 changes: 35 additions & 0 deletions files/da-vinci-artworks.html

Large diffs are not rendered by default.

36 changes: 36 additions & 0 deletions files/pablo-picasso-artworks.html

Large diffs are not rendered by default.

6 changes: 5 additions & 1 deletion files/van-gogh-paintings.html

Large diffs are not rendered by default.

53 changes: 53 additions & 0 deletions lib/carousel_item_parser.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
require 'cgi'
require_relative 'document_parser'

class CarouselItemParser < DocumentParser
attr_reader :base_url, :image_map, :html_file

def initialize(base_url, image_map, html_file)
super(html_file)
@base_url = base_url
@image_map = image_map
end

def parse(item)
carousel_item = {
name: name(item),
link: link(item),
image: image(item)
}

carousel_item[:extensions] = extensions(item) unless extensions(item).empty?
carousel_item
end

private

def name(item)
if demo?
item['aria-label']
else
item.css('div').children.first.text
end
end

def extensions(item)
if demo?
item.css('div.ellip.klmeta')&.children&.map(&:text)&.map(&:strip)&.reject(&:empty?)
else
item.css('div>div+div').map(&:text).map(&:strip).reject(&:empty?)
end
end

def link(item)
if demo?
@base_url + item['href']
else
@base_url + item.at_css('a')['href']
end
end

def image(item)
@image_map[item.at_css('img')['id']]
end
end
20 changes: 20 additions & 0 deletions lib/document_parser.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
require 'nokogiri'

class DocumentParser
attr_reader :parsed_document, :html_file

def initialize(html_file)
@html_file = html_file
@parsed_document = parse_file(html_file)
end

def demo?
html_file == 'files/van-gogh-paintings.html'
end

private

def parse_file(html_file)
File.open(html_file) { |f| Nokogiri::HTML(f, nil, 'UTF-8') }
end
end
36 changes: 36 additions & 0 deletions lib/google_html_parser.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
require_relative 'image_map'
require_relative 'document_parser'
require_relative 'carousel_item_parser'

class GoogleHtmlParser < DocumentParser
BASE_URL = "https://www.google.com".freeze

attr_reader :image_map, :carousel_item_parser, :carousel_items

def initialize(html_file)
super
@image_map = ImageMap.new(html_file).to_h
@carousel_item_parser = CarouselItemParser.new(BASE_URL, image_map, html_file)
@carousel_items = fetch_carousel_items
end

def execute
{ artworks: carousel_items.map { |item| carousel_item_parser.parse(item) } }
end

private

def fetch_carousel_items
if demo?
parsed_document
.css('g-scrolling-carousel')
.css('a.klitem')
else
parsed_document
.at_css('g-loading-icon')
.parent
.children[1]
.children
end
end
end
27 changes: 27 additions & 0 deletions lib/image_map.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
require_relative 'document_parser'

class ImageMap < DocumentParser
def initialize(html_file)
super
end

def to_h
if demo?
script_content
.scan(/var s = '(.*?)'; var ii = \['(.*?)'\]/)
.map { |src, id| [id, src.gsub('\\', '')] }
.to_h
else
script_content
.scan(/var s='(.*?)';var ii=\['(.*?)'\]/)
.map { |src, id| [id, src.gsub('\\', '')] }
.to_h
end
end

private

def script_content
parsed_document.css('script').select { |script| script.text.include?('_setImagesSrc') }.map(&:text).join
end
end
332 changes: 332 additions & 0 deletions scraped-da-vinci-paintings.json

Large diffs are not rendered by default.

351 changes: 351 additions & 0 deletions scraped-picasso-paintings.json

Large diffs are not rendered by default.

Loading