serpapi · kenyounot123 · Dec 20, 2024 · Dec 23, 2024 · Dec 25, 2024 · Dec 25, 2024
diff --git a/Gemfile b/Gemfile
@@ -0,0 +1,6 @@
+source 'https://rubygems.org'
+
+gem 'nokogiri'
+
+# For Testing
+gem 'rspec'
diff --git a/Gemfile.lock b/Gemfile.lock
@@ -0,0 +1,45 @@
+GEM
+  remote: https://rubygems.org/
+  specs:
+    diff-lcs (1.5.1)
+    nokogiri (1.17.2-aarch64-linux)
+      racc (~> 1.4)
+    nokogiri (1.17.2-arm-linux)
+      racc (~> 1.4)
+    nokogiri (1.17.2-arm64-darwin)
+      racc (~> 1.4)
+    nokogiri (1.17.2-x86-linux)
+      racc (~> 1.4)
+    nokogiri (1.17.2-x86_64-darwin)
+      racc (~> 1.4)
+    nokogiri (1.17.2-x86_64-linux)
+      racc (~> 1.4)
+    racc (1.8.1)
+    rspec (3.13.0)
+      rspec-core (~> 3.13.0)
+      rspec-expectations (~> 3.13.0)
+      rspec-mocks (~> 3.13.0)
+    rspec-core (3.13.2)
+      rspec-support (~> 3.13.0)
+    rspec-expectations (3.13.3)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.13.0)
+    rspec-mocks (3.13.2)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.13.0)
+    rspec-support (3.13.2)
+
+PLATFORMS
+  aarch64-linux
+  arm-linux
+  arm64-darwin
+  x86-linux
+  x86_64-darwin
+  x86_64-linux
+
+DEPENDENCIES
+  nokogiri
+  rspec
+
+BUNDLED WITH
+   2.6.1
diff --git a/README.md b/README.md
@@ -1,28 +1,32 @@
-# Extract Van Gogh Paintings Code Challenge
-
-Goal is to extract a list of Van Gogh paintings from the attached Google search results page.
-
-![Van Gogh paintings](https://github.com/serpapi/code-challenge/blob/master/files/van-gogh-paintings.png?raw=true "Van Gogh paintings")
-
-## Instructions
-
-This is already fully supported on SerpApi. ([relevant test], [html file], [sample json], and [expected array].)
-Try to come up with your own solution and your own test.
-Extract the painting `name`, `extensions` array (date), and Google `link` in an array.
-
-Fork this repository and make a PR when ready.
-
-Programming language wise, Ruby (with RSpec tests) is strongly suggested but feel free to use whatever you feel like.
-
-Parse directly the HTML result page ([html file]) in this repository. No extra HTTP requests should be needed for anything.
-
-[relevant test]: https://github.com/serpapi/test-knowledge-graph-desktop/blob/master/spec/knowledge_graph_claude_monet_paintings_spec.rb
-[sample json]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.json
-[html file]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.html
-[expected array]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/expected-array.json
-
-Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed). 
-
-Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.)
-
-The suggested time for this challenge is 4 hours. But, you can take your time and work more on it if you want.
+# SerpApi Code Challenge
+
+This is a solution to the SerpApi Code Challenge.
+Creates a class `GoogleHtmlParser` that can be used to scrape Google search results pages and then saves the results to a JSON file.
+
+- `scraper.rb` is the file that executes the scraper.
+- `document_parser.rb` is the parent class to handle the demo mode logic and the parsing of the HTML file.
+- `google_html_parser.rb` is the class that encapsulates the logic to parse the Google search results page.
+- `carousel_item_parser.rb` is the class that parses the carousel items and extracts the name, link, image, and extensions.
+- `image_map.rb` is the class that creates the hash mapping image id to image url from the HTML file.
+
+To run tests:
+```
+bundle install
+bundle exec rspec
+```
+
+To run the scraper:
+```
+bundle install
+ruby scraper.rb
+```
+
+The demo mode is set to true if the file being scraped is `van-gogh-paintings.html`. which is the test file provided in the challenge.
+I found searched up two other similar result pages to test the scraper against. The demo mode is needed to differentiate between the test file and the other two because they have different layouts.
+
+# Challenges
+
+1. Different layouts for the carousel items on certain search results pages.
+2. Getting used to the Nokogiri gem and its methods.
+3. Refactoring the code to be more readable and maintainable.
+4. Took a little longer than expected to get the hang of the gem and the methods.
diff --git a/files/da-vinci-artworks.html b/files/da-vinci-artworks.html
diff --git a/files/pablo-picasso-artworks.html b/files/pablo-picasso-artworks.html
diff --git a/files/van-gogh-paintings.html b/files/van-gogh-paintings.html
diff --git a/lib/carousel_item_parser.rb b/lib/carousel_item_parser.rb
@@ -0,0 +1,53 @@
+require 'cgi'
+require_relative 'document_parser'
+
+class CarouselItemParser < DocumentParser
+  attr_reader :base_url, :image_map, :html_file
+
+  def initialize(base_url, image_map, html_file)
+    super(html_file)
+    @base_url = base_url
+    @image_map = image_map
+  end
+
+  def parse(item)
+    carousel_item = {
+      name: name(item),
+      link: link(item),
+      image: image(item)
+    }
+
+    carousel_item[:extensions] = extensions(item) unless extensions(item).empty?
+    carousel_item
+  end
+
+  private
+
+  def name(item)
+    if demo?
+      item['aria-label']
+    else
+      item.css('div').children.first.text
+    end
+  end
+
+  def extensions(item)
+    if demo?
+      item.css('div.ellip.klmeta')&.children&.map(&:text)&.map(&:strip)&.reject(&:empty?)
+    else
+      item.css('div>div+div').map(&:text).map(&:strip).reject(&:empty?)
+    end
+  end
+
+  def link(item)
+    if demo?
+      @base_url + item['href']
+    else
+      @base_url + item.at_css('a')['href']
+    end
+  end
+
+  def image(item)
+    @image_map[item.at_css('img')['id']]
+  end
+end
diff --git a/lib/document_parser.rb b/lib/document_parser.rb
@@ -0,0 +1,20 @@
+require 'nokogiri'
+
+class DocumentParser
+  attr_reader :parsed_document, :html_file
+
+  def initialize(html_file)
+    @html_file = html_file
+    @parsed_document = parse_file(html_file)
+  end
+
+  def demo?
+    html_file == 'files/van-gogh-paintings.html'
+  end
+
+  private
+
+  def parse_file(html_file)
+    File.open(html_file) { |f| Nokogiri::HTML(f, nil, 'UTF-8') }
+  end
+end
diff --git a/lib/google_html_parser.rb b/lib/google_html_parser.rb
@@ -0,0 +1,36 @@
+require_relative 'image_map'
+require_relative 'document_parser'
+require_relative 'carousel_item_parser'
+
+class GoogleHtmlParser < DocumentParser
+  BASE_URL = "https://www.google.com".freeze
+
+  attr_reader :image_map, :carousel_item_parser, :carousel_items
+
+  def initialize(html_file)
+    super
+    @image_map = ImageMap.new(html_file).to_h
+    @carousel_item_parser = CarouselItemParser.new(BASE_URL, image_map, html_file)
+    @carousel_items = fetch_carousel_items
+  end
+
+  def execute
+    { artworks: carousel_items.map { |item| carousel_item_parser.parse(item) } }
+  end
+
+  private
+
+  def fetch_carousel_items
+    if demo?
+      parsed_document
+        .css('g-scrolling-carousel')
+        .css('a.klitem')
+    else
+      parsed_document
+        .at_css('g-loading-icon')
+        .parent
+        .children[1]
+        .children
+    end
+  end
+end
diff --git a/lib/image_map.rb b/lib/image_map.rb
@@ -0,0 +1,27 @@
+require_relative 'document_parser'
+
+class ImageMap < DocumentParser
+  def initialize(html_file)
+    super
+  end
+
+  def to_h
+    if demo?
+      script_content
+        .scan(/var s = '(.*?)'; var ii = \['(.*?)'\]/)
+        .map { |src, id| [id, src.gsub('\\', '')] }
+        .to_h
+    else
+      script_content
+        .scan(/var s='(.*?)';var ii=\['(.*?)'\]/)
+        .map { |src, id| [id, src.gsub('\\', '')] }
+        .to_h
+    end
+  end
+
+  private 
+
+  def script_content
+    parsed_document.css('script').select { |script| script.text.include?('_setImagesSrc') }.map(&:text).join
+  end
+end
diff --git a/scraped-da-vinci-paintings.json b/scraped-da-vinci-paintings.json
diff --git a/scraped-picasso-paintings.json b/scraped-picasso-paintings.json