4

Is it too late to teach an old dog a new trick?

I've always been the guy with a monolithic application/codebase.

For more than 10 years now, I've gotten good at caching, writing good SQL queries, using Redis to store frequently called read data, using background jobs to perform updates, edits, long-running tasks, e.t.c

Then again, I haven't had to deal with more than 100 million rows in a db before.

Not until now, I've used up all my tricks and alas no significant improvement.

Now I'm coming for advice from you smart devops people.

For my current app, just recently I had a rails app, my DB, and some script in the /lib, Redis, Sidekiq, all on one, you guessed it, server.

I read up on Docker and now I have Sidekiq, Redis, my app, and my db all on different containers.

Now, I'd like to split some functions on my app itself. This is where I need your help.

For example:

Script A: I have this ruby script(a ruby class) that crawls the web for backlink data. Script A only gathers links and removes links that have been parsed already.

Script B: Another script that gets the links script A collected and retrieves the information I want from that link.

Let's stop at that for now, so it doesn't get confusing.

I'd like to run these scripts on their own servers, and they would connect to my app via api's.

Both scripts are less than 1000 lines of code and are written in Ruby.

Requirements

  1. Both scripts require Selenium, Watir, and headless chrome to run.
  2. The server must be able to autoscale as needed.
  3. If the server is rate-limited or stuck in a link blackhole, I should be able to spin up a new server(host), with a different wan IP, with all the content of the script in as little time as possible.
  4. The server should be affordable, as it's these are long running tasks. The script does not require a database as it only sends back important data back to my app via API.
  5. In short, both scripts should be on their own server (host) and it should be easy to spin up a new one in no time.

How do you suggest I go about it?

Pierre.Vriens
  • 7,225
  • 14
  • 39
  • 84
Tommy Adey
  • 143
  • 4

1 Answers1

4

In your approach I see a few scalability problems - I'm judging from a Google App Engine (GAE) context, where scalability is achieved via breakdown of the work in small tasks/work items and strict limitation of the response time for a task execution:

  • you're duplicating some piece of functionality: both scripts need to parse the same page, leading to your requirement #1
  • the task execution time for script A is fundamentally unbound, it can vary widely depending on the number of links discovered which need to be checked if already crawled

I'd approach this a bit differently to avoid these issues. I don't even see a need for separate services for this particular job:

  • the service pulls individual page URLs as separate tasks/work items
  • it checks if the URL was already crawled and, if so, does nothing, otherwise continues
  • gets the page content and parses it to extract 2 kinds of info:

    • the list of other URLs to crawl, creating and enqueueing new work items for each one of them (it doesn't check here if they were already crawled, which may take a long time depending on the number of URLs)
    • the other info needed from the page

In this approach the duration of each task execution is reduced to the minimum necessary: one page parsing and only one check if a URL has already been crawled. And each page is only parsed once.

The number of pending work items would be the driver for scaling - if it goes above a level new instances can be started, if it drops below a level and multiple instances are running some may be stopped. If implemented as a GAE app the GAE infra can take care of that for you automatically.

This solution can be implemented using a noSQL database (like Google Datastore, in a GAE context), which can scale better than a SQL one.

Dan Cornilescu
  • 6,780
  • 2
  • 21
  • 45