Data processing infrastructure

Question

We are building a system for data science in AWS and our flow is pretty simple.

Get the data - From redis
get the model - From FSx/EBS
run.

This process script should take something like 5-10 seconds depends on the size of the data and model. So far so good.

What we did so far is to build a python application (runs in kubernetes) that listen to a queue and runs the data science flow. The problem we faced was Pythons memory leak. what ever we did and every approach we took with python a long running application that loads/unloads large objects will grow and grow, So we are trying a new approach , running the script on its own process each time a new process is created and closed when finished (still in kubernetes).

My question is Is there a better way? AWS Lambda - could be a good solution but it has a memory/cpu limits that doesn't fit. Kubernetes Job - Inefficient when the script itself takes only a few seconds to run.

Are there any other solutions to run high load of requests with cpu/memory intensive work? (I'm looking for a resource/infrastructure solution, not a software one.)

score 1 · Answer 1 · answered Nov 18 '20 at 09:17

The service can be always-on listening on the queue for jobs, or single-job started by the queue only when new work arrives.

The tradeoff is startup time for single-jobs vs cost for always-on service.

Define the requirements - number of jobs per day, max latency, ram/cpu needed, etc and then select the right technology that can deliver it in the most cost effective way.

From AWS Lambda, AWS Step Functions, through K8S Jobs and AWS Fargate to AWS EC2 on spot pricing, and more. Plenty of options, depends on the requirements.

Hope that helps :)

Data processing infrastructure

1 Answers1