How should I design an architecture for an endpoint to take in huge amount of data without loading the database?

by xenon   Last Updated October 11, 2019 12:05 PM - source

Every day, there will be a set of new data that I will need to write and update to the database. This set of data can contain 8-10million records.

This new set of data comes from Service A and it writes directly into a database which is used by Service B. I don’t maintain Service A but I do maintain Service B.

The problem is that whenever Service A has to do such a massive load to the database, it consumes all the IOPS and bogs down the database which in turn affects Service B. And because there are so many records, it takes more than 10-15 hours to complete the process. This means the load of the database remains high for that number of hours too!

I’m considering to provide an endpoint for Service A to send its new data from and I will write to the database in a more “graceful” manner so that it doesn’t affect Service B.

However, I’m not sure how I should do it without still bogging down the database even with an endpoint so that Service A doesn’t have to write directly to the database. I’m thinking, even if I provide an endpoint for Service A to write data to my database, I would still have to somehow handle that same amount of load that will now just be Service B writing directly to the database and resulting in the same load issue.

I’m currently using AWS Postgres RDS as our database. All our services are hosted on AWS and we have a "microservice" architecture.

In a scenario like this, how should I design my endpoint or what AWS services should I be using so that I can handle the huge amount of data more gracefully?



Answers 2


Update

In the comment thread you clarified that the database belongs to service B and is managed by you.

This gives you the authority to control access to the database. This does mean you have to expose an endpoint and route this traffic through your service, but it also gives you the authority to throttle requests as you see fit.

If you draw the line at X requests per second (you can figure this out based on what your db can handle while remaining usable by service B), simply reject any additional request from that source during that second. If that's a problem for the service A devs, then that's a problem the service A devs will have to solve.

You're currently letting someone DDOS your database and seem to think you have no other response than to acquiesce and let the db be dragged down. This is not the right response.

Open an endpoint for them, deny service A direct access to the database (revoke the service credentials, or if A and B share the same credentials, change the password and do not give them the new password), force them to use your service, inform them of the throttling you've implemented. The onus is on them to comply with your service.

An alternative solution when the service A devs complain about the throttling is to approach management and justify a need for upscaling service B to have the appropriate resources to deal with the workload. Management gets to decide what's more important, keeping costs down or making sure service A gets to work at top speed.


In a scenario like this, how should I design my endpoint so that I can handle the huge amount of data more gracefully?

It all boils down to what you have ownership over. And by you, I mean service B.

If the database is considered to be service B's database, then all access to the db should go through service B. Don't give direct access to others because you can't enforce any quality control on the data operations they perform.

If the database is a shared database not owned by a particular service, then it is not your problem (if you're only a service B dev). The issue needs to be tackled either by whoever maintains the database (e.g. enforced request throttling), or the service A devs (fixing their own service).

I’m considering to provide an endpoint for Service A to send its new data from and I will write to the database in a more “graceful” manner so that it doesn’t affect Service B.

If it's service B's database, then you should've done this from the start (regardless of performance issues). If it's not service B's database, then this is a case of "not my circus, not my monkeys" (from the perspective of service B, that is).

Suppose you do take on becoming a reasonable middle man between A and the db, and ignore the implementation details. So what happens if Service C is created tomorrow which imports different (but equally sizeable data)? Are you also going to incorporate service C now? What about Service D? E? F?

Trying to cover for other faulty mechanisms just leads to more effort and an inferior solution. You're muddying the water, A and B are separate for a reason.

What you're trying to do is akin to preventing a DDOS attack on your own database server (by Service A). If it were your own managed server, I would suggest taking this up with the IT support staff for your database server to handle the issue on a larger scale. DDOS attack are not (efficiently) prevented by programmers, but by server/network admins.

As this is AWS related, I'm not quite sure how exactly to go about this, but the principle of the approach is the same.
Rather than trying to fix the faulty actor (service A), fix the database server. Even if you fix service A, you're going to have to fix every actor's misbehavior. But if you fix the database server by e.g. enforcing request throttling for every actor, then you prevent any future services from making the same mistake and pulling the whole system down again.

Flater
Flater
October 11, 2019 11:46 AM

The architectural solution to most bulk load problems is to eliminate the bulk load. Go back to the source of the events and consume them in a stream.

However. Where this is not possible you need a tactical solution and following strict rules about service B owning its own DB might not help you.

Amazon has some guidance on bulk imports

https://aws.amazon.com/premiumsupport/knowledge-center/rds-import-data/

But you can see this isn't geared to being a regular business process. You are advised to switch off backups etc.

If you are at the level of data where you are forced to use these 'one off' methods just to get the work done in a reasonable amount of time then I would consider spinning up new databases for each import and having service B aware of multiple RDS instances.

That way you can have Service A create a whole new DB, add the data then message service B : "a new DB is ready for you"

Having everything in the cloud helps with this, as you can use the AWS api to spin up and access services without having to worry about physical hardware

Ewan
Ewan
October 11, 2019 12:44 PM

Related Questions



Microservices architecture for an RSS reader

Updated July 01, 2018 12:05 PM

Routing all SQL queries through a single micro-service

Updated February 19, 2018 14:05 PM