Inner join between separate databases

Question

Assume there is a large corporation with multiple SW systems that all use the same CRM system via a REST API. All systems have their own databases as does the CRM system.

Now, system A only holds references to the CRM customer in the form of customer IDs but no other customer data.
System A's database holds 10,000 customers, CRM holds 1,000,000 customers.

There's a requirement for system A to allow searching customers by name or partial name and a role held in system A. If CRM and system A were in the same database this would be trivial, a simple inner join. But since both systems have separate databases, this is not an option.

Option 1 would be that there's a REST service that accepts 10,000 customer IDs as input parameters (as well as the searched name), does the inner join against its own database, and returns the result set. In our case this is not possible as the REST service accepts at most 100 customer IDs and returns at most 300 customers.

Option 2 would be that the CRM system keeps track of which customers are in system A and can then limit name searches to those customers.

Option 3 would be that system A replicates some of the CRM data (names) in its own database, but that comes with a its own problems. How do you propagate changes in CRM to system A?

Any other options? I'm not too familiar with how MDM systems would handle this. It seems to me that with the current microservices craze similar problems are all the more prevalent.

Hans-Martin Mosner · Answer 1 · 2020-08-20T10:24:20.017

If the current data layout does not support a new requirement, you always have a number if options. Which one is most suitable in your case is something that only you (or your organization) can decide.

In general, the path of least resistance seems to be to use the APIs that are available and work around their limitations. So your system A might query the CRM system for all customer IDs matching some partial name (which could be a lot but usually returning a huge number of IDs is easier than passing a huge number as parameters) and then intersect this list with the list of customers having some role in A. This could work, but is probably suboptimal. However, you would be able to implement the solution completely within the boundaries of system A.

A more comprehensive solution might be to create a requirement for the CRM system (which may or may not be possible at all) to store additional information such as the roles each customer has in the various other systems. This has the advantage of being queriable within the CRM system and possibly supporting additional reporting functionality that requires cross-system customer role information. However, it would also require all systems to publish their customer role information to the CRM.

Your third option should also be considered, of course, but it looks like it could yield relatively little gain for the additional effort required, so to me it feels like the least attractive solution.

I'm currently working with a system which implements the second approach, and in my opinion it feels natural and sound. However, in this case the design was implemented a long time ago, so there's no pressure to change the architecture to just support a new use case. In your case, that may be harder, so the first option could actually be preferrable.

Andreas Hütter · Answer 2 · 2020-08-26T20:31:25.670

There's a requirement for system A to allow searching customers by name or partial name and a role held in system A. If CRM and system A were in the same database this would be trivial, a simple inner join. But since both systems have separate databases, this is not an option.

I'm getting a little sceptical with the there's a requirement for system A part here. From a user's perspective is there really a system A, system CRM, etc.?

If not, does the user of your software really care about where in the backend the application is searching for a name?

CRM systems often store role information along with the customer information. So maybe the easiest solution - having the role information in the CRM - is the best. With that the search API would be provided by the CRM.

But if the role information is too specific to service A and does not really reflect the role of a customer in the context of a CRM I would suggest the following:

Define the Customer domain model in the context of service A with all data that is required in service A (external customerId, customer name, role, etc)
Update your customer data in your service A with data from the CRM either
- via some caching mechanism (if this is applicable in terms of performance)
- or by letting the CRM publish changes to customers which service A can listen to in order to update its own data set (e.g. by using some messaging infrastructure)

If you go with this second approach you of course have to be aware that you might sometimes search on outdated data which could or could not be problem depending on your business requirements. Although outdated can really just mean seconds or less depending on the implementation.

Update

Bottom line I would suggest one of the two approaches I described above depending on your current situation. So publish changes from the CRM so that service A can build its own projection of the data for search capabilities if some data (for instance the role information) is too specific to service A. Or extend the CRM data so that everything is available in the CRM system and provide the search functionality via the CRM system if the searched data/parameters fit into the model of the CRM.

But if there is no way you can implement changes in the CRM system you could also look into the Change data capture pattern. With that you can utilize the transaction log of the database (which is usually provided by most modern database systems) to synchronize data changes in the background by selecting only what you need. This would not require to change anything in the CRM system's implementation or the CRM database. There are tools that also allow to setup this kind of synchronization from one database to another without implementing any custom code. You could look, for instance, into Debezium in combination with some message streaming service (e.g. Kafka). See also https://developers.redhat.com/blog/2020/04/14/capture-database-changes-with-debezium-apache-kafka-connectors/

score 0 · Answer 3 · answered Aug 22 '20 at 06:32

0

What you described a need for is ETL. You need to keep the data synchronized between two systems, you can create a process that is responsible for doing it on a schedule.

answered Aug 22 '20 at 06:32

Michael Brown

21,822

Inner join between separate databases

3 Answers3