0

We are planning to build a web based custom business analytics tool using PHP and MySQL/PostgreSQL. In our application, we are planning to upload customer’s data using a Excel sheet and these data will be displayed on custom build interactive dashboard.

I have few questions before I start working on this application

1) For the mass data upload from excel, which db is suitable MySQL or PostgreSQL?

2) At some point is it possible that multiple customers can start data upload at same time ?. To deal with such scenario is it advisable to create a separate db for each customer or to use a single db for application(single tenancy vs multi-tenancy). Please tell me pros and cons if any for each scenario?

3) How it would affect my dashboard performance, if I go with multiple databases design and I join tables across multiple databases?

Thanks

user44429
  • 1
  • 2

3 Answers3

2

For a data warehousing/analytics type application (DW/OLAP), I would go with PostgreSQL. It has set operators, windowing functions (also known as analytic functions) and common table expressions. You will most likely have to implement some or all of these in your own code with MySQL with the attendant possibility of bugs. MySQL is more suited to read-heavy OLTP type applications, whereas PostgreSQL is better (IMHO) for OLAP work.

With respect to "multi-tenancy" - why not use the same database for all and simply identify different customers by a customer_id - or (depending on the number of customers) have separate tables for each customer and non-customer specific data in other tables? How many customers do you have and how much data do you expect to be storing/analysing in the next 10 years? Modern RDBMSs can store large amounts of data and retrieve quickly (with suitable disk configuration), but I would favour PostgreSQL for large databases.

Sharding might be interesting if you only occasionally query across customers and most of your analysis is done per customer - you could set up a table (or even schema) per customer (on different disks). Again, knowledge of data volumes would help here.

Furthermore, with respect to multi-tenancy, see my question to jynus about querying across servers - or even different schemas within the same server.

I take jynus' point about "holy wars" - this post is just my own opinion about your particular question - were I to be asked about a different scenario, I might well recommend MySQL.

[EDIT]

What, exactly, is "my dashboard performance"?

Vérace
  • 30,923
  • 9
  • 73
  • 85
0

Any of the two MySQL and Postgres will do right if you enter the number of records hold in an Excel document. Insert has to be performed either in a batch or by using the Bulk interface (which is proprietary for the DB manufacturer, and on MySQL it may require the file you push to be on the same physical node as the DB server).

Creating separate database for separate users (or for whatever separate entity you may conceptually consider a border) is usually an anti-pattern. Imagine if you have 5000 customers, managing 5000 databases (each with own independent structure) is really a pain. If each customer has 10000 records, you would end up either managing a small DB of 50M records, or managing 5000 tiny databases. On the other hand, if security concerns are involved (you store the secret formula of both Pepsi and Coca-Cola) better use fully separated servers.

I don't see a problem if multiple customers upload in the same time. If both can update the same content for example (density of water at zero Celsius) the "second" uploader will receive an error, because of violating an unique key. On the other hand, if each customer has his/her own data, customerID is part of the keys, so there is no problem uploading. Even if bulk import means a lock on the table (which shouldn't), the customers will only experience a brief delay during import.

-1

1) Any modern RDBMS system will allow you to import and work with Excel-like files (usually, we use them as CSV files, for which both databases have import support: COPY and LOAD DATA).

If you had to choose, worry about in-house technology knowledge and advanced features, but nothing here suggest any impossible tasks. Please be wary of people promoting holy wars among database vendors.

2 and 3) All modern RDBMS allow for transactional, concurrent writes of different rows in a single table. Your design concerns should be directed about estimated write load/data size and its implications on performance, like -Do you need partitioning/sharding? Those questions are usually asked when you have to deal with huge numbers. Make sure you are not falling into the premature optimization anti-pattern.

There is a difference regarding joins, as schemas/databases on MySQL are mostly a logical categorization, and there is effectively no difference in performance between joing tables in the same or different databases inside the same instance. In Postgres, while it is tecnically possible (dblink, etc.), it is not an optimal action by design -different databases are "different entities"- obviously you do not put tables that you plan to join on different databases knowing that (the same way that you do not put them on different instances).

jynus
  • 15,057
  • 2
  • 37
  • 46