5

I've just been asked whether our company should consider Data Virtualization for our test environments. The benefits are given as:

  • Screening of sensitive data
  • Fast data refreshes in our test environments
  • Potential benefits for DR and BI scenarios

However I've only found marketing info; nothing technical. From what I can figure out there are 2 approaches:

  • A service layer over a production database which abstracts you from the data model (presumably resulting in a different data model presented by that new layer).
  • A tool to automate the restore and subsequent manipulation of data which can be used by non-technical users and is faster than using database backups and SQL scripts.

Without seeing any technical information this smells of snake oil to me; but I want to understand it rather than dismissing out of hand.


Keywords: [data-as-a-service] [data-virtualisation] [data-virtualization] [delphix] [denodo]

Shiwangini
  • 380
  • 7
  • 24
JohnLBevan
  • 459
  • 1
  • 4
  • 15

1 Answers1

3

Data virtualization is the provision of an abstraction layer so the data consumer does not have to know the physical location or format of the original data. You may have a PostGres DB, a MySQL DB, a SQL Server DB, a whole batch of Parquet/ORC files and the person writing the query is completely unaware of that physicality. As far as they are concerned they are hitting a connection to Presto (or whatever data virtualization solution you'd chosen use).

Technologies such as Apache Presto allow a central point against which to run SQL queries but Presto itself is configured to know where and what the source data is, the end user does not need to know. Presto is an open-source tool that has had a lot of input from Teradata, particularly with regard to JDBC connectivity, security and LDAP authentication. It also has commercial support from StarBurst. Starburst has recently announced a cost based query optimizer for Presto.

AWS have faith in Presto because they have based AWS Athena on it. The beauty of it is that data does not have to reside in a relational database. It can be file based as well.

In terms of screening sensitive data you can choose who has access to what but it isn't a data masking or obfuscation tool.

It isn't snake oil but neither is it a silver bullet. There is obviously a hit on the source systems and you have to understand what that hit is. The key benefit is that you don't have to shift data all over the place and have a huge plethora of technology to support that data movement.

JohnLBevan
  • 459
  • 1
  • 4
  • 15
Dave Poole
  • 450
  • 2
  • 5