In this write-up we’ll describe our proprietary Python Dynamic Query Builder (PDQB), and provide a use-case example as well.
First let’s define what is meant by “Dynamic Query Builder”. Imagine a simple SQL query such as:
SELECT column1 FROM table1 WHERE column2 IS NOT NULL
Now imagine that you want to run the same SQL query but for hundreds of tables in your Database. Even further, imagine that you want to change the column names and where conditions depending on each table name. Normally this would be done manually – but with our PDQB, this process gets automated, with each query string being constructed by Python, and the results are dumped as csv file(s). Of course, this is a simple example. In reality the constructed queries can be highly complex, and our code is not limited to SQL Databases.
What can you do with PDQB? A lot actually, since now one can conduct all sorts of Database Mining, data discovery, and most importantly cross-table relationship building. This is highly important for Enterprise Big Data Warehouses that initially did not implement naming standardization (bad pre-planning) for tables and columns, and have no documentation for table relationships. In such cases, this tool can assist with relationship discovery and provide a list of most likely possible pathways.
Now it’s time for a use-case example: Building a Data Dictionary for self-service analytics of an Enterprise Data Warehouse that has zero table or column name annotation, almost zero relationship documentation, and many columns/fields that have Null or insignificant data. Here is the outline of the steps that would solve this problem or at least greatly reduce the problem complexity.
1. Using PDQB, create an inventory of table names, column names, field type, etc from the Information Schema (or the NoSQL equivalent) records.
2. Review PDQB suggested metrics such as; Number of Records, Max/Min (for Numeric type), Count of Null, to determine which metrics are of highest significance for your particular case.
3. Using PDQB, employ selected metrics to build the meta-data catalog. Detect and drop insignificant or useless tables and columns and exclude from any further consideration.
4. Employ a combination of comparative methods such as fuzzy/NLP for Text fields or statistical for Numeric fields to suggest table/column relationships across the Data Warehouse.
Finally, to augment the technology process described above, we are working on table/column relationship discovery from a different approach. This novel angle however, still employs our highly innovative and foundational PDQB tool.