• Home  >  
  • Perspectives  >  
  • How to Simplify Data Profiling and Management with Snowpark and Streamlit  
Blog October 10, 2024
5 min read

How to Simplify Data Profiling and Management with Snowpark and Streamlit

Learn why data quality is one of the most overlooked aspects of data management. While all models need good quality data to generate useful insights and patterns, data quality is especially important. In this blog, we explore how data profiling can help you understand your data quality. Discover how Tiger Analytics leverages Snowpark and Streamlit to simplify data profiling and management.

The accuracy of the data-to-insights journey is underpinned by one of the most foundational yet often overlooked aspects of data management – Data Quality. While all models need good quality data to generate useful insights and patterns, data quality is especially important across industries like retail, healthcare, and finance. Inconsistent, missing, or duplicate data can impact critical operations, from customer segmentation to and even affect regulatory compliance, resulting in potential financial or reputational losses.

Let’s look at an example:

A large retail company relies on customer data from various sources, such as online orders, in-store purchases, and loyalty program interactions. Over time, inconsistencies and errors in the customer database, such as duplicate records, incorrect addresses, and missing contact details, impacted the company’s ability to deliver personalized marketing campaigns, segment customers accurately, and forecast demand.

Data Profiling Matters – Third-party or Native app? Understanding the options

Data profiling helps the organization understand the nature of the data to build the data models, and ensures data quality and consistency, enabling faster decision-making and more accurate insights.

  • Improves Data Accuracy: Identifies inconsistencies, errors, and missing values.
  • Supports Better Decision-Making: Ensures reliable data for predictive analytics.
  • Enhances Efficiency: Helps detect and remove redundant data, optimizing resources and storage.

For clients using Snowflake for data management purposes, traditional data profiling tools often require moving data outside of Snowflake, creating complexity, higher costs, and security risks.

  • Data Transfer Overhead: External tools may require data to be moved out of Snowflake, increasing latency and security risks.
  • Scalability Limitations: Third-party tools may struggle with large Snowflake datasets.
  • Cost and Performance: Increased egress costs and underutilization of Snowflake’s native capabilities.
  • Integration Complexity: Complex setup and potential incompatibility with Snowflake’s governance and security features.

At Tiger Analytics, our clients faced a similar problem statement. To address these issues, we developed a Snowflake Native App utilizing Snowpark and Streamlit to perform advanced data profiling and analysis within the Snowflake ecosystem. This solution leverages Snowflake’s virtual warehouses for scalable, serverless computational power, enabling efficient profiling without external infrastructure.

How Snowpark Makes Data Profiling Simple and Effective

Snowpark efficiently manages large datasets by chunking data into smaller pieces, ensuring smooth profiling tasks. We execute YData Profiler and custom Python functions directly within Snowflake, storing results like outlier detection and statistical analysis for historical tracking.

We also created stored procedures and UDFs with Snowpark to automate daily or incremental profiling jobs. The app tracks newly ingested data, using Snowflake’s Task Scheduler to run operations automatically. Additionally, profiling outputs integrate seamlessly into data pipelines, with alerts triggered when anomalies are detected, ensuring continuous data quality monitoring.

By keeping operations within Snowflake, Snowpark reduces data transfer, lowering latency and enhancing performance. Its native integration ensures efficient, secure, and scalable data profiling.

Let’s look at the key features of the app, built leveraging Snowpark’s capabilities.

Building a Native Data Profiling App in Snowflake – Lessons learnt:

1. Comprehensive Data Profiling

At the core of the app’s profiling capabilities are the YData Profiler or custom-built profilers – Python libraries, integrated using Snowpark. These libraries allow users to profile data directly within Snowflake by leveraging its built-in compute resources.

Key features include:

  • Column Summary Statistics: The Quickly review important statistics for columns with all the datatypes like string, number, and date to understand the data at a glance.
  • Data Completeness Checks: Identify missing values and assess the completeness of your datasets to ensure no critical information is overlooked.
  • Data Consistency Checks: Detect anomalies or inconsistent data points to ensure that your data is uniform and accurate across the board.
  • Pattern Recognition and Value Distribution: Analyze data patterns and value distributions to identify trends or detect unusual values that might indicate data quality issues.
  • Overall Data Quality Checks: Review the health of your dataset by identifying potential outliers, duplicates, or incomplete data points.

2. Snowflake Compute Efficiency

The app runs entirely within Snowflake’s virtual warehouse environment. No external servers or machines are needed, as the app fully utilizes Snowflake’s built-in computing power. This reduces infrastructure complexity while ensuring top-tier performance, allowing users to profile and manage even large datasets efficiently.

3. Flexible Profiling Options

The app allows users to conduct profiling in two distinct ways—either by examining entire tables or by focusing on specific columns. This flexibility ensures that users can tailor the profiling process to their exact needs, from broad overviews to highly targeted analyses.

4. Full Data Management Capabilities

In addition to profiling, the app supports essential data management tasks. Users can insert, update, and delete records within Snowflake directly from the app, providing an all-in-one tool for both profiling and managing data.

5. Streamlit-Powered UI for Seamless Interaction

The app is built using Streamlit, which provides a clean, easy-to-use user interface. The UI allows users to interact with the app’s profiling and data management features without needing deep technical expertise. HTML-based reports generated by the app can be easily shared with stakeholders, offering clear and comprehensive data insights.

6. Ease in Generating and Sharing Profiling Reports

Once the data profiling is complete, the app generates a pre-signed URL that allows users to save and share the profiling reports. Here’s how it works:

  • Generating Pre-Signed URLs: The app creates a pre-signed URL to a file on a Snowflake stage using the stage name and relative file path. This URL provides access to the generated reports without requiring direct interaction with Snowflake’s internal storage.
  • Accessing Files: Users can access the files in the stage through several methods:
    • Navigate directly to the pre-signed URL in a web browser.
    • Retrieve the pre-signed URL within Snowsight by clicking on it in the results table.
    • Send the pre-signed URL in a request to the REST API for file support.
  • Handling External Stages: For files in external stages that reference Microsoft Azure cloud storage, the function requires Azure Active Directory authentication. This is because querying the function fails if the container is accessed using a shared access signature (SAS) token. The GET_PRESIGNED_URL function requires Azure Active Directory authentication to create a user delegation SAS token, utilizing a storage integration object that stores a generated service principal.

7. Different roles within an organization can utilize this app in various scenarios:

  • Data Analysts: Data analysts can use the app to profile datasets, identify inconsistencies, and understand data quality issues. They will analyze the patterns and relationships in the data and point out the necessary fixes to resolve any errors, such as missing values or outliers.
  • Data Stewards/Data Owners: After receiving insights from data analysts, data stewards or data owners can apply the suggested fixes to cleanse the data, ensuring it meets quality standards. They can make adjustments directly through the app by inserting, updating, or deleting records, ensuring the data is clean and accurate for downstream processes.

This collaborative approach between analysts and data stewards ensures that the data is high quality and reliable, supporting effective decision-making across the organization.

Final notes

Snowpark offers a novel approach to data profiling by bringing it into Snowflake’s native environment. This approach reduces complexity, enhances performance, and ensures security. Whether improving customer segmentation in retail, ensuring compliance in healthcare, or detecting fraud in finance, Snowflake Native Apps with Snowpark provides a timely solution for maintaining high data quality across industries.

For data engineers looking to address client pain points this translates to:

  • Seamless Deployment: Easily deployable across teams or accounts, streamlining collaboration.
  • Dynamic UI: The Streamlit-powered UI provides an interactive dashboard, allowing users to profile data without extensive technical knowledge.
  • Flexibility: Supports profiling of both Snowflake tables and external files (e.g., CSV, JSON) in external stages like S3 or Azure Blob.

With upcoming features like AI-driven insights, anomaly detection, and hierarchical data modeling, Snowpark provides a powerful and flexible platform for maintaining data quality across industries, helping businesses make smarter decisions and drive better outcomes.

Explore more blogs

8 min read
December 6, 2022
How Tiger’s Data Quality Framework unlocks Improvements in Data Quality
Readshp-arrow-topright-large
3 min read
March 24, 2022
Data-Driven Disruption? How Analytics is Shifting Gears in the Auto Market
Readshp-arrow-topright-large
4 min read
January 27, 2022
When Opportunity Calls: Unlocking the Power of Analytics in the BPO Industry
Readshp-arrow-topright-large
Copyright © 2024 Tiger Analytics | All Rights Reserved