Project Overview
I engineered and developed an integrated platform for data profiling and cleansing at a data services company. This solution automates the process of data quality assessment by analyzing large datasets to identify anomalies, inconsistencies, and inaccuracies. It generates comprehensive data profiles and provides a suite of cleansing tools to enhance data reliability and prepare it for migration or analytics.
The Challenge
The primary challenges in data preparation were:
- Late Discovery of Issues: Data quality problems were often found late in the migration lifecycle, leading to significant delays and rework.
- Manual and Inefficient Profiling: Manually profiling large datasets was a time-consuming, error-prone process that couldn’t scale.
- Inconsistent and Duplicate Data: Datasets from legacy systems suffered from a lack of standardization and contained numerous duplicates that required extensive manual intervention.
- Lack of Quality Metrics: There was no standardized way to measure data quality, making it difficult to determine if a dataset was ready for migration.
My Role
As the Senior Software Engineer on this project, I:
- Architected and developed the microservices-based infrastructure for the profiling and cleansing services.
- Implemented a high-performance, PySpark-based data processing engine for comprehensive data profiling.
- Designed and developed advanced cleansing algorithms to handle duplicates, null values, and data inconsistencies.
- Created a dynamic reporting module to visualize data quality metrics and profiling results.
Technical Solution
Automated Data Profiling Service
I developed a robust FastAPI service that leverages PySpark to perform deep analysis on large datasets. Key features of this service include:
- Comprehensive Statistical Analysis: Automatically generates key statistics, including min, max, mean, standard deviation, skewness, and kurtosis for numerical data, and cardinality and mean length for string data.
- Data Quality Indicators: Measures data completeness by identifying null and missing values, and assesses uniqueness by detecting duplicate records.
- Outlier Detection: Implements the Interquartile Range (IQR) method to automatically identify and flag outliers in numerical columns.
- Dynamic HTML Reporting: Generates interactive HTML reports with detailed statistics and visualizations for each column in the dataset.
This service reduced the time required for data profiling from days to hours, achieving a 90% improvement in efficiency over manual methods.
Intelligent Data Cleansing Service
I built a powerful Flask-based service that offers a range of configurable, rule-based cleansing tools to address common data quality issues:
- Duplicate Detection: Implements algorithms for identifying duplicates through both exact and fuzzy matching, using libraries like
fuzzywuzzyfor advanced string comparison. - Null Value Handling: Provides multiple strategies for managing null values, including removal, replacement with a default, or imputation.
- Data Standardization: Includes modules for enforcing consistent data formats, checking data validity (e.g., email and date formats), and recognizing data patterns.
- Interactive Rule Builder: Created a user-friendly interface that allowed business users to define their own custom cleansing rules without writing code.
The cleansing service automated the processing of over 10 million records, resulting in an 80% improvement in data consistency.
Data Quality Scoring Framework
I designed a scoring system to provide a clear and concise measure of data quality at both the column and table levels:
- Column Health Score: Developed a composite health score for each column based on its completeness, uniqueness, and outlier metrics.
- Overall Quality Score: Calculated an overall data quality score for each table by averaging the health scores of its columns.
- Health Classification: Categorized columns as “good,” “medium,” or “problematic” to help data stewards prioritize their data quality efforts.
This scoring framework provided at-a-glance metrics that enabled faster and more informed go/no-go decisions for data migration projects.
Technologies Used
- Backend: Python, FastAPI, PySpark
- Reporting: Jinja2, HTML/CSS
- Infrastructure: Docker, Gunicorn
Results and Impact
The platform delivered substantial improvements to the data preparation process:
- 90% Reduction in the time and effort required for manual data profiling.
- 80% Improvement in data consistency and accuracy after applying cleansing rules.
- Successfully Processed datasets with up to 50 million rows, demonstrating scalability.
- Enabled Proactive Data Quality Management, leading to a significant reduction in downstream ETL failures and migration delays.
Lessons Learned
This project highlighted several key aspects of building effective data quality solutions:
- Configurability is Key: Cleansing rules must be highly configurable to adapt to the unique requirements of different data domains.
- Performance vs. Accuracy Trade-offs: It is crucial to balance performance with accuracy, especially when implementing computationally intensive algorithms like fuzzy matching.
- The Importance of Visualization: Clear and intuitive visual reports are essential for effectively communicating data quality issues to business stakeholders.