Fuzzy Grouping – SSIS
Fuzzy Grouping Task
It is a data cleaning task which helps to group a set of similar or exact matching string based on the similarity scores.
Similarity scores are generated by a fuzzy algorithm that requires some resources at tempdb database. So, the SSIS user should have enough permission to create table in tempdb.
Only columns with the DT_WSTR and DT_STR data types can participate in fuzzy algorithm.
The canonical row has a score of 1. It is also consider to be the group match string. All others rows will be matching with this canonical row.
In the output result set, fuzzy algorithm adds 3 additional columns. They are
- _key_in column that uniquely identifies each row.
- _key_out column that identifies a group of duplicate rows. The _key_out column has the value of the _key_in column in the canonical data row. Rows with the same value in _key_out are part of the same group. The _key_outvalue for a group corresponds to the value of _key_in in the canonical data row.
- _score, a value between 0 and 1 that indicates the similarity of the input row to the canonical row.
Fuzzy grouping transformation shares the other property with fuzzy lookup transformation settings like MaxMemoryUsage, Exhaustive and Temporary Tables and Indexes.
In addition, it has some special column options in fuzzy grouping. They are FuzzyComparisonFlags and FuzzyType (Fuzzy match/exact match).
Create a table named Country with a column CName varchar(50). Add few rows are given below.
Step 1: Create a package and add a data flow task.
Step 2: Add ADO.Net task and configure with country table that you created in your local machine.
Step 3: Add Fuzzy Grouping Task and configure as given below
Open the Fuzzy Grouping editor: In the connection Manger, create OLE DB connection string for the temp table create for fuzzy algorithm.
Step 4: Add Union All task and connect with fuzzy grouping. Add Data viewer to view the output.