Home > SSIS > Fuzzy Grouping – SSIS

Fuzzy Grouping – SSIS


Fuzzy Grouping Task

It is a data cleaning task which helps to group a set of similar or exact matching string based on the similarity scores.

Similarity scores are generated by a fuzzy algorithm that requires some resources at tempdb database. So, the SSIS user should have enough permission to create table in tempdb.

Only columns with the DT_WSTR and DT_STR data types can participate in fuzzy algorithm.

The canonical row has a score of 1. It is also consider to be the group match string. All others rows will be matching with this canonical row.

In the output result set, fuzzy algorithm adds 3 additional columns. They are

  1. _key_in column that uniquely identifies each row.
  2. _key_out column that identifies a group of duplicate rows. The _key_out column has the value of the _key_in column in the canonical data row. Rows with the same value in _key_out are part of the same group. The _key_outvalue for a group corresponds to the value of _key_in in the canonical data row.
  3. _score, a value between 0 and 1 that indicates the similarity of the input row to the canonical row.

Fuzzy grouping transformation shares the other property with fuzzy lookup transformation settings like MaxMemoryUsage, Exhaustive and Temporary Tables and Indexes.

In addition, it has some special column options in fuzzy grouping. They are FuzzyComparisonFlags and FuzzyType (Fuzzy match/exact match).

Implementation

Create a table named Country with a column CName varchar(50). Add few rows are given below.

Step 1: Create a package and add a data flow task.

Step 2: Add ADO.Net task and configure with country table that you created in your local machine.

Step 3: Add Fuzzy Grouping Task and configure as given below

Open the Fuzzy Grouping editor: In the connection Manger, create OLE DB connection string for the temp table create for fuzzy algorithm.

Move to column tab and select the fuzzy grouping column and set the column options like MatchType

Move to Advance tab and can change the fuzzy additional column names _key_in, _key_out and _score. Also, we can change the similarity threshold for the fuzzy match.

Click ok

Step 4: Add Union All task and connect with fuzzy grouping. Add Data viewer to view the output.

Thanks for reading

About these ads
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 304 other followers

%d bloggers like this: