Fuzzy Lookup – Approximate to Exact Search – SSIS
Fuzzy logic is a theory and concept that deals with the approximate values rather than the exact or fixed values. In general, fuzzy value is in between 0 and 1.
Fuzzy Lookup SSIS
Fuzzy lookup transformation is data cleaning task that helps to clean the incoming data with the reference table with the actual value. This transformation tries to find the exact or similar value as a result. The result data set is also depends on the fuzzy matching configuration in the fuzzy lookup transformation task.
Fuzzy lookup task will be more helpful when you have data typo issues in the source data.
Fuzzy lookup task has one input and one output pipeline. input data type must be DT_WSTR or DT_STR in fuzzy matching
Customize Fuzzy lookup transformation
It represents the amount of memory that the Fuzzy Lookup transformation uses while executing SSIS package.
If you set the Exhaustive property to True, this transformation compares every row in the input to every row in the reference table. But, it may slower the package performance if the reference dataset is large. This method gives more accurate match
Caching of Indexes and Reference Table
WarmCaches property to True, enables the transformation partially caches the index and reference table in memory.
Temporary Tables and Indexes
Fuzzy Lookup transformation creates temporary objects, such as tables and indexes in the SQL Server TempDB. So, make sure that the SSIS user account has sufficient access to the database engine to create and maintain this temporary table.
Also, Fuzzy Lookup transformation maintains the match index—that is, if MatchIndexOptionsis set to GenerateAndMaintainNewIndex—the transformation may lock the reference table for the duration of the data cleaning operation and prevent other users and applications from accessing the table. There are other options available and it is self Descriptive.
Working with Fuzzy lookup
Fuzzy lookup transformation has 3 features.
1) Defining maximum number of matches to return to output – It starts with 1 and that is the recommended.
2) Token delimiters – It has a set of predefined delimiters and we can also add our’s
3) Similarity score – It is the fuzzy algorithm input to match the score with the input row and reference row.
this value is between 0 and 1. higher the value is the accurate the result. It is usually .60 is the best value for similarity score.
Summary: this example will show you how to use fuzzy lookup transformation to find country name lookup for the give set of country names.
So, Create a table name Country and add a column CName nvarchar(50). Add few rows with country names. to make this example interesting remove some letters from the country name you have in your table.
The reference table is the adventureworks.Person.CountryRegion.
My source table is as follows
My package looks like this
Union All task is used as a dummy task to hold the fuzzy lookup output. I have added data viewer to see the resultset which flows from the fuzzy lookup to union all task.
Step 1: Create a package and add Ado.net source, fuzzy lookup and union all transformation into a data flow layout.
Step 2: configure ado.net with our source table and connect to fuzzy lookup transformation.
step 3: Open fuzzy lookup transformation and configure with the fuzzy setting and reference table.
Reference Table section also has the reference table and index table creation and maintenance option. You can change this setting during your development and explore the interesting stuff about it.
Move to column tab and map the source column with reference table lookup column.
Move to Advance tab to configure fuzzy settings
now connect the Fuzzy lookup transformation output pipeline to Union all transformation and add data viewer control to the pipeline.
Step 4: Execute the package.
Fuzzy algorithm estimated the source and lookup column similarity and confidence as given in the result picture. This result changes if we change the similarity threshold.
Question for you. Find out the similarity and confidence score for Malaysia and put your answer in comment session.
Thanks for reading.