Home > SSIS > Fetch sample dataset from a huge dataset

Fetch sample dataset from a huge dataset


Dataset is a collection of data. We may have huge volume of data related to any business, research output, statistics and etc. But, experiments need a sample data to validate the business logic or algorithm.

Percentage Sampling Transformation

Percentage Sampling Transformation fetches a percentage amount of the data from the data source input as a sampling dataset. This transformation has one input and two output pipeline. Data that is part of sampling will be gathered using a output pipeline and the rest of the data directed other one.

This transformation has an internal algorithm which selects the sample row. Though this is an algorithm that work for us to select the rows, we may get different number of rows in each execution. For instance, if we have 100,00 rows in the data source and we define 10% of sample data then the task may fetch >= 1000 or <= 1000 rows.

It also has the option define the random seed value for the random numbers generation or it will use the default random number generator, which uses system time ticks.

If we define a seed value in this task, it gives us a set of sample data in every execution based of the seed value. If we need different dataset then we have to change the seed value else the same data set will be coming out from the task.

Implementation

Step 1: Create a package. Add a Data Flow task.

Step 2: Go to Data Flow environment and add a data source task with a database connection. Now, Select a table for data sampling.

Step 3: Add Percentage Sampling task and connect the data source task with it.

Step 4: Open Percentage Sampling editor and define the seed value. Click OK.

Step 5: Add union all task and connect Percentage Sample task with it. Now, add a data viewer between these two tasks to see the output during package execution.

Step 6: Execute package. This sample output is generated from Adventure works.sales.customer table.

Play with Percentage Sample Transformation

Game 1: If you execute the package again, you will get the same output until you change the random seed value.

Game 2: If you execute the package without random seed value. Package will retrieve different dataset for each execution.

MSDN link

Thanks for reading.

Advertisements
  1. June 10, 2014 at 1:36 am

    Nice post. I learn new things and challenging on websites
    I stumbleupon every day. It will certainly often be helpful to
    learn content off their authors and practice a tiny bit
    something of their web sites.

  1. May 24, 2014 at 12:22 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: