Thursday, August 29, 2013

Handling Duplicate Data


Handling Duplicate Data is well-known to be a Vexing Problem for many Excel users. 

There are tools within Excel that simply delete all duplicate data, but there are times when it is more advisable to Identify these Duplicate Values prior to eliminating them.

For instance, let's say you are working with Combined and Blended human resources data that lists employees with their home addresses. If the information in the blended data includes Duplicates of employees who have had updates, and shows different addresses for these duplicates, it would be good to have an Expedient Way of doing identifying them before taking action.

Conditional Formatting can offer an excellent solution:

1.   Select the range (e.g. A1:A98).

2.   Choose Format / Conditional Formatting to display the CF dialog box

3.   Select Formula Is from the first drop-down list box, and enter =COUNTIF($A$1:$A$98,A1)>1 in the second box

4.   Click the Format button to bring up the Format Cells dialog box.

5.   Select the Patterns tab, and choose a background color

6.   Click OK twice to return to your worksheet

Bamm! If the range contains any duplicate entries, they will be highlighted with the background color you chose previously!

Removing Duplicates in databases is one of the most infamous topics for many Excel professionals. Being able to do this, enables a user to Combine Databases with repetitive data into a properly structured format.

With the last three versions of Excel (2007, 2010, and 2013), this is Remarkably Easy. Here is all you have to do to complete this simple task:

1.  Select the entire database (be careful not to select any self-generating key fields)

2.  Go to the Data tab in the Data Tools group and click Remove Duplicates

3.  Select the column(s) on which you want to base your removal and click OK

4.  A message will pop up telling you how many (if any) values were found and removed

At first glance Removing Duplicates to maintain the Integrity of your data may seem like a challenging task. As is the case with nearly everything in Excel, however, once you know how to do it, it seems so Comparatively Simple, you may wonder why you ever considered it a problem…

1 comment:

thedmv said...

In my case I have a large excel file with weather data recorded at different hours for different stations. However, when the stations report their values for a given hour, there may be two or more values given for that same hour. The file may look something like this (Temp=Temperature):

Station,Date,Time,Temp
A,Aug012013,1:00 PM, 78
A,Aug012013,1:00 PM, 77
A,Aug012013,1:00 PM, 79

So I have duplicate Station, Date, and Times but not temperatures.

My goal is that for every duplicate hour reading, all I want to save is the 2nd result (in the example this would be the 77 deg F row).

Can you use the "Remove Duplicate" feature to select which duplicate row you want?