Sei sulla pagina 1di 10

Police Analyst

Crime & Intelligence Analysis Tips & Tricks Skip to content

Home
Excels VLOOKUP Function for Crime and Intelligence Analysts A Quick and Dirty Phone Toll Technique Using Excel Pivot Tables

The Art of Filtering Duplicates in Excel


Posted on December 28, 2012 by josephglover

Every day analysts export data from myriad sources in an effort to consolidate raw data into useable intelligence. In order to make good decisions it is important that the data be as accurate as possible and one aspect of accuracy is ensuring that our information contains no duplicate data. When working with only a couple dozen rows of data sussing out duplicates is easy but when you have thousands of rows of data visual inspection just doesnt cut it and it becomes necessary to employ some automated techniques. In this blog post Im going to detail some of the Excel tools and strategies available for handling duplicate data. The Basic Example The simplest example is the straight removal of duplicate rows. Below is a stripped down data set with 3 columns and 5 rows of data.

Clearly the fourth and fifth rows are duplicates. To filter this data well use Excels built in Remove Duplicates tool that can be found on the Data ribbon in the Data Tools section (see screen grab below).

To use the tool first highlight the data in spreadsheet and then click the Remove Duplicates button and this will call up the Remove Duplicates window. There are some options here but in the simple case, the default values are sufficient. Click the OK button and Excel will delete the duplicate rows leaving only unique rows.

Filtering on Columns In the screen shot of the Remove Duplicates window you can see that each of the data columns has a checkbox beside it. These checkboxes allow you to search for duplicates using only the checked columns as the matching criteria. This may seem a bit esoteric so an example will help demonstrate how useful filtering on certain columns is. In the screen shot below is a date set that contains some occurrence numbers and addresses connected with the occurrences.

Imagine that from this data set we want only the unique occurrence numbers and we dont care about the street addresses. To get just the numbers again highlight all the columns and click the Remove Duplicates button. But this time, when the Remove Duplicates window comes up, unclick the Address and Type checkboxes and click OK.

The result is that Excel will filter out three rows and leave two, one for each of the occurrence numbers in the data set.

Filtering Sequenced Data One question you may have about filtering by specific columns is how does Excel decide what rows to keep and what rows to delete? Its actually pretty straightforward: Excel keeps the first row it finds and dumps the rest and we can exploit this behaviour to achieve a helpful result. Have a look at the data set below that lists three occurrences and a series of disposition shifts for each occurrence.

Youll notice that over time the disposition of an occurrence shifts as new information comes in. For example, many occurrences start as a Default but shift to Report To Follow (RTF) or Unfounded (UNF) as the incident evolves. This is important because when calculating crime stats its crucial to know the final disposition for an occurrence so that it can properly reported (especially unfoundeds). To get the last recorded disposition we can first sort the data by Occurrence ascending and then by DateTime descending. If we then highlight the data and use Remove Duplicates (with only the Occurrence column checked) we can exploit Excels programming to remove all rows except the first one for each occurrence number. This, very conveniently, leaves us with only the final dispositions for each occurrence number thanks to our clever sorting of the data.

Rolling Your Own All of the above examples use Excels built-in Remove Duplicates tool but it is possible to manually identify duplicates. Why would you want to do this? Primarily because Excel deletes duplicate rows and doesnt given you the opportunity to review them. Lets look at an example using the data from the Filtering Sequenced Data example. The first thing to do is sort the data based on the duplicate column. While it is possible to use multiple columns (which require multiple sorts) this example will just filter on the Occurrence column. Look in the screen shot below and youll notice that I added a new column and named it Dupe. Also notice that in the first cell under Dupe (cell B2 in this example) I put a 0 while in the subsequent cells I used Excels IF function to check the current rows occurrence number against the previous one. As can be seen in the function bar of the screen shot the formula in cell B3 is =IF(A3=A2, 1,0).

What this function is doing is examining the occurrence number in cell A3 and determining whether it is the same as the occurrence number in cell A2. Recall that the IF function evaluates statements (in this case whether A3=A2) and if the statement is TRUE it returns the second input (1 in this case) and if it is FALSE it returns the third input (0 in this case). With the statement in place in B3 I then filled the function down to B6 and Excel intelligently populated each cell of Dupe as in the screen shot above. Youll notice that the IF functions have accurately identified which rows are duplicates by putting a 1 in the Dupe column. Just like with Excels Remove Duplicate tool only subsequent records are flagged as duplicate. This means that if you sort the data by both Occurrence and DateTime you can obtain the same results as in the Filtering Sequenced Data section. One thing to note though, if you re-sort your data the Dupe column will break. This is because the IF function always checks the current row against the previous one and if the occurrence numbers are not in the proper order the results wont make any sense. To lock the results in you should copy and paste as values the Dupe column. This will replace the function calls with their respective 1s or 0s and now you can sort the data just like you otherwise would. I occasionally use this method to first review the

duplicates and then sort Dupe so that I can delete all the rows with 1. The end result is a list of unique rows. Wrap Up Filtering duplicates is a fact of analytical life and its important to be proficient at culling excess data so that stats arent double counted. I hope this post was helpful in identifying the various ways to get a handle on dupes.
This entry was posted in Excel Tricks and tagged IF, Remove Duplicates. Bookmark the permalink. Excels VLOOKUP Function for Crime and Intelligence Analysts A Quick and Dirty Phone Toll Technique Using Excel Pivot Tables

Search

Recent Posts

Performing a Chain Analysis in Excel IALEIA SWOC October 2013 Workshop Creating Heat Maps in SAPs BusinessObjects Webis Evaluating the Quality of Sales Leads Generated From an Online Contest Month-over-Month Crime Stats Arent Useful (and two alternatives) November 2013 October 2013 June 2013 April 2013 March 2013 February 2013 January 2013 December 2012 November 2012 October 2012 September 2012 August 2012 July 2012 June 2012 May 2012 April 2012 Analysis Tools Best Practices Business Intelligence Excel Tricks Math Not LEA

Archives

Categories

Uncategorized Log in Entries RSS Comments RSS WordPress.org

Meta

Police Analyst
Proudly powered by WordPress.

Potrebbero piacerti anche