Webbased Crime Thesis

A WEB-BASED TOOL FOR ANALYSIS OF CRIME LABORATORY DATA
by
Ramesh Annabathula
Thesis submitted to the Faculty of Geomatics and Space Application at Cept University of the requirements for the degree of
Master of Science in ENTERPRISE GEOMATICS
Dr. Anjana Vyas
Department of Geomatics and Space Application CEPT UNIVERSIYY AHMEDABAD
ABSTRACT A WEB-BASED SOFTWARE TOOL FOR ANALYSIS OF CRIME LABORATORY DATA
The focus of this project was on the development of a software tool for analysis of crime data at the West Virginia State Police Forensics Laboratory (WVSPFL). The tool enables users to conduct statistical analysis on criminal activity; in addition, data mining methods are implemented to identify possible crime patterns. A Crime database was implemented; it contains information on incidents submitted to WVSPFL over the last fifteen years. The WVSPFL database was integrated with a hypothetical conviction database. Such integration enables law enforcement agencies to track crime from occurrences to conviction. It is anticipated that data mining methods will assist law enforcement agencies in anticipating possible criminal activity. The software front-end was implemented in ASP.NET 2.0 which is Microsofts latest technology for web application development with SQL Server 2005 as the database.
ACKNOWLEDGEMENT
I would like to thank my advisor Dr. Rashpal S. Ahluwalia for his continued support, guidance and encouragement during the course of this research. I also wish to thank my committee members Dr. Robert C. Creese and Dr. Arun A. Ross for their valuable advice and support. Special thanks to my student colleagues who helped me in many ways. Finally, words alone cannot express the thanks I owe to my parents and Prasanthi, my wife, for their constant support and blessings in enabling my success and happiness in all my pursuits and endeavors in life. I would also like to thank the department of Industrial and Management Systems Engineering (IMSE) at West Virginia University for giving me a chance to pursue my higher education.
iii
TABLE OF CONTENTS
ABSTRACT........................................................................................................................ ii ACKNOWLEDGEMENT ................................................................................................. iii TABLE OF CONTENTS................................................................................................... iv LIST OF FIGURES .......................................................................................................... vii LIST OF TABLES............................................................................................................. ix LIST OF TABLES............................................................................................................. ix CHAPTER 1: INTRODUCTION ....................................................................................... 1 1. 1 Information Management System............................................................................ 1 1. 2 Web-Based Information Management..................................................................... 2 1. 3 Crime Statistics ........................................................................................................ 3 1. 4 Data Mining ............................................................................................................. 5 1. 5 Problem Statement ................................................................................................... 7 CHAPTER 2: LITERATURE REVIEW ............................................................................ 8 2. 1 Existing Data Mining Tools..................................................................................... 8 2. 2 Forensic Information Management Systems ......................................................... 11 CHAPTER 3: SYSTEM DESIGN AND IMPLEMENTATION ..................................... 14 3. 1 System Design ....................................................................................................... 14 3. 2 FIMS 3. 0 System Architecture ............................................................................. 15 CHAPTER 4: USER INTERFACE .................................................................................. 18 4. 1 User Interface......................................................................................................... 18 4. 2 System Administrator Methods ............................................................................. 19 4. 3 WVSPFL Agency .................................................................................................. 21 4. 3. 1 WVSPFL Agency Administrator Methods........................................................ 21 4. 3. 2 WVSPFL Evidence Handling Division ............................................................. 22 4. 3. 2. 1 Central Evidence Receiving Unit................................................................... 22 4. 3. 2. 2 Central Evidence Processing Unit ................................................................. 24 4. 3. 3 WVSPFL Evidence Processing Methods........................................................... 25 4. 3. 3. 1 Bio-Chemistry Unit Methods......................................................................... 25
iv
4. 4 Data Analysis Agency Interface ............................................................................ 28 4. 4. 1 Data File Management and Methods ................................................................. 28 4. 4. 2 Incident Statistics ............................................................................................... 31 4. 4. 3 Data Mining and Methods ................................................................................. 32 4. 4. 4 Data Visualization Interface .............................................................................. 34 CHAPTER 5: DATA ACCESS TIER .............................................................................. 36 5. 1 Data Access............................................................................................................ 36 5. 2 System Administrator Data Access Methods ........................................................ 37 5. 3 Data Analysis Methods .......................................................................................... 40 5. 3. 1 Statistical Method .............................................................................................. 40 5. 3. 2 Data Mining Methods ........................................................................................ 41 5. 3. 2. 1 Classification Algorithm................................................................................ 43 5. 3. 2. 2 Association Algorithm ................................................................................... 51 5. 3. 2. 3 Clustering Algorithm ..................................................................................... 52 CHAPTER 6: DATABASE DESIGN AND IMPLEMENTATION................................ 56 6. 1 Database................................................................................................................. 56 6. 2 Database Schema ................................................................................................... 57 6. 3 FIMS 3. 0 Data Tables ........................................................................................... 58 6. 3.1 Incident Table ..................................................................................................... 58 6. 3.2 Suspect Physical Table ....................................................................................... 59 6. 3.3 Suspect Variable Table ....................................................................................... 60 6. 3.4 Verdict Table ...................................................................................................... 61 6. 3.5 Victim Physical Table......................................................................................... 63 6. 3.6 Victim Variable Table......................................................................................... 64 6. 3.7 Agency Information Table.................................................................................. 65 6. 3.8 Convict Table...................................................................................................... 65 6. 4 Sorted Test Data Table........................................................................................... 66 6. 5 Assumptions........................................................................................................... 67 CHAPTER 7: APPLICATION STUDY........................................................................... 68 7. 1 FIMS homepage..................................................................................................... 68 7. 2 Login Tab............................................................................................................... 69
7. 3 System Administrator Tab ..................................................................................... 69 7. 4 Agency Login Tab ................................................................................................. 71 7. 5 Contact Us Tab ...................................................................................................... 72 7. 6 Data Analysis with Mining Methods ..................................................................... 73 7. 6.1 Data File Management........................................................................................ 74 7. 6.2 Data Mining/Data Visualization ......................................................................... 75 7. 6.3 Incident Statistics ................................................................................................ 76 7. 7 Building a Data Mining Model .............................................................................. 77 7. 7. 1 Defining the Problem......................................................................................... 81 7. 7. 2 Preparing Data ................................................................................................... 81 7. 7. 2. 1 Data Analysis and Query ............................................................................... 81 7. 7. 2. 2 Query Design ................................................................................................. 82 7. 7. 2. 3 Query Implementation .................................................................................. 82 7. 7. 3 Exploring Data ................................................................................................... 84 7. 7. 4 Building Models................................................................................................. 86 7. 8 Application Study for WVSPF Lab Dataset .......................................................... 87 7. 8. 1 Crime Data Mining Techniques......................................................................... 88 7. 8. 2 Data Mining Techniques for Crime Type.......................................................... 88 7. 8. 3 Training Set........................................................................................................ 91 7. 8. 4 Estimating Predictive Accuracy......................................................................... 91 7. 8. 5 Comparing Classification Methods: .................................................................. 92 7. 8. 6 Crime Data Decision Trees................................................................................ 94 7. 8. 7 Rule Sets ............................................................................................................ 96 CHAPTER 8: CONCLUSION AND FUTURE WORK .................................................. 98 8. 1 Conclusion ............................................................................................................. 98 8. 2 Future Work ........................................................................................................... 99 REFERENCES ............................................................................................................... 100 APPENDIX A: Training Data Set for Crime Data ......................................................... 103 APPENDIX B: Test Data Set for Crime Data ................................................................ 110
vi
LIST OF FIGURES
Figure 2.1: FIMS 2.0 Architecture.................................................................................... 12 Figure 2.2: Home page of FIMS 2.0 application .............................................................. 13 Figure 4.1: User Interface or Presentation Layer.............................................................. 19 Figure 4.2: System Administrator Class and Methods .................................................... 20 Figure 4.3: Agency Administrator Class and Methods..................................................... 22 Figure 4.4: CER Unit Methods ......................................................................................... 23 Figure 4.5: CEP Unit Methods.......................................................................................... 25 Figure 4.6: Data Query Building Wizard User Interface.................................................. 29 Figure 4.7: List of Methods in Data Query Building Wizard Class ................................. 30 Figure 4.8: Converting SQL Query Results to XML Methods User Interface................. 30 Figure 4.9: Data File Management User Interface Class and Methods ............................ 31 Figure 4.10: Incident Statistics Class and Methods .......................................................... 31 Figure 4.11: Selecting Data Mining Methods................................................................... 32 Figure 4.12: Using Auto Select Option While Choosing Algorithm................................ 33 Figure 4.13: Data Mining User Interface Class and Methods .......................................... 34 Figure 4.14: Graphical Output Data Visualization Decision Tree Algorithm for Iris Data Set ............................................................................................................................. 34 Figure 4.15: Data Visualization Interface Classes........................................................... 35 Figure 5.0: Data Access Logic Data Flow ........................................................................ 37 Figure 5.1: System Administrator Class and Methods ..................................................... 38 Figure 5.2: User Authentication Classes........................................................................... 39 Figure 5.3: Business Layer Class with Methods............................................................... 40 Figure 5.4: Histogram for Age and Number of Suspects for Committing Crime............. 44 Figure 5.5: Sample Linear Regression Graph................................................................... 46 Figure 5.6: Sample Time Series Graph............................................................................. 47 Figure 5.7: Sample Cluster Diagram ................................................................................ 53 Figure 5.8: Sample Scatter Graph for Cluster Groups...................................................... 54 Figure 7.1: FIMS home page ............................................................................................ 68
vii
Figure 7.2: Login Tab Home Page.................................................................................... 69 Figure 7.3: System Administrator Home page ................................................................. 70 Figure 7.4: Agency Search with Agency Name and Location or Agnecyid..................... 71 Figure 7.5: Sample Agency (WVSP) Home Page and Login Screen ............................... 72 Figure 7.6: Contact us page .............................................................................................. 73 Figure 7.7: Data Mining Home Page ................................................................................ 74 Figure 7.8: Data File Management Home Page................................................................ 75 Figure 7.9: Data Mining Home Page ................................................................................ 76 Figure 7.10: Incident Statistics Home Page...................................................................... 77 Figure 7.11: SQL Query Building Wizard........................................................................ 83 Figure 7.12: Query Data Extraction and Creating data file .............................................. 84 Figure 7.13: ARFF file viewer.......................................................................................... 85 Figure 7.14: Basic Crime Statistics and Graphical Data Visualization ............................ 86 Figure 7.15: Machine Learning Algorithms ..................................................................... 87 Figure 7.16 Prediction Performance Comparisons. .......................................................... 93
viii
LIST OF TABLES
Table 5.1: Input Data Format for Time Series Model ...................................................... 48 Table 6.1: Incident Table fields ........................................................................................ 59 Table 6.2: Suspect Physical Table Fields ......................................................................... 60 Table 6.3: Suspect Variable Table Fields ......................................................................... 61 Table 6.4: Verdict Table Fields ........................................................................................ 62 Table 6.5: Victim Physical Table Fields........................................................................... 63 Table 6.6: Victim Variable Table Fields........................................................................... 64 Table 6.7: AgencyInfo Table Fields ................................................................................. 65 Table 6.8: Convict Table Fields........................................................................................ 65 Table 6.9: Test Data Table Fields ..................................................................................... 66 Table 7.1: List of Data Mining Algorithms ...................................................................... 79 Table 7.2 Shows the Partial Data Set for the Crime Analysis .......................................... 90 Table 7.3 Shows the Possible Known Class Set for Test Data......................................... 91 Table 7.4 Prediction Accuracy.......................................................................................... 93 Table 7.5 Correctly/ Incorrectly Classified Instances for Crime Data from Weka .......... 97 Table A1: Training Data Set for Crime Data.................................................................. 103 Table B1: Test Data Set for Crime Data......................................................................... 110
ix
CHAPTER 1: INTRODUCTION
1. 1 Information Management System In the modern age, every aspect of management relies on information. Information is an important resource needed to develop other resources and it is believed that information is power [1]. Changing circumstances and environments have necessitated the need for proper dissemination of information at various levels of management. The development and use of information management systems is a modern phenomenon concerned with the use of appropriate information that will lead to better planning, better decision making, and better results [1]. Technology is changing the way in which information is captured, processed, stored, disseminated, and used. Information needs to be properly managed to ensure its cost-effective use. The rapid evolution of computer technology is expanding mans desire to obtain computer assistance in solving increasingly complex problems. The need to access information conveniently, quickly, and economically makes it imperative to devise procedures for the creation, management and utilization of databases in organizations. Information systems are generally comprised of the following functional elements: Perception - Initial generation or capture. Recording - Physical entry of data. Processing - Transformation according to the specific needs of the organization. Transmission Data flow in an information system. Storage Storage of data for expected future use. 1
Retrieval Retrieval of stored data. Information Creation of useful information from the data. Presentation - Reporting, communication, and presentation of information. Decision making - The cognitive process leading to the selection of a course of action among alternatives. Every decision making process produces a final choice. It can be an action or an opinion.
1. 2 Web-Based Information Management Recent advances in web technology have shifted the focus of information technology to the World Wide Web (WWW). More users have access to WWW, and information providers are able to store information of various types on the web easily. The web has now become one of the most important medium for information transmission. Web based systems maximize information sharing and enable comprehensive management of data [2]. Information systems using the web technology are prevalent throughout the world. Applications that have emerged on intranets and internets using web technologies can be referred to as web base information management systems [3]. However, a web based information management system should be distinguished from a normal web application; a web page simply presents information to the user. A web based information management system can be defined as a system that not only provides information to users but also proactively interacts with them to help in their task [4]. Web based information management systems take coordination, communication, and collaboration to a new level. One of the key assets of a web based information management systems is its ability to make a large volume of data and information
accessible to users in a wide variety of circumstances [5]. A web accessible device is all that is needed to view the information. Traditional information management systems focus on querying, reporting, and analyzing data related to business transactions. Web based information systems go beyond this functionality by integrating different media for knowledge representation, thereby supporting not only the creation, integration, analysis, and distribution of structured information, but also the storage and transfer of knowledge. While web based systems do not conceptually differ from respective traditional systems, web-based information systems allow for more advanced forms of information management than traditional information systems, such as paper tapes and punch cards [6]. Advancements in web technology have made it possible to move documentation and data management away from paper based systems. New web programming languages take advantage of high-end database technology to deliver the promise of a paperless society. A web based information management system manages and stores data in an electronic format. Any type of record or documentation that is needed can be completed online through the web browser [7]. 1. 3 Crime Statistics National Crime Victimization Survey (NCVS) is the Nation's primary source of information on criminal victimization. Each year, data are obtained from a nationally representative sample of 77,200 households comprising nearly 134,000 persons on the frequency, characteristics and consequences of criminal victimization in the United States. The survey enables Bureau of Justice Statistics (BJS) to estimate the likelihood of victimization by rape, sexual assault, robbery, assault, theft, household burglary, and
motor vehicle theft for the population as a whole as well as for segments of the population such as women, the elderly, members of various racial groups, city dwellers, or other groups. The NCVS provides the largest national forum for victims to describe the impact of crime and characteristics of violent offenders [8]. The Federal Bureau of Investigation's (FBI) Uniform Crime Reports Program (UCR) collects information from local law enforcement agencies about crimes reported to police. The UCR crime index includes the following offenses: 1. Homicide 2. Forcible rape 3. Robbery 4. Aggravated assault 5. Burglary 6. Larceny-theft and 7. Motor vehicle theft.
1. 4 Data Mining Data mining methods enable extraction of hidden predictive information from large databases. It is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledgedriven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems [9]. Data mining tools can answer business questions that traditionally were time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, Which clients are most likely to respond to my next promotional mailing, and why [9]. Recent advancements in data collection, storage and manipulation tools, such as phenomenal storage and computational capacity, use of the internet, advanced surveillance equipments etc., have broadened the scope and limits for the same. Moreover, the increasing dependence on high technology equipment of common man has eased the process of data collection [10].
The data may or may not be in direct form and may need some interpretation based on previous knowledge, experience and most importantly is driven by purpose of data analysis. This problem is further augmented by sheer volume, texture of the data and lack of human capability to infer it in ways it is supposed to be. For this reason many computational tools are used and are broadly termed as Data Mining Tools [10]. The data mining tools comprise of basic Statistics and Regression methods, ANOVA, Decision Trees, and Rule based techniques, and more importantly advanced algorithms that use Artificial Intelligence and Neural Networks methods. The applications of Data Mining tools are boundless and basically driven by cost, time constraints, and current requirement of the community, business and the government [10]. There are several Data Mining tools that are available commercially or as opensource. These tools are; SAS Enterprise Miner [11], JMP [12], R [13], TeraData [14], Clementine [15], and Weka [16]. Data Analysis tools such as MS Excel [17], MiniTab [18], and Statistica [19] can indirectly do Data Mining by working on the data by searching for vital information. The primary issue with the aforesaid software is, they are a generalized tool and require significant data preparation and/or coding on the part of the user. Moreover, the data is often not in the form to be interpreted directly. Currently, these tools are being upgraded to reduce user input, cross functionality and to get better results. The main thrust of data mining research is in commercial and health care sector followed by science and technology. It is only recently that data mining has gained recognition in the national security field.
1. 5 Problem Statement The objective of this research was to develop a web based software tool to analyze crime data. The software tool applies statistical and data mining methods to the FIMS database. The existing FIMS database will be expanded to include hypothetical data on individuals convicted of criminal activity. The front-end of the software was implemented in ASP.NET 2.0 and the database was implemented in SQL Server 2005. The user can to select any combination of fields from the database to conduct in-depth analysis. In addition, a data visualization capability is provided. The user can utilize the visualization tool to manually identify patterns in the data.
CHAPTER 2: LITERATURE REVIEW
2. 1 Existing Data Mining Tools CrimeConnect is a secure Web-based tactical crime information sharing system that enables multiple police jurisdictions to share information such as Wanteds, Missing Persons, Sex Crime Registrants, Bulletins, etc. in real time. It also enables authorized department personnel to search the tactical crime databases and selected data from the Records Management Systems of multiple jurisdictions to enhance the police officer's ability to solve crime [10]. The CrimeConnect software enables Single or Multiple jurisdictions to quickly and easily input tactical crime information into a secure central database and share that information online along with data from other sources, either within the agency or with an unlimited number of external agencies. Each participating agency within the CrimeConnect system determines which of their data can be shared with other agencies, and only they can upload, alter, or remove that data. This data can also be used to automatically assemble and display customized briefings, and make the briefing content available for retrieval by all law enforcement personnel 24/7, including mobile Units [20]. CrimePointWeb is a web based software solution that facilitates information sharing, analysis and management for law enforcement and public safety agencies. The idea behind CrimePointWeb is to allow agencies to share not only data, but information
as well. By combining data with easy-to-use automated analysis, CrimePointWeb can take information i.e. both data and analysis to where it is needed most [21]. AC2 is a user-friendly and powerful system that supports all aspects of Data Mining. It provides a comprehensive set of tools to access, select, prepare, and manipulate data. AC2 is based on an Object-Oriented language. It allows the user to structure the data and enrich this structure with the users domain knowledge. AC2 knowledge discovery engine is based on the state-of-the-art inductive techniques - builds predictive models automatically. These are displayed in the form of decision trees. One can easily test and validate the models and deploy them in areas such as segmentation, classification, estimation, and prediction. AC2 is a comprehensive toolkit that allows one to perform advanced Data Mining tasks and develop powerful decision support systems step-by-step. Its intuitive and user-friendly interface makes it easy for data owners and IT specialists alike to build and maintain their own Data Mining Systems [22]. CART is a robust, easy-to-use decision tree tool that automatically sifts large, complex databases, searching for and isolating significant patterns and relationships. This discovered knowledge is then used to generate reliable, easy-to-grasp predictive models for applications such as profiling customers, targeting direct mailings, detecting telecommunications and credit card fraud, and managing credit risk [23]. It has an excellent pre-processing complement to other data analysis techniques. For example, CART's outputs (predicted values) can be used as inputs to improve the predictive accuracy of neural nets and logistic regression [24].
BRAINCEL is an easy to use Excel add-in that enhances forecasts with the power of neural networks. Knowledge of neural network, mathematic or statistics is not required to use it. One supplies both input data and desired output (target) data in rows and columns on a spreadsheet. It is almost as simple to use as Excel's standard regression tool, but more powerful. It has a BESTNET option that directs the program to find the best neural net size and shape. It will vary both the number of layers and number of processing modules at each layer [25]. BrainMaker Neural Network Software lets one use their computer for business and marketing forecasting, stock, bond, commodity, and futures prediction, pattern recognition, medical diagnosis, sports handicapping etc. It lets to watch the network learn and easily finds a network that test well. It uses the Back Propagation algorithm it shows Network Progress Display graphically and shows how well the network is learning. It helps to determine the accuracy level. [26]. CrimeStat III is a spatial statistics program for the analysis of crime incident locations, developed by Ned Levine & Associates under grant 2002-IJ-CX-0007 from the National Institute of Justice. The program is Windows-based and interfaces with most desktop GIS programs. The program provides supplemental statistical tools to aid law enforcement agencies and criminal justice researchers in their crime mapping efforts. It is being used by many police departments around the country as well as by criminal justice and other researchers. The new version is 3.0 (CrimeStat III) and is available free of charge. The program inputs incident locations (e.g., robbery locations) in 'dbf', 'shp', ASCII or ODBC-compliant formats using either spherical or projected coordinates. It
10
calculates various spatial statistics and writes graphical objects to ArcView, MapInfo, Atlas*GISTM, Surfer for Windows, and ArcView Spatial Analyst [27]. 2. 2 Forensic Information Management Systems FIMS 1.0, Forensic Information Management System, was developed at WVU for the WV State Police Forensic Laboratory (WVSPFL). Prior to implementation of software a process map of the lab was developed. A windows application for the DNA Unit was developed [30]. The current version of FIMS 2.0 covers all of nine Units of WVSPFL and has extensive tool to maintain chain of custody [28]. FIMS 2.0 was implemented by studying the process flow of existing operations in WVSPFL. Each phase has been automated by converting existing written forms which contains significant data vital to the incident to the electronic reports. Thus, it was found that since a report may or may not need to be printed from a computer, a read/write function was designed with capabilities of sorting and retrieving data. Dropdown lists, menu structures, text boxes, as well as other tools within .NET Framework demonstrated the customization and automation of several decision-making tasks. After information was entered into the program via the input devices supplied by ASP.NET, data was stored as records in a MS SQL Database with a primary key being the Incident number. When an incident file is needed, it is simply a matter of retrieving the appropriate record and placing it into the I/O controls. The overall architecture of FIMS 2.0 is shown in Figure 2.1.
11
Figure 2.1: FIMS 2.0 Architecture
FIMS 2.0 was classified into four main modules as: ORI Agencies, Prosecutors Office, CER, and Evidence Analysis Units. Each module has a separate login and different access privileges. The ORI agency can submit the incidents online and can view incident status and incident submission reports. The prosecutors office can view incident status and incident reports in their jurisdiction. All of the laboratory Units and CER have access to the incident submission report. Each laboratory Unit has its own set of forms. All of the information submitted is stored in a common FIMS database. FIMS 2.0 can be accessed online through any browser at http://fims.rsa.wvu.edu. Figure 2.2 shows the home page of FIMS 2.0 application accessed through a browser. The user can select appropriate link to login into the respective module.
12
Figure 2.2: Home page of FIMS 2.0 application
13
CHAPTER 3: SYSTEM DESIGN AND IMPLEMENTATION
3. 1 System Design The focus of this project was to design and implement a crime database and associated search methods to identify crime patterns from the database. The database was created in Microsoft SQL Server (back end). The user interface (front end) and the crime pattern identification software (middle tier) were implemented in ASP.NET. Such a web based approach will enable the users to utilize the database from anywhere and at anytime. A general ARFF and XML file can also be generated, for the user in Windows based platform to use other Data Mining software such as WEKA for further analysis. ASP.NET is Microsofts latest technology for building web-based applications and services, a successor to Active Server Pages (ASP) that draws on the power of the .NET Framework development platform and the Visual Studio.NET developer toolset. ASP.NET provides an efficient approach to developing web applications because of the following: The simplistic, effective approach in creating database structures The ability to design a user-friendly interface while being able to write underlying programs within this design The ability to write programs for mobile devices The ability to create applications directly implemented on the World Wide Web (WWW)
14
The ability to recover from memory leaks and errors to make sure that the website is always available Multiple language support. ASP.NET brings a new functionality, power, and ease of rapid development for web programmers. Some of these features, such as the advent of code-behind, are well documented and have become standard practice in everyday development. Another feature that ASP.NET brings is development with a real compiled language [25]. Accessing database from a web application is an often-used technique for displaying data to website visitors. ASP.NET makes it easier then ever to access database for this purpose. ASP.NET provides a simple model that enables developers to write logic that runs at the application level. Developers can write this code in the global.asax text files or in a compiled class deployed as an assembly. ASP.NET provides easy-to-use
application and session-state-facilities that are familiar to ASP developers and readily compatible with all other .NET framework [27]. ASP.NET takes advantage of
performance enhancements found in the .NET framework and common language runtime. The .NET framework and ASP.NET provide default authorization and
authentication schemes for web application. 3. 2 FIMS 3. 0 System Architecture One of the key elements of any application design is the system architecture. The system architecture defines how pieces of the application interact with each other, and what functionality each piece is responsible for performing. There are styles of
application architecture with each style being characterized through the number of layers between the user and the data. Each layer generally runs on a different system or in a 15
different process space on the same system. With three-tier applications, the business rules are removed from the client and are executed on a system in between the user interface and the data storage system. The client application provides user interface for the system. The business rules server ensures that all of the business processing is done correctly. It serves as an intermediary between the client and the data storage. In this type of application, the client does not access the data storage system directly. This type of system allows for any part of the system to be modified without having to change the other two parts. Since the parts of the application communicate through interfaces, then as long as the interface remains the same, the internal workings can be changed without affecting the rest of the system [28]. A 3-tier application is a program which is organized into three major disjunctive tiers. These tiers are:

Tier -1 : User Interface (Front End) Tier - 2 : Data Access (Middleware) Tier - 3: Data Base (Backend). Each tier can be deployed in geographically separated computers in a network.
The tiers can be deployed on physically separated machines. The characteristic of the tier communication is that the tiers will communicate only to their adjacent neighbors. For example, The User Interface Tier will interact directly with the Data Access Tier and not directly with Data base or Data Tiers. The FIMS 3.0 software design is based on the three tier architecture; in addition, the following functionality has been added.
16
A new Unit has been added for data analysis which will make use of statistical methods and data mining methods to analyze lab data to identify crime patterns
The database has been redesigned to provide the capability of identifying the user on different Units based on a single login session. The user is not required to login into each and every Unit separately, which is based on privileges given to the user by the administrator.
17
CHAPTER 4: USER INTERFACE

4. 1 User Interface The presentation layer of most applications is critical to the application's success. The presentation layer represents the interface between the user and the rest of the application. If the user can not interact with the application in a way to perform work in an efficient and effective manner, then the overall success of the application will be severely impaired. Through the user interface topmost layer (tier1), the user can input data, view the results of requests, and interact with the underlying system. On the Web, the browser performs these user interface functions. In non-Web-based applications, the client tier is a stand-alone, compiled front-end application. The client tier is responsible for
communication with the users and web service consumers and it will use objects from Business Layer to response GUI raised events. ASP.NET is used to implement the Presentation tier. The presentation tier is the first level of the three-tier FIMS model. It interfaces directs to application services for the application processes; it also issues requests to the second data access tier. The common presentation tier services provide semantic conversion between associated data access processes. Figure 4.1 describes the user interface diagram, which shows list of all user web pages, typically organized in hierarchical fashion. It is divided into the main sections used in the navigation bar of the website. A brief description of each section is described below:
18
Figure 4.1: User Interface or Presentation Layer
4. 2 System Administrator Methods System Administrator Home page is gateway fro system administrator. System Administrator invokes the methods in the system administrator module. The System Administration module is responsible for meeting the many administrative, managerial, and technology needs of the FIMS. The primary goal of the system administration is to provide entire FIMS management. These methods perform, creating agencies and
creating divisions or units for a selected agency, deleting selected agency, division or a unit, editing selected agency, division or unit information, data base administration,
19
access log, system performance monitoring, maintenance, and backup. The section also serves as liaison to state, local, and federal agencies. Reviews and evaluates analyses and examinations performed in the agencies. It is responsible for coordinating interagency and interdepartmental activities involving agency operations. It also oversees the
preparation and maintenance of agency records and reports. It instructs law enforcement personnel in the proper agency procedures of identifying, handling, and examining of physical evidence. Figure 4.2 shows the System Administrator class diagram. It has class methods. For example, ActivateAgency method performs task of activating a particular agency which was deactivated earlier.
Figure 4.2: System Administrator Class and Methods System Administrative Unit interface class contains methods to interact with form controls of administrative unit. These methods directly link user controls on the web form. On request, they display/send information to the web controls on the form. These methods are responsible for doing tasks like: adding new agency, editing package, item
20
information and other functions like: creating password encryption and password checking. 4. 3 WVSPFL Agency On agency home page, the user cane selects his/her agency from agency on the basis of agency name and location, or agencyID. There is nearly 900 law enforcement agencies are there in West Virginia. WVSPFL is one of such agency. WVSPFL is located in South Charleston, West Virginia. It provides facilities for analysis of samples related to particular incident. It has three main divisions administrator, evidence handling and evidence processing. 4. 3. 1 WVSPFL Agency Administrator Methods
The agency administration module is responsible for meeting the many administrative, managerial, and technology needs of the agency inside. The primary
goal of the agency administration is to provide a quality, operational work environment that supports the necessary work of agency personnel. Reviews and evaluates analyses and examinations performed in the agency. It is responsible for coordinating interagency and interdepartmental activities involving laboratory operations. It also oversees the preparation and maintenance of agency records and reports. It assists in the planning of agency facility, equipment, staffing, and training to meet the day to day needs of the agency. Agency administration module is responsible for adding agency member, deleting agency member, editing agency member info, adding, deleting, editing packages, items, samples, assigning the inventoried items to analyst in testing Unit. It is also responsible
21
for verifying and approving the final agency incident report. Figure 4.3 shows the Agency Administrator class diagram.
Figure 4.3: Agency Administrator Class and Methods Agency Administrative Unit interface class contains methods to interact with form controls of administrative Unit. These methods directly link user controls on the web form. On request, they display/send information to the web controls on the form. These methods are responsible for doing tasks like: adding, deleting, editing agency
members, adding, deleting, and editing package, item information and other functions like: creating password encryption and password checking.
4. 3. 2 WVSPFL Evidence Handling Division Evidence Handling has the two units Central Evidence Processing Unit and Central Evidence Receiving Unit. 4. 3. 2. 1 Central Evidence Receiving Unit The Central Evidence Receiving (CER) Unit is responsible for receiving form DPS-53 and packages from agencies. Packages are sent via mail or personal delivery. 22
CER can accept/deny form DPS-53.
CER receives packages from the originating
agencies via mail or personal delivery. If the items within a package are for a single unit, CER sends the packages to the appropriate unit. The analyst in the unit then inventories the items within the packages using a Laboratory Evidence Inventory Form (LEIF). If the items are for multiple units, the packages are sent to processing. The Processing Unit inventories the items in the packages using a LEIF based on CR and the items received. CER is also responsible for the chain-of-custody management. Chain of Custody is a process to maintain and document the chronological history of the evidence. CER Unit module is responsible for package inventory, chain of custody, sending packages, receiving packages. Figure 4.4 shows the CER Unit Methods class diagram.
Figure 4.4: CER Unit Methods Central Evidence Receiving Unit interface class contains methods to interact with form controls of CER Unit. As shown in the Figure 4. 4, SendPackage method performs, sending all the information about the package to the sender. These methods directly link user controls on the web form. On request, they display/send information to the web controls on the form. These methods are responsible for doing tasks like: checking
status of package inventory, chain of custody of packages, opening existing incident
23
information, receiving packages, sending packages to other Units or back to respective Units. 4. 3. 2. 2 Central Evidence Processing Unit An independent CEP Unit enables better control of the existing evidence and information flow between different Units. CEP receives packages from the CER Unit
for further processing. CEP is responsible for item inventory and distribution of evidence items among different Units. After receiving package from the CER Unit, CEP Unit is responsible for opening the package and inventory items. Item name and description
entered by CEP overrides the name and description provided by the originating officer. After completion of test and final item report generation, CEP is responsible for compiling package report. Compiled package report is then sent to the administration unit for final review and approval. Package Item Inventory shows the list of packages with their description, examination requested, and item inventory status respectively. After receiving package, CEP is responsible for filling laboratory item inventory form. This section enables a user to add, update, or delete existing items. Chain of Custody shows the chain of custody of packages. Send Package/Item, Section describes how to send package/item from source to destination units. CEP user can send package by
selecting the appropriate package number. The CEP user can view the list of item assigned and then sends those items to their relevant unit through this section. CEP Unit module is responsible for package inventory, chain of custody, sending packages, receiving packages. Figure 4.5 shows the CEP Unit Methods class diagram.
24
Figure 4.5: CEP Unit Methods The Central Evidence Processing Unit interface class contains methods to interact with form controls of CEP Unit. In the Figure 4.5, ExamineIinventoryStatus method performs, checking the status of inventory item. These methods directly link to user controls on the web form. On request, they display/send information to the web controls on the form. These methods are responsible for doing tasks like: checking status of
package inventory, chain of custody of packages, opening existing incident information, receiving packages, sending packages to other units or back to respective Units.
4. 3. 3 WVSPFL Evidence Processing Methods WVSPFL has eight Units: Bio-Chemistry Unit, DNA Unit, Drug Unit, Fire Arms Unit, Latent Print Unit, Trace Evidence Unit, Toxicology Unit and Question Documents Unit. 4. 3. 3. 1 Bio-Chemistry Unit Methods The Bio-Chemistry Unit of the laboratory receives and examines physical evidence for the presence of biological material. This Unit performs DNA analysis on
25
blood, semen, and other biological specimens. The sensitivity of DNA analysis allows a large variety of samples to be tested. Generally, any cellular material can be used as a source of DNA. In routine, incident work techniques are so sensitive that DNA profiles have been obtained from blood stained clothing that had been washed, from gum that had been chewed, and from envelopes that have been licked. The Bio-Chemistry Unit
manages the Convicted Offender DNA Identification System (CODIS) program for the State of West Virginia. All individuals convicted of violent crimes, sex offenses and most felonies in the State of West Virginia are required to provide the State Police with a DNA sample for analysis. After analysis, the DNA profile of the convicted offender is entered into a national database managed by the FBI. DNA profiles developed from evidence are compared to the convicted offender database. If a hit or match occurs as a result of a database search, laboratory personnel review the match to determine what additional action is required. If appropriate, the investigating agency is contacted with the name of the convicted offender whose profile matched the evidence. The analyst in the Bio-Chemistry Unit receives the LEIF form and chain of custody form along with the evidence items. The analyst prepares samples and performs the required test. The test results are documented on appropriate forms. Once the examinations are completed, the evidence is returned to the CER and the secretary prepares a draft report. The draft report is then reviewed by the analyst/reviewer. After approval, the final report is sent to CER. View Items Assigned, method, shows the list of items assigned to the analyst by the laboratory administration Unit for further analysis. Item Picture Management methods allow user to attach pictures to an item. Analysts can perform View and Upload Pictures. Working with Samples methods allow, Analysts to
26
receive items from the CEP Unit for further analysis and processing.
Once received
items are sub-divided into samples for conducting requested tests. Each item is divided into one or more samples according to the requirement and complexity of the required test. Analysts can create new, update or delete existing samples from the item details section. This method also enables analysts to view the final sample, sample photo, item, and item photo report. Extraction/Amplification worksheet is used for filling information pertaining to extraction/amplification test performed on the sample by the analyst. After filling the
required data fields, analyst is required to save. Sperm identification worksheet is used for filling information pertaining to sperm identification test performed on the sample by the analyst. Product gel worksheet is used for filling information pertaining to product get test performed on the sample by the analyst. Yield gel worksheet is used for filling information pertaining to yield gel test performed on the sample by the analyst. Receive Item; once an item is sent by CEP Unit, Bio-Chemistry Unit analyst is responsible for receiving it. Send Item, after completion of item testing, analyst is required to return the evidence item back to CEP Unit. Chain of Custody, FIMS application records used ID, Unit name, and time stamp for tracking item chain of custody. Generating Item and Sample Reports methods generate the Item Report, Item Photo Report, Sample Report, and Sample Photo Report types of reports. DNA Unit, Drug Unit, Fire Arms Unit, Latent Print Unit, Trance Evidence Unit, Toxicology Unit, Question Documents Unit Methods have their respective methods [28].
27
4. 4 Data Analysis Agency Interface The main task of the data analysis agency is to provide a means to make logical conclusions by applying basic statistical and data mining algorithms. It provides incident statistics to agencies. It guides the law enforcement agencies in making decisions related to incident handling and controlling. Data analysis agency interface is divided into three divisions: Incident Statistics, Data Mining, and Data Visualization. In order to build a data analysis model, data need to be collected form different data table. The data file management link to Incident Statistic page. It allow user to collect the relevant data from data base by building a SQL query. 4. 4. 1 Data File Management and Methods Data File Management has Data Query Builder Wizard, Create Arff Data File, Arff Data File Viewer, and Create and View Xml data file links. Data Query Builder Wizard link provides the tools for building a SQL query from available tables and columns from the data base. To make it more user friendly, user has the option to use check boxes. Figure 4.6 shows the user interface for the data Query building wizard page. On this page, left most columns shows the table names and right most columns shows the corresponding table columns with check boxes.
28
Figure 4.6: Data Query Building Wizard User Interface Figure 4.7, shows the class that lists the methods that are used in the data query builder wizard. Data query builder wizard interface class contains methods to interact with form controls of data query builder wizard form. In the Figure 4.7, BuildSQLQuery builds a sql statement with the selected columns and table. These methods directly link to user controls on the web form. On request, they display/send information to the web controls on the form. These methods are responsible for doing tasks like: building sql query, change the color of selected columns, checking the SQL query syntax, creating check box on run time and adding to form, listing the data table from database, listing the corresponding data columns for the tables and saving the query. As in the Figure 4.8 Create Arff Data File link call the method that provides to create Arff data file. In order to validate the model, Weka like tools takes the Arff file format. Arff file data file that is created by this methods allow user to take the file to other softwares and check the validity of tool. By running the Run SQL Query, data is displayed in a data grid. Create Arff File with this Data creates the Arff file with selected data. Arff Data File Viewer link calls methods that allow user to see the arff file data in a grid format. Create and View Xml data file calls three methods, one is creating a XML file from Query Wizard data, second, Creating XML from arff file and the third one is viewing the arff file. Checking Convert SQL Query Results as XML File radio button option, and calling the Create XML File method produces a XML file from Query wizard data. Checking
29
Convert Arff to XML File radio button option and calling Create XML file method produces a XML file from input data file.
Figure 4.7: List of Methods in Data Query Building Wizard Class
Figure 4.8: Converting SQL Query Results to XML Methods User Interface
30
Data file management user interface class contains methods, to interact with form controls of Data Analysis agency user form controls, as shown in the Figure 4. 9. It is responsible to do displaying analysis attributes, Converting Xml files, Creating SQL Queries etc. Figure 4.9 shows the data analysis class diagram with methods. For Example, ConvertArffXMLfile method performs; converting data file to an arff file format.
Figure 4.9: Data File Management User Interface Class and Methods 4. 4. 2 Incident Statistics This division user interface consist methods for calculation basic statistic on the data. Figure 4.10 shows the class diagram the data statistics class and methods.
Figure 4.10: Incident Statistics Class and Methods
31
The Incident Statistics methods interface class, contains methods to interact with form controls of Incident Statistics division. In the Figure 4.10 shows Average methods provides the average incidents for year, month or a period of time. These methods directly link user controls on the web form. On request, they display/send information to the web controls on the form. These methods are responsible for doing tasks like: checking status of package inventory, chain of custody of packages, opening existing incident information, receiving packages, sending packages to other units or back to respective units.
4. 4. 3 Data Mining and Methods Data mining user interface class consist of methods that interact with user form. Logical Analysis hyperlink takes to the main page of mining algorithms. Figure 4.11 shows the available mining algorithms and results window.
Figure 4.11: Selecting Data Mining Methods
32
By choosing the appropriate algorithm, one can call that algorithm and do analysis. If user does not know what to choose, program selects best algorithm that gives best performance on the basis of percentage of accuracy. This can be done by checking auto select check box. Figure 4.12 shows how to use the auto select checkbox property. It can be noticed that by checking the check box, all the algorithms are disabled as program is going to choose best algorithm.
Figure 4.12: Using Auto Select Option While Choosing Algorithm. From the Figure 4.12 it is noticed that program selected NaiveBayes classifier as the method that gives higher accuracy for this model. Figure 4.13 shows the Data mining user interface class for providing methods for logical algorithms like: Decision Tree algorithm, Distance Rule algorithm, IB1, ID3, Interference, KNN, and Nave Bays algorithms to user interface. For Example, Decision tree algorithm, provides set of rules for the predictive class.
33
Figure 4.13: Data Mining User Interface Class and Methods 4. 4. 4 Data Visualization Interface Graphical Output Data Visualization link call the methods to display data in a Decision Tree model. Show Tree View calls the methods that will feed the raw data to the decision algorithm and displays the decision tree. visualization for Iris data set. Figure 4.14 shows the data
Figure 4.14: Graphical Output Data Visualization Decision Tree Algorithm for Iris Data Set
34
The Data Visualization user interface class contains methods to interact with form controls of data visualization form controls. It is responsible to do graphics display of analysis data. Figure 4.15 shows the data visualization classes diagram with methods and properties.
Figure 4.15: Data Visualization Interface Classes Data Visualization Interface class contains methods to interact with form controls of data visualization. For Example, the RandomTree method creates a tree structure randomly. These methods directly link user controls on the web form. On request, they create image with rectangular box as nodes. These methods are responsible for doing tasks like: Drawing Rectangles, Cleanup, Build, and Permute.
35
CHAPTER 5: DATA ACCESS TIER
5. 1 Data Access In a Web application, the middle tier consists of components registered as part of a transactional application or instantiated by a script in Active Server Pages (ASP). ASP is the brain of the Three Tier Application. It can be divided into two sub-tiers Business Tier and Data Access Tier. Some architects do not make any distinction between
Business Tier and Data Access Tier (DAT). It contains classes to calculate aggregated values such as: Total Crimes, Crime Rate, Type of Crimes, etc. This tier does not know about any Graphical User Interface (GUI) controls. It knows how to retrieve and store information. Business tier contains classes to calculate aggregated values such total crimes per year, annual crime rate, number of victims per year, number of incidents resolved, number of incidents pending and it does not know about any GUI controls and how to access databases. The classes of Data Access Tier supply the needed information from the databases, as shown in the data flow diagram (Figure 5. 0).
36
Figure 5.0: Data Access Logic Data Flow 5. 2 System Administrator Data Access Methods System Administrator performance system level events like: creating agencies and creating divisions or Units for a selected agency, deleting selected agency, division or a Unit, editing selected agency, division or Unit information, data base administration, access log, system performance, monitoring, maintenance, and backup. System administrator has all rights to manage entire FIMS. User credential for the system administrator is initially created. Administrative personnel have the ability to
37
browse the FIMS data from their respective agencies. User credentials are gathered from the username and password supplied on the login page system administrator. System administrator data access methods to verify password credentials with existing user credentials that are already stored in the database. It has methods to populate authorized agency list. Figure 5.1 shows the classes responsible for authenticating the use. The classes check the user credentials and redirect them to respective Units
Figure 5.1: System Administrator Class and Methods Figure 5.2 shows the business layer class which is responsible to do business layer tasks like: maximum incidents register for a period, number of pending and running incident, number of resolved incidents etc.
38
Figure 5.2: User Authentication Classes Figure 5.3 shows the class that lists the methods that are used in the business layer class. Business layer class contains methods to interact with database directly and
operates mathematical functions. In the Figure 5.3, TotalNumberOfcasePerYear methods give the number of total cases filed for that year. These methods directly link user controls on the web form. On request, they display/send information to the web controls on the form. These methods are responsible for doing tasks like: calculating the
Maximum register incidents, maximum number of incidents in the year, number of incidents on process, number of pending incidents, resolved incidents, total incidents per officer, total number of incident per year, to total number of victim per year.
39
BusinessLayerFunctions
Class
Methods MaxCaseregistarOfficer MaxNumberCaseIntheYear NoOfCasesOnProcess NumberOfPendingCases ResolvedCases TotalCasesPerOfficer TotalNumberOfCasePerYear TotalNumberOfVictimsPerY
Figure 5.3: Business Layer Class with Methods
5. 3 Data Analysis Methods
A new Unit was added to the existing FIMS, the data analysis Unit which has make use of statistical methods and data mining methods to analyze lab data to identify crime patterns. 5. 3. 1 Statistical Method The mean of a statistical distribution with a discrete random variable is the mathematical average of all the terms. The median of a distribution with a discrete random variable depends on whether the number of terms in the distribution is even or odd. If the number of terms is odd, then the median is the value of the term in the middle. This is the value such that the number of terms having values greater than or equal to it is the same as the number of terms having values less than or equal to it. If the number of terms is even, then the median is the average of the two terms in the middle, such that the number of terms having values greater than or equal to it is the same as the
40
number of terms having values less than or equal to it. The mode of a distribution with a discrete random variable is the value of the term that occur the most often. The range of a distribution with a discrete random variable is the difference between the maximum value and the minimum value. The standard deviation measures the spread of the data about the mean value. The variance of a random variable is a measure of its statistical dispersion, indicating how far from the expected value its values typically are. 5. 3. 2 Data Mining Methods
The data mining algorithm is the mechanism that creates data mining models. To create a model, an algorithm first analyzes a set of data, looking for specific patterns and trends. The algorithm then uses the results of this analysis to define the parameters of the mining model. The mining model that an algorithm creates can take various forms, including:
A set of rules that describe how Criminal activities are grouped together in incident investigations.
A decision tree that predicts whether a particular criminals will do such crime or not.
A mathematical model that forecasts future location of criminal activity. A set of clusters that describe how the incidents in a dataset are related. Choosing the right algorithm to use for a specific incident can be a challenge.
While it can be possible to use different algorithms to perform the same incident task, each algorithm produces a different result, and some algorithms can produce more than one type of result.
41
The Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating
association rules, which can be used in a crime background analysis. An example of an association algorithm is the Apriori Algorithm. The Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset. An example of a classification algorithm is the Decision Trees Algorithm. The Clustering algorithms divide data into groups, or clusters, of items that have similar properties. An example of a Cluster algorithm is the Clustering Algorithm. A clustering algorithm recognizes patterns, and breaks the data into groups that are more or less homogeneous. In prediction, data are mined to anticipate behavior, patterns, and trends. This is often the outcome of the previous three basic models. The idea is once the decision rules are generated through classification or clustering, those rules form the basis of the prediction model. The error or the probability factor is considered while choosing any particular algorithm over other. An example of a prediction algorithm is the Time Series Algorithm. The Sequence Discovery is used to determine sequential patterns in the data. These sequences are more often associations between various data fields, but, they are essentially based on time and often follow a particular queue. This technique
encompasses association rules as well as Markov concepts; hence not much can be elaborated regarding this. The best example would be if a person commits crime in theft then he may bind to commit burglary for it sooner than later. The Generalization, also called Description or Summarization, pulls the data into subsets with their respective description. Sometimes actual portion of the mined data are
42
retrieved and based on that the subsets described above are created. Generalization is not a Data Mining method; it is the outcome of Data Mining technique [34]. 5. 3. 2. 1 Classification Algorithm
Decision Trees Algorithm A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. Decision tree is
commonly used for gaining information for the purpose of decision - making. Decision tree starts with a root node on which it allows users to take actions. From this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome. [16] The algorithm makes predictions based on the relationships between input columns in a dataset. It uses the values, or states, of those columns to predict the states of a column that designate as predictable. Specifically, the algorithm identifies the input columns that are correlated with the predictable column. For example, in a scenario to predict which criminal are likely to commit a crime, if nine out of ten younger criminals commit a crime, but only two out of ten older criminal do so, the algorithm infers that age is a good predictor of committing crime. The decision tree makes predictions based on this tendency toward a particular outcome. Example Lets say one the Agency from West Virginia wants to identify characteristics of previous suspects that might indicate whether those suspect are likely to commit a crime
43
in the future. The FIMS database stores demographic information that describes previous suspects. By using the Decision Trees algorithm to analyze this information, the Agency Crime Data Analysis department can build a model that predicts whether a particular suspect will commit that crime, based on the states of known columns about that suspect, such as demographics or past suspicious patterns. How the Algorithm Works The Decision Trees algorithm builds a data mining model by creating a series of splits, also called nodes, in the tree. The algorithm adds a node to the model every time an input column is found to be significantly correlated with the predictable column. The way that the Decision Trees algorithm builds a tree for a discrete predictable column can be demonstrated by using a histogram. The Figure 5.4 shows a histogram that plots a predictable column, crime suspects, against an input column, Age. The histogram shows that the age of a person helps distinguish whether that suspect will commit a crime.
No. Suspects
100 No.Incidents 80 60 40 20 0 Age Age Not Commit Crime Commit Crime
Figure 5.4: Histogram for Age and Number of Suspects for Committing Crime
44
As the algorithm adds new nodes to a model, a tree structure is formed. The top node of the tree describes the breakdown of the predictable column for the overall population of suspects. As the model continues to grow, the algorithm considers all columns. IB1 IB1-type classifier uses a simple distance measure to find the training instance closest to the given test instance, and predicts the same class as this training instance. If multiple instances are the same (smallest) distance to the test instance, the first one found is used Naive Bayes Algorithm The Naive Bayes algorithm is a classification algorithm. The algorithm calculates the conditional probability between input and predictable columns, and assumes that the columns are independent. This assumption of independence leads to the name Naive Bayes, with the assumption often being naive in that, by making this assumption, the algorithm does not take into account dependencies that may exist. This algorithm is less computationally intense than other algorithms, and therefore is useful for quickly generating mining models to discover relationships between input columns and predictable columns. Example As an ongoing security threat, the security unit of an Agency has decided to target potential suspects. To reduce costs, they want to know only to those suspects who are likely to commit threat. The agency stores information in a database about demographics and response of previous suspects involved in such incidents. They want to use this data to see how demographics such as age and location can help predict response to a crime 45
incident location, by comparing potential suspect to suspect who have similar characteristics and who have commit crime in the past. Specifically, they want to see the differences between those suspects who commit a crime and those who did not. By using the Naive Bayes algorithm, the security department can quickly predict an outcome for a particular suspect profile, and can therefore determine which suspect are most likely to commit the crime. Linear Regression Algorithm Linear regression determines a relationship between two continuous columns. The relationship takes the form of an equation for a line that best represents a series of data. For example, the line in the Figure 5.5 is the best possible linear representation of the data.
Figure 5.5: Sample Linear Regression Graph The equation that represents the line in the diagram takes the general form of y = ax + b, and is known as the regression equation. The variable Y represents the output variable, X represents the input variable, a and b are adjustable coefficients. Each data point in the diagram has an error associated with its distance from the regression line. The coefficients a and b in the regression equation adjust the angle and location of the
46
regression line. The regression equation can obtained by adjusting a and b until the sum of the errors that are associated with points reaches the lowest number. Ordinal Regression (PLUM) This method helps make predictions with ordinal responses. For example,
identify satisfaction level officers response to an incident reporting (very dissatisfied, somewhat dissatisfied, somewhat satisfied, or very satisfied) to understand commuters loyalty. By choosing different link functions, one has the flexibility to fit ordinal logistic regression, ordinal probity models, and ordinal Cauchy models. One can also model both the location and scale of the underlying distribution. Ordinal regression gives options to save predicted probabilities for all dependent variable categories back to data. Time Series Algorithm The Time Series algorithm is a regression algorithm can be used to forecasting expected number of crimes for year, forecasting crime scenario. The Figure 5.6 shows a typical model for forecasting number of crimes over time
Figure 5.6: Sample Time Series Graph
47
The model that is shown in the Figure 5.6 is made up of two parts: historical information, shown in light and predicted information, shown in thick. The data in light color represents the information that the algorithm uses to create the model, while the data in thick color represents the forecast that the model makes. The bar that is formed by the combination of the thick data and the light data is called a series. Each forecasting model must contain an incident series, which is the column that distinguishes between points in a series. Example The management team at the one of agency in West Virginia wants to predict number of incidents for the coming year. By using the Time Series algorithm on historical data from the past three years, the agency can produce a data mining model that forecasts number of crime incidents in the future. Additionally, the agency can perform cross predictions to see whether the crime incident trends of other agencies are related. How the Algorithm Works Input data can be defined for the Time Series model in Table 5.1 format:
Table 5.1: Input Data Format for Time Series Model Incident Date 1/2001 2/2001 1/2001 2/2001 County Morgan Morgan Mongolia Mongolia Registered Incidents 20 25 15 18 No. Crimes 12 8 6 7
48
The Incident Date column in the table contains a time identifier, and has two entries for each day. The County column defines a County in the database. The
Registered incidents column describes the incidents registered for one day, and the No. crimes column describes the number of incidents that are crimes. In this case, the model would contain two predictable columns: Registered incidents and Number of Crimes. Neural Network Algorithm Neural Network algorithm calculates probabilities for each possible state of the input attribute when given each state of the predictable attribute. It uses these
probabilities to predict an outcome of the predicted attribute, based on the input attributes. Example The Neural Network algorithm is useful for analyzing complex input data, such as from a forensic lab data for which a significant quantity of training data is available but for which rules cannot be easily derived by using other algorithms. How the Algorithm Works The Neural Network algorithm uses a Multilayer Perceptron network, also called a Back-Propagated Rule network, composed of up to three layers of neurons, or perceptrons. These layers are an input layer, an optional hidden layer, and an output layer. In a Multilayer Perceptron network, each neuron receives one or more inputs and produces one or more identical outputs. Each output is a simple non-linear function of the sum of the inputs to the neuron. Inputs only pass forward from nodes in the input layer to nodes in the hidden layer, and then finally they pass to the output layer; there are no connections between neurons within a layer. A neuron receives several inputs: with
49
input neurons, a neuron receives inputs from the original data; with hidden neurons and output neurons, a neuron receives inputs from the output of other neurons in the neural network. Inputs establish relationships between neurons, and the relationships serve as a path of analysis for a specific set of incidents. Each input has a value assigned to it, called the weight, which describes the relevance or importance of a particular input to the hidden neuron or the output neuron. The greater the weight that is assigned to an input, the more relevant or important the value of that input is to the neuron that receives when the algorithm determines whether that input successfully classifies a specific incident. The value of the input is multiplied by the weight to emphasize for the input for a specific neuron. Correspondingly, each neuron has a simple non-linear function assigned to it, called the activation function, which describes the relevance or importance of a particular neuron to the layer of a neural network. Hidden neurons use a hyper tangent function for their activation function, whereas output neurons use a sigmoid function for their activation function. Both functions are nonlinear, continuous functions that allow the neural network to model nonlinear relationships between input and output neurons. Training Neural Networks The algorithm first evaluates and extracts training data from the data source. A percentage of the training data, called the holdout data, is reserved for use in measuring the accuracy of the structure of the resulting model. During the training process, the model is evaluated against the holdout data after each iteration over the training data. When the accuracy of the model no longer increases, the training process is stopped. The algorithm next determines the number and complexity of the networks that the mining model supports. If the mining model contains one or more attributes that are
50
used only for prediction, the algorithm creates a single network that represents all such attributes. If the mining model contains one or more attributes that are used for input and prediction, the algorithm provider constructs a network for each such attribute. The algorithm provider iteratively evaluates the weight for all inputs across the network at the same time, by taking the set of training data that was reserved earlier and comparing the actual known value for each incident in the holdout data with the network's prediction, in a process known as batch learning. After the algorithm has evaluated the entire set of training data, the algorithm reviews the predicted and actual value for each neuron. The algorithm calculates the degree of error, if any, and adjusts the weights that are associated with the inputs for that neuron, working backward from output neurons to input neurons in a process known as back propagation. The algorithm then repeats the process over the entire set of training data. 5. 3. 2. 2 Association Algorithm
Association models are built on datasets that contain identifiers both for individual incidents and for the items that the incidents contain. A group of items in an incident is called an itemset. An association model is made up of a series of itemsets and the rules that describe how those items are grouped together within the incidents. The rules that the algorithm identifies can be used to predict a suspects likeliness to commit crime, based on the items that already proved for that suspects. An Association algorithm can potentially find many rules within a dataset. The algorithm uses two parameters, support and probability, to describe the item sets and rules that it generates. For example, if X and Y represent two items that could be in a selected suspects proved behaviors item
51
set, the support parameter is the number of incidents in the dataset that contain the combination of items, X and Y. Example The agency is redesigning the functionality of its officers groups. The goal of the redesign is to decrease incidents in their region. Because the agency records each
incident geographical information in a transactional database, they can use the Association algorithm to identify sets of incidents that tend to be committing together. They can then predict expected incidents items that a one may commit, based on items that are already in the behavior of existing suspects For Example, the reduced speed limit violator are leads suspected for traffic other violators. How the Algorithm Works The Association algorithm traverses a dataset to find items that appear together in an incident. The algorithm then groups into itemsets any associated items that appear. For example, an itemset could be Reduced speed violators=Existing, Traffic Violators=Existing, and could have a strong support. The algorithm then generates rules from the itemsets. These rules are used to predict the presence of an item in the database, based on the presence of other specific items that the algorithm identifies as important. 5. 3. 2. 3 Clustering Algorithm
The Clustering algorithm uses iterative techniques to group incidents in a dataset into clusters that contain similar characteristics. These groupings are useful for exploring data, identifying anomalies in the data, and creating predictions. Clustering models identify relationships in a dataset that might not logically derive through casual
52
observation.
For example, logically discern that people who are over aged do not The algorithm,
typically do crime proximity a long distance from where they are.
however, can find other characteristics about suspect that are not as obvious. In the Figure 5.7, Cluster A represents data about suspects who have proper driving license, while Cluster B represents data about suspect who do not drive and driving license.
A: Suspects who have driving license B: Suspects who do not have driving license
Figure 5.7: Sample Cluster Diagram The clustering algorithm trains the model strictly from the relationships that exist in the data and from the clusters that the algorithm identifies. Example Consider a group of suspects who share similar demographic information and who do similar activities from the agency. This group of people represents a cluster of data. Several such clusters may exist in a database. By observing the columns that make up a cluster, it can be more clearly seen how records in a dataset are related to one another. How the Algorithm Works The Clustering algorithm first identifies relationships in a dataset and generates a series of clusters based on those relationships. A scatter plot is a useful way to visually represent how the algorithm groups data, as shown in the figure 5.8. The scatter plot 53
represents all the incidents in the dataset, and each incident is a point on the graph. The clusters group points on the graph and illustrate the relationships that the algorithm identifies.
Figure 5.8: Sample Scatter Graph for Cluster Groups After first defining the clusters, the algorithm calculates how well the clusters represent groupings of the points, and then tries to redefine the groupings to create clusters that better represent the data. The algorithm iterates through this process until it cannot improve the results more by redefining the clusters. The Clustering algorithm offers a method for calculating how well points fit within the clusters: Clustering with K-Means. For K-Means, the algorithm uses a distance measure to assign a data point to its closest cluster. Knn K* is an instance-based classifier, that is the class of a test instance is based upon the class of those training instances similar to it, as determined by some similarity function. The underlying assumption of instance-based classifiers such as K*, IB1, PEBLS, etc, is that similar instances will have similar classes Two-Step cluster analysis: One can work with very large datasets using this scalable cluster analysis algorithm. Two-Step cluster analysis can handle continuous and categorical variables or 54
attributes. This procedure enables one to group data so that records within a group are similar. For example, it can be applied to data that describes Crime occurring locations, gender, age, population, race, etc. Then, customize the investigation patterns and Crime preventing development strategy to each investigation group to easily investigate crime. Hierarchical cluster analysis: This method takes clusters from a single record and form groups until all clusters are merged. One can choose from more than 40 measures of similarity or dissimilarity, standardize data using several methods, and cluster incidents or variables. One can also analyze raw variables or choose from a variety of standardizing transformations. Generate distance or similarity measures using the proximities procedure. Display
statistics at each stage to help select the best solution. This procedure is recommended for datasets that are smaller in numberfor example, focus group lists. A Bio-Chemistry analyst researcher could use hierarchical cluster analysis to identify types of DNA that looks similar people for each race type. The researcher could cluster DNA groups into homogenous groups based on racial characteristics to identify cluster group segments.
55
CHAPTER 6: DATABASE DESIGN AND IMPLEMENTATION

6. 1 Database MS SQL Sever Studio 2005 is a sophisticated relational database management system (RDBMS). An RDBMS enables users to store related pieces of data in twodimensional data structures called tables. This data may consist of many defined types, such as integers, floating-point numbers, character strings, and timestamps. Data inserted in the table can be categorized using a grid-like system of vertical columns, and horizontal rows. While MS SQL Sever is commonly considered an RDBMS, or a database, it may not be commonly understood what is meant specifically by the word database. A
database within MS SQL Server is an object-relational implementation of what is formally called a Data base schema. Simply, a database is a stored set of data that is logically interrelated. Typically, this data can be accessed in a multi-user environment. It may not be commonly understood that MS SQL Sever can have several databases concurrently available, each with their own owner, and each with their own unique tables, views, indices, sequences, and functions. A database contains number of tables, where the data is stored in systematic order. Tables are quite possibly the most important aspect of SQL to understand inside and out, as all of data will reside within them. In order to be able to correctly plan and design SQL data structures, and any programmatic routines toward accessing and applying that data, a thorough understanding of tables is an absolute pre-requisite. A table is composed of columns and rows, and their intersections are fields. Columns within a table describe the name and type of data that will be found (and can be 56
entered) by row for that column's fields.
Rows within a table represent records
composed of fields that are described from left to right by their corresponding column's name and type. Each field in a row is implicitly correlated with each other field in that row. A database management system (DBMS) such as a Microsoft SQL Server database, an unstructured data store such as Microsoft Exchange, or a transactionprocessing mechanism such as Transaction Services or Message Queuing. A single application can enlist the services of one or more of these data providers. It is
responsible for retrieving, storing and updating from information. Separating the application into layers isolates each major area of functionality. The presentation is independent of the business logic, which is separate from the data. Designing
applications in this way has its tradeoffs; it requires a little more analysis and design at the start, but greatly reduces maintenance costs and increases functional flexible. 6. 2 Database Schema
The crime database was created from various tables described below.
The
relation between different data entities can be one-to-one or one-to-many. The key symbol represents primary key relationship and the symbol represents one-to-many relationship. The relationships between various tables of the database are described as database schema. For the crime database, various data fields were identified to cover the real world scenarios. The inputs were taken from the Forensic Information Management System (FIMS) developed for the West Virginia State Police Forensics Laboratory (WVSPFL).
57
The entity tables that are identified are: 1) Incident 2) Suspect Physical 3) Suspect Variable 4) Verdict 5) Victim Physical 6) Victim Variables 6) Agency info 7) Convict 6. 3 FIMS 3. 0 Data Tables The fields of the data tables were built from various sources as stated earlier. The tables are populated as the incidents are registered by the crime investigating agency. The data preparation issue will be addressed in the further section. 6. 3.1 Incident Table
The fields of the Incident table are shown in Table 6. 1 This table stores the data of incident submission.
58
Table 6.1: Incident Table fields
6. 3.2 Suspect Physical Table The various fields of the Suspect Physical Table are shown in Table 6.2. This tables stores various details related to suspect physical description as: Suspect Age, height, Hair color etc.
59
Table 6.2: Suspect Physical Table Fields
6. 3.3 Suspect Variable Table The various fields of the Suspect Variable Table are shown in Table 6.3. This tables stores various details related to suspect variable description as: Suspect behavioral aspects, Weapon training, personal condition etc.
60
Table 6.3: Suspect Variable Table Fields
6. 3.4 Verdict Table
The fields of the Verdict Table are shown in Table 6.4 This table stores the official information about the verdicts.
61
Table 6.4: Verdict Table Fields
The fields of the Verdict Table are shown in Table 6.4.
This table stores
information pertaining to the different verdicts involved in crimes as: Status, Sentence, Problem date etc.
62
6. 3.5 Victim Physical Table The various fields of the Victim Physical Table are shown in Table 6.5. This tables stores various details related to victim physical description as: Suspect Age, height, Hair color etc.
Table 6.5: Victim Physical Table Fields
63
6. 3.6 Victim Variable Table The various fields of the Victim Variable Table are shown in Table 6.6. This tables stores various details related to victim variable description as: Victim behavioral aspects, Weapon training, personal condition etc.
Table 6.6: Victim Variable Table Fields
64
6. 3.7 Agency Information Table
The fields of the AgencyInfo Table are shown in Table 6.7. This table stores information pertaining to the different agencies involved in crime fighting.
Table 6.7: AgencyInfo Table Fields
6. 3.8 Convict Table The fields of the Convict table are shown in Table 6.8. This table stores the official information about the individuals convicted of crime.
Table 6.8: Convict Table Fields
65
6. 4 Sorted Test Data Table Form the above database an additional test table created for the analysis purpose, called Test Data Table. This table is built by dynamically choosing some the fields from various tables of the database. This Test Data table can be populated with the original data by choosing available tables and columns. The focus of the initial table is to find a crime patterns. The fields of the Test Data Table are shown in Table 6.9.
Table 6.9: Test Data Table Fields
The salient features of the Test Data Table are: Important fields, based on domain knowledge and inputs from agencies, were duplicated from the database into the Test Data Table. These fields were put into a single table to ensure the integrity and normalization constraints.
66
While doing so, the privacy and security of the victim, suspect and the convicts are ensured.
All fields of the table are of type nominal & non-ordinal; also most were null able. Moreover, there is no Independent Dependent variable relationship in the data.
The data entries are sequential in terms of date of registration and not as per actual event.
6. 5 Assumptions These data tables had predefined relation between them, and to express that certain database constraints are required. These constraints provide data stability and at the same time cover security issues. The various assumptions are as follows: One criminal offense can be registered in many agencies. One criminal offense is assigned to one investigating officer only. However, one investigating officer can handle multiple criminal offenses. One criminal offense can include multiple suspects, victims and convicts. They may have multiple aliases having additional set of information. One convict can face multiple criminal charges and each criminal charge will have a single verdict.
67
CHAPTER 7: APPLICATION STUDY

7. 1 FIMS homepage
By default, the initial screen of FIMS is the home page screen as shown in Figure 7.1. The screen shows the different modules of the system. User can click on the navigation bar tab on the top section to navigate to that page.
Figure 7.1: FIMS home page
68
7. 2 Login Tab
The FIMS login tab has two hyperlinks; they provide system administrator or agency user to redirects, agency search page.
Figure 7.2: Login Tab Home Page The Login Tab home page shows the links to, System Administrator and Agency long in. 7. 3 System Administrator Tab
System Administrator, login page provides user to give their credentials. The user screen should show similar to the page one like in the Figure 7.3. Each user must be enrolled with the system before accessing work area. System Administrator adds, edit, delete, and review the Agencies.
69
Figure 7.3: System Administrator Home page The header region enables the user to navigate to FIMS home page or end the current session. corner. password. It also displays users full name and current date on right hand upper
The side navigation bar shows the Unit name, input box for username and After verifying user credentials content are updated according to the Unit.
For example, WVSPFL users will see list of authorized Units that he can work with after successful login. Main work area is the space in which all of the user action results are displayed. For example, if a user hits units hyperlink, the Units forms and reports will be displayed in this region. The agency login hyper link redirects the agency user to find his agency with agency name and location. The user can type agencyID and search agency. The Figure
70
7. 4, shows the West Virginia State Police at South Charleston, agency search page, where user is redirected to his respective agency.
Figure 7.4: Agency Search with Agency Name and Location or Agnecyid 7. 4 Agency Login Tab
Agency login home page, as in Figure 7.5, provide user to give their credentials and work on their authorized units. Each user must be enrolled with the agency before accessing work area. Agency Administrator has permissions to add, edit, and delete divisions, units for that agency.
71
Figure 7.5: Sample Agency (WVSP) Home Page and Login Screen 7. 5 Contact Us Tab
Contact us tab in the navigation bar, Figure 7.6 shows the information contact information about the FIMS project.
72
Figure 7.6: Contact us page 7. 6 Data Analysis with Mining Methods Data Analysis Home page provide user to create, view, apply statistical, data mining methods and view graphical representation on selected data. Figure 7.7 shows the user interface of the data analysis home page. On the left side of the page, as in the Figure 7.7, there is a vertical user navigation bar which has hyper links to Data File Management, Data Mining/Data Visualization and Incident Statistics.
73
Figure 7.7: Data Mining Home Page 7. 6.1 Data File Management
The Data File Management Tab on the navigation tab, provide user to create a data file for further analysis. As in the Figure 7.8, The Data File Management page has four hyper links, Data Query Builder Wizard, Create Arff Data File,Arff data File Viewer and Create And ViewXML Data File. The Data Query Builder Wizard tab provides user to select data from data base. The Create Arff Data File tab provides user to create a arff file with previously selected data. The Arff Data File Viewer tab provides user to view the arff file data on a data grid. The Create and View XML Data File tab provides user to create XML file from data file.
74
Figure 7.8: Data File Management Home Page
7. 6.2 Data Mining/Data Visualization The Data Mining/Data Visualization Tab on the navigation tab, provide user to do logical analysis using data mining algorithms. As in the Figure 7.9, The Data
Mining/Data Visualization page has two hyper links, Logical Analysis and Graphical Output Data Visualization. The Logical Analysis tab provides user to apply data mining methods on selected query wizard data. The Graphical Output Data Visualization tab provides user to view output as Graphical Decision tree.
75
Figure 7.9: Data Mining Home Page 7. 6.3 Incident Statistics The Data Incident Statistics Tab on the navigation tab, provide user to do statistical analysis using basic statistical methods. As in the Figure 7.10, The Incident Statistics page has two hyper links, Numerical Statistics, Crime between Dates and Graphical View. The Numerical Statistics tab provides user to statistical results in numerical values. The Crime between Dates tab provides user to view crimes incidents for specified dates. The Graphical View tab provides user to view the statistical results on a graph.
76
Figure 7.10: Incident Statistics Home Page
7. 7 Building a Data Mining Model
Data mining derives patterns and trends that exist in data. These patterns and trends can be collected together and defined as a mining model. Mining models can be applied to specific Agency Incident scenarios, such as:

Forecasting possible crime activity. Targeting locations toward specific criminals. Determining which activities are likely to be together while most crimes incidents.
Finding sequences in the order that criminals applied on crimes.
An important concept is that building a mining model is part of a larger process. This process can be defined by using the following five basic steps:
77
1. Defining the Problem 2. Preparing Data 3. Exploring Data 4. Building Models 5. Exploring and Validating Models Creating a data mining model is a dynamic and iterative process. Table 7.1 gives some guidelines for using particular data mining algorithms according to problem type.
78
Table 7.1: List of Data Mining Algorithms

Problem Types S. No. Methods Association 1 Regression Analysis Nave Bayes Classification Algorithm (Probabilistic) Decision Rules or Trees (DT): ID3 DT: C4. 5 DT: Classification And Regression Trees (CART) DT: Chi Squared Automatic Interaction Detection (CHAID) DT: Quick Unbiased Efficient Statistical Tree (QUEST) Rule Induction Classification Y Clustering Prediction Y Sequence Discovery Description/ Summarization Input Output
Numerical Values
Numerical Values Sample Computational Data, Numeric. Sample Computational Data, Sample Computational Data, Numeric. Sample Computational Data, Numeric. Sample Computational Data, Numeric. Sample Computational Data, Numeric. Feature of pattern.
Multivariate Equation Probability of outcomes belonging to particular classes. Decision Rules/Tree Decision Rules/Tree Decision Rules/Binary Tree Decision Rules/Tree Decision Rules/Tree Decision Rules/Tree Classification, Prediction, Clustering and Association results.
3. 1
3. 2
3. 3
3. 4
3. 5
5. 1
NN: Back Propagation Algorithm
79
Problem Types S. No. Methods Association 5. 2 6 7 8. 1 8. 2 NN: Self Organizing Map Kohonen Genetic Algorithm Association Algorithm (Apriori) Agglomerative Algorithm (Hierarchical Clustering) Divisive Algorithm (Hierarchical Clustering) Classification Y Y Y Y Y Clustering Y Y Y Y Prediction Sequence Discovery Description/ Summarization Input Output
Feature of pattern. Binary streams. Transactional Data Sample feature data. Sample feature data. Features of the predictor and/or classifier variables. Features of the predictor and/or classifier variables. Instance based data. Y Not Considered
Clusters of data. Set of rules. Freq. of Associations of features. Clusters of data. Clusters of data. Classification or grouping of datasets based on weighted distance from the target or query (centroid). Clusters or grouping of datasets based on similarities or closest to centroid. Sequence or probability of a sequence. Not Considered
9. 1
k - Nearest Neighbor (Partitional Clustering)
9. 2
k - Means Clustering (Partitional Clustering)
10 11
Markov and Hidden Markov Model Data Visualization
Y Y
80
7. 7. 1 Defining the Problem
The first step in the data mining process is to clearly define the problem. This step includes analyzing the requirements, defining the scope of the problem, defining the metrics by which the model will be evaluated, and defining the final objective. These tasks translate into questions such as the following:

What are you looking for? Which attribute of the dataset do you want to try to predict? What types of relationships are you trying to find?
7. 7. 2 Preparing Data
The second step is to consolidate and clean the data that was identified in the Defining the Problem. Data File Management and Incident Statistics tabs of the FIMS page have the tools to select the initial data. 7. 7. 2. 1 Data Analysis and Query Queries can be used to quickly analyze and sort information that is in a database. A query allows one to present a question to the database by specifying specific criteria. Queries allow the users to specify:

The table fields that appear in a query The order of the fields in a query Filter and sort criteria for each field in a query Select queries are the most commonly used type of queries in many database applications. Join Statement is used to make a query on multiple tables. Search
81
Conditions are tests that records must pass. The retrieved dataset will contain records that meet the given condition. When the query is run, SQL Server retrieves data out of the tables and creates a dataset. A select query can be used to select certain data from a table or tables. It basically filters and sorts the data and can perform simple calculations, such as adding and averaging. 7. 7. 2. 2 Query Design To make it more user friendly, user is provided with the flexibility to select from the following given options: View and select the tables user wants to include in a query View and select the fields of the table the user wants to include in a query User can add conditions such as AND, OR, NOT and its combinations to each of the above selected fields Based on the choices made by the user, query is generated at run time Finally, a virtual table with multiple column fields from selected tables is created
7. 7. 2. 3 Query Implementation Data wizard dynamically gets all available tables from a particular SQL Sever database (if a table is added in the future it will show in the available tables on next run). The user has the option to select all or a few selected tables from the database. Data Wizard allows the user to select required fields from the above selected tables and give conditions and value selection. The Figure 7.11 shows that the user had selected AgencyInfo and Convic tables from the available database tables. Show selected tables option allows the user to view the selected table. Show Columns for all selected tables
82
option allows the user to populate all the columns for corresponding table. In the Figure 7.10 the user has selected the fields AgencyType, Street, County and Zipcode from the table AgencyInfo table and Arrestdate, CourtAssigned and Trail from Convict table. After adding different conditions to the query, user can check if the syntax is correct. Build SQL option allows users to get dynamic SQL Query statement.
Figure 7.11: SQL Query Building Wizard The query is run based on the choice made by the user with the Run SQL button. A sample screen should look similar like one in the Figure 7.12. The query output is displayed on a data grid. In order to provide a better user interface, check boxes are designed and implemented. This feature provides the user the flexibility to query the database with complex queries and also view only the desired information. Based on the fields selected by the user, an . arff file is created.
83
Figure 7.12: Query Data Extraction and Creating data file
7. 7. 3 Exploring Data
The third step in the data mining process is to explore the prepared data. The user is also given an option View arff File Data, which will display the data of the corresponding attributes of the .arff file between which the analysis is being done by the user. Figure 7.13 show the arff file viewer hyperlink page and data grid with data.
84
Figure 7.13: ARFF file viewer The Data Exploration techniques include calculating the minimum and maximum values, calculating mean and standard deviations, and looking at the distribution of the data. Incident statistics provide the basic incident statistics like: number of crimes per year, average, means, and number of crimes for a month, graphical representation of number of crime vs. each individual agency. In the Figure 7.14, the graph shows number of crimes on vertical axis and agency names horizontal axis.
85
Figure 7.14: Basic Crime Statistics and Graphical Data Visualization
7. 7. 4 Building Models
The fourth step in the data mining process is to build the mining models. Before building a model, the prepared data is randomly separated into separate training and testing datasets. The training dataset is used to build the model, and the testing dataset is used to test the accuracy of the model. Once the . arff data file is generated, data analysis can be done under FIMS data mining tab, the user is provided with an option to choose from nine different machine learning algorithms for data analysis. Data Analysis page shows nine algorithms with optional radio buttons. Show analysis class button views all the class labels in the
86
dataset. Output results are shown in a textbox. Figure 7.15 shows sample screen which looks similar to the Data Analysis page. Start Learning button process the algorithm a grid on the right side shows the each algorithm and its accuracy of prediction. A graph on the left shows the graphical comparison of each method for given dataset. A label on the top of the results textbox shows the algorithm that is selected.
Figure 7.15: Machine Learning Algorithms
7. 8 Application Study for WVSPF Lab Dataset
Concern about national security has increased due to past recent terrorist attacks. Federal agencies are actively collecting domestic and foreign intelligence to prevent future attacks. These efforts have in turn motivated local authorities to more closely
87
monitor criminal activities in their own jurisdictions. A major challenge facing all lawenforcement and intelligence-gathering organizations is accurately and efficiently analyzing the growing volumes of crime data. Data mining is a powerful tool that enables criminal investigators who may lack extensive training as data analysts to explore large databases quickly and efficiently. 7. 8. 1 Crime Data Mining Techniques
Traditional data mining techniques such as association analysis, classification and prediction, cluster analysis, and outlier analysis identify patterns in structured data. Newer techniques identify patterns from both structured and unstructured data. As with other forms of data mining, crime data mining raises privacy concerns. Classification finds common properties among different crime entities and organizes them into predefined classes. This technique has been used to identify the source of e-mail spamming based on the senders linguistic patterns and structural features. Often used to predict crime trends, classification can reduce the time required to identify crime entities. scheme. However, the technique requires a predefined classification
Classification also requires reasonably complete training and testing data
because a high degree of missing data would limit prediction accuracy. 7. 8. 2 Data Mining Techniques for Crime Type
Understanding the relationship between analysis capability and crime type characteristics can help investigators more effectively use those techniques to identify trends and patterns, address problem areas, and even predict crimes. Based on the West
88
Virginia State Police Forensic Laboratorys crime classification database and on the existing literature, the crime data mining techniques were applied in criminal and intelligence analysis and the crime types for Suspect, Incident, and Victim profiles. Investigators can apply various techniques independently or jointly to tackle a particular crime analysis problems. WVSP, has a large collection of information recorded by the officers at the time of a particular incident. Crime rate are rapidly changing and improved analysis enables discerning hidden patterns of crime, if any, without any explicit prior knowledge of these patterns. With this background, a study is planned as per the following objectives: Prediction of a crime based on the suspect, victim and crime characteristics of existing data and anticipation of crime rate and to assist the police with crime investigation. Guides in predicting a crime. To discern trends to identify analytical solution for investigation officers which can routinely be used to associate between types of incidents, location, time and descriptive detail of the incident? These approaches preprocess data to quickly generate relevant results, analyze patterns and co-occurrence of identified concept and develop an automated solution to identify crime patterns. The Agency has records about criminal offence and contains description information on crime type, time of incident, type of weapon and details of the incident. The key terms extracted include terms referring to age, gender, and physical
89
description of suspect, time, location of the incidents, and type of crime resulting from the incidents. The attributes selected from crime data are: IncidentID :Individual Crimes are designated by unique Incident IDs CrimeName : Crimes name Gender:Belongs to which gender. Age:Age of Individual criminal. Location: Location of incident. CrimeType: It indicate particular criminal belong to which crime. Table 7.2 shows partial dataset that was extracted from the incident database. The fields are Incident Year, Incident Month, Incident Date, Incident Hour, Incident Minutes, Incident Day, County, Gender, Race and Age was used to predict class label likely hood of commit crime. Table 7.3 shows the possible values for each known classified class.
Table 7.2 Shows the Partial Data Set for the Crime Analysis
90
Table 7.3 Shows the Possible Known Class Set for Test Data
Classified Data Column Name County IncidentDay IncidentMonth IncidentType IncidentYear OffenderAgeGroup OffenderCriminalHistory OffenderGender OffenderRelationShipToVictim ProximityofCrimeTooffender VictimRace xLikelyhoodOfOffCommitCrime Possible Values {Berkley,Cabell,Hancock,Harrison,Kanawha,Marion,Mercer,Monongalia,Raleigh} {Fri,Mon,Sat,Sun,Thru,Tue,Wed} {April,Aug,Dec,Feb,Jan,July,June,March,May,Nov,Oct,Sept} {Accident,Arson,Assault,Burglary,Drugs,Murder,Rape,Sex_Assault,Weap_Threat} {1991,2004,2005,2006} {26to40,Above40,Below25} {no,yes} {F,M} {known,unknown} {Above40,Below25,Between26to40,Outof5miles,Within5miles} {AorPI,Black,Hspnc,White} {No,Yes} (Un Classifier)
7. 8. 3 Training Set
In the data classification process, a model is built describing a predetermined set of data classes or concepts. described by attributes. The model constructed by analyzing database tuples
Each tuple is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label attribute. The data tuples are also referred to as samples, examples, or objects. The data tuples analyzed to build the model collectively form the training data set. The individual tubples making up the training set are referred to as training samples and are randomly selected from the sample population 7. 8. 4 Estimating Predictive Accuracy
The accuracy of a model on a given test set is the percentage of test set samples that are correctly classified by the model. The Holdout method uses a test set of class labeled samples. These samples are randomly selected and are independent of the
91
trainings samples. For each test sample, the known class label is compared with the learned models class predictions for that sample. If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is not known. 7. 8. 5 Comparing Classification Methods:
Classification methods can be compared and evaluated according the following criteria. Predictive accuracy: This refers to the ability of the model to correctly predict the class label of new or previously unseen data. Table 7.4, Shows each algorithm and its predictive accuracy. Due to nominal data, it can be noticed that current dataset is not suitable to apply ID3, Perceptron algorithms. Speed: This refers to the computation costs involved in generating and using the model. Robustness: This is ability of the model to make correct predictions given noisy data or data with missing values. Scalability: This refers to the ability to construct the model efficiently given large amount of data. Interpretability: This refers to the level of understanding and insight that is provided by the model.
92
Table 7.4 Prediction Accuracy Algorithm O_Rule Algorithm ID3 Algorithm KNN Algorithm Distance Rule Algorithm IB1 Algorithm Perceptron Algorithm B40 Algorithm Accuracy 72.727 60.960 64.136 27.272 60.559 Not Applicable 60.9601
Naive Bayes Simple Algorithm 72.727
Figure 7.16 shows the each algorithm and its prediction performance on the basis of percentage. It can be noticed that O-rule performs much better than other algorithm in this kind of crime data set.
Figure 7.16 Prediction Performance Comparisons.
93
7. 8. 6 Crime Data Decision Trees
A tree can be regarded as a hierarchically organized set of rules. A prediction for a new example is made by following a path from the root of the tree to a leaf node. The actual path to follow from the root is decided by the outcome of tests associated with each node. For example, when making a prediction by the tree in Figure 7.15, which has been built from the dataset Test Data Table, the criminal history of intensity determines whether to go to the left, high, right, medium, or low straight child of the root. Assumed that there has an example incident for which this intensity is medium. Hence, we go to the right and the next tests concern whether or not the Gender type is male or female. It assumed that in this example, the gender type is indeed male. A leaf of the tree would be reached and no more tests are required to make a prediction for the example. Since this particular tree concerns classification (i.e. prediction of a categorical value as opposed to a numerical value), estimated class probabilities at the leaf are used to form a prediction. The class label shown at each line of node in the tree represents these estimated probabilities outcome which have been formed from training data (i.e., data that has been used to build the model).
94
Figure 7.15: A Tree Built With Web Tool With Criminal Offender Data. Each leaf in a tree corresponds to an if-then rule, where the conditions or the rule (the if-part) is formed by making a conjunction (this means that they are combined with the Boolean operator AND) of all conditions from the root to the leaf. The conclusion of the rule (the then-part) is, in the case of classification, to predict the most likely class. The rule obtained from leaf of the above tree is shown in Figure 7.16, besides the estimated class probabilities expressed not only by a class label value but also as real numbers.
IF OffenderCriminalHistory=yes AND Gender =Male THEN Agegroup = 26to40 IF OffenderCriminalHistory=no AND Proximity=Within5mls THEN Agegroup=below25 IF OffenderCriminalHistory =yes AND Proximity=Within5mls THEN Agegroup= 26to40
Figure 7.16: A Rule Built with Analysis Tool From Offender Crime Data. As a consequence of how the rules in a tree are organized, they all become nonoverlapping, i.e., the conditions of all rules are exclusive, and the rules will cover all possible examples. This means that there are no multiples, possibly conflicting, rules
95
covering the same example and cases where none of the rules apply - there is always exactly one rule that apply. 7. 8. 7 Rule Sets
Rule sets are sets of rules with no specific order. This means that one may have a look at any rule in the set and interpret it independently of its position. This contrasts to so-called decision lists for which rules only can be interpreted in their contexts. Part of a rule set generated for the crime dataset is shown in Figure. 7. 17
Figure 7.17: Rule Set for Test Data Table data Each rule in a rule set is supposed to be interpreted in exactly the same way as the rules in a tree. There is however one major difference between the rules in a rule set and in a tree when considered as a whole - each rule in a set has a unique root to which tests are associated independently of other rules. In contrast, rules in a tree will share a number of conditions corresponding to the path from the root that they have in common. In particular, this means that all rules in a tree will contain a condition concerning the
96
variable associated to the root.
In contrast, the rules in a rule set may consider
completely different variables. This characteristic of rule sets may be beneficial in particular when the target function is disjunctive,- when there are sub-groups of a class that can be defined using different sets of variables. In contrast to rules in a tree, the rules in a rule set may however overlap and in some cases none of the generated rules apply. Final step in the process of model building is validating model with other software tools such as: Weka [35]. Table 7.5 shows Weka Output for crime data set. As shown in Appendix: A, The training data contains 295 instances. The test data contains 35 instances. Each algorithm was trained on the training data set and test data set was used to determine how well the algorithms correctly classified the data. The O-Rule and Perceptron algorithms were able to classify 88 % of the instances correctly. Such accuracy is only good for this data set and can not be generalized.
Table 7.5 Correctly/ Incorrectly Classified Instances for Crime Data from Weka
Data Points In Train set 295 295 295 Data Correctly Incorrectly Percentage Percentage Points in Classified Of Of Test Set Classified Instances Correctly Incorrectly Instances Classified Classified Instances Instances 35 35 35 31 23 26 4 3 9 88.57 65.71 74.28 11.42 8.57 25.71 0 295 295 295 295 295 35 35 35 35 35 24 26 24 31 29 11 9 11 4 6 68.57 74.28 68.57 88.57 82.85 31.42 25.71 31.42 11.42 17.14 0 0 0 0 0
Algorithm O_Rule Algorithm ID3 Algorithm Naive Bayes Simple Algorithm KNN Algorithm Distance Rule Algorithm IB1 Algorithm Perceptron Algorithm B40 Algorithm
No Classification
0 9
97
CHAPTER 8: CONCLUSION AND FUTURE WORK

8. 1 Conclusion The objective of this thesis was to develop a web-based data mining tool in ASP.NET 2.0 and SQL Server 2005 as backend. This tool can help the West Virginia State Police Forensics Laboratory (WVSPFL) in identify crime patterns from the incident data (crime, suspect, and victim information) stored in the FIMS database. It will assist WVSPFL and other law enforcement agencies in better utilizing their resources and in anticipating possible criminal activity. The key features of this tool are: Dynamic data creation Crime pattern identification with nine different machine learning algorithms Visualization of output data in a simple and easily understood format Creates output file in XML format which is an extensively used data transfer technique It is a web based application which requires an Internet connection and a browser for anytime access for data analysis. The user is given the option to dynamically choose the required tables and its fields as per the analysis requirements. Output of the .arff file is displayed in a readable and easily understood format. The sample results shows that the current web tool can be used to generate crime pattern identification rules which can guide law enforcement agencies in predicting the possible criminal activity. Classification instances statistics on crime data by Weka also asserted web tool. For the sample crime data set the O-Rule & Perceptron algorithms were able to correctly classify 88% crime instacences. 98
8. 2 Future Work
This data mining tool was developed with many exclusive features. It can be further developed to include analysis of the data and time fields as well. This tool was developed in such a manner that new algorithms can be incorporated in it with ease. The graphical representations can be enhanced to show the relation between selected fields of each table.
99
REFERENCES
[1] [2] [3] [4] [5] Adekeye A. . The Importance of Management Information Systems, Library Review, Vol. 46 No. 5, pp. 318-327. 1997 Schwiesow D. , Team Management: New Web-based management systems can Dramatically improve communication, Engineering Inc., October 2004. Paynter J. , Pearson M. , A Incident Study of the Web-Based Information Systems Development, Research Paper, The University of Auckland, 2003. Takahashi K. , Liang E. , Analysis and Design of Web-based Information Systems, Computer Networks, 1997. TWroblewski L., TRantanen E., TWeb - Based Information Management: An Open Portal Interface Solution, Human Computer Interaction International Conference, 2001. [6] Strauch B. , Winter R. , Towards a Methodology for the Development of WebBased Systems, Managing Business with Electronic Commerce, Idea Group Publishing, pp. 37 58, 2002. [7] [8] Remington J. , Managing your Data in a Web-based Paperless Environment, White Paper, Saber Logic, May 2003. U. S. Department of Justice Office of Justice Programs, Bureau of Justice Statistics: http://www.ojp.usdoj.gov/bjs/cvict.htm (Internet site accessed on 20th September 2006.) [9] An Introduction to Data Mining: http://www. thearling. com/text/dmwhite/dmwhite. htm (Internet site accessed on 27 September 2006.) [10] [11] Padhye, Manoday Dhananjay .Use of Data Mining for Investigation of Crime Patterns, Master's Thesis, West Virginia University. 2006. SAS ENTERPRISE MINER: http://www. sas. com/technologies/analytics/datamining/miner/ (Internet site accessed on 15th September 2006.) [12] JMP desktop statistical discovery software:http://www. jmp. com/software/ (Internet site accessed on 12th September 2006.) 100
th
[13] [14]
The R Project for Statistical Computing: http://www. r-project. org/ (Internet site accessed on 11th September 2006.) Teradata Enterprise Data Warehouse Roadmaps and Logical Data Models: http://www. teradata. com/t/page/156388/index. html (Internet site accessed on 12th September 2006.)
[15]
Maximize Returns with Data Mining and Predictive Analysis: http://www. spss. com/clementine/ (Internet site accessed on 12th September 2006.)
[16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [25]
Data Mining Software in Java: http://www. cs. waikato. ac. nz/ml/weka/ (Internet site accessed on 11th September 2006. ) MS Excel: http://office. microsoft.com/en-us/FX010858001033.aspx (Internet site accessed on 11th September 2006.) MiniTab http://www. minitab. com/ (Internet site accessed on 10th September 2006.) Statistica http://www. statsoft. com/ (Internet site accessed on 2nd September 2006.) Crime Connect, Tactical Information System: http://www.crimeconnect.com (Internet site accessed on 03 rd October 2006.) Forensic Logic, CrimePointWeb: http://www.forensiclogic.com/products.html (Internet site accessed on 05th October 2006.) http://www. alice-soft. com/html/prod_ac2. htm (Internet site accessed on 9th September 2006.) http://www. salford-systems. com/cart. php (Internet site accessed on 07th July 2006.) Braincel: Neural Net Add-In Module http://www. jurikres. com/catalog/ms_bcel. htm (Internet site accessed on 22nd July 2006.) http://www.calsci.com/BrainIndex.html (Internet site accessed on 14th July 2006.) http://www. icpsr.umich.edu/CRIMESTAT/ (Internet site accessed on 29th July 2006.) Ennis G. ,Object-Oriented ASP.NET Introduction, Website of Developer Fusion http://www. developerfusion.com/show/4045.19.
101
[26] [27] [28] [29] [30]
Creating a Data Access Layer Website of Microsoft http://msdn2.microsoft. com/en-us/library/aa581778.aspx Bangar P., Introduction to ASP. NET, Website of Clickfire http://www.clickfire.com/viewpoints/articles/webdevelopment/asp_net.php. Singh, Parmjit .Web Based Forensic Information Management System, Master's Thesis, West Virginia University. A Wide Range of Statistics for Data Analysis 2006 http://www.spss. com/spss/data_analysis.htm (Internet site accessed on 29th September 2006.) Srinivasan A., Forensic Information Management System, Masters Thesis, Department of Industrial and Management Systems Engineering, West Virginia University, Morgantown, WV, May 2004.
[31]
Govindarajulu S , A Web Based Forensic Information Management System, Masters Thesis, Department of Industrial and Management Systems Engineering, West Virginia University, Morgantown, WV, May 2005.
[32] [33] [34] [35]
Statistical mean, median, mode, and range http://searchdatacenter. techtarget. com/gDefinition/0,294236,sid80_gci1060882,00. html Standard Deviation http://www. carlton. srsd119. ca/chemical/Sigfigs/standard_deviation. htm Microsoft Time Series Algorithm,http://msdn2. microsoft. com/enWeka A Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/
102
APPENDIX A: Training Data Set for Crime Data

Table A1: Training Data Set for Crime Data
County Harrison Monongalia Cabell Monongalia Marion Hancock Berkley Monongalia Mercer Monongalia Monongalia Hancock Mercer Cabell Marion Monongalia Monongalia Kanawha Mercer Kanawha Cabell Hancock Cabell Monongalia Mercer Monongalia Cabell Raleigh Harrison Hancock Monongalia Monongalia Cabell Harrison Mercer Monongalia Kanawha
Day Tue Sat Wed Thru Sat Sun Tue Fri Tue Sat Sat Mon Tue Tue Wed Tue Sat Mon Mon Sat Wed Sun Tue Fri Sun Sat Sat Sat Tue Thru Wed Sat Sun Tue Wed Fri Thru
Month Jan Jan Jan Jan Jan Jan Jan Jan Jan Feb Feb Feb Feb Feb Feb Feb Feb Feb March March March April April April April April April April April April May May May May May May May
Type Assault Drugs Murder Drugs Burglary Murder Accident Sex_Assault Accident Sex_Assault Rape Accident Burglary Burglary Burglary Burglary Rape Sex_Assault Accident Sex_Assault Arson Weap_Threat Burglary Rape Burglary Drugs Arson Rape Arson Drugs Burglary Drugs Burglary Arson Assault Rape Sex_Assault
AgeGroup 26to40 Below25 Below25 Below25 Below25 26to40 Above40 26to40 Above40 Below25 Below25 Below25 26to40 Below25 Above40 26to40 Above40 26to40 26to40 Below25 26to40 Below25 Above40 26to40 Above40 Below25 26to40 Above40 26to40 26to40 26to40 26to40 Below25 Below25 26to40 Above40 26to40
Gender M M M F M M F M M M M F M M M F M M M M M M M M M M M M M M M M M M M M M
Rel2Victim known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known
Proximity Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles
Bkgd No No Yes No Yes Yes No Yes No No Yes No Yes No No Yes Yes Yes No Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes
103
Monongalia Kanawha Kanawha Cabell Monongalia Monongalia Monongalia Monongalia Monongalia Hancock Hancock Monongalia Monongalia Monongalia Mercer Marion Kanawha Cabell Raleigh Mercer Monongalia Cabell Harrison Kanawha Berkley Berkley Monongalia Monongalia Marion Cabell Kanawha Harrison Berkley Monongalia Cabell Kanawha Hancock Cabell Marion Hancock Berkley Kanawha Cabell Mercer Raleigh Hancock Hancock Marion
Thru Wed Wed Wed Tue Fri Fri Sat Tue Wed Fri Thru Thru Fri Wed Wed Sat Sun Mon Tue Sat Sun Mon Sat Sun Tue Fri Sat Tue Tue Fri Sun Sun Sun Sun Sun Mon Tue Fri Sat Tue Wed Tue Sat Sun Thru Thru Thru
May June June June June June June June June June June June June June June June July July July July July July July July Aug Sept Sept Sept Sept Sept Sept Sept Oct Oct Oct Oct Oct Oct Oct Oct Nov Nov Nov Nov Nov Dec Dec Dec
Rape Accident Accident Accident Accident Sex_Assault Rape Sex_Assault Accident Assault Weap_Threat Drugs Drugs Sex_Assault Accident Burglary Rape Burglary Accident Accident Sex_Assault Burglary Accident Sex_Assault Arson Drugs Sex_Assault Accident Murder Burglary Drugs Accident Accident Assault Burglary Weap_Threat Accident Burglary Sex_Assault Drugs Weap_Threat Arson Burglary Rape Accident Weap_Threat Drugs Drugs
Below25 26to40 Above40 Above40 26to40 Below25 Below25 Above40 Below25 26to40 Above40 Below25 Below25 Above40 Above40 26to40 26to40 26to40 26to40 Below25 26to40 26to40 26to40 Above40 Above40 Below25 26to40 Above40 26to40 Below25 26to40 Below25 Above40 26to40 Above40 Below25 26to40 Below25 26to40 Below25 Above40 26to40 Above40 26to40 26to40 26to40 Above40 26to40
M F F M M M M M M M M M M M M F M M M M M F M M M M M M F M M M M M M M M M M M M M M M M M M M
known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known unknown unknown unknown
Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Between26to40 Above40 Between26to40
Yes Yes No No No Yes Yes Yes Yes Yes Yes Yes No Yes No Yes Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes No Yes Yes No Yes Yes Yes No Yes Yes No Yes Yes Yes
104
Raleigh Monongalia Hancock Harrison Kanawha Harrison Cabell Cabell Monongalia Hancock Mercer Monongalia Hancock Berkley Monongalia Harrison Mercer Kanawha Hancock Raleigh Monongalia Berkley Raleigh Monongalia Marion Harrison Kanawha Monongalia Kanawha Cabell Monongalia Raleigh Harrison Hancock Raleigh Kanawha Hancock Berkley Cabell Kanawha Cabell Marion Mercer Raleigh Mercer Hancock Marion Monongalia
Fri Fri Sun Sun Tue Mon Sat Sun Fri Tue Tue Sun Mon Tue Wed Mon Mon Wed Thru Fri Sat Tue Sat Sat Tue Sun Sun Tue Thru Fri Sat Tue Sun Mon Sat Tue Thru Mon Wed Thru Sat Sat Wed Fri Fri Sat Tue Tue
Dec Dec Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb March March March March March April April April April April April April May May May May May May May May May June June June
Assault Sex_Assault Rape Arson Weap_Threat Accident Assault Burglary Rape Weap_Threat Sex_Assault Arson Accident Burglary Sex_Assault Accident Burglary Accident Drugs Rape Drugs Sex_Assault Drugs Sex_Assault Murder Murder Assault Burglary Rape Drugs Sex_Assault Burglary Accident Weap_Threat Sex_Assault Accident Drugs Assault Accident Rape Burglary Murder Burglary Rape Drugs Accident Accident Sex_Assault
Below25 26to40 26to40 Below25 26to40 Above40 26to40 Above40 26to40 Above40 26to40 Below25 Below25 26to40 26to40 Above40 26to40 Above40 Above40 26to40 Below25 26to40 Below25 Above40 Above40 Above40 Above40 26to40 Above40 Below25 26to40 26to40 Below25 Above40 Above40 Below25 Below25 Below25 26to40 26to40 Below25 Above40 Below25 26to40 Below25 26to40 26to40 Above40
M M M M M M M M M M M M F M M M M M M M F M M M M M M M M F M M M M M M F M M M M M M M M M M M
unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown
Below25 Between26to40 Between26to40 Below25 Between26to40 Above40 Between26to40 Above40 Between26to40 Above40 Between26to40 Below25 Below25 Between26to40 Between26to40 Above40 Between26to40 Above40 Above40 Between26to40 Below25 Between26to40 Below25 Above40 Above40 Above40 Above40 Between26to40 Above40 Below25 Between26to40 Between26to40 Below25 Above40 Above40 Below25 Below25 Below25 Between26to40 Between26to40 Below25 Above40 Below25 Between26to40 Below25 Between26to40 Between26to40 Above40
No Yes Yes No Yes No Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes No Yes Yes Yes No Yes Yes Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes No Yes No Yes
105
Cabell Kanawha Kanawha Raleigh Harrison Harrison Berkley Monongalia Kanawha Cabell Marion Harrison Cabell Raleigh Harrison Berkley Harrison Monongalia Hancock Kanawha Berkley Raleigh Cabell Marion Monongalia Monongalia Harrison Berkley Berkley Mercer Kanawha Mercer Cabell Mercer Hancock Monongalia Hancock Cabell Kanawha Marion Monongalia Hancock Monongalia Raleigh Berkley Kanawha Cabell Hancock
Wed Thru Sat Sat Wed Thru Fri Tue Fri Mon Tue Wed Sun Mon Sun Tue Sun Mon Sun Mon Tue Thru Tue Wed Sat Sat Sun Sun Wed Sat Fri Wed Wed Fri Mon Fri Sat Sat Sat Sun Thru Thru Sun Thru Sun Tue Wed Wed
June June June June June June June June June July July July July July July July July Aug Aug Aug Aug Aug Aug Aug Aug Sept Sept Sept Sept Sept Sept Sept Sept Sept Oct Oct Oct Oct Oct Oct Oct Oct Oct Nov Nov Nov Nov Nov
Accident Drugs Sex_Assault Sex_Assault Burglary Drugs Assault Rape Arson Accident Murder Burglary Assault Accident Assault Accident Arson Drugs Weap_Threat Sex_Assault Assault Drugs Burglary Accident Rape Drugs Sex_Assault Assault Sex_Assault Accident Drugs Burglary Weap_Threat Arson Accident Rape Assault Drugs Sex_Assault Assault Drugs Drugs Murder Drugs Rape Accident Burglary Arson
26to40 Below25 26to40 Above40 Above40 Below25 26to40 Below25 26to40 Below25 26to40 Above40 Below25 Above40 26to40 26to40 Above40 Below25 26to40 Above40 Below25 26to40 Above40 Above40 26to40 Below25 26to40 26to40 Above40 26to40 26to40 Below25 Above40 Above40 Above40 Above40 Below25 Below25 Above40 Below25 Below25 26to40 Above40 Below25 Above40 Above40 Below25 Above40
M M M M M M M M M M M M M F M M M M F M M M M M M M M M M F F M M M M M M M M M M M M M M M M M
unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown known known known known known known known known
Between26to40 Below25 Between26to40 Above40 Above40 Below25 Between26to40 Below25 Between26to40 Below25 Between26to40 Above40 Below25 Above40 Between26to40 Between26to40 Above40 Below25 Between26to40 Above40 Below25 Between26to40 Above40 Above40 Between26to40 Below25 Between26to40 Between26to40 Above40 Between26to40 Between26to40 Below25 Above40 Above40 Above40 Above40 Below25 Below25 Above40 Below25 Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles
Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes No Yes No Yes No Yes Yes No Yes Yes No Yes Yes Yes Yes Yes No Yes Yes Yes Yes No Yes Yes Yes Yes No Yes Yes Yes No Yes Yes Yes Yes
106
Berkley Monongalia Monongalia Monongalia Mercer Monongalia Cabell Monongalia Berkley Harrison Mercer Hancock Monongalia Harrison Marion Monongalia Kanawha Raleigh Monongalia Harrison Mercer Marion Cabell Hancock Monongalia Monongalia Marion Raleigh Monongalia Hancock Marion Mercer Monongalia Marion Monongalia Mercer Raleigh Marion Cabell Monongalia Monongalia Mercer Cabell Harrison Mercer Monongalia Cabell Kanawha
Wed Sat Sat Thru Thru Tue Sat Sat Thru Sat Tue Fri Fri Mon Tue Sat Thru Sat Sun Thru Thru Thru Tue Thru Sat Sun Mon Mon Thru Wed Thru Sat Sat Sun Sun Sat Tue Fri Wed Mon Sat Mon Mon Tue Fri Fri Sun Sun
Nov Dec Dec Dec Dec Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Feb Feb Feb Feb Feb Feb Feb Feb March March March March March March March March April April April April April April April April May May May May May May May May May
Drugs Drugs Drugs Drugs Rape Drugs Arson Rape Drugs Assault Weap_Threat Weap_Threat Drugs Accident Burglary Rape Drugs Rape Sex_Assault Drugs Accident Weap_Threat Burglary Rape Rape Assault Accident Accident Drugs Burglary Drugs Sex_Assault Rape Murder Burglary Rape Accident Drugs Accident Sex_Assault Drugs Accident Burglary Accident Arson Sex_Assault Weap_Threat Assault
26to40 Below25 Below25 Below25 26to40 Below25 Above40 Above40 26to40 26to40 Below25 Above40 Below25 Below25 Above40 Above40 Below25 26to40 26to40 Below25 Above40 26to40 Above40 26to40 Above40 26to40 26to40 Above40 Below25 Below25 Below25 Above40 26to40 Above40 Below25 Above40 Below25 26to40 Above40 26to40 Above40 Above40 Below25 Above40 26to40 26to40 26to40 Below25
M M F F M M M M M M M M M M M M M M M M M M M M M M M M M F M M M M M M M M M M M M M F M M F M
known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known
Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles
Yes Yes No Yes Yes Yes No Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes No No Yes Yes Yes Yes No No No No Yes No Yes Yes Yes No Yes Yes Yes Yes Yes Yes No Yes No Yes Yes No Yes
107
Marion Raleigh Berkley Harrison Cabell Monongalia Raleigh Marion Monongalia Monongalia Berkley Cabell Mercer Harrison Berkley Monongalia Raleigh Kanawha Monongalia Hancock Monongalia Mercer Berkley Hancock Monongalia Harrison Mercer Cabell Harrison Berkley Marion Monongalia Cabell Monongalia Mercer Harrison Marion Harrison Hancock Monongalia Monongalia Mercer Hancock Monongalia Cabell Mercer Raleigh Cabell
Mon Fri Fri Tue Tue Fri Sat Mon Thru Sat Fri Mon Sat Sat Fri Sun Wed Wed Sat Sun Wed Wed Sat Tue Sat Mon Sun Wed Sun Sun Thru Fri Fri Sat Mon Sun Sun Wed Fri Sat Sun Mon Fri Thru Sun Tue Sat Mon
May May May May May May May May June June June June June June June June June June July July July July July July July July July Aug Aug Aug Aug Aug Aug Aug Aug Sept Sept Sept Sept Sept Sept Sept Sept Oct Oct Oct Oct Oct
Accident Rape Accident Arson Burglary Drugs Rape Murder Drugs Rape Drugs Rape Sex_Assault Rape Assault Drugs Accident Assault Burglary Weap_Threat Accident Sex_Assault Rape Accident Sex_Assault Accident Weap_Threat Burglary Accident Accident Drugs Rape Sex_Assault Rape Accident Burglary Murder Accident Weap_Threat Sex_Assault Arson Accident Drugs Sex_Assault Accident Burglary Rape Accident
Above40 Above40 Above40 Below25 Above40 Below25 Above40 Above40 Below25 Above40 Below25 Above40 26to40 26to40 Below25 Below25 Above40 Above40 Above40 Above40 Below25 26to40 26to40 Above40 26to40 Above40 Below25 26to40 26to40 Below25 26to40 Above40 26to40 Above40 Above40 Below25 26to40 Below25 26to40 Above40 Below25 Above40 Below25 26to40 Above40 Below25 Above40 Above40
M M M M M M M M M M M M M M M M M M M M F M M F M M M M M M M M M M M M M M F M M M M M M M M M
known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known known unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown
Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles
No Yes No No Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes No Yes Yes Yes No Yes Yes Yes No No Yes Yes No Yes Yes Yes Yes No No Yes Yes Yes Yes Yes No No No Yes Yes Yes Yes No
108
Harrison Cabell Monongalia Kanawha Hancock Marion Raleigh Monongalia Harrison Hancock Monongalia Berkley Monongalia Monongalia Cabell Mercer Harrison Raleigh
Tue Mon Thru Mon Tue Tue Fri Thru Wed Sat Sat Mon Mon Sat Sat Sun Sun Sun
Oct Oct Nov Nov Nov Nov Nov Nov Dec Dec Dec Dec Dec Dec Dec Dec Dec Dec
Accident Burglary Drugs Sex_Assault Weap_Threat Burglary Rape Drugs Accident Drugs Sex_Assault Accident Weap_Threat Rape Accident Sex_Assault Accident Rape
Above40 26to40 Below25 Above40 26to40 Below25 Above40 Below25 Below25 Above40 Above40 Below25 26to40 Above40 Below25 26to40 Above40 26to40
M M M M M M M M M M M M M M M M M M
unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown
Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles
Yes Yes No Yes Yes Yes Yes Yes No Yes Yes No Yes No No Yes Yes Yes
109
APPENDIX B: Test Data Set for Crime Data

Table B1: Test Data Set for Crime Data
County Harrison Mercer Cabell Cabell Marion Monongalia Cabell Hancock Berkley Cabell Monongalia Marion Marion Berkley Cabell Monongalia Monongalia Mercer Marion Hancock Monongalia Marion Harrison Raleigh Mercer Hancock Hancock Berkley Harrison Monongalia Kanawha Berkley Kanawha Mercer Marion Day Sun Sun Tue Tue Sun Mon Wed Sat Tue Tue Sat Sun Mon Wed Tue Wed Sat Tue Mon Tue Thru Tue Tue Sun Sun Sat Sun Mon Mon Fri Sat Sat Sat Sun Sun Month Jan Jan Jan Jan Feb Feb June June June June Aug Aug Aug Aug Aug Sept Sept Nov Dec Dec Dec Dec Dec June June July July Dec Dec Dec Dec Dec Dec Jan Jan Type Arson Weap_Threat Assault Burglary Murder Accident Burglary Assault Weap_Threat Burglary Rape Arson Accident Burglary Burglary Accident Rape Burglary Murder Burglary Drugs Assault Arson Sex_Assault Burglary Rape Assault Accident Accident Drugs Rape Sex_Assault Rape Weap_Threat Burglary AgeGroup Below25 26to40 26to40 Above40 Above40 Below25 26to40 26to40 Above40 26to40 26to40 Below25 Below25 26to40 Below25 Below25 26to40 Below25 26to40 Below25 Below25 26to40 Below25 Above40 Above40 Above40 Below25 Above40 Above40 Below25 Above40 Above40 26to40 Below25 Above40 Gender M M M M M M M M M F M M M F M M M M M M M M M M M M M M M M M M M M M History known known known known known known known known known known known known known known known known known known unknown unknown unknown unknown unknown unknown unknown unknown unknown known known known known known known known known Proximity Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Within5miles Below25 Below25 Between26to40 Below25 Above40 Above40 Above40 Below25 Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Outof5miles Bkgd Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes No Yes Yes No Yes No Yes No Yes Yes Yes Yes Yes Yes Yes No No No Yes Yes No Yes Yes
110

Webbased Crime Thesis

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Webbased Crime Thesis

Caricato da

Copyright:

Formati disponibili

A WEB-BASED TOOL FOR ANALYSIS OF CRIME LABORATORY DATA

Master of Science in ENTERPRISE GEOMATICS

Dr. Anjana Vyas

Department of Geomatics and Space Application CEPT UNIVERSIYY AHMEDABAD

ABSTRACT A WEB-BASED SOFTWARE TOOL FOR ANALYSIS OF CRIME LABORATORY DATA

CHAPTER 2: LITERATURE REVIEW

Figure 2.1: FIMS 2.0 Architecture

Figure 2.2: Home page of FIMS 2.0 application

CHAPTER 3: SYSTEM DESIGN AND IMPLEMENTATION

CHAPTER 4: USER INTERFACE

Figure 4.1: User Interface or Presentation Layer

CER can accept/deny form DPS-53.

CER receives packages from the originating

status of package inventory, chain of custody of packages, opening existing incident

Figure 4.7: List of Methods in Data Query Building Wizard Class

Figure 4.10: Incident Statistics Class and Methods

Figure 4.11: Selecting Data Mining Methods

CHAPTER 5: DATA ACCESS TIER

Methods MaxCaseregistarOfficer MaxNumberCaseIntheYear NoOfCasesOnProcess NumberOfPendingCases ResolvedCases TotalCasesPerOfficer TotalNumberOfCasePerYear TotalNumberOfVictimsPerY

Figure 5.3: Business Layer Class with Methods

5. 3 Data Analysis Methods

Figure 5.6: Sample Time Series Graph

typically do crime proximity a long distance from where they are.

CHAPTER 6: DATABASE DESIGN AND IMPLEMENTATION

entered) by row for that column's fields.

Rows within a table represent records

Table 6.1: Incident Table fields

Table 6.2: Suspect Physical Table Fields

Table 6.3: Suspect Variable Table Fields

6. 3.4 Verdict Table

Table 6.4: Verdict Table Fields

The fields of the Verdict Table are shown in Table 6.4.

This table stores

Table 6.5: Victim Physical Table Fields

Table 6.6: Victim Variable Table Fields

6. 3.7 Agency Information Table

Table 6.7: AgencyInfo Table Fields

Table 6.8: Convict Table Fields

Table 6.9: Test Data Table Fields

CHAPTER 7: APPLICATION STUDY

Figure 7.1: FIMS home page

Figure 7.8: Data File Management Home Page

Figure 7.10: Incident Statistics Home Page

7. 7 Building a Data Mining Model

Finding sequences in the order that criminals applied on crimes.

Table 7.1: List of Data Mining Algorithms

NN: Back Propagation Algorithm

k - Nearest Neighbor (Partitional Clustering)

k - Means Clustering (Partitional Clustering)

Markov and Hidden Markov Model Data Visualization

7. 7. 1 Defining the Problem

Figure 7.12: Query Data Extraction and Creating data file

Figure 7.14: Basic Crime Statistics and Graphical Data Visualization

Figure 7.15: Machine Learning Algorithms

7. 8 Application Study for WVSPF Lab Dataset

Classification also requires reasonably complete training and testing data

Each tuple is assumed to belong to a predefined class, as

Naive Bayes Simple Algorithm 72.727

Figure 7.16 Prediction Performance Comparisons.

7. 8. 6 Crime Data Decision Trees

variable associated to the root.

In contrast, the rules in a rule set may consider

CHAPTER 8: CONCLUSION AND FUTURE WORK