Apri il menu di navigazione

Impostazioni utente

Benvenuto in Scribd!

Salta carosello

Building Good Training Sets

Caricato da

Il 0% ha trovato utile questo documento (0 voti)

6 visualizzazioni3 pagine

Copyright

© © All Rights Reserved

Formati disponibili

PDF, TXT o leggi online da Scribd

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Segnala questo documento

Copyright:

© All Rights Reserved

Formati disponibili

Scarica in formato PDF, TXT o leggi online su Scribd

Segnala contenuti inappropriati

Il 0% ha trovato utile questo documento (0 voti)

6 visualizzazioni3 pagine

Building Good Training Sets

Caricato da

Copyright:

© All Rights Reserved

Formati disponibili

Scarica in formato PDF, TXT o leggi online su Scribd

Segnala contenuti inappropriati

Salta alla pagina

Sei sulla pagina 1di 3

Cerca all'interno del documento

Chapter 4.

Building Good Training Sets

– Data Preprocessing

The quality of the data and the amount of useful information that it contains are key

factors that determine how well a machine learning algorithm can learn. Therefore, it

is absolutely critical that we make sure to examine and preprocess a dataset before

we feed it to a learning algorithm. In this chapter, we will discuss the essential data

preprocessing techniques that will help us build good machine learning models.

The topics that we will cover in this chapter are as follows:

Removing and imputing missing values from the dataset

Getting categorical data into shape for machine learning algorithms

Selecting relevant features for the model construction

Dealing with missing data

It is not uncommon in real-world applications for our samples to be missing one or

more values for various reasons. There could have been an error in the data

collection process, certain measurements are not applicable, or particular fields could

have been simply left blank in a survey, for example. We typically see missing

values as the blank spaces in our data table or as placeholder strings such as NaN ,

which stands for not a number, or NULL (a commonly used indicator of unknown

values in relational databases).

Unfortunately, most computational tools are unable to handle such missing values, or

produce unpredictable results if we simply ignore them. Therefore, it is crucial that

we take care of those missing values before we proceed with further analyses. In this

section, we will work through several practical techniques for dealing with missing

values by removing entries from our dataset or imputing missing values from other

samples and features.

Identifying missing values in tabular data

But before we discuss several techniques for dealing with missing values, let's create

a simple example data frame from a Comma-separated Values ( CSV ) file to get a

better grasp of the problem:

>>> import pandas as pd

>>> from io import StringIO

>>> csv_data = \

... '''A,B,C,D

... 1.0,2.0,3.0,4.0

... 5.0,6.0,,8.0

... 10.0,11.0,12.0,'''

>>> # If you are using Python 2.7, you need

>>> # to convert the string to unicode:

>>> # csv_data = unicode(csv_data)

>>> df = pd.read_csv(StringIO(csv_data))

>>> df

ABCD

0 1.0 2.0 3.0 4.0

1 5.0 6.0 NaN 8.0

2 10.0 11.0 12.0 NaN

Using the preceding code, we read CSV-formatted data into a pandas DataFrame via

the read_csv function and noticed that the two missing cells were replaced by NaN .

The StringIO function in the preceding code example was simply used for the

purposes of illustration. It allows us to read the string assigned to csv_data into a

pandas DataFrame as if it was a regular CSV file on our hard drive.

For a larger DataFrame , it can be tedious to look for missing values manually; in this

case, we can use the isnull method to return a DataFrame with Boolean values that

indicate whether a cell contains a numeric value ( False ) or if data is missing ( True ).

Using the sum method, we can then return the number of missing values per column

as follows:

>>> df.isnull().sum()

A0

B0

C1

D1
dtype: int64

This way, we can count the number of missing values per column; in the following

subsections, we will take a look at different strategies for how to deal with this

missing data.

Note

Although scikit-learn was developed for working with NumPy arrays, it can

sometimes be more convenient to preprocess data using pandas' DataFrame . We can

always access the underlying NumPy array of a DataFrame via the values attribute

before we feed it into a scikit-learn estimator:

>>> df.values

array([[ 1., 2., 3., 4.],

[ 5., 6., nan, 8.],

[ 10., 11., 12., nan]])

Potrebbero piacerti anche

Imputing Missing Values
Documento2 pagine
Imputing Missing Values
geogeta86
Nessuna valutazione finora
Combining Multiple Decision Trees Via Random
Documento3 pagine
Combining Multiple Decision Trees Via Random
geogeta86
Nessuna valutazione finora
Modeling Class Probabilities Via Logistic PDF
Documento1 pagina
Modeling Class Probabilities Via Logistic PDF
geogeta86
Nessuna valutazione finora
Learning Classifiers Using Scikit-Learn
Documento1 pagina
Learning Classifiers Using Scikit-Learn
geogeta86
Nessuna valutazione finora
Tackling overfitting via regularization techniques
Documento2 pagine
Tackling overfitting via regularization techniques
geogeta86
Nessuna valutazione finora
Minimizing Cost Functions With Gradient Descent
Documento1 pagina
Minimizing Cost Functions With Gradient Descent
geogeta86
Nessuna valutazione finora
Tackling overfitting via regularization techniques
Documento2 pagine
Tackling overfitting via regularization techniques
geogeta86
Nessuna valutazione finora
Modeling Class Probabilities Via Logistic PDF
Documento1 pagina
Modeling Class Probabilities Via Logistic PDF
geogeta86
Nessuna valutazione finora
Improving Gradient Descent Through Feature
Documento1 pagina
Improving Gradient Descent Through Feature
geogeta86
Nessuna valutazione finora
Adaptive Linear Neurons and The
Documento1 pagina
Adaptive Linear Neurons and The
geogeta86
Nessuna valutazione finora
Trying Out Pyton
Documento1 pagina
Trying Out Pyton
geogeta86
Nessuna valutazione finora
Trying Out Pyton
Documento1 pagina
Trying Out Pyton
geogeta86
Nessuna valutazione finora
Trying Out Pyton
Documento1 pagina
Trying Out Pyton
geogeta86
Nessuna valutazione finora
Trying Out Pyton
Documento1 pagina
Trying Out Pyton
geogeta86
Nessuna valutazione finora
Trying Out Pyton
Documento1 pagina
Trying Out Pyton
geogeta86
Nessuna valutazione finora
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Da Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
Valutazione: 4 su 5 stelle
4/5 (5794)
The Little Book of Hygge: Danish Secrets to Happy Living
Da Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
Valutazione: 3.5 su 5 stelle
3.5/5 (399)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Da Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
Valutazione: 3.5 su 5 stelle
3.5/5 (231)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Da Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
Valutazione: 4 su 5 stelle
4/5 (894)
The Yellow House: A Memoir (2019 National Book Award Winner)
Da Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
Valutazione: 4 su 5 stelle
4/5 (98)
Shoe Dog: A Memoir by the Creator of Nike
Da Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
Valutazione: 4.5 su 5 stelle
4.5/5 (537)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Da Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
Valutazione: 4.5 su 5 stelle
4.5/5 (474)
Never Split the Difference: Negotiating As If Your Life Depended On It
Da Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
Valutazione: 4.5 su 5 stelle
4.5/5 (838)
Grit: The Power of Passion and Perseverance
Da Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
Valutazione: 4 su 5 stelle
4/5 (587)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Da Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
Valutazione: 4.5 su 5 stelle
4.5/5 (265)
Yes Please
Da Everand
Yes Please
Amy Poehler
Valutazione: 4 su 5 stelle
4/5 (1891)
Angela's Ashes: A Memoir
Da Everand
Angela's Ashes: A Memoir
Frank McCourt
Valutazione: 4.5 su 5 stelle
4.5/5 (440)
The Emperor of All Maladies: A Biography of Cancer
Da Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
Valutazione: 4.5 su 5 stelle
4.5/5 (271)
On Fire: The (Burning) Case for a Green New Deal
Da Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
Valutazione: 4 su 5 stelle
4/5 (73)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Da Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
Valutazione: 4.5 su 5 stelle
4.5/5 (344)
Team of Rivals: The Political Genius of Abraham Lincoln
Da Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
Valutazione: 4.5 su 5 stelle
4.5/5 (234)
Fear: Trump in the White House
Da Everand
Fear: Trump in the White House
Bob Woodward
Valutazione: 3.5 su 5 stelle
3.5/5 (738)
The Glass Castle: A Memoir
Da Everand
The Glass Castle: A Memoir
Jeannette Walls
Valutazione: 4.5 su 5 stelle
4.5/5 (1712)
Rise of ISIS: A Threat We Can't Ignore
Da Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
Valutazione: 3.5 su 5 stelle
3.5/5 (137)
Principles: Life and Work
Da Everand
Principles: Life and Work
Ray Dalio
Valutazione: 4 su 5 stelle
4/5 (599)
The Unwinding: An Inner History of the New America
Da Everand
The Unwinding: An Inner History of the New America
George Packer
Valutazione: 4 su 5 stelle
4/5 (45)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Da Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
Valutazione: 3.5 su 5 stelle
3.5/5 (2219)
Steve Jobs
Da Everand
Steve Jobs
Walter Isaacson
Valutazione: 4.5 su 5 stelle
4.5/5 (806)
John Adams
Da Everand
John Adams
David McCullough
Valutazione: 4.5 su 5 stelle
4.5/5 (2409)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Da Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
Valutazione: 4 su 5 stelle
4/5 (1090)
Bad Feminist: Essays
Da Everand
Bad Feminist: Essays
Roxane Gay
Valutazione: 4 su 5 stelle
4/5 (1015)
The Outsider: A Novel
Da Everand
The Outsider: A Novel
Stephen King
Valutazione: 4 su 5 stelle
4/5 (1839)
Brooklyn: A Novel
Da Everand
Brooklyn: A Novel
Colm Tóibín
Valutazione: 3.5 su 5 stelle
3.5/5 (1937)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Da Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
Valutazione: 4.5 su 5 stelle
4.5/5 (119)
A Man Called Ove: A Novel
Da Everand
A Man Called Ove: A Novel
Fredrik Backman
Valutazione: 4.5 su 5 stelle
4.5/5 (4609)
The Light Between Oceans: A Novel
Da Everand
The Light Between Oceans: A Novel
M.L. Stedman
Valutazione: 4.5 su 5 stelle
4.5/5 (789)
The Woman in Cabin 10
Da Everand
The Woman in Cabin 10
Ruth Ware
Valutazione: 3.5 su 5 stelle
3.5/5 (2322)
Manhattan Beach: A Novel
Da Everand
Manhattan Beach: A Novel
Jennifer Egan
Valutazione: 3.5 su 5 stelle
3.5/5 (792)
The Perks of Being a Wallflower
Da Everand
The Perks of Being a Wallflower
Stephen Chbosky
Valutazione: 4.5 su 5 stelle
4.5/5 (2099)
Wolf Hall: A Novel
Da Everand
Wolf Hall: A Novel
Hilary Mantel
Valutazione: 4 su 5 stelle
4/5 (3811)
Little Women
Da Everand
Little Women
Louisa May Alcott
Valutazione: 4 su 5 stelle
4/5 (104)
The Art of Racing in the Rain: A Novel
Da Everand
The Art of Racing in the Rain: A Novel
Garth Stein
Valutazione: 4 su 5 stelle
4/5 (4200)
Sing, Unburied, Sing: A Novel
Da Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
Valutazione: 4 su 5 stelle
4/5 (1103)
A Tree Grows in Brooklyn
Da Everand
A Tree Grows in Brooklyn
Betty Smith
Valutazione: 4.5 su 5 stelle
4.5/5 (1929)
The Constant Gardener: A Novel
Da Everand
The Constant Gardener: A Novel
John le Carre
Valutazione: 3.5 su 5 stelle
3.5/5 (104)
Her Body and Other Parties: Stories
Da Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
Valutazione: 4 su 5 stelle
4/5 (821)
Visvesvaraya Technological University: Computer Graphics and Visualization Lab (10CSL67)
Documento58 pagine
Visvesvaraya Technological University: Computer Graphics and Visualization Lab (10CSL67)
Rifat Iqbal
Nessuna valutazione finora
Student Projects in JAVA
Documento6 pagine
Student Projects in JAVA
Rahul Chanda
100% (3)
Apache Zookeeper
Documento31 pagine
Apache Zookeeper
Tripti Sagar
Nessuna valutazione finora
Case Study: Atm Management System
Documento8 pagine
Case Study: Atm Management System
Wasim Jafar
Nessuna valutazione finora
Plant Messenger Alert Reporting Service
Documento6 pagine
Plant Messenger Alert Reporting Service
Tarek SILINI
Nessuna valutazione finora
Licensing Microsoft Office Software PDF
Documento9 pagine
Licensing Microsoft Office Software PDF
info for business
Nessuna valutazione finora
Blockiotintelligence: A Blockchain-Enabled Intelligent Iot Architecture With Artificial Intelligence
Documento24 pagine
Blockiotintelligence: A Blockchain-Enabled Intelligent Iot Architecture With Artificial Intelligence
Nur Afiyat
Nessuna valutazione finora
Adobe Photoshop CS4
Documento3 pagine
Adobe Photoshop CS4
SuSrto MySrta
Nessuna valutazione finora
Deployment Options
Documento31 pagine
Deployment Options
sampath2505
Nessuna valutazione finora
005-NAS Technology and Applications V1.03
Documento38 pagine
005-NAS Technology and Applications V1.03
Phan An
Nessuna valutazione finora
Ranelle Misajon's Resume
Documento2 pagine
Ranelle Misajon's Resume
ranelle misajon
Nessuna valutazione finora
Raspberry Pi As A Video Server
Documento4 pagine
Raspberry Pi As A Video Server
Boopathi R
Nessuna valutazione finora
EDX Course List
Documento14 pagine
EDX Course List
arshad karim
Nessuna valutazione finora
Imtroduction To Angular Js
Documento22 pagine
Imtroduction To Angular Js
Anonymous Hc0qDH
Nessuna valutazione finora
Aneesh Palla SQL Resume
Documento2 pagine
Aneesh Palla SQL Resume
aneesh
Nessuna valutazione finora
Managing Microsoft Teams Exam: July 2020
Documento11 pagine
Managing Microsoft Teams Exam: July 2020
moris4321
Nessuna valutazione finora
SIFANG Renewable Energy Automation Solutions Catalog
Documento36 pagine
SIFANG Renewable Energy Automation Solutions Catalog
Daniel Lee
Nessuna valutazione finora
(Smtebooks - Com) Tableau 2019.x Cookbook 1st Edition
Documento1.228 pagine
(Smtebooks - Com) Tableau 2019.x Cookbook 1st Edition
dhanu
100% (2)
Read Me Label Printer
Documento5 pagine
Read Me Label Printer
Temelie Dragos
Nessuna valutazione finora
Easy PDF Creator
Documento450 pagine
Easy PDF Creator
Edson Costa
83% (6)
Chapter 4 Threads
Documento31 pagine
Chapter 4 Threads
Abdulrahman khalid
Nessuna valutazione finora
OPC Tunneller User Manual Guide
Documento20 pagine
OPC Tunneller User Manual Guide
irkhaju
Nessuna valutazione finora
CSS Evolution Timeline
Documento43 pagine
CSS Evolution Timeline
Tamerlan Jabrayilov
Nessuna valutazione finora
Electron: Edit Package - Json
Documento3 pagine
Electron: Edit Package - Json
Student Ibrahim Wazir
Nessuna valutazione finora
Build C programs step-by-step from source to executable
Documento10 pagine
Build C programs step-by-step from source to executable
mahamd saied
Nessuna valutazione finora
Sles 020.84
Documento1 pagina
Sles 020.84
neodvx
Nessuna valutazione finora
Data: Lumion System Requirements
Documento4 pagine
Data: Lumion System Requirements
A1 Video
Nessuna valutazione finora
Docu54667 Connectrix Manager Converged Network Edition 12.1.6 Release Notes
Documento61 pagine
Docu54667 Connectrix Manager Converged Network Edition 12.1.6 Release Notes
zepolk
Nessuna valutazione finora
V Ray 3.4 For 3ds Max / Design: Categories
Documento6 pagine
V Ray 3.4 For 3ds Max / Design: Categories
Javier Triana
Nessuna valutazione finora
3500+ Questions ITT Online Exam
Documento282 pagine
3500+ Questions ITT Online Exam
ARUN KUMAR
Nessuna valutazione finora