Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1.1 DATA
Data is one of the most critical assets of any business. Data is a collection of raw,
unorganized facts that need to be processed. Data comes from the latin word Datum which
means “Something given”. Data is the plural of datum, a single piece of information. Data can
be used both as singular and plural form of the word. Data represents a fact or statement of
event without relation to other things.
Example
• Student’s Test scores is one piece of data
• History of temperature readings all over the world for the past 100 years is data
Qualitative Quantitative
(Categorical) (Numerical)
( (()
(
(
Discrete ( Continuous
( (
Fig. 1.2 Types of Data
Qualitative Data ( (
Qualitative data refers to the quality of something. Deals with description. Data that can
be observed but not measured.
Eg. Color, texture, smell and taste
Qualitative data are often termed as categorical data. The categorical data are values
or observations that can be sorted into groups or categories.
Example:
• Tennis ball can be categorized into
• New , Used and Damaged
Quantitative Data
Deals with numbers and can be measured.
Example:
Number of golf balls
Quantitative data is further divided into discrete data and continuous data.
Discrete data
Discrete data are numeric data that have a finite number of possible values. It is
numerical data that has no gap between possible values. It is counted in whole numbers
0 1 2 3 4 5 6 7
Fig. 1.3 Discrete Data
Examples
• The number of products damaged in shipment
• The count of golf balls
• The number of students in the class
Continuous data
• Continuous data is numerical data with a continuous range and there is no gap between
possible values. The continuous data are measured.
Example: People’s heights could be any value within the range of human heights.
0 200
Fig. 1.4 Continuous Data
1.4 Structured and Unstructured
Data can be classified as structured or unstructured based on how it is stored and
managed.
Digital Data
JPEG
TEXT
NUMBER
Fig. 1.6 Other Types of Data
1.6 Information
When the data are processed, organized, structured or presented in a given context so
as to make them useful, they are called information. Data themselves are fairly useless, but
when these data are interpreted and processed to determine its true meaning they becomes
useful and can be named as information. Information is the data that has been processed in such
a way as to be meaningful to the person who receives it. Information is the intelligence and
knowledge derived from data.
Eg. Each student's test score is one piece Eg. The average score of a class or of the
of data. entire department is information that can be
derived from the given data
Physical devices used to store and process data in computers in two state devices. A
switch for example, is a two state device; it can be either ON or OFF. To a computer everything
is a number. Numbers are numbers, letters are numbers, sound and pictures are numbers. Even
the computers own instructions are numbers. A string of alphabet characters such as a sentence
looks just like a string of ones and zeros to a computer. A computer has only 2 possible states
available to represent data – on or off. When a switch is off it is represented by a 0, when it is
on it is represented by a 1. Thus all data to be stored and processed in computers are
transformed or coded as strings of two symbols, one symbol to represent each state. The two
symbols normally used are 0 and 1. These are known as bits, an abbreviation for binary digits.
A group of 8 bits is called a byte. With one byte the computer can represent 256 different
symbols or characters because the 8 1s and 0s in a byte can be combined in 256 different ways.
• 4 bits = Nibble
• 0110
• 8 bits = Byte
• 0110 1010
• 16 bits = Word
• 0110 1010 1001 1111
Session 2
Number Systems
2.1 Number System
In stone age, knots and some stone marks are used to count the items. In roman number
system I, II, III etc., are used for counting the items. There are many positional –value systems
are used like decimal , binary, octal etc.,
• Divide the Decimal Number by the base n; the remainder is the LSB of base n
number.
• If the Quotient Zero, the conversion is complete; else repeat step (a) using the
Quotient as the Decimal Number. The new remainder is the next most
significant bit of the base n number.
1 0 0 1 0
24 23 22 21 20
16 8 4 2 1
16 + 0 + 0 + 2 + 0 = 1810
Therefore 10010 = 18
2 10
a) 1310 = ?
b) 2210 = ?
c) 4310 = ?
d) 15810 = ?
Example:
Convert the binary number 01101012 into its decimal equivalent.
0 1 1 0 1 0 1
26 25 24 23 22 21 20
64 32 16 8 4 2 1
0 + 32 + 16 + 0 + 4 + 0 + 1 = 5310
01101012 = 5310
11010 2 = ?
0110101 2 = ?
11010011 2 = ?
• 0+1=1
• 1+0=1
Binary Subtraction
• 1-0=1
• 1-1=0
Binary Multiplication
Example
101
x11
101
1010
1111
Binary Division
A B Output
0 1 0
1 1 1
Session 4
Data Compression
4.1 Introduction
While technology keeps growing, the world keeps shrinking. Everything seems to be nearer
and smaller to you. Our world has changed a lot from an era where a computer used to occupy
a room to the present where supercomputers can be conveniently carried in your hand. It would
be an understatement to merely term this transformation as a technological growth; rather, it
should be termed a technological explosion. This transformation has certainly occurred as part
of wonderful contributions by many eminent personalities the world over. In this context, the
period in history which marked the advent of data compression has got a remarkable role to
play in this aspect. It is truly fascinating to figure out how data compression and its wide
techniques have facilitated this transformation. As we know the massive world of Internet is
extensively using data compression techniques in innumerable ways, without which the dreams
of web technology booms would never have been possible.
Data Compression is the process of encoding the data, so that fewer bits will be needed to
represent the original data whereby the size of the data is reduced. Compressing data can save
storage capacity, speed file transfer, and decrease costs for storage hardware and network
bandwidth.
Text compression can be as simple as removing all unneeded characters, inserting a single
repeat character to indicate a string of repeated characters, and substituting a smaller bit string
for a frequently occurring bit string. Compression can reduce a text file to 50% or a
significantly higher percentage of its original size.
For data transmission, compression can be performed on the data content or on the entire
transmission unit, including header data. When information is sent or received via the Internet,
larger files, either singly or with others as part of an archive file, may be transmitted in a .ZIP,
gzip or other compressed format.
Lossy compression permanently eliminates bits of data that are redundant, unimportant or
imperceptible. Lossy compression is useful with graphics, audio, video and images, where the
removal of some data bits has little or no discernible effect on the representation of the content.
Graphics image compression can be lossy or lossless. Graphic image file formats are typically
designed to compress information since the files tend to be large. JPEG is an image file format
that supports lossy image compression. Formats such as GIF and PNG use lossless
compression.
Used for compressing images and video files (our eyes cannot distinguish subtle
changes, so lossy data is acceptable).
These methods are cheaper, less time and space.
Several methods:
JPEG: compress pictures and graphics
MPEG: compress video
MP3: compress audio
File-level deduplication eliminates redundant files and replaces them with stubs pointing to the
original file. Block-level deduplication identifies duplicate data at the sub-file level. The
system saves unique instances of each block, uses a hash algorithm to process them and
generates a unique identifier to store them in an index. Deduplication typically looks for larger
chunks of duplicate data than compression, and systems can de-duplicate using a fixed or
variable-sized chunk.
Deduplication is most effective in environments that have a high degree of redundant data,
such as virtual desktop infrastructure or storage backup systems. Compression tends to be more
effective than deduplication in reducing the size of unique information such as image, audio,
video, database and executable files. Many storage systems support both compression and
deduplication.
The main disadvantage of compression is the performance impact resulting from the use of
CPU and memory resources to compress and decompress the data. Many vendors have
designed their systems to try to minimize the impact of the processor-intensive calculations
associated with compression. If the compression runs inline, before the data is written to disk,
the system may offload compression to preserve system resources. For instance, IBM uses a
separate hardware acceleration card to handle compression with some of its enterprise storage
systems.
If data is compressed after it is written to disk, or post process, the compression may run in the
background to reduce the performance impact. Although post-process compression can reduce
the response time for each input/output (I/O), it still consumes memory and processor cycles,
and can affect the overall number of I/Os a storage system can handle. Also, because data
initially must be written to disk or flash drives in an uncompressed form, the physical storage
savings are not as great as they are with inline compression.
Many systems and devices perform compression transparently, but some give users the option
to turn compression on or off. Compression can be performed more than once on the same file
or piece of data, but subsequent compressions result in little to no additional compression and
may even increase the size of the file to a slight degree, depending on the algorithms.
WinZip is a popular Windows program that compresses files when it packages them in an
archive. Archive file formats that support compression include ZIP and RAR. The bzip2 and
gzip formats see widespread use for compressing individual files.
4.7 Run-length encoding
Run-length encoding (RLE) is a very simple form of lossless data compression in which runs
of data (that is, sequences in which the same data value occurs in many consecutive data
elements) are stored as a single data value and count, rather than as the original run. This is
most useful on data that contains many such runs. Consider, for example, simple graphic
images such as icons, line drawings, and animations. It is not useful with files that don't have
many runs as it could greatly increase the file size.
RLE may also be used to refer to an early graphics file format supported by CompuServe for
compressing black and white images, but was widely supplanted by their later Graphics
Interchange Format. RLE also refers to a little-used image format in Windows 3.x, with the
extension rle, which is a Run Length Encoded Bitmap, used to compress the Windows 3.x
startup screen.
Typical applications of this encoding are when the source information comprises long
substrings of the same character or binary digit.
For example, consider a screen containing plain black text on a solid white background. There
will be many long runs of white pixels in the blank space, and many short runs of black pixels
within the text. A hypothetical scan line, with B representing a black pixel and W representing
white, might read as follows:
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWW
WWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
With a run-length encoding (RLE) data compression algorithm applied to the above
hypothetical scan line, it can be rendered as follows:
12W1B12W3B24W1B14W
This can be interpreted as a sequence of twelve Ws, one B, twelve Ws, three Bs, etc.
The run-length code represents the original 67 characters in only 18. While the actual format
used for the storage of images is generally binary rather than ASCII characters like this, the
principle remains the same. Even binary data files can be compressed with this method; file
format specifications often dictate repeated bytes in files as padding space. However, newer
compression methods such as DEFLATE often use LZ77-based algorithms, a generalization of
run-length encoding that can take advantage of runs of strings of characters (such as
BWWBWWBWWBWW).
WW12BWW12BB3WW24BWW14
This would be interpreted as a run of twelve Ws, a B, a run of twelve Ws, a run of three Bs,
etc. In data where runs are less frequent, this can significantly improve the compression rate.
One other matter is the application of additional compression algorithms. Even with the runs
extracted, the frequencies of different characters may be large, allowing for further
compression; however, if the run lengths are written in the file in the locations where the runs
occurred, the presence of these numbers interrupts the normal flow and makes it harder to
compress. To overcome this, some run-length encoders separate the data and escape symbols
from the run lengths, so that the two can be handled independently. For the example data, this
would result in two outputs, the string "WWBWWBBWWBWW" and the numbers (12, 12, 3,
24, and 14).
4.7 a. Examples:
1. Replace consecutive repeating occurrences of a symbol by 1 occurrence of the symbol
itself, then followed by the number of occurrences.
2. The method can be more efficient if the data uses only 2 symbols (0s and 1s) in bit
patterns and 1 symbol is more frequent than another.
4.8 Activity
Compress the following data using Run-length encoding method
1. Original Data - AAABBCDDDD
Compressed Data is A3B2C1D4
2. Original Data - aabbbccccdddddeeeeeefffffffgggggggg
Compressed Data is a2b3c4d5e6f7g8
Session 5
Data Collection and Analysis
5.1 Introduction
Data collection is the systematic approach to gathering and measuring information from a
variety of sources to get a complete and accurate picture of an area of interest. Data collection
enables a person or organization to answer relevant questions, evaluate outcomes and make
predictions about future probabilities and trends.
Accurate data collection is essential to maintaining the integrity of research, making informed
business decisions and ensuring quality assurance. For example, in retail sales, data might be
collected from mobile applications, website visits, loyalty programs and online surveys to learn
more about customers. In a server consolidation project, data collection would include not just
a physical inventory of all servers, but also an exact description of what is installed on each
server -- the operating system, middleware and the application or database that the server
supports.
Examines non-numerical
data for patterns and
meanings
Often described as being
more "rich" than Evaluators may wish to look at the level
quantitative data of engagement of afterschool staff in
Is gathered and analysed program trainings. He/she might
Qualitative
by an individual, it can conduct interviews of these staff
Data
be more subjective members to capture the level of
Can be collected through engagement that each staff member
methods such as feels they have during the trainings.
observation techniques,
focus groups, interviews,
and case studies
Primary data sources include information collected and processed directly by the researcher,
such as observations, surveys, interviews, and focus groups.
Secondary data sources include information that you retrieve through pre-existing sources such
as research articles, Internet or library searches. Pre-existing data may also include examining
existing records and data within the program such as publications and training materials,
financial records, student/ client data, and performance reviews of staff, etc.
Primary Data Sources Secondary Data Sources
Likewise, there are a variety of techniques to use when gathering primary data. Listed below
are some of the most common data collection techniques used for collecting data.
Consists of examining
existing data in the form of
databases, meeting minutes,
To understand the primary
reports, attendance logs,
reasons students miss school,
financial records, newsletters,
Documents and records on student absences
etc.
Records are collected and analysed
This can be an inexpensive
way to gather information,
but may be an incomplete
data source
Types of technology that can be used to collect data traditionally captured with surveys include:
The focus is on using technology to collect quantitative data from participants. However you
could use a social networking site to engage participants in a virtual focus group. Or conduct
observations of interactions on a social-networking site.
Advantages Disadvantages
Simpler and quicker way of collecting
both quantitative and qualitative data
Limited to respondents who have
Easy to access a large group of
access to the internet
respondents in geographically diverse
Some may find on-line interface off-
locations
putting
More cost effective than manually
Does not guarantee the quality
administering surveys
(reliability and validity) of actual
Data can typically be exported
survey design
eliminating manual data entry
Potential lack of security
Improves accuracy of data entry (e.g.,
reduces omissions, duplicate entries)
5.7.2 Clickers
Clickers are hand-held devices, much like household remote controls, that have been
implemented in classrooms to gauge student participation and learning. Clickers can also be
used to collect data from a group of participants gathered in one location at the same time.
Advantages Disadvantages
Reduce errors/missing data Typically limited to collecting
Greatly reduce/eliminate data entry quantitative data
Increase internal program Technology may be off-putting to certain
evaluation capacity demographics of respondents
Can collect data from large groups Cost of obtaining clicker technology
of respondents at once system may be prohibitive initially
Cost effective over time
A Personal Digital Assistant (PDA) is a hand-held mobile computer that can also be used for
data collection in the field. Data is inputted directly on the PDA and then transferred to
another computer for analysis.
Advantages Disadvantages
Streamlines the data collection process Cost of obtaining PDAs may be
Reduce errors/missing data prohibitive initially
Greatly reduce/eliminate data entry Data loss due to malfunctioning
Can be cost effective over time device
Enable collection of more data in a Learning curve associated with i
shorter time frame
Cell phones can also be used as portable, real time data collection tools. Text messaging is a
way to capture information from a large group at once. Each participant would need to have a
cell phone and familiarity with texting. To store/collate data, the use of a message relay system
or interface technology (software program, for example) is also needed. Formatting of
responses needs to be very specific to be received by interface or relay service. The cell phone
receiving messages may need to be linked to a computer or a Web-based interface designed to
capture and store all messages sent to a specific cell phone.
Advantages Disadvantages
Capturing data in real time from Requires advanced technological
many users proficiency of the administrator
Popularity of texting may mean All users must possess cell phone and
increased level of comfort with texting capabilities
this method Currently, message relay systems are not
No need to purchase costly equipped to receive high number of text
technology for initial data responses at once
collection Loss of data or technological difficulties
with interface may occur
Social networking sites often include profiles of individuals with information available such as
the user’s age/birth date, gender, ethnicity, location (address/city), sexual orientation, political
orientation, education, and contact information (email, phone number, website). These sites
can also include forums in which users can dialogue with one another. Examples of networking
sites include MySpace, Facebook, and Twitter, all of which have gained popularity as social
forums and modes of communication. Data could be collected through sampling random sites
for trends, soliciting information from specific users and creating a profile for data collection
that attracts certain users for discussions (such as online focus groups).
Disadvantages
Advantages
No verification of information available
Able to reach a young demographic
on public profiles
using a popular medium
Privacy settings on profiles may impede
Option to create a profile to target
data collection
specific community
Social networking caters to very specific
Ability to engage participants at
demographic of users, with an average
remote locations in real time
age range of 14-35 years
Can be a rich source of quantitative
Consent issues involved working with
and qualitative data, some of which
underage youth (if soliciting information
is publicly available
not publicly available on profile)
While data analysis in qualitative research can include statistical procedures, many times
analysis becomes an ongoing iterative process where data is continuously collected and
analyzed almost simultaneously. Indeed, researchers generally analyze for patterns in
observations through the entire data collection phase. An essential component of ensuring data
integrity is the accurate and appropriate analysis of research findings. Improper statistical
analyses distort scientific findings, mislead casual readers and may negatively influence the
public perception of research. Integrity issues are just as relevant to analysis of non-statistical
data as well.
Mean
Median
Standard Deviation
The mean and the median are both measures of central tendency; they give an indication of the
average value of a distribution of figures.
The mean is the arithmetic average of a group of scores; that is, the scores are added up and
divided by the number of scores. The mean is sensitive to extreme scores when population
samples are small. For example, for a class of 20 students, if there were two students who
scored well above the others, the mean will be skewed higher than the rest of the scores might
indicate. Means are better used with larger sample sizes.
The median is the middle score in a list of scores; it is the point at which half the scores are
above and half the scores are below. Medians are less sensitive to extreme scores and are
probably a better indicator generally of where the middle of the class is achieving, especially
for smaller sample sizes.
The larger the population sample (number of scores) the closer mean and median become. In
fact, in a perfect bell curve, the mean and median are identical.
Standard deviation (SD) is a widely used measurement of variability used in statistics. It shows
how much variation there is from the average (mean). A low SD indicates that the data points
tend to be close to the mean, whereas a high SD indicates that the data are spread out over a
large range of values.
One SD away from the mean in either direction on the horizontal axis (the orange area on the
graph) accounts for around 68 percent of the people in this group. Two SDs away from the
mean (the orange and beige areas) account for roughly 95 percent of the people. And three SDs
(the orange, beige and blue areas) account for about 99 percent of the people.
If this curve were flatter and more spread out, the SD would have to be larger in order to account
for those 68 percent or so of the people. So the SD can tell you how spread out the examples
in a set are from the mean.
For example, if you were to calculate the SD of scores from a class of students of similar ability,
you would expect it to be low, because the scores would all be close to the mean. On the other
hand, you would expect the SD of scores from a mixed-ability class to be higher. If these
calculations did not conform to expectations, you would want to look more closely at the data
to check for inaccuracies.
Fig. 5.2 Standard Deviation Chart
A pie chart (or a circle chart) is a circular statistical graphic, which is divided into slices to
illustrate numerical proportion. In a pie chart, the arc length of each slice (and consequently its
central angle and area), is proportional to the quantity it represents. While it is named for its
resemblance to a pie which has been sliced, there are variations on the way it can be presented.
There are sub-types of the Pie Chart available. The second chart above is the Pie in 3-D and
the third chart is an Exploded Pie Chart; an Exploded Pie in 3-D is also available.
It is possible to customize the design of the pie chart so either the numeric values or the
percentages display inside the chart on top of the slices of the pie.
A bar chart or bar graph is a chart that presents grouped data with rectangular bars with
lengths proportional to the values that they represent. The bars can be plotted vertically or
horizontally. A vertical bar chart is sometimes called a column bar chart.
A bar graph is a chart that use either horizontal or vertical bars to show comparisons among
categories. One axis of the chart shows the specific categories being compared, and the other
axis represents a discrete value. Some bar graphs present bars clustered in groups of more than
one.
Fig. 5.4 Column Chart
The Column Chart very effectively shows the comparison of one or more series of data points.
But the Clustered Column Chart is especially useful in comparing multiple data series.
One variation of this chart type is the Stacked Column Chart. We show a 3-D Stacked Column
Chart at left. In a Stacked Column Chart, the data points for each time period are "stacked"
instead of "clustered." This chart type lets us see the percentage of the total for each data point
in the series.
All the Column Charts have a version in which the columns display in three-dimension - as
illustrated by the 3-D Stacked Column Chart above. But one chart, the "3-D Column Chart," is
special because the chart itself is three-dimensional - displaying multiple series on the X-axis,
Y-axis, and Z-axis. The first chart below is a 3-D Column Chart of our data series.
In newer versions cylinders, pyramids, and cones can be used instead of bars for most of the
Column charts. The second chart above shows a 3-D Pyramid Chart.
The Bar Chart
The Bar Chart is like a Column Chart lying on its side. The horizontal axis of a Bar Chart
contains the numeric values. The first chart below is the Bar Chart for our single series,
Flowers.
When to use a Bar Chart versus a Column Chart depends on the type of data and user
preference. Sometimes it is worth the time to create both charts and compare the results.
However, Bar Charts do tend to display and compare a large number of series better than the
other chart types.
All of the Bar Charts are available in 2-D and 3-D formats, but only the bars are 3-D. There is
no 3-D Bar chart containing three axes.
A line chart or line graph is a type of chart which displays information as a series of data
points called 'markers' connected by straight line segments. It is a basic type of chart common
in many fields. It is similar to a scatter plot except that the measurement points are ordered
(typically by their x-axis value) and joined with straight line segments. A line chart is often
used to visualize a trend in data over intervals of time – a time series – thus the line is often
drawn chronologically.
The Line Chart is especially effective in displaying trends. In a Line Chart, the vertical axis
(Y-axis) always displays numeric values and the horizontal axis (X-axis) displays time or other
category.
Fig. 5.9 Line Chart (Multiple Series)
The Line Chart is equally effective in displaying trends for multiple series as shown in our
chart at right. As you will notice, each line is a different color. Though not as colorful as the
other charts, it is easy to see how effective the Line Chart in showing a trend for a single series,
and comparing trends for multiple series of data values.
An area chart or area graph displays graphically quantitative data. It is based on the line
chart. The area between axis and line are commonly emphasized with colors, textures and
hatchings. Commonly one compares with an area chart two or more quantities.
Area Charts are like Line Charts except that the area below the plot line is solid. And like Line
Charts, Area Charts are used primarily to show trends over time or other category. The chart at
left is an Area Chart for our single series.
In many cases, the 2-D version of the Area Chart can be ineffective in displaying multiple
series of data meaningfully. Series with lesser values may be completely hidden behind series
with greater values - as demonstrated in the first chart below. Flowers is totally hidden, and
just a wee bit of Trees peaks through.
Fig. 5.12 Area Charts
The Scatter Chart
The purpose of a Scatter Chart is to observe how the values of two series compares over time
or other category. A scatter plot can be used either when one continuous variable that is under
the control of the experimenter and the other depends on it or when both continuous variables
are independent.
"Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot data
points. However, they have a very specific purpose. Scatter plots show how much one variable
is affected by another. The relationship between two variables is called their correlation."
Take a look at our two sample Scatter Charts below. The first chart is a Scatter Chart with Only
Markers, and the second chart is a Scatter Chart with Smooth Lines.
There are forms of data that can be used for extracting models describing important classes or
to predict future data trends. Those forms are as follows –
Clustering
Classification
Prediction
5.11 Classification
Classification is a classic data mining technique based on machine learning. Basically,
classification is used to classify each item in a set of data into one of a predefined set of classes
or groups. Classification method makes use of mathematical techniques such as decision trees,
linear programming, neural network and statistics. In classification, we develop the software
that can learn how to classify the data items into groups. For example, we can apply
classification in the application that “given all records of employees who left the company,
predict who will probably leave the company in a future period.” In this case, we divide the
records of employees into two groups that named “leave” and “stay”. And then we can ask our
data mining software to classify the employees into separate groups.
5.12 Clustering
Clustering is a data mining technique that makes a meaningful or useful cluster of objects which
have similar characteristics using the automatic technique. The clustering technique defines the
classes and puts objects in each class, while in the classification techniques, objects are
assigned into predefined classes.
5.13 Prediction
The prediction, as its name implied, is one of a data mining techniques that discovers the
relationship between independent variables and relationship between dependent and
independent variables. For instance, the prediction analysis technique can be used in the sale
to predict profit for the future if we consider the sale is an independent variable, profit could
be a dependent variable. Then based on the historical sale and profit data, we can draw a fitted
regression curve that is used for profit prediction.
5.14 Summary
The process involved in finding the story within your data
1. Finding Data – find the data that is suitable to answer your question
2. Wrangle the Data – bring it to a format that is useable
3. Merge Datasets – Bring different datasets together
4. Filter and sort the Data – Pick the data that is interesting
5. Analyze Data – Is there something to it?
6. Visualize Data – If there is something interesting in the data how can we best
showcase it to others?
"Reason sits firm and holds the reins, and she will not let the feelings burst away and
hurry her to wild chasms. The passions may rage furiously, like true heathens, as they
are; and the desires may imagine all sorts of vain things: but judgement shall still have
the last word in every argument, and the casting vote in every decision."
-- Charlotte Brontë
Session 6
6.2 Arguments
Logical thinking is a special mental activity to draw conclusions. Whenever we express
a thought, we do so by means of statements. An Argument is a collection of conversation which
helps us draw some conclusion. Formally, argument is a group of statements, one of which (the
conclusion) is claimed to follow from the other or others (the premises).
The premises of an argument are intended to support (justify) the conclusion of the
argument. Premises are statements that set forth the evidences. Conclusion is a statement that
is claimed to follow from evidences
Example:
Premise 1: All Animals can jump
Premise 2: Cat is an animal
Conclusion: Therefore, cat can jump
General Nature of arguments
When we draw conclusions, we do it in some circumstances based on some information
(premises) using some concepts. Good arguments are those which the conclusion really does
follow from the premises. Bad arguments are those in which does not yield conclusion, even
though it is claimed to.
The attribute by which a statement is either true or false is called truth value of
statement. The truth values of a statement can be either true or false.
There are also interrogative, imperative, and exclamatory sentences which is not
capable of being true or false are not statements. The following are examples of sentences
that are not statements.
are you hungry?
shut the door, please
We suggest that you travel by bus.
Turn to the left at the next corner.
You, here!!!!
Example: The children in that house yell loudly when they play in their bedroom. I can hear
children yelling in that house, therefore the children must be playing in their bedroom .
Given the data this is certainly a reasonable hypothesis. But children may play somewhere
else and yell. Here the conclusion is only probable
This cat is black. That cat is black A third cat is black. Therefore all cats are are black.
This marble from the bag is black. That marble from the bag is black. A third marble
from the bag is black. Therefore all the marbles in the bag black.
Fig 3a. Truth Table for AND Fig 3b. Circuit for AND
Fig 3a. represents the truth table for AND operations, the output is 1 if and only if both the
inputs are 1. Fig 3b. is the circuit diagram for the AND operation in which the switch A and
B are connected in series. If A and B are in ON state then the Lamp L will be in ON state.
2. OR operation or Disjunction
It describes events which can occur if at least one of the other events are true.
The OR operation says if any input is on, the output will be on.
It is represented as C= A+B
Fig 4a. Truth Table for OR Fig 4b. Circuit Diagram for OR
Fig 4a. represents the truth table for OR operation, in which if any of the input is 1 then the
output will be 1. Fig 4b. represents the circuit diagram for OR, in which the switch A and B
are connected in parallel and if any of the switch is ON the Lamp will be ON.
3. NOT Operation or Inversion
NOT operation changes a statement from true to false and vice versa. It is
represented as c= Ā.
Fig 5a. Truth Table for NOT Fig 5b. Circuit Diagram for
NOT
Fig 5a. represents the Truth Table for NOT and the circuit diagram for the same is given in
Fig 5b. If the switch A is ON state, the Lamp is in the OFF state and vise versa.
AND
OR
XOR
NAND
NOR
7.6 Basic Laws of Boolean Algebra
Commutative Law
A+B=B+A
A.B=B.A
Associative Law
A + (B + C) = (A + B) + C
A (BC) = (AB) C
Distributive Law
A (B + C) = AB + AC
Example 2:
There is a car with three main control systems. A warning lamp should be designed to light if
any of the following conditions occur:
All systems are down
Systems A,B down but C is ok
Systems A,C down but B is ok
System A down, but B,C are ok
Step 1 : Define the problem
There are two possible states for each system
Assign:
System : Down = 0, OK = 1
Light : Off = 0, On = 1
Step 2: Draw a logic block diagram
Example- PROPOSITION
• 5+2 = 7
• 5 + 2 = 12
• Milk is white
• Go in a straight line
• Simple sentences are relatively short and do not contain any other sentence as a component.
Ex.
• Grass is green.
• The first sentence ‘Grass is green’ can be symbolized as G. The second sentence ‘The sky is
blue’ can be symbolized as S.
Session 9
Connectives
9.1 Connectives
What is Connectives?
In propositional logic, connectives are used to show the relationship between the propositions
or sentences.
Need for Connectives
Complex and larger propositions can be constructed by combining simple propositions using
connectives.
• OR ()
• NOT ()
• IF_AND_ONLY_IF ()
3) Querying Database
A search query returns web pages that contain all the terms or phrases in the query.
By default, all search queries that use more than one word use conjunction.
The query returns a flood of web pages related to Lisha's of all kinds; none of whom are
related to the “Lisha Joy” that was high school friend.
Conjunction:
o We decide to use the search query as “Lisha Joy”.
o The search engine understands this query to be the conjunction of the two words.
We still cannot find any web page related to high school classmate Lisha Joy.
As we think about how to improve search results, we remember that she married someone
with the surname “Stephen”.
This search query would be overly narrow since only pages containing all three words would
be returned.
The search engine would interpret this query to mean “find all web pages that contain the
words Lisha AND Joy AND Stephen”.
Search engine shows all related details. But it fails to show desired result.
Disjunction:
o We therefore attempt to construct a search query that expresses the possibility that her
last name is either Joy OR Stephen.
Perhaps the search query “Lisha AND (Joy OR Stephen)” returns many web pages related to a
well-known soccer player.
We are certain that high school classmate is not a soccer player and therefore we attempt to
construct a search query that prevents soccer-related pages from appearing.
Negation
o Search engines use the NOT operator to exclude pages that contain certain words.
o We can therefore construct the following search query: “Lisha AND (Joy OR
Stephen)AND NOT soccer”. It gives desire result.
Example 2:
Consider a situation where we live in a city that has a grocery store named “Green”. We
would like to use a search engine to find the phone number and the hours of operation for this
store.
Green is a color
Conjunction:
For example, Green Basket Coimbatore
Suppose you want to search the details of Green Basket or Green BigBasket
but not both. Then use disjunction.
Disjunction:
Then the search query should be “Green AND (Basket or
BigBasket)”.
This query correctly expresses your desire to find web pages that
contain only Basket or BigBasket’ but not both.
Negation
Search engines use the NOT operator to exclude pages that contain
certain words. (Represented by – symbol)
These digital circuits are typically constructed by combining logic gates in various ways.
It has inputs and produces a single output corresponding to the operator that it implements.
For the AND circuit (Figure: AND circuit) , it is apparent that both switches must be closed in
order to light up the bulb.
If either one of the switches in the OR circuit (Figure: OR circuit) is closed, the light bulb will
be illuminated.
Each row of the table contains a set of data that belongs to a single record.
Every cell of the table contains one field for one record of data.
Output:
Helen Lobby
Reggle Green
Jennifer Dichali
Session 11
Introduction to Computational Thinking and Problem Solving
11.1 Computational Thinking
Computers are used in everyday life for solving problems of various kinds in various fields.
Computer programming is used for implementing the solutions to the problems. Although
programming is an essential activity in computer science, it is not the only activity involved in
solving a problem. Computer science is mainly about computational thinking or computational
problem solving. It is about learning how computers solve a problem or the way of orienting
our thought process in a way in which the computer solves a problem. Computational problem
involves the following processes.
Analyzing the problem
Designing the solution to a problem
Implementing the solution
Testing the solution
The above processes are the steps involved in solving a problem computationally.
The first phase in solving a problem computationally is problem analysis. Problem analysis
requires two things.
1. A representation that captures all the relevant aspects of the problem
2. An algorithm that solves the problem by using the representation.
Let us consider the Man, Cabbage, Goat, Wolf [MCGW] problem. There is a man who lives
on the east side of a river. He has a goat, a wolf and a cabbage with him. When he is there he
will take care that the goat would not eat the cabbage and the wolf would not eat the goat. He
wishes to bring the cabbage, goat and the wolf to the west side of the river for selling them.
But he has a goat which is large enough to carry himself and either the cabbage or the goat or
the wolf. The man cannot leave the cabbage alone with the goat because the goat will eat the
cabbage. He cannot leave the wolf alone with the goat because the wolf will eat the goat. How
can he bring all of them safely to the west side of the river?
An algorithmic approach for solving this problem is simply trying all possible combinations of
the items that can be taken back and forth across the river and then arriving at a correct solution.
Trying all possible solutions to a problem and finding a solution is known as Brute Force
Approach. Initially we have to find out the relevant aspects of the problem. When a problem
is analyzed, there may be relevant details as well as irrelevant details. For example in the
MCGW problem, the relevant details are
What is the current location of the items? – On the east side of the river
What is their destination? – On the west side of the river
What are the items? – Man, Cabbage, Goat, Wolf
How many items can travel in the boat? – Only two
But there may be irrelevant details like, the color of the boat, the width of the river, the name
of the man etc. These details are not required for our solution. Therefore, these details need not
be represented in our data representation. The process of hiding the irrelevant data and exposing
only the relevant data is known as data abstraction.
Table Representation
List of Lists
[ [‘Atlanta’, [‘Boston’, 1110], [‘Chicago’, 718], [‘Los Angeles’, 2175], [‘New York’, 888],
[‘San Francisco’, 2473] ],
[‘Boston’, [‘Chicago’, 992], [‘Los Angeles’, 2991], [‘New York’, 215], [‘San Francisco’,
3106] ],
[‘Chicago’, [‘Los Angeles’, 2015], [‘New York’, 791], [‘San Francisco’, 2131] ],
[‘Los Angeles’, [‘New York’, 2790], [‘San Francisco’, 381] ],
[‘New York’, [‘San Francisco’, 2901] ] ]
Algorithm Description: The second major task in problem design is algorithm description. You
can choose either an existing algorithm or you may design a new algorithm for solving a
problem. For example, for the calendar problem, day of week algorithm already exists. For the
travelling salesman problem, many algorithms are available. Algorithms that work well in
general, but are not guaranteed to give the correct result for each specific problem are called
heuristic algorithms.
Brainstorming Guidelines
Find out ways to motivate the members who participate in brainstorming. Clarify the
understanding. Once all the ideas have been generated, review the ideas that have been offered.
Combine items that are similar and eliminate duplicates.
2. Multivoting
Multivoting is a way to vote to select the most important or popular items (alternatives)
from a list. It is used to help a group of people to make a decision with which they are
comfortable.
Hints
Completed Map
Draw over clusters of similar thoughts that are associated with the main focus point.
Have fun using a different color highlighter with each cluster of words.
How do the variety of ideas relate to one another?
Do you notice any common causes of the problem? What are the most important
causes?
You are now ready to brainstorm solutions!
11.8 Activities
1. Write an algorithm to find the area of a circle.
2. Write an algorithm to find the greatest number among three numbers
3. Write an algorithm to find the factorial of a number.
4. Represent the following data in suitable formats
a. Seasons of a year
b. Colors of a rainbow
c. A unit matrix
d. Details of a book
5. Identify the modules in Library Management System
6. Group your class students based on some similarity
7. Identify the rules to join B.Tech. course in Karunya University.
8. What is the four-digit number in which the first digit is one-third the second, the third
is the sum of the first and second, and the last is three times the second?
9. The following verse spells out a word, letter by letter. "My first" refers to the word's
first letter, and so on. What's the word that this verse describes?
– Input/output
– Processing
– Error handling
• In general, non-functional requirements include:
– Reliability
– Safety
– Security
– Performance
– Delivery
– Help facilities
Example: Functional Requirements of a counter application
The purpose is to develop a counter application which is used to count items up or
down.
The sample design of the counter is shown in Fig. 12.1.
Reset
0 1 2 3 4
5 6 7 8 9
12.3 Activities
Think about an app you can develop for your mobile phone (may be to solve problems you
face in everyday life) and prepare its functional requirement specification.
Session 13
Problem Decomposition
13. 1 Introduction
A solution to a large problem is often complex. It can be made simple if the problem is
broken down into smaller parts. Then each part can be solved individually and then the
solutions are combined to produce the solution to the original problem. The process of breaking
down a complex problem or system into smaller sub problems with more manageable parts is
known as Problem Decomposition.
Decomposition helps to solve complex problems and manage large projects.
Large problems can be tackled with “divide and conquer” method
The problem should be decomposed such that every sub problem is of the same level
of detail.
Each sub problem can be solved independently.
The solutions to the sub problems can be combined to solve the original problem.
Example 1: Consider the problem of making a pizza. The task of making a pizza is a
considerably larger task. But it can be subdivided into the following subtasks.
1. Make crust
2. Make spread and sauce
3. Spread cheese
4. Spread toppings
5. Bake
6. Slice
Fig. 13.1 Dividing a problem into smaller subproblems
Hence, if the task is divided into subtasks, the problem becomes smaller and more manageable.
Example 2: Organizing a school trip
The task of organizing a school trip can be subdivided into sub tasks such as, booking a coach,
getting consent letters, staffing, checking weather and checking resources. Each subtask can be
delegated to different persons and finally the school trip can be successfully organized easily.
0 1 2 3 4 5 6 7 8 9 10
20 23 15 7 4 8 10 20 40 25 12
0 1 2 3 4 5 6 7 8 9 10
4 7 8 10 12 15 20 20 23 25 40
0 1 2 3 4 5 6 7 8 9 10
4 7 8 10 12 15 20 20 23 25 40
Fig. 13.5 Middle Element 15; 25<15; So, left half is discarded
Now, the new low is calculated as low = mid + 1, since the lower half of the list is discarded.
If the upper half is discarded, low will be calculated as low = mid – 1.
So
low = mid+1 = 5+1 = 6
mid = 6 + (10-6)/2 = 6 + 2 = 8
6 7 8 9 10
20 20 23 25 40
Fig. 13. 6 New list
Now the search element is to be searched only in the upper half of the list shown in Fig. 13.6.
The low and middle values are calculated as shown above and the middle element is 23. Since,
25>23, once again the lower half of the list is discarded and the upper half is considered. The
low and middle values are calculated using the usual formula.
low = mid + 1 = 8 + 1 = 9
mid = low + (high-low)/2 = 9 + (10-9)/2 = 9.5 = 10.
Now the new list is shown in Fig. 13.7.
9 10
25 40
13.5 Activities
Divide the following problems into modules.
1. Finding the sum of first ‘n’ natural numbers
2. Finding the factorial of a number
3. Finding the sum of the digits of a number
4. Counting the digits of a number and checking whether the given number is an
Armstrong number or not.
5. Counting the divisors of a number and checking whether the given number is prime or
composite.
6. Finding the sum of the divisors of a number and checking whether the given number
is perfect or not.
7. Evaluation of nCr.
Session 17
Algorithm Design
17.1 Need for Algorithm
Computer is an electronic machine. To solve any problem (sum of two numbers or any complex
problem), Computer can’t think on its own to solve it. But if we give instructions, computer
follows it and solves the problem.
17.3 Algorithms
At its most basic, an algorithm is a method for solving a computational problem. An
algorithm is any well-defined computational procedure that takes some value, or set of values,
as input and produces some value, or set of values, as output.
An algorithm is thus a sequence of computational steps that transform the input into the
output. We can also view an algorithm as a tool for solving a well-specified computational
problem.
The statement of the problem specifies in general terms the desired input/output
relationship. The algorithm describes a specific computational procedure for achieving that
input/output relationship.
Step 1: Start
Step 2: Declare the variables
Step 3: Input the values
Step 4: Calculate d=b*b-4*a*c
Step 5: if d=0 then
Print the roots are real and equal
Calculate root1=root2= -b/2*a
Print the values of root1 and root2
Step 6: if d>0 then
Print the roots are real and unequal
Calculate root1=(-b+sqrt(d))/2*a
root2=(-b-sqrt(d))/2*a
Print the values of root1 and root2
Step 7: if d<0
Print the roots are real and imaginary
Calculate r1=-b/2*a
r2=sqrt(abs(d))/(2*a);
root1=r1+ir2
root2=r1-ir2
Print the values of root1 and root2
Step 8: stop
Step 1: Start
Step 2: Declare the Variables
Step 3: Input the value of n
Step 4: Initialize sum=0
Step 5: Calculate x=n%10
sum =sum+x
n=n/10
Step 6: Repeat Step 5 until n>0
Step 7: Print sum
Step 8: Stop
Step1: Start
Step2: Declare the variables
Step3: Input the value of n
Step4: Initialize f1=-1, f2=1 and f3=0
Step5: Inside the loop calculate
f3=f1+f2
f1=f2
f2=f3
Step6: Display f3
Step7: Stop
17.7 Summary
1. A sequence of steps to solve a given problem
2. Algorithm should be self-contained
3. The steps should not be ambiguous
4. It should produce an outcome when executed
5. May be designed to do one or multiple tasks
6. Written in plain natural language [with some specific terminology
Session 18
Flow Chart
18.1 Flow Chart
Flow chart is the pictorial representation of an algorithm. The flowchart uses some
standard notations or symbols to represent the programming components.
18.2 Symbols
• Selection
– Selecting one of two possible actions based on a particular condition
• Iteration
– Repeating actions
18.4 Connectors
• Connectors are used to connect two parts of a flow chart
• Connectors should be named uniquely
• There are two types of connectors
– In page connector
– Off page connector
18.5 Advantages of using a FlowChart
• Communication: Flowcharts are better way of communicating the logic of a system.
• Effective Analysis: With the help of flowchart, problems can be analyzed in more
effective ways.
• Proper Documentation: Program flowcharts serve as a good documentation, which is
needed for various purposes.
• Efficient Coding: Flowcharts act as a guide or blueprint during system analysis and
program development phase.
• Efficient Program Maintenance: Maintenance of operating program becomes easy
with the help of flowchart.
• Proper Debugging: Flowchart helps in debugging process.
18.6 Examples
1. Flowchart for Area of a Circle
2. To add 6 subject marks of a person and calculate the total and average
3. To check whether a person is eligible to vote