Sei sulla pagina 1di 8

Summer Reading and Topics List for

Master of Science in Data Science (MSDS) Program


Linear Algebra

In the linear algebra boot camp, the instructor will draw from a combination of the following
books:

1. Introduction to Linear Algebra by Gilbert Strang


2. Matrix Analysis and Applied Linear Algebra by Carl D. Meyer
3. Applied Linear Algebra and Matrix Analysis by Thomas S. Shores
4. Numerical Linear Algebra by Lloyd N. Trefethen and David Bau

For the time being, you can rely on your linear algebra book from college, as well as the free
linear algebra book by Jim Hefferson at

http://joshua.smcvt.edu/linearalgebra/book.pdf.

In your initial review, focus on the following topics: vectors, matrices, and associated oper-
ations, solving linear equations, determinants, vector spaces, eigenvalues and eigenvectors,
and linear transformations. Also start looking

If you are looking for an online learning venue, consider taking the OCW Scholar course in
linear algebra at the Massachusetts Institute of Technology at

http://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/.

Another basic online resource is the Khan academy course

https://www.khanacademy.org/math/linear-algebra.

A fair warning on the online courses, watching but not doing the homework has next to no
value. Engage in the work.
Probability and Statistics

In the probability and statistics boot camp (MSAN 504), the instructor will use a combina-
tion of the following books:

1. Ghahramani, Saeed. Fundamentals of Probability, with Stochastic Processes, 3rd edi-


tion. ISBN: 978-0131427068.
2. Freund, John. Mathematical Statistics, 8th edition. ISBN: 978-0321807090.
3. Ross, Sheldon. Simulation, 5th edition. ISBN: 978-0124158252.
4. Moore, David. The Basic Practice of Statistics, 7th edition. ISBN: 978-1464142536.
5. Hogg, Robert V. and Elliot A. Tanis. Probability and Statistical Inference, 9th edition.
ISBN: 978-0321923271.
6. Grolemund, Garrett and Hadley Wickham. R for Data Science. ISBN: 978-1491910368.
7. Ross, Sheldon. A First Course in Probability. ISBN: 978-0321866813.
8. Lock, Lock, Lock, Lock and Lock. Statistics: Unlocking the Power of Data, 2nd edition.
ISBN: 978-1119308843.

Selected excerpts from the above-mentioned books are available on the Canvas page for
MSAN 504. Students will have access to this Canvas page as soon as they log onto MyUSF
(and the Canvas page is published).

As you review, focus on the following learning outcomes:

• Understanding the definitions of probability mass functions, probability density func-


tions, cumulative distributions functions, and moments;
• Knowing the properties of the most famous examples of random variables (Bernoulli,
binomial, geometric, exponential, Poisson, normal, etc.);
• Mastering the underpinnings of the most common parameter estimation technique,
maximum likelihood estimation;
• Understanding the difference between a sample and a population;
• Being able to state the Central Limit Theorem, understanding its importance, and
applying it in a variety of basic situations;
• Being able to implement, by hand and in R, all elementary one- and two-sample tests of
hypotheses and confidence interval constructions (e.g., means, proportions, correlation,
ratios of variances, etc.);
• Understanding the fundamental axioms, rules, and laws of probability theory;
• Simulating (using R) random numbers governed by various probability distributions
using the method of inverse transformation and the acceptance-rejection technique;
• Defining, and working with examples related to, conditional probability;
• Understanding the importance of the concept of independence;
• Proving and using the Law of Total Probability;
• Using the Law of Total Probability to prove Bayes’ Theorem and deploying Bayes’
Theorem in a variety of practical situations;
• Working with random vectors as well as random variables;
• Working with multivariate distributions, as well as the concepts of conditional expec-
tation and independence in a high-dimensional setting; and
• Working with the multivariate Gaussian distribution.
If you are looking for an online venue to review this material, there are several you might
consider. We recommend the first two courses in Duke University’s “Statistics with R Spe-
cialization” at Coursera.
Economics, Finance, Communication and Business

The following represent some resources that you may find helpful:

• Data Science for Business by Foster Provost and Tom Fawcett contains a wealth of
useful information.

• Web Analytics 2.0: The Art of Online Accountability and Science of Customer Cen-
tricity by Avinash Kaushik. In this book the following chapters are useful 1 – web
analytics 2.0 framework; 5 – conversions, revenue, and satisfaction; 6 – customer cen-
tricity; 7 – experimentation and testing; 8 – competitive intelligence analysis; 13 –
planning your career for success; and 14 – creating a data-driven culture.

• Next, consider reading Predictive Analytics: The Power to Predict Who Will Click,
Buy, Lie, or Die by Eric Seigel and Thomas H. Davenport.

• If your English skills (particularly your written skills) are weak, practice some of the
writing exercises at http://www.autoenglish.org/writing.htm. Also, consider re-
viewing The Elements of Style by W. Strunk and E.B. White and Style, Ten Lessons
in Clarity and Grace by J. Williams.
Computer Programming Languages and Tools

Laptops
Every student must have a laptop with enough computing power and memory to complete
projects and practicum work. We strongly recommend you buy a Mac (or Linux) laptop
with at least the following specs:

2.9 GHz Duel-Core Intel Core i5, Turbo Boost up to 3.3GHz


or, 2.6GHz quad-core Intel Core i7 processor, Turbo Boost up to 3.5GHz.
256 GB PCIe-based Flash Storage minimum

We further recommend that you get 16G not 8G RAM (memory).The more memory you
have, the better.
Let us be very clear that Mac OS X is the preferred operating system. You can get
away with Linux, but should avoid Windows. If you choose to use Windows, you are on your
own: Faculty will not be able to help you install software or packages on Windows! Most
of the software we use in this program does not work well or cannot be easily installed on
Windows.

Programming
The greater your facility with writing software and using programming tools, the easier
you will find the entire curriculum. When students have difficulty in the programming
assignments, our first question to them is: What did you do in the months prior to the boot
camp? Every year a few students do not pass the computational boot camp and must exit
the program. We have created these notes so that you can properly prepare yourself.

Computer programming has four key elements:


1. Knowledge of at least one programming language’s syntax
2. An ability to translate a problem description or specification to properly-working code
3. An ability to write code that is properly organized, robust, and that is easy to read,
understand, and modify
4. A facility with the programming ecosystem that includes development tools, processes,
and protocols
At minimum, we expect all incoming students to satisfy #1. As you will write programs
in Python and R immediately upon entry into any of the boot camp courses, you should
take this opportunity to learn as much as you can before arriving. You will use a variety of
tools, #4, throughout the degree program and learning about them before you arrive will
definitely make your life easier. The computational boot camp will focus on #2 and #3, but
those skills will continue to improve for the rest of your career.

Python 3.x
To ensure that you satisfy #1 in Python (3.x) you must be able to read and write code
using the following items. There will be a quiz at the start of the first class of computational
boot camp to verify this.
• Types vs variables. Common types: strings, integer and floating-point numbers. Dif-
ference between objects and values (see section 8.11 of “Python for Everybody book”).
Conversion between types. Multiple assignments: a,b = f().

• if statements and conditional expression; nested if statements.

• Iteration with for and while loops.

• List and string element access, s[i], and slicing, s[1:5], s[1:], etc...

• String formatting with the % operator.

• Adding elements to a list, traversing a list.

• Adding elements to a dictionary, accessing by key, traversing dictionaries.

• Function declaration syntax; the difference between local variables, parameters, and
global variables; return values.

• Importing and using common functions and libraries: range(), round(), len(), min,
max, split()ing strings, reading and writing text files.

• Be able to look up functions to learn about their parameters and return values.

• An understanding of Python packages and how to import code from one file to another.

• Install Python packages using the pip program.

• Accessing script parameters via sys.argv.

You can glean all of that from the following book: “Python for Everybody book” at
http://www.pythonlearn.com/book.php or by going through any of the following free courses:

• https://www.coursera.org/learn/python

• https://www.coursera.org/learn/interactive-python-1

• https://www.udacity.com/course/programming-foundations-with-python--ud036

• https://www.edx.org/course/introduction-computer-science-mitx-6-00-1x-10

• https://www.edx.org/course/cs-all-introduction-computer-science-harveymuddx-cs005x-0

• https://www.codecademy.com/learn/python

Python Tutor. You will also find http://pythontutor.com/ to be a very useful tool
when trying to visualize what’s going on with the objects in your running program. For
example, it will show you how a list of numbers is laid out in memory. Being able to
visualize data structures is a critical skill.
How to learn. Many of you have already gone through these courses, but you might
not have gotten that much out of it, particularly if you didn’t do the projects. If you want
to learn to write software, there is no substitute for actually typing in code. Don’t just
listen and watch the instructor write software. Study the code, understand the problem it is
solving, and then with a blank screen try to reproduce the code yourself without looking at
the solution unless you get stuck. Just as in a foreign natural language, we are much better
at listening than speaking. To improve your speaking, you must get as much practice as
possible speaking that language. You should be typing in (not just cut/pasting) lots and lots
of Python between now and the boot camp.
For incoming students with some Python experience already, you might consider review-
ing the first three chapters of Python for Data Analysis: Data Wrangling with Pandas,
NumPy, and IPython, by Wes McKinney.

R
You will do a lot of programming in R. If you are new to R, you should familiarize yourself
with Robert Kabacoff’s “R in Action.” A good introduction to data science techniques can
be found in “Data Science with R” by Grolemun and Wickham, which can be found free
online at http://r4ds.had.co.nz/. There will be a srtong focus on the following packages:
dplyr, ggplot2 and magrittr. Students with more substantial programming experience,
but in languages other than R, should consider reviewing Chapters 1 to 3 of Software for
Data Analysis: Programming with R (Statistics and Computing) by John M. Chambers.

We recommend that you watch YouTube videos on R Programming, such as:


https://www.youtube.com/playlist?list=PLOU2XLYxmsIK9qQfztXeybpHvru-TrqAP
There are plenty of online courses, such as:
https://www.edx.org/course/programming-r-data-science-microsoft-dat209x

Tools
Throughout the program you will use a number of tools in order to write software, execute
code, and communicate with team members and faculty. Before arriving at orientation, you
are required to have the following software installed on your laptop:

• Python 3.x not 2.7.x. Install Anaconda 3, a Python installation that includes
most of the packages you will need in this program:
https://www.anaconda.com/download
Be aware that there is almost certainly an existing Python installation on your laptop,
which you do not want to use. To ensure that you are using the correct version from
the command line (see bash below), you should see “Anaconda” when you start up
the Python interactive shell:

$ python
Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 6 2017, 12:04:38)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

• Jupyter notebooks (or in beta: Jupyter Lab). See http://jupyter.org/, but these
notebooks are built into the Anaconda installation. (Run jupyter notebook from
the shell to start.) Notebooks are combined programs and text that are great for
presentations as well as development. You can intersperse your thoughts with the code
that implements those ideas. Machine learning programming and many other kinds of
data science programming are particularly challenging because of the amount of data
involved. Being able to display graphs, data frames, and output directly in line with
the code in the same document is extremely valuable.

• Bash. Using the command line is a critical skill in this degree program. The command
line is also called “the shell” or by the specific shell’s name we use: Bash. When we
create servers in the cloud, we communicate with it through the command line. Further,
you will manage and process files on your laptop or remote server using the command
line. Bash is the default shell on both OS X and Linux. You should go through a
course such as this one: http://www.bash.academy

• git. The most important collaborative tool used by programmers is a revision control
system. We will use a tool called git in particular, which is hideous but powerful and
is the most commonly used. Very often you will be submitting software for grading
through git to http://www.github.com. Certainly, it is how multiple students work
on the same project. It is likely something that you will use on the first day of your
practicum. Potential employers will look at your sample projects at github. You should
go through these courses:
https://www.codecademy.com/learn/learn-git
https://www.udacity.com/course/how-to-use-git-and-github--ud775

• Amazon Web services (AWS) https://aws.amazon.com. Every student is required


to have an AWS account. AWS is the cloud computing environment we use in this
degree program. Get an Amazon account if you don’t already have one and then sign
up to use AWS services. I think you might need a credit card two register for this but
we will likely give you coupons for free computing services at AWS.

Potrebbero piacerti anche