Sei sulla pagina 1di 15

Table

of Contents
Introduction

1.1

Installation

1.2

Setup

1.3

Training

1.4

How it Works

1.5

Introduction

LearnProgrammingBot
Documentation | How it Works | Download

LearnProgrammingBot is a bot for reddit that uses scikit-learn and supervised learning
techniques to categorise submissions and reply with useful and appropriate links.
It is intended to answer common questions on /r/learnprogramming that can be found on the
wiki, but in theory it should be suitable for any subreddit provided that it is trained properly.
The default training set should be fine for most programming subreddits, but it can be
extended at any time.

Installation
LearnProgrammingBot requires scikit-learn, praw and sqlalchemy. Due to this, the
installation instructions are slightly different depending on which platform you are using. It
should work with both Python 2 and Python 3 (unit tests are coming soon)
Before continuing, download the code (either through the source zip or releases tab) and
extract it if necessary, or clone using git. Then, open a command prompt or terminal and cd
to the directory where you have extracted the code.

Windows
As an administrator, in the command prompt, run:
pip install -r requirements.txt

Mac
sudo pip install -r requirements.txt

Debian/Ubuntu/Mint

Introduction

sudo apt-get install python-scipy


sudo pip install sqlalchemy scikit-learn praw

Setup and Running


You'll need to enter a few variables into settings.py . Just follow the instructions on lines
preceded by # (comments) and fill in the correct data.
To run, use ./main.py run in the terminal. This will run continuously until killed using Ctrl+C
or an exception. You might find useful logging information in bot.log if the bot does crash.
Feel free to report an issue if you do find a bug!

Classifications
Currently, the classifier only recognises 3 types of post classes:
'good' - the post is a good question for /r/learnprogramming
'faq' - the post contains a common question that is probably on the FAQ
'bad' - the post is formatted badly, off topic or does not contain enough detail.

Accuracy
As of commit 9742b376ef4e845ac45cbd96e86dfe7156dc913e , the classifier's accuracy is as
follows:
Correct classification = 81%
False negative = 13%
False positive = 5%
Wrong category = 1%
False negatives are counted as any time that the actual class was not 'good' and the
classifier returned 'good'. False positives occur when the actual class was 'good' but the
classifier did not return 'good'. Wrong category classifications occur when the classifier
returned a different negative classification (i.e. 'faq' instead of 'bad')

Roadmap and Planned Features


Unit tests
More modular approach so that extra modules can be installed to make the bot more

Introduction

customisable.

Contributing
We're happy to accept contributors - you don't need to be an expert, just file an issue about
getting started and we can start there!

License
MIT License. Please see the LICENSE file!

Installation

Installation
LearnProgrammingBot requires scikit-learn, praw and sqlalchemy. Due to this, the
installation instructions are slightly different depending on which platform you are using. It
should work with both Python 2 and Python 3 (unit tests are coming soon).
To install, first download the source from either the master ZIP or the releases tab, and
extract the zip file into any directory. Alternatively, you can clone the source using git by
running:
git clone https://github.com/Aurora0001/LearnProgrammingBot.git

Then, follow the instructions for your platform to install the dependencies:

Windows
As an administrator, in the command prompt, run:
pip install -r requirements.txt

If pip is not recognised, you may need to install it using these instructions. If you want to
download a pre-compiled version of Python with SciPy, try Python(x,y).

Mac
sudo pip install -r requirements.txt

Debian/Ubuntu/Mint
sudo apt-get install python-scipy python-pip
sudo pip install sqlalchemy scikit-learn praw

Other Linux Distributions


5

Installation

Check if your distribution has a package such as python-scipy , which will save time and
avoid the need for you to compile NumPy from source (which is slow and quite difficult). If
you can use a package, just run this afterwards:
sudo pip install sqlalchemy scikit-learn praw

Make sure that you've installed the package for pip too, if you haven't already.
If your distribution does not have a SciPy package, just run this (and prepare for a long
wait!):
sudo pip install -r requirements.txt

Setup

Setup
To configure LearnProgrammingBot, you'll need to obtain an OAuth access token from
reddit. This will allow LearnProgrammingBot to log in to the account that you want to
automate.
If you already have the token for CLIENT_ACCESSCODE , skip this section. This code is not the
ID or secret, though.

Getting the OAuth Tokens


To use OAuth (which reddit requires), you need 3 tokens: the client id, the client secret and
the access token.

Getting the ID and Secret Tokens


To create these tokens, you'll need to go to the app preferences page, while logged in as
your bot account. If you don't see something like this, you may need to click 'create another
app...':
Set the name box to 'LearnProgrammingBot' (or a custom name, if you prefer - it isn't
important). Select the script app type from the radio buttons below the textbox. Leave
description and about url blank, and enter http://127.0.0.1/callback in the redirect uri
box. Then, click 'create app', and you should see something like what you see in the image:

Setup

The token under 'personal use script' is your client ID. The token underlined in red is your
client secret.
Open up settings.py and change the following lines to your ID and secret:
CLIENT_ID = 'my_client_id_here'
CLIENT_SECRET = 'my_client_secret_here'

You can ignore any lines preceded by #.

Getting your Access Token


LearnProgrammingBot can help you generate your access token automatically. This only
needs to be done once - after this, it can be done manually.
In a terminal, run:
./main.py create-token

A web browser should open (if you are logged in as your bot account). Click 'Allow', and wait
to be redirected. You will probably get something like this:

Don't worry, this is correct. Copy the token after code= (circled in the image), and put it in
settings.py as CLIENT_ACCESSTOKEN. Do not include the code= section - this will

not work!

Running LearnProgrammingBot
You're now ready to run LearnProgrammingBot (finally!). Use ./main.py run in the terminal.
This will run continuously until killed using Ctrl+C or an exception. You might find useful
logging information in bot.log if the bot does crash. Feel free to report an issue if you do find
a bug!

Setup

Training

Training
To train the bot, you need to install LearnProgrammingBot and its dependencies (see the
Installation section). You do not need to create OAuth tokens as shown in the Setup section
if you are only training the bot.

Training with a Specific Post


You can train the bot with one post if it has misclassified it, using the following command:
./main.py train --id ID

Where ID is the reddit submission ID, for example:


https://www.reddit.com/r/learnprogramming/comments/4g4far/meta_i_wrote_a_bot_for_rlear
nprogramming_that/
^^^^^^^

In this link, the id is 4g4far, so you could train it with:


./main.py train --id 4g4far

LearnProgrammingBot will then fetch the post from reddit, and display it for you to review. It
will then prompt you to enter the correct classification of the post. Here are the categories
(an updated list is found in review_corpus.py )
Valid categories:
good
off_topic (incl. bad questions)
faq_get_started (incl. getting started with a project - where do I start?)
faq_career
faq_resource (incl. challenges e.g. codewars)
faq_resource_podcast
faq_tool (incl. laptop specs)
faq_language (e.g. how much should I know before I am expert, which should I pick)
faq_other (including motivation, 'does a programmer google?', project ideas etc.)
faq_what_now (what to do after codecademy etc.)

10

Training

For the best results, it's best to be generous with your classification, and, if in doubt, classify
as 'good'. Check data.db for examples of how previous posts were classified, if you're not
sure.

Training in Batches
You might find it easier to train with larger samples from the 'new' feed of
/r/learnprogramming. This is supported with the train-batch command, which can be used
like so:
./main.py train-batch --limit AMOUNT_OF_POSTS_TO_CLASSIFY

This is also interactive, just like the train command. To see the valid classifications, please
see the above section.

Committing Changes
To merge your database changes with the main repository, fork LearnProgrammingBot on
GitHub, then clone your copy. Train the classifier using the steps listed above, then create a
pull request. Try to do this relatively quickly (i.e. don't wait for days before merging) because
it's difficult to resolve merge conflicts with the database.

Summary
1. Fork repository
2.

git clone https://github.com/MyUserName/LearnProgrammingBot

3. Train classifer
4.

git commit -m "Trained classifier with X new records"

5.

git push origin master

6. Create pull request on GitHub

11

How it Works

How it Works
The code for LearnProgrammingBot is quite simple, but the theory behind it is slightly more
difficult to get to grips with. Here's a 'bird's eye view' of how LearnProgrammingBot works:
1. Train support vector machine with known data (a 'corpus')
2. Fetch latest posts from reddit
3. 'Vectorize' the post into a numpy array
4. Classify the array using the trained support vector machine
5. If the post class is not 'good', check the responses dictionary for the correct response,
and reply.
Below, I'll try to explain the reasons for each of the steps and how they work.

The Classifier
Before explaining how LearnProgrammingBot's classifier works, it might be helpful to briefly
talk about the document classification problem as a whole, and the different types of learning
techniques.

Types of Machine Learning


There are two types of learning that are used for the majority of AI problems: supervised
learning and unsupervised learning.
Supervised learning is where the algorithm is shown some samples and the correct
answers, and it extrapolates so that it can answer similar questions. It's similar to how a
child learns through asking questions and using the answers to predict things in the future.
Unsupervised learning is less useful for classification, because we already know the correct
categories. It works better for data mining (finding trends that you don't already know).

Classification Algorithms
There are a few big solutions to classification problems, which all work in slightly different
ways but provide similar outcomes.
Naive Bayes (NB) classifiers are simple and popular classifiers which are often used for
spam detection. They work on a simple principle, which Wikipedia illustrates like this:

12

How it Works

Usually, NB classifiers work very quickly, but aren't as accurate as Support Vector Machines
(SVMs). If you're interested in reading more about their competitiveness with SVMs, you can
read this paper.
Support Vector Machines appear similar to NB classifiers, but they are not probability-based
- they can only return either 'Category A' or 'Category B'. Essentially, they find a line in a
graph that splits the two datasets as accurately as possible, like this:

It's clear that both $$H_2$$ and $$H_3$$ are suitable lines, but $$H_1$$ is incorrect. The
training period allows the SVM to calculate the best line.
As you can see, SVMs can only split data points into two groups. To allow the SVM to split
data points into multiple groups, a strategy called one-vs-the-rest is used. Essentially, this
makes multiple graphs, which might be like this:
'good' vs rest
'faq' vs rest
'bad' vs rest
Therefore, if it is in the 'rest' section for every graph but 'bad', the document must be 'bad'.

The Vectorizer
13

How it Works

It's easy to understand how the SVM works with points, but one aspect that we haven't
covered is how the points are actually calculated from a document of text. Obviously, you
can't just pick a random point for a document - that'd produce nonsensical results!
The solution to this is the vectorizer. As the name suggests, it turns text into a mathematical
vector. This is done through a model known as the bag-of-words. The example on Wikipedia
(see the link) is very clear, and this is how scikit-learn's CountVectorizer works. Once the
text has been turned into a vector, the numerical values can be used to position a point for
the SVM.
However, this method is a bit naive and might miss important words that aren't common.
Instead, 'the' might be ranked as the most important word, which could cause the SVM to fall
victim to an effect called overfitting. This is where 'junk values' are misinterpreted as
statistically important, leading to significant inaccuracies.
An improved technique uses tf-idf. This is an algorithm to rank words in a body of text by
their importance, which can help to catch the key words in a message, even if they're only
said once or twice.

Summary
Here's a beautiful ASCII-art graph for the key stages:

Training
Corpus of Training Data (pre-classified)
|
Process with Vectorizer to calcualte all
key words and store it in the 'bag'
|
Process with SVM to train and fit the correct
lines to split the groups

Classification
Text To Process (fetched from reddit)
|
Process with Vectorizer into Bag-of-Words,
searching for words found in training
phase
|
Classify with SVM using pre-fitted line
|
Return correct document classification

14

How it Works

15