Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
of Contents
Introduction
1.1
1.
1.2
2.
1.3
3.Information Retrieval
1.4
4.
1.5
5.crawler2
1.6
6.crawler3
1.7
1.8
Material
1.9
FAQ
1.10
Introduction
Introduction
Project Description
First stage
Second stage
Third stage
Setup Requirements
Reference
Introduction
Web crawling to gather information is a common technique used to efficiently collect
information from across the web. As an introduction to web crawling, in this project we will
use Scrapy, a free and open source web crawling framework written in Python[1]. Originally
designed for web scraping, it can also be used to extract data using APIs or as a general
purpose web crawler. Even though Scrapy is a comprehensive infrastructure to support web
crawling, you will face different kinds of challenges in real applications, e.g., dynamic
JavaScript or your IP being blocked.
The project contains 3 parts. Each part is an extension of the previous one. The end goal is
to code a Scrapy project that can crawl tens of thousands of apps from the Xiaomi AppStore,
or any other app store with which you are familiar.
Project Description
First stage
Create a Scrapy pr oject to crawl the content in the Xiaomi Appstore homepage or any other
Appstore homepage
Second stage
Save the crawled content in MongoDB[2]. Install Python MongoDB driver and modify
pipelines.py to insert crawled data into MongoDB.
Third stage
Introduction
Crawl more content by following next page links. So far you have likely only crawled the
content of the home page. We need to use Splash[3] and ScrapyJS[4] to re-render the web
page to transform the dynamic part to static content if the next page link is written in
JavaScript
Setup Requirements
python2.7
Scrapy 1.0+
Splash
ScrapyJS
MongoDB
Reference
[1]Scrapy http://scrapy.org [2]MongoDB https://www.mongodb.org/ [3]Splash & ScrapyJS
https://github.com/scrapinghub/scrapy-splash [4]ScrapyJS & ScrapyJS
https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/
1.
python
html Http/REST
Python
Python,
C/C++
urlliburllib2Beautiful SoupScrapy
python
urilib2
import urllib2 // urilib2
request = urllib2.Request("http://www.baidu.com") // Request
response = urllib2.urlopen(request) //
print response.read() //
htmlHtml
HTML
HTML
4
1.
HTML , Head
Body
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names wer
e
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
<> Head
title body
HTTP
html
Http
CSS
HTMLCSS HTML
CSS3
style
linkCSS
Javascript
CSSJS
JS3
1.
action
script
script
Http/REST
Representational State Transfer(REST)
URIURI
HTTP GETPOSTPUTDELETE
RESTful
Scrapy(
)Data.
2.
Web Crawler
Outline
Scrapy at a Glance
- [`items.py` -- schema](#itemspy----schema)
- [`pipelines.py` -- ](#pipelinespy----
)
- [`parse()` -- apps](#parse----apps
)
- [`parse_item()` -- appapps](#parseitem----
appapps)
3.
mongoDB
mongoDB
scrapymongoDB
4.block
5.Render Javascript
6.
7.
Web Crawler
pythonscrapy
2 AppStore
3 AppStore
4 AppStore
Outline
1.
2.
3.
4. block
5. Render Javascript
6.
2.
Scrapy at a Glance
1. SpiderSpiderInternetrequest
2. responseparserpython
3. pipelinespythondatabase
file
1.
2.
appstore tag
tag parserspattern
Scrapy:
$ pip install scrapy
scrapy
$ scrapy startproject appstore
appstore/spiders spider
$ touch huawei_spider.py
tree scrapyappstorescrapy4
items.py -- schema
2.
import scrapy
class AppstoreItem(scrapy.Item)
# define the fields for your item here like:
title = scrapy.Field()
url = scrapy.Field()
appid = scrapy.Field()
intro = scrapy.Field()
pipelines.py --
settings.py -- scrapy
1. pipelinespipelines
ITEM_PIPELINES = {
'appstore.pipelines.AppstorePipeline': 300,
}
2. send request
DOWNLOAD_DELAY=5
spiders/huawei_spider/py --
3
start_urls appstore
10
2.
class HuaweiSpider(BaseSpider):
name = "appstore"
allowed_domains = ["huawei.com"]
start_urls = [
"http://appstore.huawei.com/more/all"
]
titletext
4pythonscrapy
$ cd appstore
$ scrapy crawl huawei
$ cat appstore.dat
11
2.
2.
appappapps
new field items.py schema
import scrapy
class AppstoreItem(scrapy.Item)
# define the fields for your item here like:
title = scrapy.Field()
url = scrapy.Field()
appid = scrapy.Field()
intro = scrapy.Field()
recommended = scrapy.Field() # new field
12
2.
spiders/huawei_spider.py apps
Crawler Unit
parse() -- apps
app
apprequestapps
request
def parse(self, response):
"""
response.body is a result of render.html call; it contains HTML processed by a b
rowser.
here we parse the html
:param response:
:return: request to detail page & request to next page if exists
"""
# count apps on current page
page = Selector(response)
divs = page.xpath('//div[@class="list-game-app dotline-btn nofloat"]')
current_url = response.url
print "num of app in current page: ", len(divs)
print "current url: ", current_url
# parse details when looping apps on current page
count = 0
for div in divs:
if count >= 2:
break
item = AppstoreItem()
13
2.
parse_item() -- appapps
14
2.
scrapyrequest URL
start URLsroot of the tree
ID
request
3.
mongoDB
15
2.
MongoDB
MongoDB
mongoDB
mongoDB
mongo
mongod
homebrewmongodb
$ brew install mongodb
mongodb /data/db
$ sudo chown xxx /data/db
# xxxrun $ whoami
mongoDB
$ mkdir -p ~/data/db
$ mongod dbpath ~/data/db #alias $ mongod=mongod dbpath ~/data/db
mongodb/bin$PATH
$ touch .base_profile
$ vim .base_profile
16
2.
terminal
export MONGO_PATH=/usr/local/Cellar/mongodb/3.2.1
export PATH=MONGO_PATH/bin:$PATH
mongodb
$ mongod
querymongoDB
mongod terminalmongodb
client
scrapymongoDB
python package pymongo pipelines.py mongoDBpipeline
item mongoDB
17
2.
import pymongo
class AppstoreMongodbPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
"""
return an instance of this pipeline
crawler.settings --> settings.py
get mongo_uri & mongo_database from settings.py
:param crawler:
:return: crawler instance
"""
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
"""
process data here before loading to mongodb
:param item:
:param spider:
:return: item
"""
collection_name = item.__class__.__name__ # use itemName as the collectionName
self.db[collection_name].remove({}) # clean the collection when new crawling s
tarts
self.db[collection_name].insert(dict(item))
return item
settings.py
18
2.
ITEM_PIPELINES = {
'appstore.pipelines.AppstoreWritePipeline': 1,
'appstore.pipelines.AppstoreImagesPipeline': 2,
'appstore.pipelines.AppstoreMongodbPipeline': 3,
}
# mongo db settings
MONGO_URI = "127.0.0.1:27017"
MONGO_DATABASE = "appstore"
4.block
1 useragentserver
useragentchromemollizasafari scrapyuseragent
spideruseragent
2serverrequestserverblockIP
Proxyuseragentuseragent
settings.py
DOWNLOADER_MIDDLEWARES = {
'appstore.random_useragent.RandomUserAgentMiddleware': 400,
}
19
2.
import random
from scrapy import log
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
class RandomUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, settings, user_agent='Scrapy'):
super(RandomUserAgentMiddleware, self).__init__()
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
spider.log(
u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), req
uest),
level=log.DEBUG
)
"""
the default user_agent_list composes chrome, IE, Firefox, Mozilla, Opera,
for more user agent strings, you can find it in http://www.useragentstring.com/pag
es/useragentstring.php
"""
user_agent_list = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, lik
e Gecko) Chrome/32.0.1664.3 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, li
ke Gecko) Chrome/48.0.2564.103 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/537.36 (KHTML, lik
e Gecko) Chrome/32.0.1664.3 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.13 (KHTML, lik
e Gecko) Chrome/24.0.1290.1 Safari/537.13",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like
Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, lik
e Gecko) Chrome/18.0.1025.45 Safari/535.19",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.19 (KHTML, lik
e Gecko) Chrome/18.0.1025.45 Safari/535.19",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, lik
e Gecko) Chrome/18.0.1025.11 Safari/535.19",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.11 (KHTML, lik
e Gecko) Chrome/17.0.963.66 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, lik
e Gecko) Chrome/27.0.1453.93 Safari/537.36",
]
5.Render Javascript
20
2.
JavascriptJavascript
Scrapy
scrapy-splashSplashjavascriptScrapyScrapy
SplashjavascriptHTTP APITwisted
QTPythonSplash
dockerSplash
settings.py splash
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
SPLASH_URL = 'http://192.168.99.100:8050' #'DOCKER_HOST_IP:CONTAINER_PORT'
DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter' #Splash
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage' #SplashHttp
appstoreSpider.py SplashJS
class HuaweiSpider(BaseSpider):
name = "appstore"
allowed_domains = ["huawei.com"]
start_urls = [
"http://appstore.huawei.com/more/all"
]
# render since the start url
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
scrapy-splashhost splashserviceserver
requestsplashmiddlewareappstore
appstorerequestresponsesplashHTML
pagescrapy
21
2.
scrapy-splashhost splashserviceserver
requestsplashmiddlewareappstore
appstorerequestresponsesplashHTML
pagescrapy
6.
Flaskpythonpython
web
settings.py templates/appstore_index.html
22
2.
# coding=utf-8
__author__ = 'jing'
from flask import Flask, render_template
import pymongo
from settings import MONGO_URI, MONGO_DATABASE
app = Flask(__name__, static_folder = "images") # instantiate flask
@app.route("/")
def hello():
client = pymongo.MongoClient(MONGO_URI)
db = client[MONGO_DATABASE]
apps = db["AppstoreItem"].find()
client.close()
return render_template("appstore_index.html", apps=apps) # render anything we hav
e in each app
if __name__ == "__main__":
app.run(debug=True) # some error won't show up until you enable debugging feature
7.
pi
GitBookInformation Retrieval
http://doc.scrapy.org/en/latest/intro/tutorial.html
http://kissg.me/2016/06/01/note-on-web-scraping-with-python/
https://appear.in/captivating-wren
44 Crawler
45 Crawler
53 Crawler
23
3.Information Retrieval
Information Retrieval
Solr/ElasticSearch = NoSQL + Search
24
4.
Crawler
44 Crawler
Information Collection
Information Retrieval
(Rank, Search, Recommend)
SYN
SYN-ACK
ACK
25
4.
Layers
HTTPHTTP
TCPUDP, HTTPTCP
IP
socket(abstract layer)
socketAPI
What is HTML?
26
4.
HTML
HTML
HTML
Architecture
1. Crawl all the news of a website
1.
python
pythonxPathBeautifulSoupparsers
27
4.
2.
28
4.
crawler,
crawler
Schedulercrawlers
taskTable
python
pythonscrapy
socket
clientserver page
linkIDlist
crawlerID taskTable
29
5.crawler2
Crawler
SleepConditional Variable
Semaphoremark
CrawlerSchedulerCrawler
CrawlerSchedulerCrawler
SleepCrawlertaskTablepageTableCrawlertaskTable
pageTableurlurltaskTable
CrawlertaskTable
pagepageTable
tasktaskTable
tabletable
30
5.crawler2
Crawler
Conditional Variable
Cond_Waitblock
Cond_Signal
{width="6.0in" height="3.5419630358705163in"}
Semaphore
Wait
Signal
31
5.crawler2
{width="6.0in" height="3.5596937882764657in"}
32
6.crawler3
Crawler
Task&PagesCrawlerCrawlerConnector
{width="6.0in" height="2.8848075240594926in"}
{width="6.0in" height="3.0521325459317583in"}
ConnectorConnector
SenderCrawlerReceiver
CrawlerConnector
33
6.crawler3
{width="3.6216655730533684in" height="2.9818471128608923in"}
{width="6.0in" height="3.409429133858268in"}
34
6.crawler3
{width="5.943125546806649in" height="3.302561242344707in"}
35
1621627lcl
262874
375711
QA
4712718
5 (719
gitbook
-- (by 7.2)
react.js, node.js -- xing (by 7.2)
es-mongoDB connector -- eva (by 7.2)
36
Material
Material
https://github.com/BitTigerInst/Kumamon
37
FAQ
FAQ
62xpath parse httppython
columnhtmlcssxpathparser
scrapy
4mongoDBmongodbcinnection
mongodbserverclientserverlaunchactive
clientmongodbscrapy
5mongodbhtmlcss
present htmlppt
ip,scrapyip
hard codedscrapyenablefeaturescrapy
ip
teamsearch
google adwords
projectmeeting
meetings5
crawler
scrapypythonworkflowgeneratoriteartor
38
FAQ
fancyfeaturescrawler
:)
1. html
2. http / REST (put get etc)
3. Python
jing's pppt
39