Sei sulla pagina 1di 39

Table

of Contents
Introduction

1.1

1.

1.2

2.

1.3

3.Information Retrieval

1.4

4.

1.5

5.crawler2

1.6

6.crawler3

1.7

1.8

Material

1.9

FAQ

1.10

Introduction

Introduction
Project Description
First stage
Second stage
Third stage
Setup Requirements
Reference

Introduction
Web crawling to gather information is a common technique used to efficiently collect
information from across the web. As an introduction to web crawling, in this project we will
use Scrapy, a free and open source web crawling framework written in Python[1]. Originally
designed for web scraping, it can also be used to extract data using APIs or as a general
purpose web crawler. Even though Scrapy is a comprehensive infrastructure to support web
crawling, you will face different kinds of challenges in real applications, e.g., dynamic
JavaScript or your IP being blocked.
The project contains 3 parts. Each part is an extension of the previous one. The end goal is
to code a Scrapy project that can crawl tens of thousands of apps from the Xiaomi AppStore,
or any other app store with which you are familiar.

Project Description
First stage
Create a Scrapy pr oject to crawl the content in the Xiaomi Appstore homepage or any other
Appstore homepage

Second stage
Save the crawled content in MongoDB[2]. Install Python MongoDB driver and modify
pipelines.py to insert crawled data into MongoDB.

Third stage

Introduction

Crawl more content by following next page links. So far you have likely only crawled the
content of the home page. We need to use Splash[3] and ScrapyJS[4] to re-render the web
page to transform the dynamic part to static content if the next page link is written in
JavaScript

Setup Requirements
python2.7
Scrapy 1.0+
Splash
ScrapyJS
MongoDB

Reference
[1]Scrapy http://scrapy.org [2]MongoDB https://www.mongodb.org/ [3]Splash & ScrapyJS
https://github.com/scrapinghub/scrapy-splash [4]ScrapyJS & ScrapyJS
https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/

1.

python
html Http/REST

Python
Python,
C/C++
urlliburllib2Beautiful SoupScrapy
python
urilib2
import urllib2 // urilib2
request = urllib2.Request("http://www.baidu.com") // Request
response = urllib2.urlopen(request) //
print response.read() //

NOTE: Python 3 urilib2 urilib. requestBeautifulSoup


BeautifulSoup
urilib2html
from base import BeautifulSoup //
soup = BeautifulSoup(open('html.html')) // html soup
print soup.prettify() //

htmlHtml

HTML
HTML
4

1.

HTML , Head
Body

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names wer
e
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

<> Head
title body
HTTP
html

Http

CSS
HTMLCSS HTML
CSS3

style
linkCSS

Javascript
CSSJS

JS3

1.

action
script
script

Http/REST
Representational State Transfer(REST)

URIURI
HTTP GETPOSTPUTDELETE

RESTful

Scrapy(
)Data.

2.

Web Crawler
Outline
Scrapy at a Glance

- [`items.py` -- schema](#itemspy----schema)
- [`pipelines.py` -- ](#pipelinespy----
)
- [`parse()` -- apps](#parse----apps
)
- [`parse_item()` -- appapps](#parseitem----
appapps)

3.
mongoDB
mongoDB
scrapymongoDB
4.block
5.Render Javascript
6.
7.

Web Crawler
pythonscrapy

2 AppStore
3 AppStore
4 AppStore

Outline
1.
2.
3.
4. block
5. Render Javascript
6.

2.

Scrapy at a Glance
1. SpiderSpiderInternetrequest
2. responseparserpython

3. pipelinespythondatabase

file

1.

2.

appstore tag
tag parserspattern

Scrapy:
$ pip install scrapy

scrapy
$ scrapy startproject appstore

appstore/spiders spider
$ touch huawei_spider.py

tree scrapyappstorescrapy4


items.py -- schema

title, url, appid, intro4 items.py schema

2.

import scrapy
class AppstoreItem(scrapy.Item)
# define the fields for your item here like:
title = scrapy.Field()
url = scrapy.Field()
appid = scrapy.Field()
intro = scrapy.Field()

pipelines.py --

pipeline appid, title, intro


appstore.dat
class AppstorePipeline(object):
def __init__(self):
self.file = open('appstore.dat', 'wb')
def process_item(self, item, spider):
val = '{0}\t{1}\t{2}\n".format(item['appid'], item['title'], item['intro'])
self.file.write(val)
return item # ID, title, intro

settings.py -- scrapy

1. pipelinespipelines

ITEM_PIPELINES = {
'appstore.pipelines.AppstorePipeline': 300,
}

2. send request
DOWNLOAD_DELAY=5

spiders/huawei_spider/py --

3
start_urls appstore

10

2.

class HuaweiSpider(BaseSpider):
name = "appstore"
allowed_domains = ["huawei.com"]
start_urls = [
"http://appstore.huawei.com/more/all"
]

titletext

4pythonscrapy
$ cd appstore
$ scrapy crawl huawei

$ cat appstore.dat

11

2.

2.
appappapps
new field items.py schema
import scrapy
class AppstoreItem(scrapy.Item)
# define the fields for your item here like:
title = scrapy.Field()
url = scrapy.Field()
appid = scrapy.Field()
intro = scrapy.Field()
recommended = scrapy.Field() # new field

12

2.

spiders/huawei_spider.py apps
Crawler Unit

parse() -- apps

app
apprequestapps
request
def parse(self, response):
"""
response.body is a result of render.html call; it contains HTML processed by a b
rowser.
here we parse the html
:param response:
:return: request to detail page & request to next page if exists
"""
# count apps on current page
page = Selector(response)
divs = page.xpath('//div[@class="list-game-app dotline-btn nofloat"]')
current_url = response.url
print "num of app in current page: ", len(divs)
print "current url: ", current_url
# parse details when looping apps on current page
count = 0
for div in divs:
if count >= 2:
break
item = AppstoreItem()

13

2.

info = div.xpath('.//div[@class="game-info whole"]')


detail_url = info.xpath('./h4[@class="title"]/a/@href').extract_first()
item["url"] = detail_url
req = Request(detail_url, callback=self.parse_detail_page)
req.meta["item"] = item
count += 1
yield req
# go to next page
page_ctrl = response.xpath('//div[@class="page-ctrl ctrl-app"]')
isNextPageThere = page_ctrl.xpath('.//em[@class="arrow-grey-rt"]').extract()
if isNextPageThere:
current_page_index = int(page_ctrl.xpath('./span[not(@*)]/text()').extract_f
irst()) # "div[not(@attr)]"(not any on specific attr)
if current_page_index >= 5: # 5
print "let's stop here for now"
return
next_page_index = str(current_page_index + 1)
next_page_url = self.start_urls[0] + "/" + next_page_index
print "next_page_index: ", next_page_index, "next_page_url: ", next_page_url
request = scrapy.Request(next_page_url, callback=self.parse, meta={ # render
the next page
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
},
})
yield request
else:
print "this is the end!"

parse_item() -- appapps

req meta parse() req.meta["item"] = item


parse_item() item

14

2.

def parse_detail_page(self, response):


"""
GET details for each app
:param response:
:return: item
"""
item = response.meta["item"]
# details about current app
item["image_url"] = response.xpath('//ul[@class="app-info-ul nofloat"]//img[@cla
ss="app-ico"]/@lazyload').extract()[0]
item["title"] = response.xpath('//ul[@class="app-info-ul nofloat"]//span[@class=
"title"]/text()').extract_first().encode('utf-8')
item["appid"] = re.match(r'http://.*/(.*)', item["url"]).group(1)
item["intro"] = response.xpath('//div[@class="content"]/div[@id="app_strdesc"]/t
ext()').extract_first().encode('utf-8')
# recommended apps
divs = response.xpath('//div[@class="unit nofloat corner"]/div[@class="unit-main
nofloat"]/div[@class="app-sweatch nofloat"]')
recommends = []
for div in divs:
rank = div.xpath('./div[@class="open nofloat"]/em/text()').extract_first()
name = div.xpath('./div[@class="open nofloat"]/div[@class="open-info"]/p[@cl
ass="name"]/a/@title').extract()[0].encode('utf-8')
url = div.xpath('./div[@class="open nofloat"]/div[@class="open-info"]/p[@cla
ss="name"]/a/@href').extract_first()
rec_appid = re.match(r'http://.*/(.*)', url).group(1)
recommends.append({'name': name, 'rank': rank, 'appid': rec_appid})
item["recommends"] = recommends
yield item

scrapyrequest URL
start URLsroot of the tree

ID
request

3.
mongoDB

15

2.

MongoDB key-value pairs


NoSQL
MongoDB

MongoDB

MongoDB
mongoDB
mongoDB
mongo
mongod
homebrewmongodb
$ brew install mongodb

mongodb /data/db
$ sudo chown xxx /data/db
# xxxrun $ whoami

mongoDB

$ mkdir -p ~/data/db
$ mongod dbpath ~/data/db #alias $ mongod=mongod dbpath ~/data/db

mongodb/bin$PATH
$ touch .base_profile
$ vim .base_profile

16

2.

terminal
export MONGO_PATH=/usr/local/Cellar/mongodb/3.2.1
export PATH=MONGO_PATH/bin:$PATH

mongodb
$ mongod

querymongoDB
mongod terminalmongodb
client
scrapymongoDB
python package pymongo pipelines.py mongoDBpipeline
item mongoDB

17

2.

import pymongo
class AppstoreMongodbPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
"""
return an instance of this pipeline
crawler.settings --> settings.py
get mongo_uri & mongo_database from settings.py
:param crawler:
:return: crawler instance
"""
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
"""
process data here before loading to mongodb
:param item:
:param spider:
:return: item
"""
collection_name = item.__class__.__name__ # use itemName as the collectionName
self.db[collection_name].remove({}) # clean the collection when new crawling s
tarts
self.db[collection_name].insert(dict(item))
return item

settings.py

18

2.

ITEM_PIPELINES = {
'appstore.pipelines.AppstoreWritePipeline': 1,
'appstore.pipelines.AppstoreImagesPipeline': 2,
'appstore.pipelines.AppstoreMongodbPipeline': 3,
}
# mongo db settings
MONGO_URI = "127.0.0.1:27017"
MONGO_DATABASE = "appstore"

4.block

1 useragentserver
useragentchromemollizasafari scrapyuseragent
spideruseragent
2serverrequestserverblockIP

Proxyuseragentuseragent

settings.py
DOWNLOADER_MIDDLEWARES = {
'appstore.random_useragent.RandomUserAgentMiddleware': 400,
}

settings.py python random_useragent.py


useragentuseragent

19

2.

import random
from scrapy import log
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

class RandomUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, settings, user_agent='Scrapy'):
super(RandomUserAgentMiddleware, self).__init__()
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
spider.log(
u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), req
uest),
level=log.DEBUG
)
"""
the default user_agent_list composes chrome, IE, Firefox, Mozilla, Opera,
for more user agent strings, you can find it in http://www.useragentstring.com/pag
es/useragentstring.php
"""
user_agent_list = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, lik
e Gecko) Chrome/32.0.1664.3 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, li
ke Gecko) Chrome/48.0.2564.103 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/537.36 (KHTML, lik
e Gecko) Chrome/32.0.1664.3 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.13 (KHTML, lik
e Gecko) Chrome/24.0.1290.1 Safari/537.13",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like
Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, lik
e Gecko) Chrome/18.0.1025.45 Safari/535.19",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.19 (KHTML, lik
e Gecko) Chrome/18.0.1025.45 Safari/535.19",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, lik
e Gecko) Chrome/18.0.1025.11 Safari/535.19",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.11 (KHTML, lik
e Gecko) Chrome/17.0.963.66 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, lik
e Gecko) Chrome/27.0.1453.93 Safari/537.36",
]

5.Render Javascript

20

2.

JavascriptJavascript
Scrapy
scrapy-splashSplashjavascriptScrapyScrapy
SplashjavascriptHTTP APITwisted
QTPythonSplash
dockerSplash
settings.py splash
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
SPLASH_URL = 'http://192.168.99.100:8050' #'DOCKER_HOST_IP:CONTAINER_PORT'
DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter' #Splash
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage' #SplashHttp

appstoreSpider.py SplashJS
class HuaweiSpider(BaseSpider):
name = "appstore"
allowed_domains = ["huawei.com"]
start_urls = [
"http://appstore.huawei.com/more/all"
]
# render since the start url
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})

scrapy-splashhost splashserviceserver
requestsplashmiddlewareappstore
appstorerequestresponsesplashHTML
pagescrapy

21

2.

scrapy-splashhost splashserviceserver
requestsplashmiddlewareappstore
appstorerequestresponsesplashHTML
pagescrapy

6.
Flaskpythonpython
web
settings.py templates/appstore_index.html

22

2.

# coding=utf-8
__author__ = 'jing'
from flask import Flask, render_template
import pymongo
from settings import MONGO_URI, MONGO_DATABASE
app = Flask(__name__, static_folder = "images") # instantiate flask
@app.route("/")
def hello():
client = pymongo.MongoClient(MONGO_URI)
db = client[MONGO_DATABASE]
apps = db["AppstoreItem"].find()
client.close()
return render_template("appstore_index.html", apps=apps) # render anything we hav
e in each app
if __name__ == "__main__":
app.run(debug=True) # some error won't show up until you enable debugging feature

7.

pi

GitBookInformation Retrieval

http://doc.scrapy.org/en/latest/intro/tutorial.html
http://kissg.me/2016/06/01/note-on-web-scraping-with-python/
https://appear.in/captivating-wren

44 Crawler
45 Crawler
53 Crawler

23

3.Information Retrieval

Information Retrieval
Solr/ElasticSearch = NoSQL + Search

24

4.

Crawler
44 Crawler

Information Collection

Information Retrieval
(Rank, Search, Recommend)

What is the network process when you are crawling a


webpage?

SYN
SYN-ACK
ACK

25

4.

Layers
HTTPHTTP
TCPUDP, HTTPTCP
IP

socket(abstract layer)
socketAPI

What is HTML?
26

4.

HTML
HTML

HTML

Architecture
1. Crawl all the news of a website
1.
python
pythonxPathBeautifulSoupparsers

27

4.

2.

2. Crawl more websites

28

4.

crawler,

crawler
Schedulercrawlers
taskTable

python
pythonscrapy
socket
clientserver page
linkIDlist
crawlerID taskTable

29

5.crawler2

Crawler
SleepConditional Variable
Semaphoremark
CrawlerSchedulerCrawler
CrawlerSchedulerCrawler

SleepCrawlertaskTablepageTableCrawlertaskTable
pageTableurlurltaskTable
CrawlertaskTable
pagepageTable
tasktaskTable
tabletable

30

5.crawler2

Crawler
Conditional Variable
Cond_Waitblock
Cond_Signal

{width="6.0in" height="3.5419630358705163in"}
Semaphore
Wait
Signal

31

5.crawler2

{width="6.0in" height="3.5596937882764657in"}

32

6.crawler3

Crawler

Task&PagesCrawlerCrawlerConnector

{width="6.0in" height="2.8848075240594926in"}

{width="6.0in" height="3.0521325459317583in"}
ConnectorConnector
SenderCrawlerReceiver
CrawlerConnector

33

6.crawler3

{width="3.6216655730533684in" height="2.9818471128608923in"}

{width="6.0in" height="3.409429133858268in"}

34

6.crawler3

{width="5.943125546806649in" height="3.302561242344707in"}

35

1621627lcl
262874
375711
QA
4712718
5 (719

gitbook
-- (by 7.2)
react.js, node.js -- xing (by 7.2)
es-mongoDB connector -- eva (by 7.2)

36

Material

Material
https://github.com/BitTigerInst/Kumamon

37

FAQ

FAQ
62xpath parse httppython
columnhtmlcssxpathparser

scrapy
4mongoDBmongodbcinnection
mongodbserverclientserverlaunchactive
clientmongodbscrapy
5mongodbhtmlcss

present htmlppt

ip,scrapyip
hard codedscrapyenablefeaturescrapy
ip
teamsearch
google adwords

projectmeeting

meetings5

crawler

scrapypythonworkflowgeneratoriteartor

38

FAQ

fancyfeaturescrawler

:)

1. html
2. http / REST (put get etc)
3. Python
jing's pppt

39

Potrebbero piacerti anche