Learn Programming Bot

Table
of Contents
Introduction
1.1
1.
1.2
2.
1.3
3.Information Retrieval
1.4
4.
1.5
5.crawler2
1.6
6.crawler3
1.7
1.8
Material
1.9
FAQ
1.10
Introduction
Introduction
Project Description
First stage
Second stage
Third stage
Setup Requirements
Reference
Introduction
Web crawling to gather information is a common technique used to efficiently collect
information from across the web. As an introduction to web crawling, in this project we will
use Scrapy, a free and open source web crawling framework written in Python[1]. Originally
designed for web scraping, it can also be used to extract data using APIs or as a general
purpose web crawler. Even though Scrapy is a comprehensive infrastructure to support web
crawling, you will face different kinds of challenges in real applications, e.g., dynamic
JavaScript or your IP being blocked.
The project contains 3 parts. Each part is an extension of the previous one. The end goal is
to code a Scrapy project that can crawl tens of thousands of apps from the Xiaomi AppStore,
or any other app store with which you are familiar.
Project Description
First stage
Create a Scrapy pr oject to crawl the content in the Xiaomi Appstore homepage or any other
Appstore homepage
Second stage
Save the crawled content in MongoDB[2]. Install Python MongoDB driver and modify
pipelines.py to insert crawled data into MongoDB.
Third stage
Introduction
Crawl more content by following next page links. So far you have likely only crawled the
content of the home page. We need to use Splash[3] and ScrapyJS[4] to re-render the web
page to transform the dynamic part to static content if the next page link is written in
JavaScript
Setup Requirements
python2.7
Scrapy 1.0+
Splash
ScrapyJS
MongoDB
Reference
[1]Scrapy http://scrapy.org [2]MongoDB https://www.mongodb.org/ [3]Splash & ScrapyJS
https://github.com/scrapinghub/scrapy-splash [4]ScrapyJS & ScrapyJS
https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/
1.
python
html Http/REST
Python
Python,
C/C++
urlliburllib2Beautiful SoupScrapy
python
urilib2
import urllib2 // urilib2
request = urllib2.Request("http://www.baidu.com") // Request
response = urllib2.urlopen(request) //
print response.read() //
NOTE: Python 3 urilib2 urilib. requestBeautifulSoup

BeautifulSoup
urilib2html
from base import BeautifulSoup //
soup = BeautifulSoup(open('html.html')) // html soup
print soup.prettify() //
htmlHtml
HTML
HTML
4
1.
HTML , Head
Body
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story
Once upon a time there were three little sisters; and their names wer
e
<a href="http://example.com/elsie" class="sister" id="link1"></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
"""
<> Head
title body
HTTP
html
Http
CSS
HTMLCSS HTML
CSS3
style
linkCSS
Javascript
CSSJS
JS3
1.
action
script
script
Http/REST
Representational State Transfer(REST)
URIURI
HTTP GETPOSTPUTDELETE
RESTful
Scrapy(
)Data.
2.
Web Crawler
Outline
Scrapy at a Glance
- [`items.py` -- schema](#itemspy----schema)
- [`pipelines.py` -- ](#pipelinespy----
)
- [`parse()` -- apps](#parse----apps
)
- [`parse_item()` -- appapps](#parseitem----
appapps)
3.
mongoDB
mongoDB
scrapymongoDB
4.block
5.Render Javascript
6.
7.
Web Crawler
pythonscrapy
2 AppStore
3 AppStore
4 AppStore
Outline
1.
2.
3.
4. block
5. Render Javascript
6.
2.
Scrapy at a Glance
1. SpiderSpiderInternetrequest
2. responseparserpython
3. pipelinespythondatabase
file
1.
2.
appstore tag
tag parserspattern
Scrapy:
$ pip install scrapy
scrapy
$ scrapy startproject appstore
appstore/spiders spider
$ touch huawei_spider.py
tree scrapyappstorescrapy4

items.py -- schema
title, url, appid, intro4 items.py schema
2.
import scrapy
class AppstoreItem(scrapy.Item)
# define the fields for your item here like:
title = scrapy.Field()
url = scrapy.Field()
appid = scrapy.Field()
intro = scrapy.Field()
pipelines.py --
pipeline appid, title, intro

appstore.dat
class AppstorePipeline(object):
def __init__(self):
self.file = open('appstore.dat', 'wb')
def process_item(self, item, spider):
val = '{0}\t{1}\t{2}\n".format(item['appid'], item['title'], item['intro'])
self.file.write(val)
return item # ID, title, intro
settings.py -- scrapy
1. pipelinespipelines
ITEM_PIPELINES = {
'appstore.pipelines.AppstorePipeline': 300,
}
2. send request
DOWNLOAD_DELAY=5
spiders/huawei_spider/py --
3
start_urls appstore
10
2.
class HuaweiSpider(BaseSpider):
name = "appstore"
allowed_domains = ["huawei.com"]
start_urls = [
"http://appstore.huawei.com/more/all"
]
titletext
4pythonscrapy
$ cd appstore
$ scrapy crawl huawei
$ cat appstore.dat
11
2.
2.
appappapps
new field items.py schema
import scrapy
class AppstoreItem(scrapy.Item)
# define the fields for your item here like:
title = scrapy.Field()
url = scrapy.Field()
appid = scrapy.Field()
intro = scrapy.Field()
recommended = scrapy.Field() # new field
12
2.
spiders/huawei_spider.py apps
Crawler Unit
parse() -- apps
app
apprequestapps
request
def parse(self, response):
"""
response.body is a result of render.html call; it contains HTML processed by a b
rowser.
here we parse the html
:param response:
:return: request to detail page & request to next page if exists
"""
# count apps on current page
page = Selector(response)
divs = page.xpath('//div[@class="list-game-app dotline-btn nofloat"]')
current_url = response.url
print "num of app in current page: ", len(divs)
print "current url: ", current_url
# parse details when looping apps on current page
count = 0
for div in divs:
if count >= 2:
break
item = AppstoreItem()
13
2.
info = div.xpath('.//div[@class="game-info whole"]')

detail_url = info.xpath('./h4[@class="title"]/a/@href').extract_first()
item["url"] = detail_url
req = Request(detail_url, callback=self.parse_detail_page)
req.meta["item"] = item
count += 1
yield req
# go to next page
page_ctrl = response.xpath('//div[@class="page-ctrl ctrl-app"]')
isNextPageThere = page_ctrl.xpath('.//em[@class="arrow-grey-rt"]').extract()
if isNextPageThere:
current_page_index = int(page_ctrl.xpath('./span[not(@*)]/text()').extract_f
irst()) # "div[not(@attr)]"(not any on specific attr)
if current_page_index >= 5: # 5
print "let's stop here for now"
return
next_page_index = str(current_page_index + 1)
next_page_url = self.start_urls[0] + "/" + next_page_index
print "next_page_index: ", next_page_index, "next_page_url: ", next_page_url
request = scrapy.Request(next_page_url, callback=self.parse, meta={ # render
the next page
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
},
})
yield request
else:
print "this is the end!"
parse_item() -- appapps
req meta parse() req.meta["item"] = item

parse_item() item
14
2.
def parse_detail_page(self, response):

"""
GET details for each app
:param response:
:return: item
"""
item = response.meta["item"]
# details about current app
item["image_url"] = response.xpath('//ul[@class="app-info-ul nofloat"]//img[@cla
ss="app-ico"]/@lazyload').extract()[0]
item["title"] = response.xpath('//ul[@class="app-info-ul nofloat"]//span[@class=
"title"]/text()').extract_first().encode('utf-8')
item["appid"] = re.match(r'http://.*/(.*)', item["url"]).group(1)
item["intro"] = response.xpath('//div[@class="content"]/div[@id="app_strdesc"]/t
ext()').extract_first().encode('utf-8')
# recommended apps
divs = response.xpath('//div[@class="unit nofloat corner"]/div[@class="unit-main
nofloat"]/div[@class="app-sweatch nofloat"]')
recommends = []
for div in divs:
rank = div.xpath('./div[@class="open nofloat"]/em/text()').extract_first()
name = div.xpath('./div[@class="open nofloat"]/div[@class="open-info"]/p[@cl
ass="name"]/a/@title').extract()[0].encode('utf-8')
url = div.xpath('./div[@class="open nofloat"]/div[@class="open-info"]/p[@cla
ss="name"]/a/@href').extract_first()
rec_appid = re.match(r'http://.*/(.*)', url).group(1)
recommends.append({'name': name, 'rank': rank, 'appid': rec_appid})
item["recommends"] = recommends
yield item
scrapyrequest URL
start URLsroot of the tree
ID
request
3.
mongoDB
15
2.
MongoDB key-value pairs

NoSQL
MongoDB

MongoDB
MongoDB
mongoDB
mongoDB
mongo
mongod
homebrewmongodb
$ brew install mongodb
mongodb /data/db
$ sudo chown xxx /data/db
# xxxrun $ whoami
mongoDB
$ mkdir -p ~/data/db
$ mongod dbpath ~/data/db #alias $ mongod=mongod dbpath ~/data/db
mongodb/bin$PATH
$ touch .base_profile
$ vim .base_profile
16
2.
terminal
export MONGO_PATH=/usr/local/Cellar/mongodb/3.2.1
export PATH=MONGO_PATH/bin:$PATH
mongodb
$ mongod
querymongoDB
mongod terminalmongodb
client
scrapymongoDB
python package pymongo pipelines.py mongoDBpipeline
item mongoDB
17
2.
import pymongo
class AppstoreMongodbPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
"""
return an instance of this pipeline
crawler.settings --> settings.py
get mongo_uri & mongo_database from settings.py
:param crawler:
:return: crawler instance
"""
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
"""
process data here before loading to mongodb
:param item:
:param spider:
:return: item
"""
collection_name = item.__class__.__name__ # use itemName as the collectionName
self.db[collection_name].remove({}) # clean the collection when new crawling s
tarts
self.db[collection_name].insert(dict(item))
return item
settings.py
18
2.
ITEM_PIPELINES = {
'appstore.pipelines.AppstoreWritePipeline': 1,
'appstore.pipelines.AppstoreImagesPipeline': 2,
'appstore.pipelines.AppstoreMongodbPipeline': 3,
}
# mongo db settings
MONGO_URI = "127.0.0.1:27017"
MONGO_DATABASE = "appstore"
4.block
1 useragentserver
useragentchromemollizasafari scrapyuseragent
spideruseragent
2serverrequestserverblockIP
Proxyuseragentuseragent
settings.py
DOWNLOADER_MIDDLEWARES = {
'appstore.random_useragent.RandomUserAgentMiddleware': 400,
}
settings.py python random_useragent.py

useragentuseragent
19
2.
import random
from scrapy import log
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
class RandomUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, settings, user_agent='Scrapy'):
super(RandomUserAgentMiddleware, self).__init__()
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
spider.log(
u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), req
uest),
level=log.DEBUG
)
"""
the default user_agent_list composes chrome, IE, Firefox, Mozilla, Opera,
for more user agent strings, you can find it in http://www.useragentstring.com/pag
es/useragentstring.php
"""
user_agent_list = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, lik
e Gecko) Chrome/32.0.1664.3 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, li
ke Gecko) Chrome/48.0.2564.103 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like
Gecko) Chrome/19.0.1063.0 Safari/536.3",
]
5.Render Javascript
20
2.
JavascriptJavascript
Scrapy
scrapy-splashSplashjavascriptScrapyScrapy
SplashjavascriptHTTP APITwisted
QTPythonSplash
dockerSplash
settings.py splash
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
SPLASH_URL = 'http://192.168.99.100:8050' #'DOCKER_HOST_IP:CONTAINER_PORT'
DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter' #Splash
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage' #SplashHttp
appstoreSpider.py SplashJS
class HuaweiSpider(BaseSpider):
name = "appstore"
allowed_domains = ["huawei.com"]
start_urls = [
"http://appstore.huawei.com/more/all"
]
# render since the start url
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
scrapy-splashhost splashserviceserver
requestsplashmiddlewareappstore
appstorerequestresponsesplashHTML
pagescrapy
21
2.
scrapy-splashhost splashserviceserver
requestsplashmiddlewareappstore
appstorerequestresponsesplashHTML
pagescrapy
6.
Flaskpythonpython
web
settings.py templates/appstore_index.html
22
2.
# coding=utf-8
__author__ = 'jing'
from flask import Flask, render_template
import pymongo
from settings import MONGO_URI, MONGO_DATABASE
app = Flask(__name__, static_folder = "images") # instantiate flask
@app.route("/")
def hello():
client = pymongo.MongoClient(MONGO_URI)
db = client[MONGO_DATABASE]
apps = db["AppstoreItem"].find()
client.close()
return render_template("appstore_index.html", apps=apps) # render anything we hav
e in each app
if __name__ == "__main__":
app.run(debug=True) # some error won't show up until you enable debugging feature
7.

pi
GitBookInformation Retrieval
http://doc.scrapy.org/en/latest/intro/tutorial.html
http://kissg.me/2016/06/01/note-on-web-scraping-with-python/
https://appear.in/captivating-wren
44 Crawler
45 Crawler
53 Crawler
23
3.Information Retrieval
Information Retrieval
Solr/ElasticSearch = NoSQL + Search
24
4.
Crawler
44 Crawler
Information Collection
Information Retrieval
(Rank, Search, Recommend)
What is the network process when you are crawling a

webpage?
SYN
SYN-ACK
ACK
25
4.
Layers
HTTPHTTP
TCPUDP, HTTPTCP
IP
socket(abstract layer)
socketAPI
What is HTML?
26
4.
HTML
HTML
HTML
Architecture
1. Crawl all the news of a website
1.
python
pythonxPathBeautifulSoupparsers
27
4.
2.
2. Crawl more websites
28
4.
crawler,

crawler
Schedulercrawlers
taskTable

python
pythonscrapy
socket
clientserver page
linkIDlist
crawlerID taskTable
29
5.crawler2
Crawler
SleepConditional Variable
Semaphoremark
CrawlerSchedulerCrawler
CrawlerSchedulerCrawler
SleepCrawlertaskTablepageTableCrawlertaskTable
pageTableurlurltaskTable
CrawlertaskTable
pagepageTable
tasktaskTable
tabletable
30
5.crawler2
Crawler
Conditional Variable
Cond_Waitblock
Cond_Signal
{width="6.0in" height="3.5419630358705163in"}
Semaphore
Wait
Signal
31
5.crawler2
32
6.crawler3
Crawler
Task&PagesCrawlerCrawlerConnector
ConnectorConnector
SenderCrawlerReceiver
CrawlerConnector
33
6.crawler3
34
6.crawler3
35
1621627lcl
262874
375711
QA
4712718
5 (719
gitbook
-- (by 7.2)
react.js, node.js -- xing (by 7.2)
es-mongoDB connector -- eva (by 7.2)
36
Material
Material
https://github.com/BitTigerInst/Kumamon
37
FAQ
FAQ
62xpath parse httppython
columnhtmlcssxpathparser
scrapy
4mongoDBmongodbcinnection
mongodbserverclientserverlaunchactive
clientmongodbscrapy
5mongodbhtmlcss
present htmlppt
ip,scrapyip
hard codedscrapyenablefeaturescrapy
ip
teamsearch
google adwords
projectmeeting
meetings5
crawler
scrapypythonworkflowgeneratoriteartor
38
FAQ
fancyfeaturescrawler
:)
1. html
2. http / REST (put get etc)
3. Python
jing's pppt
39

Learn Programming Bot

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Learn Programming Bot

Caricato da

Copyright:

Formati disponibili

Table

NOTE: Python 3 urilib2 urilib. requestBeautifulSoup

title, url, appid, intro4 items.py schema

pipeline appid, title, intro

info = div.xpath('.//div[@class="game-info whole"]')

req meta parse() req.meta["item"] = item

def parse_detail_page(self, response):

MongoDB key-value pairs

settings.py python random_useragent.py

What is the network process when you are crawling a

2. Crawl more websites

Potrebbero piacerti anche