Sei sulla pagina 1di 35

Artofthescrape!!!!

Showtheinternet whosboss. Scrapeit!

Beforewebegin!
Youmightwant.

Acomputer! Serverspace! Processing!

Whatsgoingonhere?
Todayweregoingto.
Examineourdataresources! Trysomescraping! Trysomepulling! MessaroundwithanAPI! Sayhellotovisualization!

Data?Ihardlyknewa!
Data:Anydiscreetunit anditsmetainformation Usefuldata:Morethan onerecordofdata...but thatsecondrecordcan beinyourhead! Everything isnumbers!

InternalUse

ExternalUse

Tellmemoreofthisdataofwhichyouspeak! Realtime

Blogs Twitterfeed Newsfeeds Etc


Govt censusdata, EPAdata NationalLeaguesalaries etc

Staticdatasets

DataisPowerful!
Theactofmeasuring something solidifiesitsstate. Ahh,thepower!!!

Dataismisleading!
Choosingonesource overanother Onlyportrayingparts ofthestatistic Choosingabiased methodofportrayal

InformationOverload:dontbelieve thehype

Flavorsofdata
Indexeddata documents,weblogs,images, videos,shoppingarticles,jobs... Cartographicandgeographicdata Geolocationsoftware,Geovisualization NewsAggregators Feeds,podcasts:

DATATYPE!
Straighttext CSV/tabdelimited XML/RSS/ATOM
Woulditfitinhere?Then itsdata!

JSON

VIA
Textfile Datafeed Scrapinghtml API Somecombination

Coulditpotentiallybe transferredbythis?Then itsgrabable!

DESTINATION
Spreadsheet(byhand) Browser(direct,javascript, php,perl...) Database (viasql usingphp, perl,etc....) Application (Processing,java, python) AsecondAPI

Mom,wheredoesdatacomefrom?
HTMLforscraping:Anywhereyoucanseetextonline Weather.com Yahootrendingtopics Preformatteddatasets:Anywhereitsavailable Amazondatasets opendata.gov Realtime rss feeds: Anywheretheresadatafeed Anyblogfeed Anynewsfeed PersonalizedAwesometargeteddata:AnywherewithanAPI. NewYorktimesAPI TwitterAPI

Choosewisely!
DATATYPE xml/rss csv xml xml html csv html xml VIA Browser textfile api api scraping textfile scraping browser DESTINATION Excel php:database php:browser javascript:browser php browser Processing Processing Processing
(throughphp)

Example1and2
Datatype VIA Destination HTMLSCRAPINGBROWSER (Weatherinfo)(PHP) (Firefox,orwhatever) Stepone:Gettoknowyourdata:
http://www.weather.com/weather/today/New+York+NY+10010?lswe=10010

Steptwo:Setupthecode

Example1:Straightscrapin
<?php $url = 'http://www.weather.com/weather/today/New+Y Getthedata! ork+NY+10010?lswe=10010'; $output=file_get_contents($url); echo$output;
DoSomethingwithit!

?>

Example1
<?php $url = 'http://www.weather.com/weather/today/New+Y ork+NY+10010?lswe=10010'; $output=file_get_contents($url); echo$output; ?>

Example2:Scrapingwithapurpose
$currentTerm =NULL;//we'llusethistoholdthewords! $myUrl ="http://www.google.com/trends/hottrends/atom/hourly $searchForStart ="sa=X\">"; $searchForEnd ="</a>"; $rawPage =file_get_contents($myUrl);

Geteverythingready Getthedata!

echo"<B>Thesearethishour'strendingtopicsonGoogle!</b><BR><BR>"; while($startPos =(strpos($rawPage,$searchForStart))){//aslongasthere'smorestufftofind,findit! $endPos =strpos($rawPage,$searchForEnd);//Andthenfindwhereitends! $length=$endPos $startPos; //Howlongisthisstringwe'vefound,anyway? if($startPos &&$endPos){ //Didwefindsomething?Then $currentTerm =substr($rawPage,($startPos+strlen($searchForStart)),$length6); echo$currentTerm ."<BR>"; }//endif $rawPage =substr($rawPage,($endPos +4)); }//endwhile

DoSomethingwithit!

Example2
$currentTerm =NULL;//we'llusethistoholdthewords! $myUrl ="http://www.google.com/trends/hottrends/atom/hourly $searchForStart ="sa=X\">"; $searchForEnd ="</a>"; $rawPage =file_get_contents($myUrl); echo"<B>Thesearethishour'strendingtopicsonGoogle!</b><BR><BR>"; while($startPos =(strpos($rawPage,$searchForStart))){//aslongasthere'smorestufftofind,findit! $endPos =strpos($rawPage,$searchForEnd);//Andthenfindwhereitends! $length=$endPos $startPos; //Howlongisthisstringwe'vefound,anyway? if($startPos &&$endPos){ //Didwefindsomething?Then $currentTerm =substr($rawPage,($startPos+strlen($searchForStart)),$length6); echo$currentTerm ."<BR>"; }//endif $rawPage =substr($rawPage,($endPos +4)); }//endwhile

Example3
Datatype VIA Destination XML RSSFEEDBROWSER (Huffingtonpost)(PHP) (Firefox,orwhatever) Stepone:Gettoknowyourdata:
http://feeds.huffingtonpost.com/huffingtonpost/raw_feed

Steptwo:Setupthecode

Whatsthisxmlstuff?
<introductorytags> <entry> <title></title> <id></id> <published></published> <updated>20100619T15:50:45Z</updated> <summary>summary> <author> <name></name> <uri>http://www.huffingtonpost.com/annenaylor/</uri> </author> <content></content> </entry>

Example3:XMLmakesthings awesome $url ="http://feeds.huffingtonpost.com/huffingtonpost/raw_feed";


$data=file_get_contents($url);

Getthedata!

$xml=newSimpleXmlElement($data); Getitinaformwecanuse echo"<b>HerearethecurrentpopularpostsfromHuffingtonPostwithoutthe ads!</b><BR><BR><ul>"; foreach ($xml>entryas$item){//navigatetothetagwewant?

DoSomethingwithit! $myTitle ="unknown";//initializethevariablesoit'sallset!


$myTitle =trim($item>title); echo"<LI>".$myTitle ."<br>"; }//endforeach echo"</ul>"; //nowprintit!

Example3:XMLmakesthings awesome $url ="http://feeds.huffingtonpost.com/huffingtonpost/raw_feed";


$data=file_get_contents($url); $xml=newSimpleXmlElement($data); echo"<b>HerearethecurrentpopularpostsfromHuffingtonPostwithoutthe ads!</b><BR><BR><ul>"; foreach ($xml>entryas$item){//navigatetothetagwewant? $myTitle ="unknown";//initializethevariablesoit'sallset! $myTitle =trim($item>title); echo"<LI>".$myTitle ."<br>"; //nowprintit! }//endforeach echo"</ul>";

Example4
Datatype VIA Destination XML dataFEEDAPIandBROWSER (USExchangerates)(PHP) (GoogleChartsAPI Firefox,orwhatever) Stepone:Gettoknowyourdata:
http://rss.timegenie.com/forex.xml

Steptwo:Setupthecode

APIs?Eh?
Data allthetypesofdatawediscussedbefore Functionality
Dataconverters:languagetranslators,speechprocessing,url shorteners) Communication:email,IM,notifications Visualdatarendering:Informationvisualization,diagrams,maps Securityrelated :electronicpaymentsystems,IDidentification...

Example4:Doingthetwostep
Getthedata! Getitinaformwecanuse RunitthroughasecondProcess

Dosomethingwithit(likedisplayingthatbaby!)

Bringingdataintoahigherlevel Applicationlikeprocessing!
Installthesimplml library:
http://www.learningprocessing.com/tutorials/simpleml/

Inspectyourdataforstructure Writesomecode!
Declareyourxmlintent! Maketherequest! Processtherequest! Dofunstuffwithit!

Gooutanddosomescraping!
ZoeFraadeBlanar Fraade@gmail.com
www.binaryspark.com

Potrebbero piacerti anche