CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 39
Scraping
Page under construction
A scraper builder is being implemented. For now it only provides some basic feature but it should still cover most usage. The scraper support HTML only (XML should work too, but require more tests). JSON scraping will come later for APIs once the HTML scraper is stable.
The syntax used is based on xPath, you can find a tutorial here for exemple: https://www.w3schools.com/xml/xpath_intro.asp.
I would recommend using a browser extension like Try xPath to test your xPath expressions.
Here is an example for Discogs
- xPath for the item name.
- URL pattern this scraper will be used for. Optional, it only speeds up the process by automatically selecting the correct scraper based on the provided URL.
- xPath for the image src, must be an URL (the image will be downloaded from that URL on item submit).
- xPaths for additionnal data
Here the full xPath from previous example:
As mentionned above, the scraper uses xPath syntax.
But on top of that each xPath MUST be wrapped around #. This allows to use multiple xPaths for the same field.
In the example #//h1/span/a/text()# - #//h1/text()[2]#
will result in something like The band name - The album name
Three types of data fields are supported for now :
- Text -> if your xPath matches multiple strings, they will be concatenated using commas
- List -> each xPath match will create a new list element
- Country -> will try to match either the full country name or the alpha2 and alpha3 code based on ISO 3166
On the Item or Collection create form, click on the scrap button. Choose the scraper you want to use and the URL to be scrapped.
Alternatively, you can upload an HTML file if the URL isn't publicly accessible.
Using the Discogs exemple from above, here the result :
To make it easier for people to share their scrapers, an import/export function is available. Here is two scraper I've been using as exemple (save them in a json file each) :
{"name":"Discogs - release","namePath":"#\/\/h1\/span\/a\/text()# - #\/\/h1\/text()[2]#","imagePath":"#(\/\/img)[2]\/@src#","urlPattern":"https:\/\/www.discogs.com\/release\/","dataPaths":[{"name":"Style","path":"#\/\/th[contains(text(),'Style')]\/ancestor::tr\/td\/a\/text()#","type":"text","position":1},{"name":"Country","path":"#\/\/th[contains(text(),'Country')]\/ancestor::tr\/td\/a\/text()#","type":"country","position":2},{"name":"Tracks","path":"#\/\/td[contains(@class,'trackTitle')]\/span\/text()# - #\/\/td[contains(@class,'duration')]\/span\/text()#","type":"list","position":3}]}
{"name":"MyFigureCollection","namePath":"#\/\/span[@class='headline']\/text()#","imagePath":"#\/\/a[@class='main']\/img\/@src#","urlPattern":"https:\/\/myfigurecollection.net\/item\/","dataPaths":[{"name":"Origin","path":"#\/\/div[contains(text(),'Origin')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":1},{"name":"Character","path":"#\/\/div[contains(text(),'Character')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":2},{"name":"Version","path":"#\/\/div[contains(text(),'Version')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/text()#","type":"text","position":3},{"name":"Company","path":"#\/\/div[contains(text(),'Company')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":4},{"name":"Classification","path":"#\/\/div[contains(text(),'Classification')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":5},{"name":"Sculpted by","path":"#\/\/small[contains(text(),'As Sculptor')]\/ancestor::a\/span\/text()#","type":"text","position":6},{"name":"Illustrated by","path":"#\/\/small[contains(text(),'As Illustrator')]\/ancestor::a\/span\/text()#","type":"text","position":7},{"name":"Designed by","path":"#\/\/small[contains(text(),'As Designer')]\/ancestor::a\/span\/text()#","type":"text","position":8},{"name":"Color production by","path":"#\/\/small[contains(text(),'As Color producer')]\/ancestor::a\/span\/text()#","type":"text","position":9},{"name":"Material","path":"#\/\/div[contains(text(),'Material')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":10},{"name":"Scale","path":"#\/\/a[contains(@class,'item-scale')]\/small\/text()##\/\/a[contains(@class,'item-scale')]\/text()#","type":"text","position":11},{"name":"Country","path":"#substring-before(substring-after(\/\/div[contains(text(),'Releases')]\/ancestor::div\/div[contains(@class, 'form-input')][1]\/small\/em\/text(), '('),')')#","type":"country","position":12}]}
Some websites have protections against scraping, especially e-commerce websites.