scrape web using scrapy and mongodb
prepare
you’d better do this in a virtual env.
- install mongodb
- install scrapy
description
use scrapy to scrapy some new questions from stackoverflow and save the data into the mongodb.
details
if you use the virutal env, make sure you are inside the environment.
1. edit the items
here we just record the title and url field.
2. create the spiders
before we create the spiders, we need to find out what path we will use to scrapy the data, if you are not familiar with the xpath, you can find more detail in the xpath documents. xpath like a file path in the file system, use xpath the scrapy can find out the data we need.
you can use the web browser to help you find out the xpath.
- use firefox or chrome open the web you want to scrapy.
- select the item you are interested in.
- right mouse key and select inspect elements.
- after you get the html tag, just right mouse key get the xpath.
if you select this article title, and follow the step, you will find the xpath of this site is: “/html/body/div/div/article/header/h1”
if you want to test if you get the correct xpath, you can use the browser js console to test:
3. ready to create the spiders
- here we just want to scrape the website:
“http://stackoverflow.com/questions” -
we just want to get title and url item as definition in items.py.
- get the xpath of title: “//*[@id=’question-summary-34959124’]/div[2]/h3/a/text()”
notice that: if you want to extract the title text out you need to append text() function. please use the single quote inside the double quote.
- get the xpath of url: “//*[@id=’question-summary-34959124’]/div[2]/h3/a/@href”
notice that: if you want to get the attribute out you need to use operator @ before the attribute of html. please use the single quote inside the double quote.
- inspect the source html structure you may test successfully under the js console inside the browser, but you’ll get error when use the scrapy to scrapy the test out.
what you really want to do is to scrapy the whole list of the question, but now you just scrapy only a specific one.
4. write the spiders
here comes the source code, do more test in scrapy bash.
test scrapy
use the mongodb save the result
in scrapy if you want to use the mongodb or other database to save the result( the item return ) to database, you can use the pipeline interface to deal with the item.
you can find settings.py and pipelines.py inside your scrapy directory. just enable the pipeline setting inside the settings.py
customer the pipeline class inside the pipelines.py file:
test the mongodb
before you run the spider, you need to start the mongod service.
all done here, hope you’ll enjoy with it.