怎么做一个带login 的python 定时爬虫程序

网络爬虫(英語:web crawler),也叫網路蜘蛛(spider)

python有这样几个库:

  • BeautifulSoup: Beautiful soup is a library for parsing HTML and XML documents. Requests (handles HTTP sessions and makes HTTP requests) in combination with BeautifulSoup (a parsing library) are the best package tools for small and quick web scraping. For scraping simpler, static, less-JS related complexities, then this tool is probably what you’re looking for. If you want to know more about BeautifulSoup, please refer to my previous guide on Extracting Data from HTML with BeautifulSoup.

    lxml is a high-performance, straightforward, fast, and feature-rich parsing library which is another prominent alternative to BeautifulSoup.

  • Scrapy: Scrapy is a web crawling framework that provides a complete tool for scraping. In Scrapy, we create Spiders which are python classes that define how a particular site/sites will be scrapped. So, if you want to build a robust, concurrent, scalable, large scale scraper, then Scrapy is an excellent choice for you. Also, Scrapy comes with a bunch of middlewares for cookies, redirects, sessions, caching, etc. that helps you to deal with different complexities that you might come across. If you want to know more about Scrapy, please refer to my previous guide on Crawling the Web with Python and Scrapy.

  • Selenium For heavy-JS rendered pages or very sophisticated websites, Selenium webdriver is the best tool to choose. Selenium is a tool that automates the web-browsers, also known as a web-driver. With this, you can open a Google Chrome/Mozilla Firefox automated window, which visits a URL and navigates on the links. However, it is not as efficient as the tools which we have discussed so far. This tool is something to use when all doors of web scraping are being closed, and you still want the data which matters to you. If you want to know more about Selenium, please refer to Web Scraping with Selenium.

 

其中 scrapy 简单的例子 : https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3

以上例子是不需要login的

 


如果需要login , 要用到。scrapy 的 formrequest。

以 https://ktu3333.asuscomm.com:9085/enLogin.htm

为例

测试login已成功


scrapy 抓取的只是静态内容, 目标网页含有js 和ajax , 需要配合  selenium 和 webdrive一起用

原因见: https://www.geeksforgeeks.org/scrape-content-from-dynamic-websites/


mac 如何安装 chrome web drive

https://www.swtestacademy.com/install-chrome-driver-on-mac/

 


2021-07-27 : login 不再使用 scrapy , 因为它login之后和selenium 不是一个session , 所以直接用selenium login


找element用 xpath , 注意xpath里如果用到参数的写法


目前为止能运行的代码: 还没加定时功能 , python 版本3.8.6

 


改进版,把结果放到一个json 数组