爬取網頁的流程一般如下:
可以看到,頁面的獲取其實不難,難的是數據的篩選,即如何獲取到自己想要的數據。本文就帶大家學習下 BeautifulSoup 的使用。
BeautifulSoup 官網介紹如下:
Beautiful Soup 是一個可以從 HTML 或 XML 文件中提取數據的 Python 庫,它能夠通過你喜歡的轉換器實現慣用的文檔導航、查找、修改文檔的方式,能夠幫你節省數小時甚至數天的工作時間。
1 安裝
可以利用 pip 直接安裝:
$ pip install beautifulsoup4
BeautifulSoup 不僅支持 HTML 解析器,還支持一些第三方的解析器,如 lxml,XML,html5lib 但是需要安裝相應的庫。如果我們不安裝,則 Python 會使用 Python 默認的解析器,其中 lxml 解析器更加強大,速度更快,推薦安裝。
$ pip install html5lib$ pip install lxml
2 BeautifulSoup 的簡單使用
首先我們先新建一個字符串,后面就以它來演示 BeautifulSoup 的使用。
html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,<a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and<a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""
使用 BeautifulSoup 解析這段代碼,能夠得到一個 BeautifulSoup 的對象,并能按照標準的縮進格式的結構輸出:
>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(html_doc, "lxml")>>> print(soup.prettify())
篇幅有限,輸出結果這里不再展示。
另外,這里展示下幾個簡單的瀏覽結構化數據的方法:
>>> soup.title<title>The Dormouse's story</title>>>> soup.title.name'title'>>> soup.title.string"The Dormouse's story">>> soup.p['class']['title']>>> soup.a<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>>>> soup.find_all('a')[<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]>>> soup.find(id='link1')<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
新聞熱點
疑難解答