Python 爬蟲之Beautiful Soup模塊使用指南

2020-02-15 22:11:54

字體：大中小

來源：轉載

供稿：網友

爬取網頁的流程一般如下：

選著要爬的網址（url）使用 python 登錄上這個網址（urlopen、requests 等）讀取網頁信息（read() 出來）將讀取的信息放入 BeautifulSoup 使用 BeautifulSoup 選取 tag 信息等

可以看到，頁面的獲取其實不難，難的是數據的篩選，即如何獲取到自己想要的數據。本文就帶大家學習下 BeautifulSoup 的使用。

BeautifulSoup 官網介紹如下：

Beautiful Soup 是一個可以從 HTML 或 XML 文件中提取數據的 Python 庫，它能夠通過你喜歡的轉換器實現慣用的文檔導航、查找、修改文檔的方式，能夠幫你節省數小時甚至數天的工作時間。

1 安裝

可以利用 pip 直接安裝：

$ pip install beautifulsoup4

BeautifulSoup 不僅支持 HTML 解析器，還支持一些第三方的解析器，如 lxml，XML，html5lib 但是需要安裝相應的庫。如果我們不安裝，則 Python 會使用 Python 默認的解析器，其中 lxml 解析器更加強大，速度更快，推薦安裝。

$ pip install html5lib$ pip install lxml

2 BeautifulSoup 的簡單使用

首先我們先新建一個字符串，后面就以它來演示 BeautifulSoup 的使用。

html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""

使用 BeautifulSoup 解析這段代碼，能夠得到一個 BeautifulSoup 的對象，并能按照標準的縮進格式的結構輸出:

>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(html_doc, "lxml")>>> print(soup.prettify())

篇幅有限，輸出結果這里不再展示。

另外，這里展示下幾個簡單的瀏覽結構化數據的方法：

>>> soup.title<title>The Dormouse's story</title>>>> soup.title.name'title'>>> soup.title.string"The Dormouse's story">>> soup.p['class']['title']>>> soup.a<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>>>> soup.find_all('a')[<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]>>> soup.find(id='link1')<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>

上一篇：python中csv文件的若干讀寫方法小結

下一篇：python處理數據,存進hive表的方法