python應用程序Hypy實例教程

2024-09-09 19:03:44

字體：大中小

來源：轉載

供稿：網友

重要的類和方法
—————————-

hdatabase
~~~~~~~~~~~
表示索引后的數據存儲.

關鍵方法: db.open(filename,mode), db.close(), db.putdoc(document), db.search(condition)

hdocument
~~~~~~~~~~

標示一條被索引并存儲的記錄.

關鍵方法: doc.addtext(text), doc.addhiddentext(text), doc.encode(encoding)

hdocument 的行為類似字典, key 保存記錄的元數據(比如字段名)
比如 doc[u’@uri’] 返回文檔(記錄)的 @uri 屬性. 它支持所有的字典方法.

hhit
~~~~~~
hdocument 的一個子類, 表示搜索結果中的一條記錄. 它擁有 hdocument 的全部特性, 還多了一個關鍵方法: result.teaser(wordlist)

hresults
~~~~~~~~~~

代表搜索結果, 它是 hhit 的集合. 關鍵方法: results.hitwords(), results.pluck(attr)

hresults 的行為類似字典, 這樣你就可以迭代它.

hcondition
~~~~~~~~~~~

查詢對象, 用來實施一次搜索. 帶著搜索參數(如關鍵字, 返回的最大結果數, 或者其他元數據)創建一個查詢實例.

關鍵方法: cond.addattr(attributeexpression), cond.setorder(orderexpression)

例子
——–

后面例子的代碼依賴前面的例子. 如果你打算運行所有的例子, 你最好從上到下順序運行. 如果你只是隨便看看, 那就無所謂了, 挑你感興趣了例子閱讀就可以了.

打開一個索引庫
~~~~~~~~~~~~~~~~

在使用索引庫(或者稱之為數據庫)之前, 必須先(指定一個磁盤路徑)創建它.

創建索引庫的標志和 python 內建函數 open() 一樣:

‘r’ :flag
只讀方式打開
‘w’ :flag
如果文件存在, 先刪除或清空, 然后打開文件, 準備寫入數據
‘a’ :flag
為寫入而打開文件, 如果文件不存在, 自動創建之(最常用的方式)

例子:

from hypy import * # 在生產環境不要這樣做
# import * 是一個壞習慣, 我在這兒這么寫只是圖一時省事.

index = ‘breakfast/’
db = hdatabase()
db.open(index, ‘w’) # 創建新庫, 舊庫如果存在則刪除之
db.close()
db.open(index, ‘a’)

抓取內容
~~~~~~~~

hypy 本身不是 web 爬蟲, 不過因為它依賴 hyper estraier, 所以也就順手擁有了爬蟲的功能. 妙事一樁! 下面展示來如何使用 hyper estraier’s spider, 它的名字叫 estwaver. 關于 estwaver 的更多細節請問 google .

(bash 語法) 例子:

$ cd ~/projects/
$ estwaver init hypysite
2009-02-21t23:18:45z info the root directory created

estwaver init 做的事情和上面例子中的 db.open(index,’w') 基本一樣, 有一點不同, 就是它還創建了一個樣板配置文件: hypysite/_conf.

編輯這個配置文件. 修改最頂部的 seeds 為你打算抓取的 url. 如果你打算限制一下抓取范圍, 可以修改定義允許訪問 url 的正則表達式.

# 原始文件
seed: 1.5|http://hyperestraier.sourceforge.net/uguide-en.html
seed: 1.0|http://hyperestraier.sourceforge.net/pguide-en.html
seed: 1.0|http://hyperestraier.sourceforge.net/nguide-en.html
seed: 0.0|http://qdbm.sourceforge.net/
…
# allowing regular expressions of urls to be visited
allowrx: ^http://

我修改后的文件如下, 我增加了幾條規則以避免對內建的 wiki 內容再次索引

# 我的修改
seed: 1.0|http://mysite.goonmill.org/
…
allowrx: ^http://(|[a-za-z0-9_]*/.)goonmill/.org
denyrx: ^http://wiki/.goonmill/.org/help
denyrx: ^http://wiki/.goonmill/.org/.*wiki
denyrx: ^http://wiki/.goonmill/.org/.*/?action=
denyrx: ^http://wiki/.goonmill/.org/systempages

# 保留剩下的 denyrx 不動.

現在可以用 estwaver crawl 命令來抓取內容了, 注意要告訴它索引建立在什么地方(./hypysite).

$ estwaver crawl hypysite
2009-02-21t23:44:43z    info    db-event: status: name=hypysite/_index …
2009-02-21t23:44:43z    info    crawling started (continue)
2009-02-21t23:44:43z    info    fetching: 0: http://goonmill.org/
2009-02-21t23:44:44z    info    seeding: 1.000: http://goonmill.org/
2009-02-21t23:44:45z    info    [1]: fetching: 0: http://goonmill.org/
2009-02-21t23:44:45z    info    [2]: fetching: 1: http://goonmill.org/cory

…

2009-02-21t23:47:02z info db-event: closing: name=hypysite/_index …
2009-02-21t23:47:02z info finished successfully

crud (create/read/update/delete) 創建/閱讀/更新/刪除
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

如果你的內容不是來自抓取, 或者需要手工添加內容, 我們提供了下面的例子來直接訪問 documents.

簡單例子:
^^^^^^^^^

doc = hdocument(uri=u’http://estraier.gov/example.txt’)
doc.addtext(u”hello there, this is my text.”)
db.putdoc(doc)

我僅僅添加了一個簡單文檔到索引庫, 該文檔的 uri 屬性是 http://estraier.gov/example.txt. 注意！無論是 uri 還是 text 都是 unicode 字符串. hypy 總是要求 unicode 字符串

只有索引庫刷新之后, 這篇文檔才能被搜索到. 以下操作會刷新索引庫:

* 調用索引庫對象的 close() 方法關閉掉
* 調用索引庫對象的 flush() 方法明確刷新索引
* 如果 autoflush 選項打開的話, 調用 putdoc() 方法也可以刷新索引庫

打開 autoflush 選項的方式:

# db = hdatabase(autoflush=true) ## 或者
db.autoflush = true
# 打開這個選項的操作本身并不刷新索引庫, 所以接下來要手工執行一句
db.flush()

tip

autoflush 會增加磁盤 io 負載, 為提高性能, 當你要索引大批文檔時, 請務必關掉 autoflush 選項, 你應該每索引 n 個文件, 手動執行一次 flush(), 這個 n 的值要大到能夠顯著降低磁盤負載, 并且不會填滿你的內存. 自己把握吧, 嘿嘿.

所有的文檔都要有一個 @uri 屬性, 以便初始化 hdocument() 對象. 不過, 你也可以添加任意的其他命名屬性到文檔中.

doc2 = hdocument(uri=u’http://estraier.gov/pricelist.txt’)
doc2.addtext(u”"”coffee: $2.00
toast: $1.00
eggs (2, any style): $3.00          eggs (9, any style): $13.50
eggs (3, any style): $4.50          eggs (10, any style): $15.00
eggs (4, any style): $6.00          eggs (11, any style): $16.50
eggs (5, any style): $7.50          eggs (12, any style): $18.00
eggs (6, any style): $9.00          eggs (13, any style): $19.50
eggs (7, any style): $10.50         eggs (14, any style): $21.00
eggs (8, any style): $12.00         eggs (15, any style): $22.50
…
with apologies to the new yorker for this joke”"”)
doc2[u’maxprice’] = u’22.50′
doc2[u’minprice’] = u’1.00′
db.putdoc(doc2)

有些屬性是自動生成的:

print doc[u’@id’]
## prints 1
print doc[u’@uri’]
## prints http://estraier.gov/example.txt

你可以通過指定一個 id 從索引庫中移除特定的文檔, 也能通過其他內建的屬性來移除任意一個文檔

db.remove(id=1)
db.putdoc(doc) # 把它放回去, 以便在刪除它一次
db.remove(doc)
db.putdoc(doc) # 再加進去
db.remove(uri=u’http://estraier.gov/example.txt’)
# 注意, 每次我們添加這個文檔, 都會獲得一個新的 id
print doc[u’@id’]
## prints 4

刪除文檔, 再把它放回去這種做法, 就是更新文檔的標準做法. (哈哈, 原來這樣！)

那么, 嗯, 我怎么知道索引庫里一共有多少個文檔？ python 內建函數 len() 可以告訴你答案. 現在我們來確認一下 db 里還剩一個文檔:

print len(db)
## prints 1

你可能猜到, 你能夠通過 uri 屬性獲取一個 document, 使用類似字典的獲取元素的語法:

print db[u’http://extraier.gov/pricelist.txt’]
## prints @digest=caacaefddcc1fd244de251723b0814be
##        @id=2
##        @uri=http://estraier.gov/pricelist.txt
##        …

現在打印出來的是 “draft” 格式, 這是這篇文檔的內部表示, 也就是str(doc) 的結果. 我們不推薦用這樣的手段獲取文檔. 你應該用 encode() 方法來得到自己想要的表示:

print doc2.encode(’utf-16′)
## prints ÿþc^@o^@f^@f^@e^@e^@:…

創建文檔的復雜例子
~~~~~~~~~~~~~~~~~~~~

hypy 沒有提供一種直接的方式決定搜索結果的權重, 有一種變通的方式, 就是你在做索引的時候決定權重. 如果你使用 estwaver 抓取資料, 改變 seed 的值: 每行不同的 seed 值, 會導致不同的權重. 如果你手工添加文檔, doc.addhiddentext(text) 可以幫你改變某個關鍵詞的權重.

文檔的權重主要是根據搜索關鍵字在該文檔出現的次數計算得來的. 如果你想讓某個關鍵字所占的權重特別大, 很容易, 簡單的添加一段隱含文本:

doc5 = hdocument(u’http://estraier.gov/weighted.txt’)
doc5.addtext(u”this is my boom-stick.”)
doc5.addhiddentext(u”eggs ” * 30)
db.putdoc(doc5)

當搜索 eggs 關鍵字時, 第三個文檔的得分將超過 doc2 . 當你打印它的時候, 你看不到這些隱含的文檔.

print doc5.encode(’utf-8′)
# prints this is my boom-stick.

搜索, 讀取搜索結果
~~~~~~~~~~~~~~~~~~~~~

那么, 怎么進行搜索？簡單的很: 構建查詢條件, 然后調用 db.search(condition). 搜索結果對象是一個類似列表的對象.

cond = hcondition(u’eggs’)
results = db.search(cond)
for doc in results:
print doc[u’@id’]
# prints 5 then 2

# 使用 pluck 方法能得到每個結果文檔的某一屬性:

print results.pluck(u’@id’)
# prints [u’5′, u’2′]

搜索關鍵字還支持通配符:

cond = hcondition(u’egg*’)
results = db.search(cond)
print len(results)
# prints 2
print results[0][u’@uri’]
# prints …/weighted.txt

還有些其他的模式. 默認搜索模式是 “simple”, 多個搜索關鍵字取其交集 ( 與查詢). 通過在查詢條件中增加 matching 參數, 可以改變這一默認行為(變成或查詢), 就是不指定 matching 參數, 也有別的辦法改變默認行為, 在關鍵字中使用特殊語法. 建議你使用關鍵字語法, 這樣能靈活的控制搜索結果.

doc6 = hdocument(uri=u’http://estraier.gov/spam.txt’)
doc6.addtext(u’spam and eggs’)
db.putdoc(doc6) # document @id is 6

# simple, 與查詢:
print db.search(hcondition(u’spam* eggs*’)).pluck(u’@id’)
# prints [u’6′]

# union, 或查詢
print db.search(hcondition(u’eggs spam’, matching=’union’)).pluck(u’@id’)
# prints [u’5′, u’2′, u’6′]

# unions with simple matching - you cannot use wildcard matches with
# matching=’union’ but you can do so with ‘|’ syntax
print db.search(hcondition(u’egg* | spam’)).pluck(u’@id’)
# prints [u’5′, u’2′, u’6′]

hyper estraier 用戶指南里有完整的搜索語法.

最后, 你可以獲取一個文檔的抽象, 稱為 “teaser”, 嗯, 就是使用一個叫 teaser() 的方法得到它. 這個方法目前支持兩種速度輸出格式, html 和 rst. 你必須以列表的形式提供要高亮的關鍵字.

words = [u’toast’]
results = db.search(hcondition(u’ ‘.join(words)))
hit = results[0]
print hit.teaser(words) # default is ‘html’
# prints coffee: $2.00 <strong>toast</strong>: $1.00 eggs (2, …

# 另一種強調搜索關鍵字:
words = results.hintwords()

print hit.teaser(words, format=’rst’)
# prints coffee: $2.00 **toast**: $1.00 eggs (2, …

屬性搜索
~~~~~~~~~~~~

hypy 使用一種功能強大的語法支持屬性搜索

cond = hcondition()
cond.addattr(u’@id streq 5′)

print db.search(cond)[0][u’@id’]
# prints 5

后面的例子就比較輕松了, 我們也可以稍事休息一會兒:

# 為了讓后面的例子清晰, 我定義了下面的函數:
def attrsx(expr):
cond = hcondition()
cond.addattr(expr)
return u’ ‘.join(db.search(cond).pluck(u’@id’))

# 給 doc6 添加一些有趣的屬性
doc6[u’maxprice’] = u’100′
doc6[u’minprice’] = u’0′
doc5[u’date’] = u’2009-01-01′
doc6[u’date’] = u’2009-02-02′

# 提交這個 doc, 注意剛才的代碼并沒有刷新索引庫, 在這個例子里, 我們使用 autoflush
db.putdoc(doc6)
db.putdoc(doc5)

數字屬性搜索:

print attrsx(u’maxprice numge 50′)
# prints 6

# 注意: 沒有這個屬性的文檔, 會被看作是擁有這個屬性, 且屬性值為 “0′. 因此所有的文檔都匹配下面這個查詢條件:
print attrsx(u’minprice numle 50′)
# prints 2 6 5

# 兩個文檔匹配下面這個條件:
print attrsx(u’minprice numle 0.99′)
# prints 2 5

日期比較視為數字比較:

print attrsx(u’date numge 2008-12-31′)
# prints 6 5

print attrsx(u’date numge 2009-01-30′)
# prints 6

支持正則表達式:

print attrsx(u’@uri strrx (pricelist.txt|spam.txt)’)
# prints 2 6

你可以對條件取反, 得到 1 個匹配:

print attrsx(u’@uri !strrx (pricelist.txt|spam.txt)’)
# prints 5

當然啦, 你能夠在短語搜索條件基礎之上增加屬性條件, 條件之間默認是與結合.

cond = hcondition(u’spam’)
cond.addattr(u’minprice numle 50′)
print db.search(cond)[0][u’@id’]
# prints 6

其他搜索選項
~~~~~~~~~~~~~~

唷！這么多搜索選項！. 呵呵, 這還不是全部. 你還可以限制返回的結果總數, 或者改變搜索結果的順序.

print db.search(hcondition(u’e*’)).pluck(u’@id’)
# prints [u’5′, u’2′, u’6′]
print db.search(hcondition(u’e*’, max=2)).pluck(u’@id’)
# prints [u’5′, u’2′] .. what did you expect?

如果你喜歡 max, 那么你也可能會喜歡 skip.

print db.search(hcondition(u’e*’, skip=2)).pluck(u’@id’)
# prints [u’6′]

要改變結果順序, 在 condition 對象上調用 setorder(order) 方法. hyper estraier 用戶指南上有一個完整的 order 表達式參考. 下面的例子通過設置排序方法改變查詢結果的默認順序.

cond = hcondition(u’e*’)

# natural (scored) order
print db.search(cond).pluck(u’@id’)
# prints [u’5′, u’2′, u’6′]

# numeric ascending
cond.setorder(u’@id numa’)
print db.search(cond).pluck(u’@id’)
# prints [u’2′, u’5′, u’6′]

# numeric descending
cond.setorder(u’@id numd’)
print db.search(cond).pluck(u’@id’)
# prints [u’6′, u’5′, u’2′]

其他參考文檔
————-

已經講的足夠多啦, 如果你覺得還不過癮, 好辦！請接著去深挖:

1. api 文檔, 真的很棒！
2. hyper estraier 用戶指南描述了關鍵字搜索語法, 屬性搜索語法和排序語法.
3. hypy 的單元測試文件中有豐富的搜索語法的例子特別是 testdatabase.test_queries 和 testdatabase.test_condextras 中. 這些測試付給了 lib.py 中 100% 的代碼. 他們有完善的文檔字符串和注釋; 象 skip 和 max 搜所還有各種各樣的比較例子等等全都有！

上一篇：Python教程:Python入門

下一篇：python:Discuz發帖器的實現