亚洲香蕉成人av网站在线观看_欧美精品成人91久久久久久久_久久久久久久久久久亚洲_热久久视久久精品18亚洲精品_国产精自产拍久久久久久_亚洲色图国产精品_91精品国产网站_中文字幕欧美日韩精品_国产精品久久久久久亚洲调教_国产精品久久一区_性夜试看影院91社区_97在线观看视频国产_68精品久久久久久欧美_欧美精品在线观看_国产精品一区二区久久精品_欧美老女人bb

首頁 > 學院 > 開發設計 > 正文

《機器學習實戰》中貝葉斯分類中導入RSS源例子

2019-11-14 17:41:32
字體:
來源:轉載
供稿:網友

跟著書中代碼往下寫在這里卡住了,考慮到可能還會有其他同學也遇到了這樣的問題,記下來分享。

先吐槽一下,相信大部分網友在這里卡住的主要原因是偉大的GFW,所以無論是軟件FQ還是肉身FQ的小伙伴們估計是無論如何也看不到這篇博文的,不想往下看的請自覺使用FQ技能。

 

怎么安裝feedparser?

按書中提供的網址直接安裝feedparser會提示錯誤說沒有setuptools,然后去找setuptools,官方的說法是windows最好用ez_setup.py安裝,我確實下載不下來官網的那個ez_etup.py,這個帖子給出了解決方案:http://adesquared.WordPRess.com/2013/07/07/setting-up-python-and-easy_install-on-windows-7/

ez_setup.py

將這個文件直接拷貝到C://python27文件夾中,輸入命令行:python ez_setup.py install

然后轉到放feedparser安裝文件的文件夾中,命令行輸入:python setup.py install

 

作者提供的rss源鏈接“http://newyork.craigslist.org/stp/index.rss”不可訪問怎么辦?

書中作者的意思是以來自源 http://newyork.craigslist.org/stp/index.rss 中的文章作為分類為1的文章,以來自源 http://sfbay.craigslist.org/stp/index.rss 中的文章作為分類為0的文章

為了能夠跑通示例代碼,可以找兩可用的RSS源作為替代。

我用的是這兩個源:

NASA Image of the Day:http://www.nasa.gov/rss/dyn/image_of_the_day.rss

Yahoo Sports - NBA - Houston Rockets News:http://sports.yahoo.com/nba/teams/hou/rss.xml

也就是說,如果算法運行正確的話,所有來自于 nasa 的文章將會被分類為1,所有來自于yahoo sports的休斯頓火箭隊新聞將會分類為0

 

使用自己定義的RSS源,當程序運行到trainNB0(array(trainMat),array(trainClasses))時會報錯,怎么辦?

從書中作者的例子來看,作者使用的源中文章數量較多,len(ny['entries']) 為 100,我找的幾個 RSS 源只有10-20個左右。

>>> import feedparser
>>>ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> ny['entries']
>>> len(ny['entries'])
100

因為作者的一個RSS源有100篇文章,所以他可以在代碼中剔除了30個“停用詞”,隨機選擇20篇文章作為測試集,但是當我們使用替代RSS源時我們只有10篇文章卻要取出20篇文章作為測試集,這樣顯然是會出錯的。只要自己調整下測試集的數量就可以讓代碼跑通;如果文章中的詞太少,減少剔除的“停用詞”數量可以提高算法的準確度。

 

如果不想將出現頻率排序最高的30個單詞移除,該如何去掉“停用詞”?

可以把要去掉的停用詞存放到txt文件中,使用時讀取(替代移除高頻詞的代碼)。具體需要停用哪些詞可以參考這里 http://www.ranks.nl/stopwords

以下代碼想正常運行需要將停用詞保存至stopword.txt中。

我的txt中保存了以下單詞,效果還不錯:

a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
below
between
both
but
by
can't
cannot
could
couldn't
did
didn't
do
does
doesn't
doing
don't
down
during
each
few
for
from
further
had
hadn't
has
hasn't
have
haven't
having
he
he'd
he'll
he's
her
here
here's
hers
herself
him
himself
his
how
how's
i
i'd
i'll
i'm
i've
if
in
into
is
isn't
it
it's
its
itself
let's
me
more
most
mustn't
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan't
she
she'd
she'll
she's
should
shouldn't
so
some
such
than
that
that's
the
their
theirs
them
themselves
then
there
there's
these
they
they'd
they'll
they're
they've
this
those
through
to
too
under
until
up
very
was
wasn't
we
we'd
we'll
we're
we've
were
weren't
what
what's
when
when's
where
where's
which
while
who
who's
whom
why
why's
with
won't
would
wouldn't
you
you'd
you'll
you're
you've
your
yours
yourself
yourselves

 

'''Created on Oct 19, 2010@author: Peter'''from numpy import *def loadDataSet():    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not    return postingList,classVec                 def createVocabList(dataSet):    vocabSet = set([])  #create empty set    for document in dataSet:        vocabSet = vocabSet | set(document) #union of the two sets    return list(vocabSet)def bagOfWords2Vec(vocabList, inputSet):    returnVec = [0]*len(vocabList)    for word in inputSet:        if word in vocabList:            returnVec[vocabList.index(word)] += 1        else: print "the word: %s is not in my Vocabulary!" % word    return returnVecdef trainNB0(trainMatrix,trainCategory):    numTrainDocs = len(trainMatrix)    numWords = len(trainMatrix[0])    pAbusive = sum(trainCategory)/float(numTrainDocs)    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones()     p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0    for i in range(numTrainDocs):        if trainCategory[i] == 1:            p1Num += trainMatrix[i]            p1Denom += sum(trainMatrix[i])        else:            p0Num += trainMatrix[i]            p0Denom += sum(trainMatrix[i])    p1Vect = log(p1Num/p1Denom)          #change to log()    p0Vect = log(p0Num/p0Denom)          #change to log()    return p0Vect,p1Vect,pAbusivedef classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)    if p1 > p0:        return 1    else:         return 0    def bagOfWords2VecMN(vocabList, inputSet):    returnVec = [0]*len(vocabList)    for word in inputSet:        if word in vocabList:            returnVec[vocabList.index(word)] += 1    return returnVecdef testingNB():    print '*** load dataset for training ***'    listOPosts,listClasses = loadDataSet()    print 'listOPost:/n',listOPosts    print 'listClasses:/n',listClasses    print '/n*** create Vocab List ***'    myVocabList = createVocabList(listOPosts)    print 'myVocabList:/n',myVocabList    print '/n*** Vocab show in post Vector Matrix ***'    trainMat=[]    for postinDoc in listOPosts:        trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))    print 'train matrix:',trainMat    print '/n*** train P0V p1V pAb ***'    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))    print 'p0V:/n',p0V    print 'p1V:/n',p1V    print 'pAb:/n',pAb    print '/n*** classify ***'    testEntry = ['love', 'my', 'dalmation']    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)    testEntry = ['stupid', 'garbage']    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)def textParse(bigString):    #input is big string, #output is word list    import re    listOfTokens = re.split(r'/W*', bigString)    return [tok.lower() for tok in listOfTokens if len(tok) > 2]     def spamTest():    docList=[]; classList = []; fullText =[]    for i in range(1,26):        wordList = textParse(open('email/spam/%d.txt' % i).read())        docList.append(wordList)        fullText.extend(wordList)        classList.append(1)        wordList = textParse(open('email/ham/%d.txt' % i).read())        docList.append(wordList)        fullText.extend(wordList)        classList.append(0)    vocabList = createVocabList(docList)#create vocabulary    trainingSet = range(50); testSet=[]           #create test set    for i in range(10):        randIndex = int(random.uniform(0,len(trainingSet)))        testSet.append(trainingSet[randIndex])        del(trainingSet[randIndex])      trainMat=[]; trainClasses = []    for docIndex in trainingSet:#train the classifier (get probs) trainNB0        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))        trainClasses.append(classList[docIndex])    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))    errorCount = 0    for docIndex in testSet:        #classify the remaining items        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:            errorCount += 1            print "classification error",docList[docIndex]    print 'the error rate is: ',float(errorCount)/len(testSet)    #return vocabList,fullTextdef calcMostFreq(vocabList,fullText):    import Operator    freqDict = {}    for token in vocabList:        freqDict[token]=fullText.count(token)    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True)     return sortedFreq[:30]       def stopWords():    import re    wordList =  open('stopword.txt').read() # see http://www.ranks.nl/stopwords    listOfTokens = re.split(r'/W*', wordList)    return [tok.lower() for tok in listOfTokens]     print 'read stop word from /'stopword.txt/':',listOfTokens    return listOfTokensdef localWords(feed1,feed0):    import feedparser    docList=[]; classList = []; fullText =[]    print 'feed1 entries length: ', len(feed1['entries']), '/nfeed0 entries length: ', len(feed0['entries'])    minLen = min(len(feed1['entries']),len(feed0['entries']))    print '/nmin Length: ', minLen    for i in range(minLen):        wordList = textParse(feed1['entries'][i]['summary'])        print '/nfeed1/'s entries[',i,']/'s summary - ','parse text:/n',wordList        docList.append(wordList)        fullText.extend(wordList)        classList.append(1) #NY is class 1        wordList = textParse(feed0['entries'][i]['summary'])        print '/nfeed0/'s entries[',i,']/'s summary - ','parse text:/n',wordList        docList.append(wordList)        fullText.extend(wordList)        classList.append(0)    vocabList = createVocabList(docList)#create vocabulary    print '/nVocabList is ',vocabList    print '/nRemove Stop Word:'    stopWordList = stopWords()    for stopWord in stopWordList:        if stopWord in vocabList:            vocabList.remove(stopWord)            print 'Removed: ',stopWord##    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words##    print '/nTop 30 words: ', top30Words##    for pairW in top30Words:##        if pairW[0] in vocabList:##            vocabList.remove(pairW[0])##            print '/nRemoved: ',pairW[0]    trainingSet = range(2*minLen); testSet=[]           #create test set    print '/n/nBegin to create a test set: /ntrainingSet:',trainingSet,'/ntestSet',testSet    for i in range(5):        randIndex = int(random.uniform(0,len(trainingSet)))        testSet.append(trainingSet[randIndex])        del(trainingSet[randIndex])    print 'random select 5 sets as the testSet:/ntrainingSet:',trainingSet,'/ntestSet',testSet    trainMat=[]; trainClasses = []    for docIndex in trainingSet:#train the classifier (get probs) trainNB0        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))        trainClasses.append(classList[docIndex])    print '/ntrainMat length:',len(trainMat)    print '/ntrainClasses',trainClasses    print '/n/ntrainNB0:'    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))    #print '/np0V:',p0V,'/np1V',p1V,'/npSpam',pSpam    errorCount = 0    for docIndex in testSet:        #classify the remaining items        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])        classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)        originalClass = classList[docIndex]        result =  classifiedClass != originalClass        if result:            errorCount += 1        print '/n',docList[docIndex],'/nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result    print '/nthe error rate is: ',float(errorCount)/len(testSet)    return vocabList,p0V,p1Vdef testRSS():    import feedparser    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')    vocabList,pSF,pNY = localWords(ny,sf)def testTopWords():    import feedparser    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')    getTopWords(ny,sf)def getTopWords(ny,sf):    import operator    vocabList,p0V,p1V=localWords(ny,sf)    topNY=[]; topSF=[]    for i in range(len(p0V)):        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"    for item in sortedSF:        print item[0]    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"    for item in sortedNY:        print item[0]def test42():    print '/n*** Load DataSet ***'    listOPosts,listClasses = loadDataSet()    print 'List of posts:/n', listOPosts    print 'List of Classes:/n', listClasses    print '/n*** Create Vocab List ***'    myVocabList = createVocabList(listOPosts)    print 'Vocab List from posts:/n', myVocabList    print '/n*** Vocab show in post Vector Matrix ***'    trainMat=[]    for postinDoc in listOPosts:        trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))    print 'Train Matrix:/n', trainMat    print '/n*** Train ***'    p0V,p1V,pAb = trainNB0(trainMat,listClasses)    print 'p0V:/n',p0V    print 'p1V:/n',p1V    print 'pAb:/n',pAb    

發表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發表
亚洲香蕉成人av网站在线观看_欧美精品成人91久久久久久久_久久久久久久久久久亚洲_热久久视久久精品18亚洲精品_国产精自产拍久久久久久_亚洲色图国产精品_91精品国产网站_中文字幕欧美日韩精品_国产精品久久久久久亚洲调教_国产精品久久一区_性夜试看影院91社区_97在线观看视频国产_68精品久久久久久欧美_欧美精品在线观看_国产精品一区二区久久精品_欧美老女人bb
亚洲精品成人网| 一区二区三区 在线观看视| 国产一区二区三区视频免费| 国产精品日韩电影| 欧美日韩另类字幕中文| 欧美风情在线观看| 久久久久久成人精品| 国产精品综合不卡av| 欧美高清视频一区二区| 伊人亚洲福利一区二区三区| 成人网在线视频| 亚洲日本aⅴ片在线观看香蕉| 日韩欧美精品网址| 91国产一区在线| 国产精品999| 久久综合亚洲社区| 91国内揄拍国内精品对白| 精品久久久久久久久国产字幕| 欧美大尺度激情区在线播放| 5278欧美一区二区三区| 午夜剧场成人观在线视频免费观看| 久久综合色88| 国产一区二区三区欧美| 欧美成人剧情片在线观看| 欧美日韩国产一区在线| 国产亚洲欧美日韩美女| 久久99久久99精品免观看粉嫩| 国产视频精品在线| 91中文字幕在线观看| 欧美性视频网站| 国产精品国产三级国产aⅴ浪潮| 菠萝蜜影院一区二区免费| 亚洲精品国精品久久99热| 午夜精品一区二区三区在线| 影音先锋欧美精品| 91高清免费在线观看| 国产精品一区二区三区毛片淫片| 亚洲精品美女视频| 国模叶桐国产精品一区| 美女av一区二区| 日韩视频精品在线| 日韩美女视频中文字幕| 亚洲国产精品嫩草影院久久| 亚洲欧美激情在线视频| 欧美wwwxxxx| 国产丝袜一区二区三区| 欧美日韩国产限制| 4388成人网| 91精品视频在线看| 一区二区三区久久精品| 久久不射热爱视频精品| 91成人精品网站| 亚洲精品中文字幕av| 国产欧美日韩精品专区| 亚洲第一色中文字幕| 国产小视频91| 91久久久亚洲精品| 国产成人精品视频| 亚洲一区二区三区sesese| 4438全国成人免费| 欧美日韩国产精品一区二区三区四区| 亚洲一区二区在线播放| 午夜欧美不卡精品aaaaa| 精品无人区乱码1区2区3区在线| 国产精品视频久| 成人福利在线视频| 国产精品久久久久久久久久三级| 日韩精品在线私人| 欧美日韩国产黄| 91免费人成网站在线观看18| 国产精品视频一区二区高潮| 亚洲日韩第一页| 亚洲人成网站色ww在线| 久久香蕉频线观| 国产91网红主播在线观看| 精品人伦一区二区三区蜜桃免费| 欧美一级bbbbb性bbbb喷潮片| 一区二区中文字幕| 久久久久五月天| 国产免费成人av| 精品自拍视频在线观看| 久久久精品2019中文字幕神马| 亚洲第一综合天堂另类专| 一本色道久久综合狠狠躁篇的优点| 国产日韩av在线播放| 国产午夜精品视频免费不卡69堂| 亚洲欧美国产精品| 亚洲激情中文字幕| 亚洲无线码在线一区观看| 欧美精品videossex性护士| 日韩资源在线观看| 国产一区二区久久精品| 狠狠爱在线视频一区| 91精品久久久久久久久久另类| 亚洲福利视频专区| 国产精品视频内| 亚洲第一av网站| 色一情一乱一区二区| 国产精品一久久香蕉国产线看观看| 91亚洲永久免费精品| 国产精品普通话| 中文字幕视频在线免费欧美日韩综合在线看| 中文字幕国内精品| 亚洲激情视频在线观看| 国产亚洲精品久久久久久牛牛| 欧美日韩性生活视频| 欧美日韩国产一区二区三区| 日韩国产一区三区| 久久成人精品电影| 国产成人亚洲综合| 日韩欧美在线网址| 最近中文字幕2019免费| 欧美大尺度在线观看| 久久久久久这里只有精品| 亚洲偷熟乱区亚洲香蕉av| 国内免费精品永久在线视频| 亚洲在线www| 日韩黄在线观看| 亚洲free性xxxx护士hd| 丝袜亚洲欧美日韩综合| 国产精品高潮在线| 色偷偷亚洲男人天堂| 韩国精品久久久999| 91av在线免费观看| 欧美成人免费全部观看天天性色| 欧美极品少妇xxxxⅹ裸体艺术| 欧美中文字幕视频在线观看| 狠狠久久亚洲欧美专区| 疯狂做受xxxx欧美肥白少妇| 91精品久久久久久久| 红桃视频成人在线观看| 日韩欧美在线中文字幕| 亚洲精品美女在线观看播放| 亚洲综合日韩在线| 2019中文字幕在线| 亚洲国模精品一区| 伊人伊成久久人综合网小说| 欧美乱大交xxxxx| 久久久精品免费| 日韩中文字幕在线免费观看| 欧美成人免费全部观看天天性色| 2020国产精品视频| 69久久夜色精品国产69| 国产精国产精品| 欧美视频在线看| 欧美一级在线亚洲天堂| 欧美资源在线观看| 裸体女人亚洲精品一区| 亚洲自拍偷拍区| 91久久精品美女高潮| 欧美亚洲午夜视频在线观看| 亚洲欧美制服综合另类| 亚洲欧美综合图区| 日韩电影大片中文字幕| 久久精品国产一区| 97超级碰碰人国产在线观看| 亚洲第一av在线| 亚洲精品电影网在线观看| 国产精品美女网站| 成人欧美在线视频| 亚洲欧美另类人妖| 欧洲亚洲在线视频| 2019中文字幕全在线观看| 91九色国产在线|