【python數(shù)據(jù)挖掘】十.Pandas、Matplotlib、PCA繪圖實用代碼補充

2019-11-06 06:07:06

字體：大中小

供稿：網(wǎng)友

這篇文章主要是最近整理《數(shù)據(jù)挖掘與分析》課程中的作品及課件過程中，收集了幾段比較好的代碼供大家學習。同時，做數(shù)據(jù)分析到后面，除非是研究算法創(chuàng)新的，否則越來越覺得數(shù)據(jù)非常重要，才是有價值的東西。后面的課程會慢慢講解Python應用在Hadoop和Spark中，以及networkx數(shù)據(jù)科學等知識。如果文章中存在錯誤或不足之處，還請海涵~希望文章對你有所幫助。

一. Pandas獲取數(shù)據(jù)集并顯示

采用Pandas對2002年~2014年的商品房價數(shù)據(jù)集作時間序列分析，從中抽取幾個城市與貴陽做對比，并對貴陽商品房作出分析。

數(shù)據(jù)集位32.csv，具體值如下：（讀者可直接復制）

year	Beijing	Chongqing	Shenzhen	Guiyang	Kunming	Shanghai	Wuhai	Changsha2002	4764.00 	1556.00 	5802.00 	1643.00 	2276.00 	4134.00 	1928.00 	1802.00 2003	4737.00 	1596.00 	6256.00 	1949.00 	2233.00 	5118.00 	2072.00 	2040.00 2004	5020.93 	1766.24 	6756.24 	1801.68 	2473.78 	5855.00 	2516.32 	2039.09 2005	6788.09 	2134.99 	7582.27 	2168.90 	2639.72 	6842.00 	3061.77 	2313.73 2006	8279.51 	2269.21 	9385.34 	2372.66 	2903.32 	7196.00 	3689.64 	2644.15 2007	11553.26 	2722.58 	14049.69 	2901.63 	3108.12 	8361.00 	4664.03 	3304.74 2008	12418.00 	2785.00 	12665.00 	3149.00 	3750.00 	8195.00 	4781.00 	3288.00 2009	13799.00 	3442.00 	14615.00 	3762.00 	3807.00 	12840.00 	5329.00 	3648.00 2010	17782.00 	4281.00 	19170.00 	4410.00 	3660.00 	14464.00 	5746.00 	4418.00 2011	16851.95 	4733.84 	21350.13 	5069.52 	4715.23 	14603.24 	7192.90 	5862.39 2012	17021.63 	5079.93 	19589.82 	4846.14 	5744.68 	14061.37 	7344.05 	6100.87 2013	18553.00 	5569.00 	24402.00 	5025.00 	5795.00 	16420.00 	7717.00 	6292.00 2014	18833.00 	5519.00 	24723.00 	5608.00 	6384.00 	16787.00 	7951.00 	6116.00 
繪制對比各個城市的商品房價數(shù)據(jù)代碼如下所示：
# -*- coding: utf-8 -*-"""Created on Mon Mar 06 10:55:17 2017@author: eastmount"""import pandas as pddata = pd.read_csv("32.csv",index_col='year') #index_col用作行索引的列名 #顯示前6行數(shù)據(jù) PRint(data.shape)  print(data.head(6))import matplotlib.pyplot as pltplt.rcParams['font.sans-serif'] = ['simHei'] #用來正常顯示中文標簽plt.rcParams['axes.unicode_minus'] = False   #用來正常顯示負號data.plot()plt.savefig(u'時序圖.png', dpi=500)plt.show()輸出如下所示：
重點知識：1、plt.rcParams顯示中文及負號；2、plt.savefig保存圖片至本地；3、pandas直接讀取數(shù)據(jù)顯示繪制圖形，index_col獲取索引。
二. Pandas獲取某列數(shù)據(jù)繪制柱狀圖
接著上面的實驗，我們需要獲取貴陽那列數(shù)據(jù)，再繪制相關(guān)圖形。
# -*- coding: utf-8 -*-"""Created on Mon Mar 06 10:55:17 2017@author: eastmount"""import pandas as pddata = pd.read_csv("32.csv",index_col='year') #index_col用作行索引的列名 #顯示前6行數(shù)據(jù) print(data.shape)  print(data.head(6))import matplotlib.pyplot as pltplt.rcParams['font.sans-serif'] = ['simHei'] #用來正常顯示中文標簽plt.rcParams['axes.unicode_minus'] = False   #用來正常顯示負號data.plot()plt.savefig(u'時序圖.png', dpi=500)plt.show()#獲取貴陽數(shù)據(jù)集并繪圖gy = data['Guiyang']print u'輸出貴陽數(shù)據(jù)'print gygy.plot()plt.show()通過data['Guiyang']獲取某列數(shù)據(jù)，然后再進行繪制如下所示：
通過這個數(shù)據(jù)集調(diào)用bar函數(shù)可以繪制對應的柱狀圖，如下所示，需要注意x軸位年份，獲取兩列數(shù)據(jù)進行繪圖。
# -*- coding: utf-8 -*-"""Created on Mon Mar 06 10:55:17 2017@author: eastmount"""import pandas as pddata = pd.read_csv("32.csv",index_col='year') #index_col用作行索引的列名 #顯示前6行數(shù)據(jù) print(data.shape)  print(data.head(6))#獲取貴陽數(shù)據(jù)集并繪圖gy = data['Guiyang']print u'輸出貴陽數(shù)據(jù)'print gyimport numpy as npx = ['2002','2003','2004','2005','2006','2007','2008',     '2009','2010','2011','2012','2013','2014']N = 13ind = np.arange(N)  #賦值0-13width=0.35plt.bar(ind, gy, width, color='r', label='sum num') #設(shè)置底部名稱  plt.xticks(ind+width/2, x, rotation=40) #旋轉(zhuǎn)40度  plt.title('The price of Guiyang')  plt.xlabel('year')  plt.ylabel('price')  plt.savefig('guiyang.png',dpi=400)  plt.show()  輸出如下圖所示：
補充一段hist繪制柱狀圖的代碼：
import numpy as npimport pylab as pl# make an array of random numbers with a gaussian distribution with# mean = 5.0# rms = 3.0# number of points = 1000data = np.random.normal(5.0, 3.0, 1000)# make a histogram of the data arraypl.hist(data, histtype='stepfilled') #去掉黑色輪廓# make plot labelspl.xlabel('data') pl.show()輸出如下圖所示：
推薦文章：http://www.cnblogs.com/jasonfreak/p/5441512.html
三. Python繪制時間序列-自相關(guān)圖
核心代碼如下所示：# -*- coding: utf-8 -*-"""Created on Mon Mar 06 10:55:17 2017@author: yxz15"""import pandas as pddata = pd.read_csv("32.csv",index_col='year')#顯示前6行數(shù)據(jù)  print(data.shape)  print(data.head(6))import matplotlib.pyplot as pltplt.rcParams['font.sans-serif'] = ['simHei']plt.rcParams['axes.unicode_minus'] = Falsedata.plot()plt.savefig(u'時序圖.png', dpi=500)plt.show()from statsmodels.graphics.tsaplots import plot_acfgy = data['Guiyang']print gyplot_acf(gy).show()plt.savefig(u'貴陽自相關(guān)圖',dpi=300)from statsmodels.tsa.stattools import adfuller as ADFprint 'ADF:',ADF(gy)輸出結(jié)果如下所示：時間序列相關(guān)文章推薦：        python時間序列分析        個股與指數(shù)的回歸分析（python）        Python_Statsmodels包_時間序列分析_ARIMA模型
四. 聚類分析大連交易所數(shù)據(jù)集
這部分主要提供一個網(wǎng)址給大家下載數(shù)據(jù)集，前面文章說過sklearn自帶一些數(shù)據(jù)集以及UCI官網(wǎng)提供大量的數(shù)據(jù)集。這里講述一個大連商品交易所的數(shù)據(jù)集。地址：http://www.dce.com.cn/dalianshangpin/xqsj/lssj/index.html#
比如下載"焦炭"數(shù)據(jù)集，命名為"35.csv"，在對其進行聚類分析。代碼如下：# -*- coding: utf-8 -*-"""Created on Mon Mar 06 10:19:15 2017@author: yxz15"""#第一部分：導入數(shù)據(jù)集import pandas as pdCoke1 =pd.read_csv("35.csv")print Coke1 [:4]#第二部分：聚類from sklearn.cluster import KMeansclf=KMeans(n_clusters=3)pre=clf.fit_predict(Coke1)print pre[:4]#第三部分：降維from sklearn.decomposition import PCApca=PCA(n_components=2)newData=pca.fit_transform(Coke1)print newData[:4]x1=[n[0] for n in newData]x2=[n[1] for n in newData]#第四部分：用matplotlib包畫圖import matplotlib.pyplot as pltplt.titleplt.xlabel("x feature")plt.ylabel("y feature")plt.scatter(x1,x2,c=pre, marker='x')plt.savefig("bankloan.png",dpi=400)plt.show()輸出如下圖所示：
五. PCA降維及繪圖代碼
PCA降維繪圖參考這篇博客。http://blog.csdn.net/xiaolewennofollow/article/details/46127485代碼如下：
# -*- coding: utf-8 -*-"""Created on Mon Mar 06 21:47:46 2017@author: yxz"""from numpy import *def loadDataSet(fileName,delim='/t'):    fr=open(fileName)    stringArr=[line.strip().split(delim) for line in fr.readlines()]    datArr=[map(float,line) for line in stringArr]    return mat(datArr)def pca(dataMat,topNfeat=9999999):    meanVals=mean(dataMat,axis=0)    meanRemoved=dataMat-meanVals    covMat=cov(meanRemoved,rowvar=0)    eigVals,eigVets=linalg.eig(mat(covMat))    eigValInd=argsort(eigVals)    eigValInd=eigValInd[:-(topNfeat+1):-1]    redEigVects=eigVets[:,eigValInd]    print meanRemoved    print redEigVects    lowDDatMat=meanRemoved*redEigVects    reconMat=(lowDDatMat*redEigVects.T)+meanVals    return lowDDatMat,reconMatdataMat=loadDataSet('41.txt')lowDMat,reconMat=pca(dataMat,1)def plotPCA(dataMat,reconMat):    import matplotlib    import matplotlib.pyplot as plt    datArr=array(dataMat)    reconArr=array(reconMat)    n1=shape(datArr)[0]    n2=shape(reconArr)[0]    xcord1=[];ycord1=[]    xcord2=[];ycord2=[]    for i in range(n1):        xcord1.append(datArr[i,0]);ycord1.append(datArr[i,1])    for i in range(n2):        xcord2.append(reconArr[i,0]);ycord2.append(reconArr[i,1])    fig=plt.figure()    ax=fig.add_subplot(111)    ax.scatter(xcord1,ycord1,s=90,c='red',marker='^')    ax.scatter(xcord2,ycord2,s=50,c='yellow',marker='o')    plt.title('PCA')    plt.savefig('ccc.png',dpi=400)    plt.show()plotPCA(dataMat,reconMat)輸出結(jié)果如下圖所示：采用PCA方法對數(shù)據(jù)集進行降維操作，即將紅色三角形數(shù)據(jù)降維至黃色直線上，一個平面降低成一條直線。PCA的本質(zhì)就是對角化協(xié)方差矩陣，對一個n*n的對稱矩陣進行分解，然后把矩陣投影到這N個基上。數(shù)據(jù)集為41.txt，值如下：
61.5	5559.8	6156.9	6562.4	5863.3	5862.8	5762.3	5761.9	5565.1	6159.4	6164	5562.8	5660.4	6162.2	5460.2	6260.9	5862	5463.4	5463.8	5662.7	5963.3	5663.8	5561	5759.4	6258.1	6260.4	5862.5	5762.2	5760.5	6160.9	5760	5759.8	5760.7	5959.5	5861.9	5858.2	5964.1	5964	5460.8	5961.8	5561.2	5661.1	5665.2	5658.4	6363.1	5662.4	5861.8	5563.8	5663.3	6060.7	6060.9	6161.9	5460.9	5561.6	5859.3	6261	5959.3	6162.6	5763	5763.2	5560.9	5762.6	5962.5	5762.1	5661.5	5961.4	5662	55.363.3	5761.8	5860.7	5861.5	6063.1	5662.9	5962.5	5763.7	5759.2	6059.9	5862.4	5462.8	6062.6	5963.4	5962.1	6062.9	5861.6	5657.9	6062.3	5961.2	5860.8	5960.7	5862.9	5862.5	5755.1	6961.6	5662.4	5763.8	5657.5	5859.4	6266.3	6261.6	5961.5	5863.2	5659.9	5461.6	5561.7	5862.9	5662.2	5563	5962.3	5558.8	5762	5561.4	5762.2	5663	5862.2	5962.6	5662.7	5361.7	5862.4	5460.7	5859.9	5962.3	5662.3	5461.7	6364.5	5765.3	5561.6	6061.4	5659.6	5764.4	5765.7	6062	5663.6	5861.9	5962.6	6061.3	6060.9	6060.1	6261.8	5961.2	5761.9	5660.9	5759.8	5661.8	5560	5761.6	5562.1	6463.3	5960.2	5661.1	5860.9	5761.7	5961.3	5662.5	6061.4	5962.9	5762.4	5760.7	5660.7	5861.5	5859.9	5759.2	5960.3	5661.7	6061.9	5761.9	5560.4	5961	5761.5	5561.7	5659.2	6161.3	5658	6260.2	6161.7	5562.7	5564.6	5461.3	6163.7	56.462.7	5862.2	5761.6	5661.5	5761.8	5660.7	5659.7	60.560.5	5662.7	5862.1	5862.8	5763.8	5857.8	6062.1	5561.1	6060	5961.2	5762.7	5961	5761	5861.4	5761.8	6159.9	6361.3	5860.5	5864.1	5967.9	6062.4	5863.2	6061.3	5560.8	5661.7	5663.6	5761.2	5862.1	5461.5	5561.4	5961.8	6062.2	5661.2	5660.6	6357.5	6461.3	5657.2	6262.9	6063.1	5860.8	5762.7	5962.8	6055.1	6761.4	5962.2	5563	5463.7	5663.6	5862	5761.5	5660.5	6061.1	6061.8	5663.3	5659.4	6462.5	5564.5	5862.7	5964.2	5263.7	5460.4	5861.8	5863.2	5661.6	5661.6	5660.9	5761	6162.1	5760.9	6061.3	6065.8	5961.3	5658.8	5962.3	5560.1	6261.8	5963.6	55.862.2	5659.2	5961.8	5961.3	5562.1	6060.7	6059.6	5762.2	5660.6	5762.9	5764.1	5561.3	5662.7	5563.2	5660.7	5661.9	6062.6	5560.7	6062	6063	5758	5962.9	5758.2	6063.2	5861.3	5960.3	6062.7	6061.3	5861.6	6061.9	5561.7	5661.9	5861.8	5861.6	5658.8	6661	5767.4	6063.4	6061.5	5958	6262.4	5461.9	5761.6	5662.2	5962.2	5861.3	5662.3	5761.8	5762.5	5962.9	6061.8	5962.3	5659	7060.7	5562.5	5562.7	5860.4	5762.1	5857.8	6063.8	5862.8	5762.2	5862.3	5859.9	5861.9	5463	5562.4	5862.9	5863.5	5661.3	5660.6	5465.1	5862.6	5858	6262.4	6161.3	5759.9	6060.8	5863.5	5562.2	5763.8	5864	5762.5	5662.3	5861.7	5762.2	5861.5	5661	5962.2	5661.5	5467.3	5961.7	5861.9	5661.8	5858.7	6662.5	5762.8	5661.1	6864	5762.5	6060.6	5861.6	5562.2	5860	5761.9	5762.8	5762	5766.4	5963.4	5660.9	5663.1	5763.1	5959.2	5760.7	5464.6	5661.8	5659.9	6061.7	5562.8	6162.7	5763.4	5863.5	5465.7	5968.1	5663	6059.5	5863.5	5961.7	5862.7	5862.8	5862.4	5761	5963.1	5660.7	5760.9	5960.1	5562.9	5863.3	5663.8	5562.9	5763.4	6063.9	5561.4	5661.9	5562.4	5561.8	5861.5	5660.4	5761.8	5562	5662.3	5661.6	5660.6	5658.4	6261.4	5861.9	5662	5661.5	5762.3	5860.9	6162.4	5755	6158.6	6062	5759.8	5863.4	5564.3	5862.2	5961.7	5761.1	5961.5	5658.5	6261.7	5860.4	5661.4	5661.5	5561.4	5665	5656	6060.2	5958.3	5853.1	6360.3	5861.4	5660.1	5763.4	5561.5	5962.7	5662.5	5561.3	5660.2	5662.7	5762.3	5861.5	5659.2	5961.8	5961.3	5561.4	5862.8	5562.8	6462.4	6159.3	6063	6061.3	6059.3	6261	5762.9	5759.6	5761.8	6062.7	5765.3	6263.8	5862.3	5659.7	6364.3	6062.9	5862	5761.6	5961.9	5561.3	5863.6	5759.6	6162.2	5961.7	5563.2	5860.8	6060.3	5960.9	6062.4	5960.2	6062	5560.8	5762.1	5562.7	6061.3	5860.2	6060.7	56
        最后希望這篇文章對你有所幫助，尤其是我的學生和接觸數(shù)據(jù)挖掘、機器學習的博友。這篇文字主要是記錄一些代碼片段，作為在線筆記，也希望對你有所幫助。        一醉一輕舞，一夢一輪回。一曲一人生，一世一心愿。       (By:Eastmount 2017-03-07 下午3點半  http://blog.csdn.net/eastmount/ )