亚洲香蕉成人av网站在线观看_欧美精品成人91久久久久久久_久久久久久久久久久亚洲_热久久视久久精品18亚洲精品_国产精自产拍久久久久久_亚洲色图国产精品_91精品国产网站_中文字幕欧美日韩精品_国产精品久久久久久亚洲调教_国产精品久久一区_性夜试看影院91社区_97在线观看视频国产_68精品久久久久久欧美_欧美精品在线观看_国产精品一区二区久久精品_欧美老女人bb

首頁 > 學院 > 開發設計 > 正文

網絡爬蟲+HtmlAgilityPack+windows服務從博客園爬取20萬博文

2019-11-14 15:52:36
字體:
來源:轉載
供稿:網友

1.前言

最新在公司做一個項目,需要一些文章類的數據,當時就想到了用網絡爬蟲去一些技術性的網站爬一些,當然我經常去的就是博客園,于是就有下面的這篇文章。

程序源碼:CSDN下載地址

2.準備工作

我需要把我從博客園爬取的數據,保存起來,最好的方式當然是保存到數據庫中去了,好了我們先建一個數據庫,在來一張表,保存我們的數據,其實都很簡單的了啊,如下圖所示

BlogArticleId博文自增ID,BlogTitle博文標題,BlogUrl博文地址,BlogAuthor博文作者,BlogTime博文發布時間,BlogMotto作者座右銘,BlogDepth蜘蛛爬蟲爬取的深度,IsDeleted是否刪除。

數據庫表也創建好了,我們先來一個數據庫的幫助類。

    /// <summary>    /// 數據庫幫助類    /// </summary>    public class MssqlHelper    {        #region 字段屬性        /// <summary>        /// 數據庫連接字符串        /// </summary>        PRivate static string conn = "Data Source=.;Initial Catalog=Cnblogs;User ID=sa;PassWord=123";        #endregion        #region DataTable寫入數據        public static void GetData(string title, string url, string author, string time, string motto, string depth, DataTable dt)        {            DataRow dr;            dr = dt.NewRow();            dr["BlogTitle"] = title;            dr["BlogUrl"] = url;            dr["BlogAuthor"] = author;            dr["BlogTime"] = time;            dr["BlogMotto"] = motto;            dr["BlogDepth"] = depth;            //2.0 將dr追加到dt中            dt.Rows.Add(dr);        }        #endregion        #region 插入數據到數據庫        /// <summary>        /// 插入數據到數據庫        /// </summary>        public static void InsertDb(DataTable dt)        {            try            {                using (System.Data.SqlClient.SqlBulkCopy copy = new System.Data.SqlClient.SqlBulkCopy(conn))                {                    //3.0.1 指定數據插入目標表名稱                    copy.DestinationTableName = "BlogArticle";                    //3.0.2 告訴SqlBulkCopy對象 內存表中的 OrderNO1和Userid1插入到OrderInfos表中的哪些列中                    copy.ColumnMappings.Add("BlogTitle", "BlogTitle");                    copy.ColumnMappings.Add("BlogUrl", "BlogUrl");                    copy.ColumnMappings.Add("BlogAuthor", "BlogAuthor");                    copy.ColumnMappings.Add("BlogTime", "BlogTime");                    copy.ColumnMappings.Add("BlogMotto", "BlogMotto");                    copy.ColumnMappings.Add("BlogDepth", "BlogDepth");                    //3.0.3 將內存表dt中的數據一次性批量插入到OrderInfos表中                    copy.WriteToServer(dt);                    dt.Rows.Clear();                }            }            catch (Exception)            {                dt.Rows.Clear();            }        }        #endregion    }

3.日志

來個日志,方便我們查看,代碼如下。

    /// <summary>    /// 日志幫助類    /// </summary>    public class LogHelper    {        #region 寫入日志        //寫入日志        public static void WriteLog(string text)        {            //StreamWriter sw = new StreamWriter(AppDomain.CurrentDomain.BaseDirectory + "http://log.txt", true);            StreamWriter sw = new StreamWriter("F:" + "//log.txt", true);            sw.WriteLine(text);            sw.Close();//寫入        }        #endregion    }

4.爬蟲

我的網絡蜘蛛爬蟲,用的一個第三方類庫,代碼如下。

namespace Feng.SimpleCrawler{    using System;    /// <summary>    /// The add url event handler.    /// </summary>    /// <param name="args">    /// The args.    /// </param>    /// <returns>    /// The <see cref="bool"/>.    /// </returns>    public delegate bool AddUrlEventHandler(AddUrlEventArgs args);    /// <summary>    /// The add url event args.    /// </summary>    public class AddUrlEventArgs : EventArgs    {        #region Public Properties        /// <summary>        /// Gets or sets the depth.        /// </summary>        public int Depth { get; set; }        /// <summary>        /// Gets or sets the title.        /// </summary>        public string Title { get; set; }        /// <summary>        /// Gets or sets the url.        /// </summary>        public string Url { get; set; }        #endregion    }}
AddUrlEventArgs.cs
namespace Feng.SimpleCrawler{    using System;    using System.Collections;    /// <summary>    /// The bloom filter.    /// </summary>    /// <typeparam name="T">    /// The generic type.    /// </typeparam>    public class BloomFilter<T>    {        #region Fields        /// <summary>        /// The get hash secondary.        /// </summary>        private readonly HashFunction getHashSecondary;        /// <summary>        /// The hash bits.        /// </summary>        private readonly BitArray hashBits;        /// <summary>        /// The hash function count.        /// </summary>        private readonly int hashFunctionCount;        #endregion        #region Constructors and Destructors        /// <summary>        /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.        /// </summary>        /// <param name="capacity">        /// The capacity.        /// </param>        public BloomFilter(int capacity)            : this(capacity, null)        {        }        /// <summary>        /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.        /// </summary>        /// <param name="capacity">        /// The capacity.        /// </param>        /// <param name="errorRate">        /// The error rate.        /// </param>        public BloomFilter(int capacity, int errorRate)            : this(capacity, errorRate, null)        {        }        /// <summary>        /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.        /// </summary>        /// <param name="capacity">        /// The capacity.        /// </param>        /// <param name="hashFunction">        /// The hash function.        /// </param>        public BloomFilter(int capacity, HashFunction hashFunction)            : this(capacity, BestErrorRate(capacity), hashFunction)        {        }        /// <summary>        /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.        /// </summary>        /// <param name="capacity">        /// The capacity.        /// </param>        /// <param name="errorRate">        /// The error rate.        /// </param>        /// <param name="hashFunction">        /// The hash function.        /// </param>        public BloomFilter(int capacity, float errorRate, HashFunction hashFunction)            : this(capacity, errorRate, hashFunction, BestM(capacity, errorRate), BestK(capacity, errorRate))        {        }        /// <summary>        /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.        /// </summary>        /// <param name="capacity">        /// The capacity.        /// </param>        /// <param name="errorRate">        /// The error rate.        /// </param>        /// <param name="hashFunction">        /// The hash function.        /// </param>        /// <param name="m">        /// The m.        /// </param>        /// <param name="k">        /// The k.        /// </param>        public BloomFilter(int capacity, float errorRate, HashFunction hashFunction, int m, int k)        {            if (capacity < 1)            {                throw new ArgumentOutOfRangeException("capacity", capacity, "capacity must be > 0");            }            if (errorRate >= 1 || errorRate <= 0)            {                throw new ArgumentOutOfRangeException(                    "errorRate",                     errorRate,                     string.Format("errorRate must be between 0 and 1, exclusive. Was {0}", errorRate));            }            if (m < 1)            {                throw new ArgumentOutOfRangeException(                    string.Format(                        "The provided capacity and errorRate values would result in an array of length > int.MaxValue. Please reduce either of these values. Capacity: {0}, Error rate: {1}",                         capacity,                         errorRate));            }            if (hashFunction == null)            {                if (typeof(T) == typeof(string))                {                    this.getHashSecondary = HashString;                }                else if (typeof(T) == typeof(int))                {                    this.getHashSecondary = HashInt32;                }                else                {                    throw new ArgumentNullException(                        "hashFunction",                         "Please provide a hash function for your type T, when T is not a string or int.");                }            }            else            {                this.getHashSecondary = hashFunction;            }            this.hashFunctionCount = k;            this.hashBits = new BitArray(m);        }        #endregion        #region Delegates        /// <summary>        /// The hash function.        /// </summary>        /// <param name="input">        /// The input.        /// </param>        /// <returns>        /// The <see cref="int"/>.        /// </returns>        public delegate int HashFunction(T input);        #endregion        #region Public Properties        /// <summary>        /// Gets the truthiness.        /// </summary>        public double Truthiness        {            get            {                return (double)this.TrueBits() / this.hashBits.Count;            }        }        #endregion        #region Public Methods and Operators        /// <summary>        /// The add.        /// </summary>        /// <param name="item">        /// The item.        /// </param>        public void Add(T item)        {            int primaryHash = item.GetHashCode();            int secondaryHash = this.getHashSecondary(item);            for (int i = 0; i < this.hashFunctionCount; i++)            {                int hash = this.ComputeHash(primaryHash, secondaryHash, i);                this.hashBits[hash] = true;            }        }        /// <summary>        /// The contains.        /// </summary>        /// <param name="item">        /// The item.        /// </param>        /// <returns>        /// The <see cref="bool"/>.        /// </returns>        public bool Contains(T item)        {            int primaryHash = item.GetHashCode();            int secondaryHash = this.getHashSecondary(item);            for (int i = 0; i < this.hashFunctionCount; i++)            {                int hash = this.ComputeHash(primaryHash, secondaryHash, i);                if (this.hashBits[hash] == false)                {                    return false;                }            }            return true;        }        #endregion        #region Methods        /// <summary>        /// The best error rate.        /// </summary>        /// <param name="capacity">        /// The capacity.        /// </param>        /// <returns>        /// The <see cref="float"/>.        /// </returns>        private static float BestErrorRate(int capacity)        {            var c = (float)(1.0 / capacity);            if (Math.Abs(c) > 0)            {                return c;            }            double y = int.MaxValue / (double)capacity;            return (float)Math.Pow(0.6185, y);        }        /// <summary>        /// The best k.        /// </summary>        /// <param name="capacity">        /// The capacity.        /// </param>        /// <param name="errorRate">        /// The error rate.        /// </param>        /// <returns>        /// The <see cref="int"/>.        /// </returns>        private static int BestK(int capacity, float errorRate)        {            return (int)Math.Round(Math.Log(2.0) * BestM(capacity, errorRate) / capacity);        }        /// <summary>        /// The best m.        /// </summary>        /// <param name="capacity">        /// The capacity.        /// </param>        /// <param name="errorRate">        /// The error rate.        /// </param>        /// <returns>        /// The <see cref="int"/>.        /// </returns>        private static int BestM(int capacity, float errorRate)        {            return (int)Math.Ceiling(capacity * Math.Log(errorRate, 1.0 / Math.Pow(2, Math.Log(2.0))));        }        /// <summary>        /// The hash int 32.        /// </summary>        /// <param name="input">        /// The input.        /// </param>        /// <returns>        /// The <see cref="int"/>.        /// </returns>        private static int HashInt32(T input)        {            var x = input as uint?;            unchecked            {                x = ~x + (x << 15);                x = x ^ (x >> 12);                x = x + (x << 2);                x = x ^ (x >> 4);                x = x * 2057;                x = x ^ (x >> 16);                return (int)x;            }        }        /// <summary>        /// The hash string.        /// </summary>        /// <param name="input">        /// The input.        /// </param>        /// <returns>        /// The <see cref="int"/>.        /// </returns>        private static int HashString(T input)        {            var str = input as string;            int hash = 0;            if (str != null)            {                for (int i = 0; i < str.Length; i++)                {                    hash += str[i];                    hash += hash << 10;                    hash ^= hash >> 6;                }                hash += hash << 3;                hash ^= hash >> 11;                hash += hash << 15;            }            return hash;        }        /// <summary>        /// The compute hash.        /// </summary>        /// <param name="primaryHash">        /// The primary hash.        /// </param>        /// <param name="secondaryHash">        /// The secondary hash.        /// </param>        /// <param name="i">        /// The i.        /// </param>        /// <returns>        /// The <see cref="int"/>.        /// </returns>        private int ComputeHash(int primaryHash, int secondaryHash, int i)        {            int resultingHash = (primaryHash + (i * secondaryHash)) % this.hashBits.Count;            return Math.Abs(resultingHash);        }        /// <summary>        /// The true bits.        /// </summary>        /// <returns>        /// The <see cref="int"/>.        /// </returns>        private int TrueBits()        {            int output = 0;            foreach (bool bit in this.hashBits)            {                if (bit)                {                    output++;                }            }            return output;        }        #endregion    }}
BloomFilter.cs
namespace Feng.SimpleCrawler{    using System;    /// <summary>    /// The crawl error event handler.    /// </summary>    /// <param name="args">    /// The args.    /// </param>    public delegate void CrawlErrorEventHandler(CrawlErrorEventArgs args);    /// <summary>    /// The crawl error event args.    /// </summary>    public class CrawlErrorEventArgs : EventArgs    {        #region Public Properties        /// <summary>        /// Gets or sets the exception.        /// </summary>        public Exception Exception { get; set; }        /// <summary>        /// Gets or sets the url.        /// </summary>        public string Url { get; set; }        #endregion    }}
CrawlErrorEventArgs.cs
namespace Feng.SimpleCrawler{    using System;    /// <summary>    /// The crawl error event handler.    /// </summary>    /// <param name="args">    /// The args.    /// </param>    public delegate void CrawlErrorEventHandler(CrawlErrorEventArgs args);    /// <summary>    /// The crawl error event args.    /// </summary>    public class CrawlErrorEventArgs : EventArgs    {        #region Public Properties        /// <summary>        /// Gets or sets the exception.        /// </summary>        public Exception Exception { get; set; }        /// <summary>        /// Gets or sets the url.        /// </summary>        public string Url { get; set; }        #endregion    }}
CrawlExtension.cs
namespace Feng.SimpleCrawler{    using System;    using System.Collections.Generic;    using System.IO;    using System.IO.Compression;    using System.Linq;    using System.Net;    using System.Text;    using System.Text.RegularExpressions;    using System.Threading;        /// <summary>    /// The crawl master.    /// </summary>    public class CrawlMaster    {        #region Constants        /// <summary>        /// The web url regular expressions.        /// </summary>        private const string WebUrlRegularExpressions = @"^(http|https)://([/w-]+/.)+[/w-]+(/[/w- ./?%&=]*)?";        #endregion        #region Fields        /// <summary>        /// The cookie container.        /// </summary>        private readonly CookieContainer cookieContainer;        /// <summary>        /// The random.        /// </summary>        private readonly Random random;        /// <summary>        /// The thread status.        /// </summary>        private readonly bool[] threadStatus;        /// <summary>        /// The threads.        /// </summary>        private readonly Thread[] threads;        #endregion        #region Constructors and Destructors        /// <summary>        /// Initializes a new instance of the <see cref="CrawlMaster"/> class.        /// </summary>        /// <param name="settings">        /// The settings.        /// </param>        public CrawlMaster(CrawlSettings settings)        {            this.cookieContainer = new CookieContainer();            this.random = new Random();            this.Settings = settings;            this.threads = new Thread[settings.ThreadCount];            this.threadStatus = new bool[settings.ThreadCount];        }        #endregion        #region Public Events        /// <summary>        /// The add url event.        /// </summary>        public event AddUrlEventHandler AddUrlEvent;        /// <summary>        /// The crawl error event.        /// </summary>        public event CrawlErrorEventHandler CrawlErrorEvent;        /// <summary>        /// The data received event.        /// </summary>        public event DataReceivedEventHandler DataReceivedEvent;        #endregion        #region Public Properties        /// <summary>        /// Gets the settings.        /// </summary>        public CrawlSettings Settings { get; private set; }        #endregion        #region Public Methods and Operators        /// <summary>        /// The crawl.        /// </summary>        public void Crawl()        {            this.Initialize();            for (int i = 0; i < this.threads.Length; i++)            {                this.threads[i].Start(i);                this.threadStatus[i] = false;            }        }        /// <summary>        /// The stop.        /// </summary>        public void Stop()        {            foreach (Thread thread in this.threads)            {                thread.Abort();            }        }        #endregion        #region Methods        /// <summary>        /// The config request.        /// </summary>        /// <param name="request">        /// The request.        /// </param>        private void ConfigRequest(HttpWebRequest request)        {            request.UserAgent = this.Settings.UserAgent;            request.CookieContainer = this.cookieContainer;            request.AllowAutoRedirect = true;            request.MediaType = "text/html";            request.Headers["Accept-Language"] = "zh-CN,zh;q=0.8";            if (this.Settings.Timeout > 0)            {                request.Timeout = this.Settings.Timeout;            }        }        /// <summary>        /// The crawl process.        /// </summary>        /// <param name="threadIndex">        /// The thread index.        /// </param>        private void CrawlProcess(object threadIndex)        {            var currentThreadIndex = (int)threadIndex;            while (true)            {                // 根據隊列中的 Url 數量和空閑線程的數量,判斷線程是睡眠還是退出                if (UrlQueue.Instance.Count == 0)                {                    this.threadStatus[currentThreadIndex] = true;                    if (!this.threadStatus.Any(t => t == false))                    {                        break;                    }                    Thread.Sleep(2000);                    continue;                }                this.threadStatus[currentThreadIndex] = false;                if (UrlQueue.Instance.Count == 0)                {                    continue;                }                UrlInfo urlInfo = UrlQueue.Instance.DeQueue();                HttpWebRequest request = null;                HttpWebResponse response = null;                try                {                    if (urlInfo == null)                    {                        continue;                    }                    // 1~5 秒隨機間隔的自動限速                    if (this.Settings.AutoSpeedLimit)                    {                        int span = this.random.Next(1000, 5000);                        Thread.Sleep(span);                    }                    // 創建并配置Web請求                    request = WebRequest.Create(urlInfo.UrlString) as HttpWebRequest;                    this.ConfigRequest(request);                    if (request != null)                    {                        response = request.GetResponse() as HttpWebResponse;                    }                    if (response != null)                    {                        this.PersistenceCookie(response);                        Stream stream = null;                        // 如果頁面壓縮,則解壓數據流                        if (response.ContentEncoding == "gzip")                        {                            Stream responseStream = response.GetResponseStream();                            if (responseStream != null)                            {                                stream = new GZipStream(responseStream, CompressionMode.Decompress);                            }                        }                        else                        {                            stream = response.GetResponseStream();                        }                        using (stream)                        {                            string html = this.ParseContent(stream, response.CharacterSet);                            this.ParseLinks(urlInfo, html);                            if (this.DataReceivedEvent != null)                            {                                this.DataReceivedEvent(                                    new DataReceivedEventArgs                                        {                                            Url = urlInfo.UrlString,                                             Depth = urlInfo.Depth,                                             Html = html                                        });                            }                            if (stream != null)                            {                                stream.Close();                            }                        }                    }                }                catch (Exception exception)                {                    if (this.CrawlErrorEvent != null)                    {                        if (urlInfo != null)                        {                            this.CrawlErrorEvent(                                new CrawlErrorEventArgs { Url = urlInfo.UrlString, Exception = exception });                        }                    }                }                finally                {                    if (request != null)                    {                        request.Abort();                    }                    if (response != null)                    {                        response.Close();                    }                }            }        }        /// <summary>        /// The initialize.        /// </summary>        private void Initialize()        {            if (this.Settings.SeedsAddress != null && this.Settings.SeedsAddress.Count > 0)            {                foreach (string seed in this.Settings.SeedsAddress)                {                    if (Regex.IsMatch(seed, WebUrlRegularExpressions, RegexOptions.IgnoreCase))                    {                        UrlQueue.Instance.EnQueue(new UrlInfo(seed) { Depth = 1 });                    }                }            }            for (int i = 0; i < this.Settings.ThreadCount; i++)            {                var threadStart = new ParameterizedThreadStart(this.CrawlProcess);                this.threads[i] = new Thread(threadStart);            }            ServicePointManager.DefaultConnectionLimit = 256;        }        /// <summary>        /// The is match regular.        /// </summary>        /// <param name="url">        /// The url.        /// </param>        /// <returns>        /// The <see cref="bool"/>.        /// </returns>        private bool IsMatchRegular(string url)        {            bool result = false;            if (this.Settings.RegularFilterExpressions != null && this.Settings.RegularFilterExpressions.Count > 0)            {                if (                    this.Settings.RegularFilterExpressions.Any(                        pattern => Regex.IsMatch(url, pattern, RegexOptions.IgnoreCase)))                {                    result = true;                }            }            else            {                result = true;            }            return result;        }        /// <summary>        /// The parse content.        /// </summary>        /// <param name="stream">        /// The stream.        /// </param>        /// <param name="characterSet">        /// The character set.        /// </param>        /// <returns>        /// The <see cref="string"/>.        /// </returns>        private string ParseContent(Stream stream, string characterSet)        {            var memoryStream = new MemoryStream();            stream.CopyTo(memoryStream);            byte[] buffer = memoryStream.ToArray();            Encoding encode = Encoding.ASCII;            string html = encode.GetString(buffer);            string localCharacterSet = characterSet;            Match match = Regex.Match(html, "<meta([^<]*)charset=([^<]*)/"", RegexOptions.IgnoreCase);            if (match.Success)            {                localCharacterSet = match.Groups[2].Value;                var stringBuilder = new StringBuilder();                foreach (char item in localCharacterSet)                {                    if (item == ' ')                    {                        break;                    }                    if (item != '/"')                    {                        stringBuilder.Append(item);                    }                }                localCharacterSet = stringBuilder.ToString();            }            if (string.IsNullOrEmpty(localCharacterSet))            {                localCharacterSet = characterSet;            }            if (!string.IsNullOrEmpty(localCharacterSet))            {                encode = Encoding.GetEncoding(localCharacterSet);            }            memoryStream.Close();            return encode.GetString(buffer);        }        /// <summary>        /// The parse links.        /// </summary>        /// <param name="urlInfo">        /// The url info.        /// </param>        /// <param name="html">        /// The html.        /// </param>        private void ParseLinks(UrlInfo urlInfo, string html)        {            if (this.Settings.Depth > 0 && urlInfo.Depth >= this.Settings.Depth)            {                return;            }            var urlDictionary = new Dictionary<string, string>();            Match match = Regex.Match(html, "(?i)<a .*?href=/"([^/"]+)/"[^>]*>(.*?)</a>");            while (match.Success)            {                // 以 href 作為 key                string urlKey = match.Groups[1].Value;                // 以 text 作為 value                string urlValue = Regex.Replace(match.Groups[2].Value, "(?i)<.*?>", string.Empty);                urlDictionary[urlKey] = urlValue;                match = match.NextMatch();            }            foreach (var item in urlDictionary)            {                string href = item.Key;                string text = item.Value;                if (!string.IsNullOrEmpty(href))                {                    bool canBeAdd = true;                    if (this.Settings.EscapeLinks != null && this.Settings.EscapeLinks.Count > 0)                    {                        if (this.Settings.EscapeLinks.Any(suffix => href.EndsWith(suffix, StringComparison.OrdinalIgnoreCase)))                        {                            canBeAdd = false;                        }                    }                    if (this.Settings.HrefKeywords != null && this.Settings.HrefKeywords.Count > 0)                    {                        if (!this.Settings.HrefKeywords.Any(href.Contains))                        {                            canBeAdd = false;                        }                    }                    if (canBeAdd)                    {                        string url = href.Replace("%3f", "?")                            .Replace("%3d", "=")                            .Replace("%2f", "/")                            .Replace("&amp;", "&");                        if (string.IsNullOrEmpty(url) || url.StartsWith("#")                            || url.StartsWith("mailto:", StringComparison.OrdinalIgnoreCase)                            || url.StartsWith("javascript:", StringComparison.OrdinalIgnoreCase))                        {                            continue;                        }                        var baseUri = new Uri(urlInfo.UrlString);                        Uri currentUri = url.StartsWith("http", StringComparison.OrdinalIgnoreCase)                                             ? new Uri(url)                                             : new Uri(baseUri, url);                        url = currentUri.AbsoluteUri;                        if (this.Settings.LockHost)                        {                            // 去除二級域名后,判斷域名是否相等,相等則認為是同一個站點                            // 例如:mail.pzcast.com 和 www.pzcast.com                            if (baseUri.Host.Split('.').Skip(1).Aggregate((a, b) => a + "." + b)                                != currentUri.Host.Split('.').Skip(1).Aggregate((a, b) => a + "." + b))                            {                                continue;                            }                        }                        if (!this.IsMatchRegular(url))                        {                            continue;                        }                        var addUrlEventArgs = new AddUrlEventArgs { Title = text, Depth = urlInfo.Depth + 1, Url = url };                        if (this.AddUrlEvent != null && !this.AddUrlEvent(addUrlEventArgs))                        {                            continue;                        }                        UrlQueue.Instance.EnQueue(new UrlInfo(url) { Depth = urlInfo.Depth + 1 });                    }                }            }        }        /// <summary>        /// The persistence cookie.        /// </summary>        /// <param name="response">        /// The response.        /// </param>        private void PersistenceCookie(HttpWebResponse response)        {            if (!this.Settings.KeepCookie)            {                return;            }            string cookies = response.Headers["Set-Cookie"];            if (!string.IsNullOrEmpty(cookies))            {                var cookieUri =                    new Uri(                        string.Format(                            "{0}://{1}:{2}/",                             response.ResponseUri.Scheme,                             response.ResponseUri.Host,                             response.ResponseUri.Port));                this.cookieContainer.SetCookies(cookieUri, cookies);            }        }        #endregion    }}
CrawlMaster.cs
namespace Feng.SimpleCrawler{    using System;    using System.Collections.Generic;    /// <summary>    /// The crawl settings.    /// </summary>    [Serializable]    public class CrawlSettings    {        #region Fields        /// <summary>        /// The depth.        /// </summary>        private byte depth = 3;        /// <summary>        /// The lock host.        /// </summary>        private bool lockHost = true;        /// <summary>        /// The thread count.        /// </summary>        private byte threadCount = 1;        /// <summary>        /// The timeout.        /// </summary>        private int timeout = 15000;        /// <summary>        /// The user agent.        /// </summary>        private string userAgent =             "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11";        #endregion        #region Constructors and Destructors        /// <summary>        /// Initializes a new instance of the <see cref="CrawlSettings"/> class.        /// </summary>        public CrawlSettings()        {            this.AutoSpeedLimit = false;            this.EscapeLinks = new List<string>();            this.KeepCookie = true;            this.HrefKeywords = new List<string>();            this.LockHost = true;            this.RegularFilterExpressions = new List<string>();            this.SeedsAddress = new List<string>();        }        #endregion        #region Public Properties        /// <summary>        /// Gets or sets a value indicating whether auto speed limit.        /// </summary>        public bool AutoSpeedLimit { get; set; }        /// <summary>        /// Gets or sets the depth.        /// </summary>        public byte Depth        {            get            {                return this.depth;            }            set            {                this.depth = value;            }        }        /// <summary>        /// Gets the escape links.        /// </summary>        public List<string> EscapeLinks { get; private set; }        /// <summary>        /// Gets or sets a value indicating whether keep cookie.        /// </summary>        public bool KeepCookie { get; set; }        /// <summary>        /// Gets the href keywords.        /// </summary>        public List<string> HrefKeywords { get; private set; }        /// <summary>        /// Gets or sets a value indicating whether lock host.        /// </summary>        public bool LockHost        {            get            {                return this.lockHost;            }            set            {                this.lockHost = value;            }        }        /// <summary>        /// Gets the regular filter expressions.        /// </summary>        public List<string> RegularFilterExpressions { get; private set; }        /// <summary>        /// Gets  the seeds address.        /// </summary>        public List<string> SeedsAddress { get; private set; }        /// <summary>        /// Gets or sets the thread count.        /// </summary>        public byte ThreadCount        {            get            {                return this.threadCount;            }            set            {                this.threadCount = value;            }        }        /// <summary>        /// Gets or sets the timeout.        /// </summary>        public int Timeout        {            get            {                return this.timeout;            }            set            {                this.timeout = value;            }        }        /// <summary>        /// Gets or sets the user agent.        /// </summary>        public string UserAgent        {            get            {                return this.userAgent;            }            set            {                this.userAgent = value;            }        }        #endregion    }}
CrawlSettings.cs
namespace Feng.SimpleCrawler{    /// <summary>    /// The crawl status.    /// </summary>    public enum CrawlStatus    {        /// <summary>        /// The completed.        /// </summary>        Completed = 1,         /// <summary>        /// The never been.        /// </summary>        NeverBeen = 2    }}
CrawlStatus.cs
namespace Feng.SimpleCrawler{    using System;    /// <summary>    /// The data received event handler.    /// </summary>    /// <param name="args">    /// The args.    /// </param>    public delegate void DataReceivedEventHandler(DataReceivedEventArgs args);    /// <summary>    /// The data received event args.    /// </summary>    public class DataReceivedEventArgs : EventArgs    {        #region Public Properties        /// <summary>        /// Gets or sets the depth.        /// </summary>        public int Depth { get; set; }        /// <summary>        /// Gets or sets the html.        /// </summary>        public string Html { get; set; }        /// <summary>        /// Gets or sets the url.        /// </summary>        public string Url { get; set; }        #endregion    }}
DataReceivedEventArgs.cs
namespace Feng.SimpleCrawler{    using System.Collections.Generic;    using System.Threading;    /// <summary>    /// The security queue.    /// </summary>    /// <typeparam name="T">    /// Any type.    /// </typeparam>    public abstract class SecurityQueue<T>        where T : class    {        #region Fields        /// <summary>        /// The inner queue.        /// </summary>        protected readonly Queue<T> InnerQueue = new Queue<T>();        /// <summary>        /// The sync object.        /// </summary>        protected readonly object SyncObject = new object();        /// <summary>        /// The auto reset event.        /// </summary>        private readonly AutoResetEvent autoResetEvent;        #endregion        #region Constructors and Destructors        /// <summary>        /// Initializes a new instance of the <see cref="SecurityQueue{T}"/> class.        /// </summary>        protected SecurityQueue()        {            this.autoResetEvent = new AutoResetEvent(false);        }        #endregion        #region Delegates        /// <summary>        /// The before en queue event handler.        /// </summary>        /// <param name="target">        /// The target.        /// </param>        /// <returns>        /// The <see cref="bool"/>.        /// </returns>        public delegate bool BeforeEnQueueEventHandler(T target);        #endregion        #region Public Events        /// <summary>        /// The before en queue event.        /// </summary>        public event BeforeEnQueueEventHandler BeforeEnQueueEvent;        #endregion        #region Public Properties        /// <summary>        /// Gets the auto reset event.        /// </summary>        public AutoResetEvent AutoResetEvent        {            get            {                return this.autoResetEvent;            }        }        /// <summary>        /// Gets the count.        /// </summary>        public int Count        {            get            {                lock (this.SyncObject)                {                    return this.InnerQueue.Count;                }            }        }        /// <summary>        /// Gets a value indicating whether has value.        /// </summary>        public bool HasValue        {            get            {                return this.Count != 0;            }        }        #endregion        #region Public Methods and Operators        /// <summary>        /// The de queue.        /// </summary>        /// <returns>        /// The <see cref="T"/>.        /// </returns>        public T DeQueue()        {            lock (this.SyncObject)            {                if (this.InnerQueue.Count > 0)                {                    return this.InnerQueue.Dequeue();                }                return default(T);            }        }        /// <summary>        /// The en queue.        /// </summary>        /// <param name="target">        /// The target.        /// </param>        public void EnQueue(T target)        {            lock (this.SyncObject)            {                if (this.BeforeEnQueueEvent != null)                {                    if (this.BeforeEnQueueEvent(target))                    {                        this.InnerQueue.Enqueue(target);                    }                }                else                {                    this.InnerQueue.Enqueue(target);                }                this.AutoResetEvent.Set();            }        }        #endregion    }}
SecurityQueue.cs
namespace Feng.SimpleCrawler{    /// <summary>    /// The url info.    /// </summary>    public class UrlInfo    {        #region Fields        /// <summary>        /// The url.        /// </summary>        private readonly string url;        #endregion        #region Constructors and Destructors        /// <summary>        /// Initializes a new instance of the <see cref="UrlInfo"/> class.        /// </summary>        /// <param name="urlString">        /// The url string.        /// </param>        public UrlInfo(string urlString)        {            this.url = urlString;        }        #endregion        #region Public Properties        /// <summary>        /// Gets or sets the depth.        /// </summary>        public int Depth { get; set; }        /// <summary>        /// Gets the url string.        /// </summary>        public string UrlString        {            get            {                return this.url;            }        }        /// <summary>        /// Gets or sets the status.        /// </summary>        public CrawlStatus Status { get; set; }        #endregion    }}
UrlInfo.cs
namespace Feng.SimpleCrawler{    /// <summary>    /// The url queue.    /// </summary>    public class UrlQueue : SecurityQueue<UrlInfo>    {        #region Constructors and Destructors        /// <summary>        /// Prevents a default instance of the <see cref="UrlQueue"/> class from being created.        /// </summary>        private UrlQueue()        {        }        #endregion        #region Public Properties        /// <summary>        /// Gets the instance.        /// </summary>        public static UrlQueue Instance        {            get            {                return Nested.Inner;            }        }        #endregion        /// <summary>        /// The nested.        /// </summary>        private static class Nested        {            #region Static Fields            /// <summary>            /// The inner.            /// </summary>            internal static readonly UrlQueue Inner = new UrlQueue();            #endregion        }    }}
UrlQueue.cs

5.創建windows服務.

這些工作都準備完成后,終于要來我們的重點了,我們都知道控制臺程序非常不穩定,而我們的這個從博客園上面爬取文章的這個事情需要長期的進行下去,這個需要 很穩定的進行下去,所以我想起了windows服務,創建好我們的windows服務,代碼如下。

using Feng.SimpleCrawler;using Feng.DbHelper;using Feng.Log;using HtmlAgilityPack;namespace Feng.Demo{    /// <summary>    /// windows服務    /// </summary>    partial class FengCnblogsService : ServiceBase    {        #region 構造函數        /// <summary>        /// 構造函數        /// </summary>        public FengCnblogsService()        {            InitializeComponent();        }         #endregion        #region 字段屬性        /// <summary>        /// 蜘蛛爬蟲的設置        /// </summary>        private static readonly CrawlSettings Settings = new CrawlSettings();        /// <summary>        /// 臨時內存表存儲數據        /// </summary>        private static DataTable dt = new DataTable();        /// <summary>        /// 關于 Filter URL:http://www.49028c.com/heaad/archive/2011/01/02/1924195.html        /// </summary>        private static BloomFilter<string> filter;        #endregion        #region 啟動服務        /// <summary>        /// TODO: 在此處添加代碼以啟動服務。        /// </summary>        /// <param name="args"></param>        protected override void OnStart(string[] args)        {            ProcessStart();        }         #endregion        #region 停止服務        /// <summary>        /// TODO: 在此處添加代碼以執行停止服務所需的關閉操作。        /// </summary>        protected override void OnStop()        {        }         #endregion        #region 程序開始處理        /// <summary>        /// 程序開始處理        /// </summary>        private void ProcessStart()        {            dt.Columns.Add("BlogTitle", typeof(string));            dt.Columns.Add("BlogUrl", typeof(string));            dt.Columns.Add("BlogAuthor", typeof(string));            dt.Columns.Add("BlogTime", typeof(string));            dt.Columns.Add("BlogMotto", typeof(string));            dt.Columns.Add("BlogDepth", typeof(string));            filter = new BloomFilter<string>(200000);            const string CityName = "";            #region 設置種子地址            // 設置種子地址             Settings.SeedsAddress.Add(string.Format("http://www.49028c.com/{0}", CityName));            Settings.SeedsAddress.Add("http://www.49028c.com/artech");            Settings.SeedsAddress.Add("http://www.49028c.com/wuhuacong/");            Settings.SeedsAddress.Add("http://www.49028c.com/dudu/");            Settings.SeedsAddress.Add("http://www.49028c.com/guomingfeng/");            Settings.SeedsAddress.Add("http://www.49028c.com/daxnet/");            Settings.SeedsAddress.Add("http://www.49028c.com/fenglingyi");            Settings.SeedsAddress.Add("http://www.49028c.com/ahthw/");            Settings.SeedsAddress.Add("http://www.49028c.com/wangweimutou/");            #endregion            #region 設置 URL 關鍵字            Settings.HrefKeywords.Add("a/");            Settings.HrefKeywords.Add("b/");            Settings.HrefKeywords.Add("c/");            Settings.HrefKeywords.Add("d/");            Settings.HrefKeywords.Add("e/");            Settings.HrefKeywords.Add("f/");            Settings.HrefKeywords.Add("g/");            Settings.HrefKeywords.Add("h/");            Settings.HrefKeywords.Add("i/");            Settings.HrefKeywords.Add("j/");            Settings.HrefKeywords.Add("k/");            Settings.HrefKeywords.Add("l/");            Settings.HrefKeywords.Add("m/");            Settings.HrefKeywords.Add("n/");            Settings.HrefKeywords.Add("o/");            Settings.HrefKeywords.Add("p/");            Settings.HrefKeywords.Add("q/");            Settings.HrefKeywords.Add("r/");            Settings.HrefKeywords.Add("s/");            Settings.HrefKeywords.Add("t/");            Settings.HrefKeywords.Add("u/");            Settings.HrefKeywords.Add("v/");            Settings.HrefKeywords.Add("w/");            Settings.HrefKeywords.Add("x/");            Settings.HrefKeywords.Add("y/");            Settings.HrefKeywords.Add("z/");            #endregion            // 設置爬取線程個數            Settings.ThreadCount = 1;            // 設置爬取深度            Settings.Depth = 55;            // 設置爬取時忽略的 Link,通過后綴名的方式,可以添加多個            Settings.EscapeLinks.Add("http://www.oschina.net/");            // 設置自動限速,1~5 秒隨機間隔的自動限速            Settings.AutoSpeedLimit = false;            // 設置都是鎖定域名,去除二級域名后,判斷域名是否相等,相等則認為是同一個站點            Settings.LockHost = false;            Settings.RegularFilterExpressions.Add(@"http://([w]{3}.)+[VEVb]+.com/");            var master = new CrawlMaster(Settings);            master.AddUrlEvent += MasterAddUrlEvent;            master.DataReceivedEvent += MasterDataReceivedEvent;            master.Crawl();        }                #endregion        #region 打印Url        /// <summary>        /// The master add url event.        /// </summary>        /// <param name="args">        /// The args.        /// </param>        /// <returns>        /// The <see cref="bool"/>.        /// </returns>        private static bool MasterAddUrlEvent(AddUrlEventArgs args)        {            if (!filter.Contains(args.Url))            {                filter.Add(args.Url);                Console.WriteLine(args.Url);                if (dt.Rows.Count > 200)                {                    MssqlHelper.InsertDb(dt);                    dt.Rows.Clear();                }                return true;            }            return false; // 返回 false 代表:不添加到隊列中        }        #endregion        #region 解析HTML        /// <summary>        /// The master data received event.        /// </summary>        /// <param name="args">        /// The args.        /// </param>        private static void MasterDataReceivedEvent(SimpleCrawler.DataReceivedEventArgs args)        {            // 在此處解析頁面,可以用類似于 HtmlAgilityPack(頁面解析組件)的東東、也可以用正則表達式、還可以自己進行字符串分析            HtmlDocument doc = new HtmlDocument();            doc.LoaDHTML(args.Html);            HtmlNode node = doc.DocumentNode.SelectSingleNode("//title");            string title = node.InnerText;            HtmlNode node2 = doc.DocumentNode.SelectSingleNode("//*[@id='post-date']");            string time = node2.InnerText;            HtmlNode node3 = doc.DocumentNode.SelectSingleNode("//*[@id='topics']/div/div[3]/a[1]");            string author = node3.InnerText;            HtmlNode node6 = doc.DocumentNode.SelectSingleNode("//*[@id='blogTitle']/h2");            string motto = node6.InnerText;            MssqlHelper.GetData(title, args.Url, author, time, motto, args.Depth.ToString(), dt);            LogHelper.WriteLog(title);            LogHelper.WriteLog(args.Url);            LogHelper.WriteLog(author);            LogHelper.WriteLog(time);            LogHelper.WriteLog(motto == "" ? "null" : motto);            LogHelper.WriteLog(args.Depth + "&dt.Rows.Count=" + dt.Rows.Count);            //每次超過100條數據就存入數據庫,可以根據自己的情況設置數量            if (dt.Rows.Count > 100)            {                MssqlHelper.InsertDb(dt);                dt.Rows.Clear();            }        }        #endregion    }}

這里我們用爬蟲從博客園爬取來了博文,我們需要用這個HtmlAgilityPack第三方工具來解析出我們需要的字段,博文標題,博文作者,博文URL,等等一些信息。同時我們可以設置服務的一些信息

在網絡爬蟲中,我們要設置一些參數,設置種子地址,URL關鍵字,還有爬取的深度等等,這些工作都完成后,我們就只需要安裝我們的windows服務,就大功告成了。嘿嘿...

 6.0安裝windows服務

在這里我們采用vs自帶的工具來安裝windows服務。

安裝成功后,打開我們的windows服務就可以看到我們安裝的windows服務。

同時可以查看我們的日志文件,查看我們爬取的博文解析出來的信息。如下圖所示。

這個時候去查看我們的數據庫,我的這個服務已經運行了一天。。。

 如果你覺得本文不錯的話,幫我推薦一下,本人能力有限,文中如有不妥之處,歡迎拍磚,如果需要源碼的童鞋,可以留下你的郵箱...

 


發表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發表
亚洲香蕉成人av网站在线观看_欧美精品成人91久久久久久久_久久久久久久久久久亚洲_热久久视久久精品18亚洲精品_国产精自产拍久久久久久_亚洲色图国产精品_91精品国产网站_中文字幕欧美日韩精品_国产精品久久久久久亚洲调教_国产精品久久一区_性夜试看影院91社区_97在线观看视频国产_68精品久久久久久欧美_欧美精品在线观看_国产精品一区二区久久精品_欧美老女人bb
欧美在线免费观看| 国产一区二区三区精品久久久| 亚洲福利视频专区| 国产精品日日摸夜夜添夜夜av| 日韩欧美在线一区| 日韩一区二区精品视频| 欧美理论电影在线观看| 8x拔播拔播x8国产精品| 国产大片精品免费永久看nba| 日本乱人伦a精品| 日韩专区在线播放| 欧美成人免费视频| 欧美电影在线免费观看网站| 91探花福利精品国产自产在线| 日韩av在线影视| 91亚洲国产成人久久精品网站| 国产精品国产亚洲伊人久久| 55夜色66夜色国产精品视频| 国产精品视频免费在线观看| 亚洲片在线资源| 国产精品一二三视频| 亚洲精品国偷自产在线99热| 欧美一级黑人aaaaaaa做受| 91av网站在线播放| 日韩电影免费在线观看| 精品国产999| 另类视频在线观看| 国产精品第二页| 欧美性生活大片免费观看网址| 久久精品久久久久久| 亚洲国产精品免费| 亚洲最新av在线网站| 久久精品国产欧美激情| 中国china体内裑精亚洲片| 国产日本欧美一区二区三区在线| 国产一区二区三区在线免费观看| 久久久久久91香蕉国产| 国产在线视频一区| 欧美一性一乱一交一视频| 538国产精品一区二区免费视频| 亚洲第一偷拍网| 欧美中文在线视频| 国产女精品视频网站免费| 亚洲高清不卡av| 日韩美女写真福利在线观看| 日韩精品视频免费专区在线播放| 日韩在线观看高清| 欧美激情乱人伦一区| 国产成人精品综合| 中文字幕欧美视频在线| 国产中文欧美精品| 日本成熟性欧美| 成人激情视频在线播放| 国产一区二区三区直播精品电影| www.美女亚洲精品| 国产精品福利在线观看| 精品av在线播放| 视频一区视频二区国产精品| 欧美理论电影在线播放| 国产欧美日韩免费| 国产精品1区2区在线观看| 欧美日韩高清在线观看| 国产欧美日韩91| 亚洲视频精品在线| 国产精品青青在线观看爽香蕉| 国产成人精品午夜| 国产精品18久久久久久麻辣| 性色av一区二区三区红粉影视| 久久久成人精品| 中文字幕久久精品| 法国裸体一区二区| 国产91|九色| 国产精品一二三在线| 欧美成人在线影院| 日韩视频第一页| 性色av一区二区咪爱| 日韩免费看的电影电视剧大全| 国产99久久精品一区二区 夜夜躁日日躁| 欧美老女人在线视频| 国产精品美女免费看| 国产成人aa精品一区在线播放| 91爱视频在线| 国内精品模特av私拍在线观看| 一区二区三区视频免费| 欧美大片va欧美在线播放| 日韩a**站在线观看| 久久精品色欧美aⅴ一区二区| 91久久久久久久久久久久久| 欧美在线观看日本一区| 午夜精品久久久久久99热软件| 成人在线中文字幕| 日韩精品电影网| 久久久久久久久亚洲| 国产精品网站大全| www国产亚洲精品久久网站| 成人欧美一区二区三区黑人孕妇| 亚洲精品小视频| 亚洲91精品在线| 国产丝袜一区二区| 国产视频一区在线| 欧美日韩亚洲成人| 成人综合网网址| 久久久久中文字幕2018| 伊人久久五月天| 中文字幕在线看视频国产欧美在线看完整| 国产精品美女无圣光视频| 精品夜色国产国偷在线| 国产精品高清免费在线观看| 欧美精品一区在线播放| 欧美日韩国产综合视频在线观看中文| 久久国产精品影视| 久久成年人视频| 91高清视频免费| 国产丝袜高跟一区| 97av在线视频| 国产69精品久久久久久| 中文字幕在线看视频国产欧美在线看完整| 96pao国产成视频永久免费| 国产日韩欧美日韩| 亚洲二区在线播放视频| 孩xxxx性bbbb欧美| 欧美疯狂做受xxxx高潮| 九九热精品视频在线播放| 九九精品视频在线| 国产日韩欧美夫妻视频在线观看| 亚洲国产精品女人久久久| 色偷偷av一区二区三区乱| 成人免费福利视频| 5566日本婷婷色中文字幕97| 中文字幕日韩免费视频| 久久精品亚洲94久久精品| 国产精品91在线观看| 国内精品国产三级国产在线专| 日韩一区二区三区xxxx| 欧美精品videosex性欧美| 欧美性猛交xxxx免费看漫画| 国产精品视频中文字幕91| 欧美午夜精品久久久久久人妖| 91免费看视频.| 欧美精品免费看| 在线一区二区日韩| 色噜噜国产精品视频一区二区| 久久91亚洲精品中文字幕| 亚洲综合色av| 国产欧美精品一区二区三区介绍| 午夜精品一区二区三区在线| 亚洲欧美999| 成人福利视频网| 91干在线观看| 国产视频亚洲视频| 亚洲伊人久久大香线蕉av| 精品视频在线播放免| 国产成人高潮免费观看精品| 国产美女精彩久久| 色综合久久88| 韩剧1988在线观看免费完整版| 欧美日韩在线另类| www.欧美精品| 亚洲色在线视频| 成人精品aaaa网站| 欧美亚洲第一区| 色婷婷成人综合| 国产精品日日摸夜夜添夜夜av| 国产中文字幕日韩|