亚洲香蕉成人av网站在线观看_欧美精品成人91久久久久久久_久久久久久久久久久亚洲_热久久视久久精品18亚洲精品_国产精自产拍久久久久久_亚洲色图国产精品_91精品国产网站_中文字幕欧美日韩精品_国产精品久久久久久亚洲调教_国产精品久久一区_性夜试看影院91社区_97在线观看视频国产_68精品久久久久久欧美_欧美精品在线观看_国产精品一区二区久久精品_欧美老女人bb

首頁 > 編程 > C# > 正文

c#檢測文本文件編碼的方法

2020-01-24 01:15:15
字體:
來源:轉載
供稿:網友

C#如何檢測文本文件的編碼,本文為大家分享了示例代碼,具體內容如下

using System;using System.Text;using System.Text.RegularExpressions;using System.IO; namespace KlerksSoft{  public static class TextFileEncodingDetector  {    /** Simple class to handle text file encoding woes (in a primarily English-speaking tech* world).** - This code is fully managed, no shady calls to MLang (the unmanaged codepage* detection library originally developed for Internet Explorer).** - This class does NOT try to detect arbitrary codepages/charsets, it really only* aims to differentiate between some of the most common variants of Unicode* encoding, and a "default" (western / ascii-based) encoding alternative provided* by the caller.** - As there is no "Reliable" way to distinguish between UTF-8 (without BOM) and* Windows-1252 (in .Net, also incorrectly called "ASCII") encodings, we use a* heuristic - so the more of the file we can sample the better the guess. If you* are going to read the whole file into memory at some point, then best to pass* in the whole byte byte array directly. Otherwise, decide how to trade off* reliability against performance / memory usage.** - The UTF-8 detection heuristic only works for western text, as it relies on* the presence of UTF-8 encoded accented and other characters found in the upper* ranges of the Latin-1 and (particularly) Windows-1252 codepages.** - For more general detection routines, see existing projects / resources:* - MLang - Microsoft library originally for IE6, available in Windows XP and later APIs now (I think?)* - MLang .Net bindings: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx* - CharDet - Mozilla browser's detection routines* - Ported to Java then .Net: http://www.conceptdevelopment.net/Localization/NCharDet/* - Ported straight to .Net: http://code.google.com/p/chardetsharp/source/browse** Copyright Tao Klerks, Jan 2010, tao@klerks.biz* Licensed under the modified BSD license:* Redistribution and use in source and binary forms, with or without modification, arepermitted provided that the following conditions are met: - Redistributions of source code must retain the above copyright notice, this list ofconditions and the following disclaimer.- Redistributions in binary form must reproduce the above copyright notice, this listof conditions and the following disclaimer in the documentation and/or other materialsprovided with the distribution.- The name of the author may not be used to endorse or promote products derived fromthis software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FORA PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANYDIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, ORPROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITYOF SUCH DAMAGE. **/     const long _defaultHeuristicSampleSize = 0x10000; //completely arbitrary - inappropriate for high numbers of files / high speed requirements     public static Encoding DetectTextFileEncoding(string InputFilename, Encoding DefaultEncoding)    {      using (FileStream textfileStream = File.OpenRead(InputFilename))      {        return DetectTextFileEncoding(textfileStream, DefaultEncoding, _defaultHeuristicSampleSize);      }    }     public static Encoding DetectTextFileEncoding(FileStream InputFileStream, Encoding DefaultEncoding, long HeuristicSampleSize)    {      if (InputFileStream == null)        throw new ArgumentNullException("Must provide a valid Filestream!", "InputFileStream");       if (!InputFileStream.CanRead)        throw new ArgumentException("Provided file stream is not readable!", "InputFileStream");       if (!InputFileStream.CanSeek)        throw new ArgumentException("Provided file stream cannot seek!", "InputFileStream");       Encoding encodingFound = null;       long originalPos = InputFileStream.Position;       InputFileStream.Position = 0;       //First read only what we need for BOM detection       byte[] bomBytes = new byte[InputFileStream.Length > 4 ? 4 : InputFileStream.Length];      InputFileStream.Read(bomBytes, 0, bomBytes.Length);       encodingFound = DetectBOMBytes(bomBytes);       if (encodingFound != null)      {        InputFileStream.Position = originalPos;        return encodingFound;      }       //BOM Detection failed, going for heuristics now.      // create sample byte array and populate it      byte[] sampleBytes = new byte[HeuristicSampleSize > InputFileStream.Length ? InputFileStream.Length : HeuristicSampleSize];      Array.Copy(bomBytes, sampleBytes, bomBytes.Length);      if (InputFileStream.Length > bomBytes.Length)        InputFileStream.Read(sampleBytes, bomBytes.Length, sampleBytes.Length - bomBytes.Length);      InputFileStream.Position = originalPos;       //test byte array content      encodingFound = DetectUnicodeInByteSampleByHeuristics(sampleBytes);       if (encodingFound != null)        return encodingFound;      else        return DefaultEncoding;    }     public static Encoding DetectTextByteArrayEncoding(byte[] TextData, Encoding DefaultEncoding)    {      if (TextData == null)        throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");       Encoding encodingFound = null;       encodingFound = DetectBOMBytes(TextData);       if (encodingFound != null)      {        return encodingFound;      }      else      {        //test byte array content        encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData);         if (encodingFound != null)          return encodingFound;        else          return DefaultEncoding;      }     }     public static Encoding DetectBOMBytes(byte[] BOMBytes)    {      if (BOMBytes == null)        throw new ArgumentNullException("Must provide a valid BOM byte array!", "BOMBytes");       if (BOMBytes.Length < 2)        return null;       if (BOMBytes[0] == 0xff        && BOMBytes[1] == 0xfe        && (BOMBytes.Length < 4          || BOMBytes[2] != 0          || BOMBytes[3] != 0          )        )        return Encoding.Unicode;       if (BOMBytes[0] == 0xfe        && BOMBytes[1] == 0xff        )        return Encoding.BigEndianUnicode;       if (BOMBytes.Length < 3)        return null;       if (BOMBytes[0] == 0xef && BOMBytes[1] == 0xbb && BOMBytes[2] == 0xbf)        return Encoding.UTF8;       if (BOMBytes[0] == 0x2b && BOMBytes[1] == 0x2f && BOMBytes[2] == 0x76)        return Encoding.UTF7;       if (BOMBytes.Length < 4)        return null;       if (BOMBytes[0] == 0xff && BOMBytes[1] == 0xfe && BOMBytes[2] == 0 && BOMBytes[3] == 0)        return Encoding.UTF32;       if (BOMBytes[0] == 0 && BOMBytes[1] == 0 && BOMBytes[2] == 0xfe && BOMBytes[3] == 0xff)        return Encoding.GetEncoding(12001);       return null;    }     public static Encoding DetectUnicodeInByteSampleByHeuristics(byte[] SampleBytes)    {      long oddBinaryNullsInSample = 0;      long evenBinaryNullsInSample = 0;      long suspiciousUTF8SequenceCount = 0;      long suspiciousUTF8BytesTotal = 0;      long likelyUSASCIIBytesInSample = 0;       //Cycle through, keeping count of binary null positions, possible UTF-8      // sequences from upper ranges of Windows-1252, and probable US-ASCII      // character counts.       long currentPos = 0;      int skipUTF8Bytes = 0;       while (currentPos < SampleBytes.Length)      {        //binary null distribution        if (SampleBytes[currentPos] == 0)        {          if (currentPos % 2 == 0)            evenBinaryNullsInSample++;          else            oddBinaryNullsInSample++;        }         //likely US-ASCII characters        if (IsCommonUSASCIIByte(SampleBytes[currentPos]))          likelyUSASCIIBytesInSample++;         //suspicious sequences (look like UTF-8)        if (skipUTF8Bytes == 0)        {          int lengthFound = DetectSuspiciousUTF8SequenceLength(SampleBytes, currentPos);           if (lengthFound > 0)          {            suspiciousUTF8SequenceCount++;            suspiciousUTF8BytesTotal += lengthFound;            skipUTF8Bytes = lengthFound - 1;          }        }        else        {          skipUTF8Bytes--;        }         currentPos++;      }       //1: UTF-16 LE - in english / european environments, this is usually characterized by a      // high proportion of odd binary nulls (starting at 0), with (as this is text) a low      // proportion of even binary nulls.      // The thresholds here used (less than 20% nulls where you expect non-nulls, and more than      // 60% nulls where you do expect nulls) are completely arbitrary.       if (((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2        && ((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6        )        return Encoding.Unicode;       //2: UTF-16 BE - in english / european environments, this is usually characterized by a      // high proportion of even binary nulls (starting at 0), with (as this is text) a low      // proportion of odd binary nulls.      // The thresholds here used (less than 20% nulls where you expect non-nulls, and more than      // 60% nulls where you do expect nulls) are completely arbitrary.       if (((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2        && ((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6        )        return Encoding.BigEndianUnicode;       //3: UTF-8 - Martin Dürst outlines a method for detecting whether something CAN be UTF-8 content      // using regexp, in his w3c.org unicode FAQ entry:      // http://www.w3.org/International/questions/qa-forms-utf-8      // adapted here for C#.      string potentiallyMangledString = Encoding.ASCII.GetString(SampleBytes);      Regex UTF8Validator = new Regex(@"/A("        + @"[/x09/x0A/x0D/x20-/x7E]"        + @"|[/xC2-/xDF][/x80-/xBF]"        + @"|/xE0[/xA0-/xBF][/x80-/xBF]"        + @"|[/xE1-/xEC/xEE/xEF][/x80-/xBF]{2}"        + @"|/xED[/x80-/x9F][/x80-/xBF]"        + @"|/xF0[/x90-/xBF][/x80-/xBF]{2}"        + @"|[/xF1-/xF3][/x80-/xBF]{3}"        + @"|/xF4[/x80-/x8F][/x80-/xBF]{2}"        + @")*/z");      if (UTF8Validator.IsMatch(potentiallyMangledString))      {        //Unfortunately, just the fact that it CAN be UTF-8 doesn't tell you much about probabilities.        //If all the characters are in the 0-127 range, no harm done, most western charsets are same as UTF-8 in these ranges.        //If some of the characters were in the upper range (western accented characters), however, they would likely be mangled to 2-byte by the UTF-8 encoding process.        // So, we need to play stats.         // The "Random" likelihood of any pair of randomly generated characters being one        // of these "suspicious" character sequences is:        // 128 / (256 * 256) = 0.2%.        //        // In western text data, that is SIGNIFICANTLY reduced - most text data stays in the <127        // character range, so we assume that more than 1 in 500,000 of these character        // sequences indicates UTF-8. The number 500,000 is completely arbitrary - so sue me.        //        // We can only assume these character sequences will be rare if we ALSO assume that this        // IS in fact western text - in which case the bulk of the UTF-8 encoded data (that is        // not already suspicious sequences) should be plain US-ASCII bytes. This, I        // arbitrarily decided, should be 80% (a random distribution, eg binary data, would yield        // approx 40%, so the chances of hitting this threshold by accident in random data are        // VERY low).         if ((suspiciousUTF8SequenceCount * 500000.0 / SampleBytes.Length >= 1) //suspicious sequences          && (              //all suspicious, so cannot evaluate proportion of US-Ascii              SampleBytes.Length - suspiciousUTF8BytesTotal == 0              ||              likelyUSASCIIBytesInSample * 1.0 / (SampleBytes.Length - suspiciousUTF8BytesTotal) >= 0.8            )          )          return Encoding.UTF8;      }       return null;    }     private static bool IsCommonUSASCIIByte(byte testByte)    {      if (testByte == 0x0A //lf        || testByte == 0x0D //cr        || testByte == 0x09 //tab        || (testByte >= 0x20 && testByte <= 0x2F) //common punctuation        || (testByte >= 0x30 && testByte <= 0x39) //digits        || (testByte >= 0x3A && testByte <= 0x40) //common punctuation        || (testByte >= 0x41 && testByte <= 0x5A) //capital letters        || (testByte >= 0x5B && testByte <= 0x60) //common punctuation        || (testByte >= 0x61 && testByte <= 0x7A) //lowercase letters        || (testByte >= 0x7B && testByte <= 0x7E) //common punctuation        )        return true;      else        return false;    }     private static int DetectSuspiciousUTF8SequenceLength(byte[] SampleBytes, long currentPos)    {      int lengthFound = 0;       if (SampleBytes.Length >= currentPos + 1        && SampleBytes[currentPos] == 0xC2        )      {        if (SampleBytes[currentPos + 1] == 0x81          || SampleBytes[currentPos + 1] == 0x8D          || SampleBytes[currentPos + 1] == 0x8F          )          lengthFound = 2;        else if (SampleBytes[currentPos + 1] == 0x90          || SampleBytes[currentPos + 1] == 0x9D          )          lengthFound = 2;        else if (SampleBytes[currentPos + 1] >= 0xA0          && SampleBytes[currentPos + 1] <= 0xBF          )          lengthFound = 2;      }      else if (SampleBytes.Length >= currentPos + 1        && SampleBytes[currentPos] == 0xC3        )      {        if (SampleBytes[currentPos + 1] >= 0x80          && SampleBytes[currentPos + 1] <= 0xBF          )          lengthFound = 2;      }      else if (SampleBytes.Length >= currentPos + 1        && SampleBytes[currentPos] == 0xC5        )      {        if (SampleBytes[currentPos + 1] == 0x92          || SampleBytes[currentPos + 1] == 0x93          )          lengthFound = 2;        else if (SampleBytes[currentPos + 1] == 0xA0          || SampleBytes[currentPos + 1] == 0xA1          )          lengthFound = 2;        else if (SampleBytes[currentPos + 1] == 0xB8          || SampleBytes[currentPos + 1] == 0xBD          || SampleBytes[currentPos + 1] == 0xBE          )          lengthFound = 2;      }      else if (SampleBytes.Length >= currentPos + 1        && SampleBytes[currentPos] == 0xC6        )      {        if (SampleBytes[currentPos + 1] == 0x92)          lengthFound = 2;      }      else if (SampleBytes.Length >= currentPos + 1        && SampleBytes[currentPos] == 0xCB        )      {        if (SampleBytes[currentPos + 1] == 0x86          || SampleBytes[currentPos + 1] == 0x9C          )          lengthFound = 2;      }      else if (SampleBytes.Length >= currentPos + 2        && SampleBytes[currentPos] == 0xE2        )      {        if (SampleBytes[currentPos + 1] == 0x80)        {          if (SampleBytes[currentPos + 2] == 0x93            || SampleBytes[currentPos + 2] == 0x94            )            lengthFound = 3;          if (SampleBytes[currentPos + 2] == 0x98            || SampleBytes[currentPos + 2] == 0x99            || SampleBytes[currentPos + 2] == 0x9A            )            lengthFound = 3;          if (SampleBytes[currentPos + 2] == 0x9C            || SampleBytes[currentPos + 2] == 0x9D            || SampleBytes[currentPos + 2] == 0x9E            )            lengthFound = 3;          if (SampleBytes[currentPos + 2] == 0xA0            || SampleBytes[currentPos + 2] == 0xA1            || SampleBytes[currentPos + 2] == 0xA2            )            lengthFound = 3;          if (SampleBytes[currentPos + 2] == 0xA6)            lengthFound = 3;          if (SampleBytes[currentPos + 2] == 0xB0)            lengthFound = 3;          if (SampleBytes[currentPos + 2] == 0xB9            || SampleBytes[currentPos + 2] == 0xBA            )            lengthFound = 3;        }        else if (SampleBytes[currentPos + 1] == 0x82          && SampleBytes[currentPos + 2] == 0xAC          )          lengthFound = 3;        else if (SampleBytes[currentPos + 1] == 0x84          && SampleBytes[currentPos + 2] == 0xA2          )          lengthFound = 3;      }       return lengthFound;    }   }}

使用方法:

Encoding fileEncoding = TextFileEncodingDetector.DetectTextFileEncoding("you file path",Encoding.Default);

以上就是本文的全部內容,希望對大家學習C#程序設計有所幫助。

發表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發表
亚洲香蕉成人av网站在线观看_欧美精品成人91久久久久久久_久久久久久久久久久亚洲_热久久视久久精品18亚洲精品_国产精自产拍久久久久久_亚洲色图国产精品_91精品国产网站_中文字幕欧美日韩精品_国产精品久久久久久亚洲调教_国产精品久久一区_性夜试看影院91社区_97在线观看视频国产_68精品久久久久久欧美_欧美精品在线观看_国产精品一区二区久久精品_欧美老女人bb
国产欧美婷婷中文| 一本色道久久88亚洲综合88| 欧美激情一区二区三区在线视频观看| 国产欧美精品一区二区三区介绍| 亚洲国内精品视频| 7777免费精品视频| 日韩在线视频网站| 国产一区二区三区视频免费| 国产做受高潮69| 久久精品亚洲国产| 91a在线视频| 精品女同一区二区三区在线播放| 久久久日本电影| 蜜月aⅴ免费一区二区三区| 欧美日韩福利在线观看| 97热精品视频官网| 中文字幕成人在线| 久久成年人视频| 日韩av大片在线| 欧美国产一区二区三区| 日韩成人免费视频| 少妇高潮久久77777| 欧美高清视频在线播放| 国产日韩在线免费| 97免费在线视频| 欧美成人精品在线视频| 亚洲美女av在线| 国内精品久久久久影院优| 96精品久久久久中文字幕| 色婷婷**av毛片一区| www.亚洲一区| 岛国av午夜精品| 久久久久国色av免费观看性色| 亚洲精品国产美女| 97激碰免费视频| 欧美—级高清免费播放| 国产成+人+综合+亚洲欧美丁香花| 色综合91久久精品中文字幕| 久久久久久久网站| 亚洲自拍偷拍第一页| 亚洲午夜国产成人av电影男同| 日韩黄色高清视频| 深夜福利一区二区| 精品国产依人香蕉在线精品| 日韩中文在线中文网三级| 麻豆一区二区在线观看| 成人av在线天堂| 久久精品国产欧美亚洲人人爽| 91成人天堂久久成人| 一区二区三区视频免费| 亚洲毛片在线看| 亚洲天堂免费视频| 国产精品久久久久久亚洲调教| 国产一区二区黑人欧美xxxx| 成人中文字幕+乱码+中文字幕| 亚洲wwwav| 国产成人精品免高潮费视频| 久久夜色精品国产欧美乱| 国产精品香蕉av| 欧美国产日韩一区二区三区| 日韩综合中文字幕| 欧美日韩在线视频观看| 韩国三级日本三级少妇99| 青青草99啪国产免费| 国产精品久久久一区| 理论片在线不卡免费观看| 国产剧情日韩欧美| 国产午夜精品一区二区三区| 日本欧美爱爱爱| 亚洲精品成人免费| 欧美日韩国产丝袜另类| 亚洲欧洲中文天堂| 国产精品久久在线观看| 亚洲综合大片69999| 国产美女精品视频免费观看| 在线播放亚洲激情| 国产69精品久久久久9| 久久99久国产精品黄毛片入口| 亚洲国产精品一区二区三区| 国产精品第100页| 狠狠综合久久av一区二区小说| 91在线观看免费高清| 一个人www欧美| 日韩精品视频免费| 国产精品电影一区| 欧美激情三级免费| 69久久夜色精品国产69乱青草| 欧美亚洲成人免费| 色综合天天综合网国产成人网| 亚洲第一区在线| 另类视频在线观看| 欧美国产日韩一区二区| 精品国产999| 精品国产91乱高清在线观看| 久久久999成人| 国产精品亚洲自拍| 亚洲在线一区二区| 91成人免费观看网站| 欧美乱大交做爰xxxⅹ性3| 精品亚洲一区二区| 久久精品视频免费播放| 欧美色视频日本版| 亚洲欧洲日韩国产| 国产国语videosex另类| 91tv亚洲精品香蕉国产一区7ujn| 欧美激情综合色综合啪啪五月| 国产精品一区二区三| 午夜精品一区二区三区在线播放| 7777免费精品视频| 中文字幕亚洲综合| 国产亚洲xxx| 日韩电影中文字幕在线| 亚洲综合国产精品| 欧美激情伊人电影| 国内精品久久影院| 亚洲女同精品视频| 久久综合五月天| 亚洲91精品在线| 亚洲色图偷窥自拍| xxav国产精品美女主播| 亚洲国产欧美一区二区丝袜黑人| 亚洲欧美制服另类日韩| 一区二区中文字幕| 国产精品69久久| 91精品国产91久久久久久久久| 欧美成人自拍视频| 久久精品国产清自在天天线| 亚洲人成网站色ww在线| 欧美性猛交xxx| 91久久久国产精品| 91在线免费观看网站| 欧美激情视频在线观看| 中文字幕精品一区久久久久| 欧美激情在线观看| 亚洲激情久久久| 狠狠做深爱婷婷久久综合一区| 欧美激情伊人电影| 九九热r在线视频精品| 91精品久久久久久久久中文字幕| 久久69精品久久久久久久电影好| 久久综合亚洲社区| 亚洲一区制服诱惑| 九九久久久久99精品| 亚洲一区二区三区香蕉| 日韩福利视频在线观看| 亚洲人在线视频| 欧美制服第一页| 亚洲一区亚洲二区亚洲三区| 欧美亚洲午夜视频在线观看| 日韩精品欧美激情| 欧美午夜www高清视频| 国产精品美女在线观看| 国产福利精品在线| 亚洲美女视频网| 久久69精品久久久久久国产越南| 欧美在线一区二区视频| 亚洲人精品午夜在线观看| 日韩av在线一区二区| 久久久国产精品免费| 亚洲国产精品yw在线观看| 亚洲国产精品电影在线观看| 国产免费久久av| 欧美日韩国内自拍| 91系列在线观看|