簡單java采集程序一

2019-11-14 21:03:00

字體：大中小

來源：轉載

供稿：網友

簡單java采集程序一

【目標任務】通過該網站采集全國的手機號碼段至數據庫表中

【完成過程】

1、初涉正則表達式，學會寫簡單的正則表達式

2、獲取單個網頁內容，學會java中基本的IO流

3、將獲取數據插入MySQL數據庫表中，掌握基本的JDBC編程。

5、通過url拼接獲取每一個城市的完整url

6、采集整個網站的號碼段，并使用批處理+預編譯批量插入數據庫表中

7、使用StringBuilder進行優化提速

【數據庫表】注意如果是在cmd命令下建表的話，字段名稱不需要加引號

create table number_segment (`id` bigint not null auto_increment unique,`segment` char(7) not null PRimary key,`province` varchar(255) not null,`city` varchar(255) not null) default charset=utf8;

CE@PZ(2L[]0RZGNR6XS(4UH

【初涉正則表達式】

1、學習簡單的表達式：正則表達式30分入門。

2、在線測試自己寫的表達式：在線測試正則表達式1

3、使用java的Pattern類和 Matcher類

import java.util.regex.Matcher;import java.util.regex.Pattern;public class test_ZhengZe{public static void main(String[] args){Pattern p = Pattern.compile("(13//d{5}[^<])");String s ="/mobile/guangzhou_1300040.>1300040</a></li><li><a href=/"../../mobile/guangzhou_1300041.html/">1300041</a></li><li><a ";Matcher m = p.matcher(s);while(m.find()){System.out.println("打印出的號碼段落："+m.group(0));}System.out.print("捕獲的數據有："+m.groupCount());}}

clipboard

【獲取網頁內容】

這里主要用到 InputStream BufferReader兩個IO流處理類。更多的方法講解見【java獲得采集網頁內容的方法小結】

import java.io.BufferedReader;import java.io.IOException;import java.io.InputStreamReader;import java.net.URL;import java.util.regex.Matcher;import java.util.regex.Pattern;public class getHtml {    public static void main(String[] args) throws Exception    {    long start= System.currentTimeMillis();        String str_url="http://www.hiphop8.com/city/guangdong/guangzhou.php";        //匹配號碼段        Pattern p = Pattern.compile(">(13//d{5}|15//d{5}|18//d{5}|147//d{4})<");        String html = get_Html(str_url);        Matcher m = p.matcher(html);             int num = 0;       while(m.find()){System.out.println("打印出的號碼段落："+m.group(1)+"  編號"+(++num));}       System.out.println(num);              long end = System.currentTimeMillis();System.out.println("花費的時間"+(end-start)+"毫秒");    }     public static String get_Html(String str_url) throws IOException{    URL url = new URL(str_url);    String content="";StringBuffer page = new StringBuffer();try {BufferedReader in = new BufferedReader(new InputStreamReader(url                    .openStream()));while((content = in.readLine()) != null){page.append(content);}} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}        return page.toString();    }}

【將采集內容插入數據庫】

java連接mysql數據庫的大概操作是：

加載mysql驅動---》創意一個數據庫連接---》創建一個sql語句執行對象statement---》定義String類型的SQL語句，statment調用SQL語句的執行方法---》關閉statment對象和數據庫。

import java.sql.DriverManager;import java.sql.SQLException;import java.sql.Statement;public class database {public static String driver ="com.mysql.jdbc.Driver";public static String url    ="jdbc:mysql://127.0.0.1:3306/tele_dat?autoReconnect=true&characterEncoding=UTF-8";public static String user   ="root";public static String passWord = "123456";public static Statement statement = null;public static java.sql.Connection conn = null;public static int i=0;//創建一個插入數據的方法public static void datatoMySql(String sql) throws SQLException {try {Class.forName(driver);} catch (ClassNotFoundException e) {System.out.println("加載驅動失敗");e.printStackTrace();}conn = DriverManager.getConnection(url, user, password);//創建一個連接statement = conn.createStatement();   //創建一個Statemnet對象來傳送SQL語句statement.executeUpdate(sql);}public  static void close() throws SQLException{statement.close();   //關閉數據庫操作對象conn.close();       //關閉數據庫連接}//測試連接數據庫例子public static void main(String args[]){String sql = "insert  into   number_segment(segment,province,city) " +"values (123458,'廣東1','廣州') ";try {datatoMySql(sql);System.out.println("插入成功");} catch (SQLException e) {System.out.println("插入失敗");e.printStackTrace();}try {close();System.out.print("關閉數據庫");} catch (SQLException e) {e.printStackTrace();}}}

我用的是wampsever中集成的mysql數據庫，并在cmd下進行操作，常用命令見：mysql常用的命令見，如果對jdbc編程不熟，可以參考這篇博文。

【獲取整個網站中所以城市的URL】

通過查看網站首頁的源代碼，發現可以從這里獲取每一個省份的URL，然后觀察一個省份的頁面，可以獲取該省城市url的部分后綴，由此可以拼接就得到一個完整城市的url.

mport java.io.BufferedReader;import java.io.IOException;import java.io.InputStreamReader;import java.net.URL;import java.util.ArrayList;import java.util.regex.Matcher;import java.util.regex.Pattern;public class get_all_city_url {public static void  main(String[] args) throws Exception {String home_url = "http://www.hiphop8.com";String pattern_pro ="//w{3}//.//w{7}//.//w{3}/////w{4}/////w+";  //匹配省份的URLString pattern_city_hz="<LI><A href=/"(.*?)/" target=_blank>";   //城市后綴Matcher mat_home = get(home_url,pattern_pro);int i = 0;//可以用ArrayList保存所有url，另外可以用StringBuilder對字符串進行相加，不過測試耗時差不多long  start = System.currentTimeMillis();while(mat_home.find()){String city_url_qz = "http://"+mat_home.group()+"/";Matcher mat_city_hz = get(city_url_qz,pattern_city_hz);while(mat_city_hz.find()){i++;String city_url = city_url_qz + mat_city_hz.group(1);System.out.println(i+"  "+city_url);}}long  end = System.currentTimeMillis();long time =end - start;System.out.println("總共用時"+time);}public static  Matcher get(String str, String pa) throws Exception  {String urlsource =get_Html(str);Pattern p = Pattern.compile(pa);Matcher m = p.matcher(urlsource);return m;}public static String get_Html(String str_url) throws IOException{    URL url = new URL(str_url);    String content="";StringBuffer page = new StringBuffer();try {BufferedReader in = new BufferedReader(new InputStreamReader(url                    .openStream()));while((content = in.readLine()) != null){page.append(content);}} catch (IOException e) {e.printStackTrace();}        return page.toString();    }}

3DB6A7ACC5B647B49F182322CEF7B13C