HTMLParser1.6 源代码阅读

看到博客园的大牛们都喜欢发系列的文章，我也发一篇。不过我不打算写什么spring hibernate配置什么的，我只想写写自己阅读别人代码的一些笔记。欢迎大家拍砖。

从开始进行阅读，第一个包是：org.htmlparser.里面的类包括

Attribute.java

Node.java

NodeFactory.java

NodeFilter.java

Parser.java

PrototypicalNodeFactory.java

Remark.java

Tag.java

Text.java

可以看出都是针对基本数据结构的类。

一个一个进行分析，Attribute.java是记录网页元素的属性的，

                                 <  a   href  ="EditPosts.aspx"   id  ="TabPosts"  > 随笔 </  a  >

这个是博客园的，那Attribute应当可以记录 href=“MySubscibe.aspx"这样的元素。看他的构造方法：

 public  Attribute (String name, String assignment, String value,  char   quote)
    {
        setName (name);
        setAssignment (assignment);
          if  (0 ==  quote)
            setRawValue (value);
          else  
        {
            setValue (value);
            setQuote (quote);
        }
    }

看见分为是否含有引号的情况。setName，和setValue没有什么可以看的，如果不含引号的情况，value怎么设置

 /**  
     * Set the value of the attribute and the quote character.
     * If the value is pure whitespace, assign it 'as is' and reset the
     * quote character. If not, check for leading and trailing double or
     * single quotes, and if found use this as the quote character and
     * the inner contents of <code>value</code> as the real value.
     * Otherwise, examine the string to determine if quotes are needed
     * and an appropriate quote character if so. This may involve changing
     * double quotes within the string to character references.
     *   @param   value The new value.
     *   @see   #getRawValue
     *   @see   #getRawValue(StringBuffer)
       */ 
     public   void   setRawValue (String value)
    {
          char   ch;
          boolean   needed;
          boolean   singleq;
          boolean   doubleq;
        String ref;
        StringBuffer buffer;
          char   quote;

        quote  = 0 ;
          if  (( null  != value) && (0 !=  value.trim ().length ()))
        {
              if  (value.startsWith ("'") && value.endsWith ("'" )
                 && (2 <=  value.length ()))
            {
                quote  = '\'' ;
                value  = value.substring (1, value.length () - 1 );
            }
              else   if  (value.startsWith ("\"") && value.endsWith ("\"" )
                 && (2 <=  value.length ()))
            {
                quote  = '"' ;
                value  = value.substring (1, value.length () - 1 );
            }
              else  
            {
                  //   first determine if there's whitespace in the value
                  //   and while we're at it find a suitable quote character 
                needed =  false  ;
                singleq  =  true  ;
                doubleq  =  true  ;
                  for  ( int  i = 0; i < value.length (); i++ )
                {
                    ch  =  value.charAt (i);
                      if  ('\'' ==  ch)
                    {
                        singleq   =  false  ;
                        needed  =  true  ;
                    }
                      else   if  ('"' ==  ch)
                    {
                        doubleq  =  false  ;
                        needed  =  true  ;
                    }
                      else   if  (!('-' == ch) && !('.' == ch) && !('_' ==  ch)
                        && !(':' == ch) && ! Character.isLetterOrDigit (ch))
                    {
                        needed  =  true  ;
                    }
                }

                  //   now apply quoting 
                 if   (needed)
                {
                      if   (doubleq)
                        quote  = '"' ;
                      else   if   (singleq)
                        quote  = '\'' ;
                      else  
                    {
                          //   uh-oh, we need to convert some quotes into character
                          //   references, so convert all double quotes into &#34; 
                        quote = '"' ;
                        ref  = "&quot;";  //   Translate.encode (quote);
                          //   JDK 1.4: value = value.replaceAll ("\"", ref); 
                        buffer =  new   StringBuffer (
                                value.length()  * (ref.length () - 1 ));
                          for  ( int  i = 0; i < value.length (); i++ )
                        {
                            ch  =  value.charAt (i);
                              if  (quote ==  ch)
                                buffer.append (ref);
                              else  
                                buffer.append (ch);
                        }
                        value  =  buffer.toString ();
                    }
                }
            }
        }
        setValue (value);
        setQuote (quote);
    }

如果没有设置分割字符的话，需要进行判断，首先判断value的字符是哪种？单引号，双引号，还是其他。如果是单引号开头，单引号结尾，那么分割是单引号。如果是双引号，就是双引号。如果不是这样，可能需要修复，如果里面有单引号，那么我们用双引号进行包装，里面含有双引号，我们用单引号进行包装。或者其中含有一些特别的字符（不是数字，字符，-，_,:），我们需要用引号引用起来。这样属性就可以保存下来。

属性暂时分析到这里，有兴趣的可以自己阅读剩下的部分。

Node.java

其实htmlParser就是一个词法语法分析器，学过自动机的同学应当对此很熟悉，（ps，本人自动机挂掉了。。）而HTML的元素有三种类型，text，Tag，Remark（remark是不是这种,这个不确定）。我们进行语法解析的时候肯定要返回相应的node，那这个node应当设计成抽象类或者接口，的确，也是这样设计的。看代码

 package   org.htmlparser;

  import   org.htmlparser.lexer.Page;
  import   org.htmlparser.util.NodeList;
  import   org.htmlparser.util.ParserException;
  import   org.htmlparser.visitors.NodeVisitor;
 
 public   interface   Node
      extends  
        Cloneable
{
      
    String toPlainTextString ();    
    String toHtml ();  
    String toHtml (  boolean   verbatim);  
    String toString ();     
     void   collectInto (NodeList list, NodeFilter filter);    
     int   getStartPosition ();     
     void  setStartPosition ( int   position)；    
     int   getEndPosition ();    
     void  setEndPosition ( int   position);     
    Page getPage (); 
     void   setPage (Page page);  
     void   accept (NodeVisitor visitor);     
    Node getParent (); 
     void   setParent (Node node);  
    NodeList getChildren ();   
     void   setChildren (NodeList children);  
    Node getFirstChild ();  
    Node getLastChild ();   
    Node getPreviousSibling ();        
    Node getNextSibling ();       
    String getText ();   
     void   setText (String text);  
     void   doSemanticAction ()
          throws  
            ParserException;    
    Object clone ()
          throws  
            CloneNotSupportedException;
}

这里光看这个也没办法领略精髓，大致就是一个开始克隆，可以转换成html元素，转换成string的类型，并且可以迭代的一系列方法。不过下一步我们应当从lexer中寻找相关答案，保持我们的阅读顺序，我们继续进行下一个java分析

NodeFactory

先看代码

 package   org.htmlparser;

  import   java.util.Vector;

  import   org.htmlparser.lexer.Page;
  import   org.htmlparser.util.ParserException;
 
 public   interface   NodeFactory
{
     
    Text createStringNode (Page page,   int  start,  int   end)
          throws  
            ParserException;
   
    Remark createRemarkNode (Page page,   int  start,  int   end)
          throws  
            ParserException;
     
    Tag createTagNode (Page page,   int  start,  int   end, Vector attributes)
          throws  
            ParserException;
}

就是创建上面三种HTML页面的基本元素。

NodeFilter

 package   org.htmlparser;

  import   java.io.Serializable;

  /**  
 * Implement this interface to select particular nodes.
   */ 
 public   interface   NodeFilter
      extends  
        Serializable,
        Cloneable
{
 
     boolean   accept (Node node);
}

进行一个Node的合理性验证

Parser

终于看到重点了

 package   org.htmlparser;

  import   java.io.Serializable;
  import   java.net.HttpURLConnection;
  import   java.net.URLConnection;

  import   org.htmlparser.filters.TagNameFilter;
  import   org.htmlparser.filters.NodeClassFilter;
  import   org.htmlparser.http.ConnectionManager;
  import   org.htmlparser.http.ConnectionMonitor;
  import   org.htmlparser.http.HttpHeader;
  import   org.htmlparser.lexer.Lexer;
  import   org.htmlparser.lexer.Page;
  import   org.htmlparser.util.DefaultParserFeedback;
  import   org.htmlparser.util.IteratorImpl;
  import   org.htmlparser.util.NodeIterator;
  import   org.htmlparser.util.NodeList;
  import   org.htmlparser.util.ParserException;
  import   org.htmlparser.util.ParserFeedback;
  import   org.htmlparser.visitors.NodeVisitor;

 
 public   class   Parser
      implements  
        Serializable,
        ConnectionMonitor
{
    
     public   static   final   double  
    VERSION_NUMBER  = 1.6 
    ;

    
     public   static   final   String
    VERSION_TYPE  = "Release Build" 
    ;

    
     public   static   final   String
    VERSION_DATE  = "Jun 10, 2006" 
    ;

    
     public   static   final  String VERSION_STRING =
            "" +  VERSION_NUMBER
             + " (" + VERSION_TYPE + " " + VERSION_DATE + ")" ;

   
     protected   ParserFeedback mFeedback;
 
     protected   Lexer mLexer;
 
     public   static   final  ParserFeedback DEVNULL =
         new   DefaultParserFeedback (DefaultParserFeedback.QUIET);
 
     public   static   final  ParserFeedback STDOUT =  new   DefaultParserFeedback ();

      static  
    {
        getConnectionManager ().getDefaultRequestProperties ().put (
             "User-Agent", "HTMLParser/" +  getVersionNumber ());
    
    }

 
     public   static   String getVersion ()
    {
          return   (VERSION_STRING);
    }
 
     public   static   double   getVersionNumber ()
    {
          return   (VERSION_NUMBER);
    }
 
     public   static   ConnectionManager getConnectionManager ()
    {
          return   (Page.getConnectionManager ());
    }
 
     public   static   void   setConnectionManager (ConnectionManager manager)
    {
        Page.setConnectionManager (manager);
    }
 
     public   static   Parser createParser (String html, String charset)
    {
        Parser ret;

          if  ( null  ==  html)
              throw   new  IllegalArgumentException ("html cannot be null" );
        ret  =  new  Parser ( new  Lexer ( new   Page (html, charset)));

          return   (ret);
    }

 
     public   Parser ()
    {
          this  ( new  Lexer ( new  Page ("" )), DEVNULL);
    }
 
     public   Parser (Lexer lexer, ParserFeedback fb)
    {
        setFeedback (fb);
        setLexer (lexer);
        setNodeFactory (  new   PrototypicalNodeFactory ());
    }
 
     public   Parser (URLConnection connection, ParserFeedback fb)
          throws  
            ParserException
    {
          this  ( new   Lexer (connection), fb);
    }
 
     public   Parser (String resource, ParserFeedback feedback)
          throws  
            ParserException
    {
        setFeedback (feedback);
        setResource (resource);
        setNodeFactory (  new   PrototypicalNodeFactory ());
    }
 
     public  Parser (String resource)  throws   ParserException
    {
          this   (resource, STDOUT);
    }
 
     public   Parser (Lexer lexer)
    {
          this   (lexer, STDOUT);
    }
 
     public  Parser (URLConnection connection)  throws   ParserException
    {
          this   (connection, STDOUT);
    }
 
     public   void   setResource (String resource)
          throws  
            ParserException
    {
          int   length;
          boolean   html;
          char   ch;

          if  ( null  ==  resource)
              throw   new  IllegalArgumentException ("resource cannot be null" );
        length  =  resource.length ();
        html  =  false  ;
          for  ( int  i = 0; i < length; i++ )
        {
            ch  =  resource.charAt (i);
              if  (! Character.isWhitespace (ch))
            {
                  if  ('<' ==  ch)
                    html  =  true  ;
                  break  ;
            }
        }
          if   (html)
            setLexer (  new  Lexer ( new   Page (resource)));
          else  
            setLexer (  new   Lexer (getConnectionManager ().openConnection (resource)));
    }
 
     public   void   setConnection (URLConnection connection)
          throws  
            ParserException
    {
          if  ( null  ==  connection)
              throw   new  IllegalArgumentException ("connection cannot be null" );
        setLexer (  new   Lexer (connection));
    }
 
     public   URLConnection getConnection ()
    {
          return   (getLexer ().getPage ().getConnection ());
    }

 
     public   void   setURL (String url)
          throws  
            ParserException
    {
          if  (( null  != url) && !"" .equals (url))
            setConnection (getConnectionManager ().openConnection (url));
    }
 
     public   String getURL ()
    {
          return   (getLexer ().getPage ().getUrl ());
    }
 
     public   void   setEncoding (String encoding)
          throws  
            ParserException
    {
        getLexer ().getPage ().setEncoding (encoding);
    }
 
     public   String getEncoding ()
    {
          return   (getLexer ().getPage ().getEncoding ());
    }

 
     public   void   setLexer (Lexer lexer)
    {
        NodeFactory factory;
        String type;

          if  ( null  ==  lexer)
              throw   new  IllegalArgumentException ("lexer cannot be null" );
          //   move a node factory that's been set to the new lexer 
        factory =  null  ;
          if  ( null  !=  getLexer ())
            factory  =  getLexer ().getNodeFactory ();
          if  ( null  !=  factory)
            lexer.setNodeFactory (factory);
        mLexer  =  lexer;
          //   warn about content that's not likely text 
        type =  mLexer.getPage ().getContentType ();
          if  (type !=  null  && !type.startsWith ("text" ))
            getFeedback ().warning (
                 "URL "
                +  mLexer.getPage ().getUrl ()
                 + " does not contain text" );
    }
 
     public   Lexer getLexer ()
    {
          return   (mLexer);
    }
 
     public   NodeFactory getNodeFactory ()
    {
          return   (getLexer ().getNodeFactory ());
    }
 
     public   void   setNodeFactory (NodeFactory factory)
    {
          if  ( null  ==  factory)
              throw   new  IllegalArgumentException ("node factory cannot be null" );
        getLexer ().setNodeFactory (factory);
    }

  
     public   void   setFeedback (ParserFeedback fb)
    {
          if  ( null  ==  fb)
            mFeedback  =  DEVNULL;
          else  
            mFeedback  =  fb;
    }
 
     public   ParserFeedback getFeedback()
    {
          return   (mFeedback);
    }
 
     public   void   reset ()
    {
        getLexer ().reset ();
    }

  
     public  NodeIterator elements ()  throws   ParserException
    {
          return  ( new   IteratorImpl (getLexer (), getFeedback ()));
    }
    
     public  NodeList parse (NodeFilter filter)  throws   ParserException
    {
        NodeIterator e;
        Node node;
        NodeList ret;

        ret  =  new   NodeList ();
          for  (e =  elements (); e.hasMoreNodes (); )
        {
            node  =  e.nextNode ();
              if  ( null  !=  filter)
                node.collectInto (ret, filter);
              else  
                ret.add (node);
        }

          return   (ret);
    }
 
     public   void  visitAllNodesWith (NodeVisitor visitor)  throws   ParserException
    {
        Node node;
        visitor.beginParsing();
          for  (NodeIterator e =  elements(); e.hasMoreNodes(); )
        {
            node  =  e.nextNode();
            node.accept(visitor);
        }
        visitor.finishedParsing();
    }
 
     public   void   setInputHTML (String inputHTML)
          throws  
            ParserException
    {
          if  ( null  ==  inputHTML)
              throw   new  IllegalArgumentException ("html cannot be null" );
          if  (!"" .equals (inputHTML))
            setLexer (  new  Lexer ( new   Page (inputHTML)));
    }
 
     public   NodeList extractAllNodesThatMatch (NodeFilter filter)
          throws  
            ParserException
    {
        NodeIterator e;
        NodeList ret;

        ret  =  new   NodeList ();
          for  (e =  elements (); e.hasMoreNodes (); )
            e.nextNode ().collectInto (ret, filter);

          return   (ret);
    }

 
     public   void   preConnect (HttpURLConnection connection)
          throws  
            ParserException
    {
        getFeedback ().info (HttpHeader.getRequestHeader (connection));
    }

 
     public   void   postConnect (HttpURLConnection connection)
          throws  
            ParserException
    {
        getFeedback ().info (HttpHeader.getResponseHeader (connection));
    }
 
     public   static   void   main (String [] args)
    {
        Parser parser;
        NodeFilter filter;

          if  (args.length < 1 || args[0].equals ("-help" ))
        {
            System.out.println ( "HTML Parser v" + getVersion () + "\n" );
            System.out.println ();
            System.out.println ( "Syntax : java -jar htmlparser.jar"
                    + " <file/page> [type]" );
            System.out.println ( "   <file/page> the URL or file to be parsed" );
            System.out.println ( "   type the node type, for example:" );
            System.out.println ( "     A - Show only the link tags" );
            System.out.println ( "     IMG - Show only the image tags" );
            System.out.println ( "     TITLE - Show only the title tag" );
            System.out.println ();
            System.out.println ( "Example : java -jar htmlparser.jar"
                    + " http://www.yahoo.com" );
            System.out.println ();
        }
          else 
             try  
            {
                parser  =  new   Parser ();
                  if  (1 <  args.length)
                    filter  =  new  TagNameFilter (args[1 ]);
                  else  
                {
                    filter  =  null  ;
                      //   for a simple dump, use more verbose settings 
                     parser.setFeedback (Parser.STDOUT);
                    getConnectionManager ().setMonitor (parser);
                }
                getConnectionManager ().setRedirectionProcessingEnabled (  true  );
                getConnectionManager ().setCookieProcessingEnabled (  true  );
                parser.setResource (args[ 0 ]);
                System.out.println (parser.parse (filter));
            }
              catch   (ParserException e)
            {
                e.printStackTrace ();
            }
    }
}

首先我们要看看构造方法：通常，我们使用HTMLParser的时候是这样new的 Parser a = Parser.createParser(content,"UTF-8");而这个方法就是通过new Parser(New Lexer(new Page(html,charset)));方法创建一个Parser，也就是核心是Page对象，Lexer对象和Parser对象。

看其他的构造方法，空构造方法，我们略去。

public Parser(Lexer lexer,ParserFeedback fb)，里面无非就是设置词法解析器，设置异常信息接收器，和设置一个nodeFactory。

public Parser (URLConnection connection, ParserFeedback fb)网页
public Parser (String resource, ParserFeedback feedback) 对内容进行解析

public Parser (String resource)

public Parser (Lexer lexer)

public Parser (URLConnection connection)

无非就是用默认的fd，之后会对FeedBack类进行相关说明。

到这里，parser完成的无非就是对给定的url，或者content，或者urlconnection对象，创建相应的词法分析器，feedback，和根据条件创建相应的Page对象。同时，我们看到Nodefactory实际上的设置是在Lexer中进行的。这里先不管。

这里有几个我们经常用的方法

public NodeList parse (NodeFilter filter) throws ParserException 根据过滤条件返回相应的nodelist

public void visitAllNodesWith (NodeVisitor visitor) throws ParserException 运用迭代器的方式进行遍历，这里涉及了NodeVisitor，IteratorImpl 这里还没有看。暂时不知道为什么这样设计。

public void setInputHTML (String inputHTML) 可以对html元素进行解析

public NodeList extractAllNodesThatMatch (NodeFilter filter) 根据过滤条件返回满足条件的NodeList

最后是测试方法。parser类读完了，其实parser类就是一个大的入口，将与页面相关的信息传递给parser类，parser调用其他类对其进行解析，返回nodelist。nodelist里面有若干node，而实例化的node里面存有我们相用的各种信息，到这里没有看到词法分析器的影子。。

PrototypicalNodeFactory

对Text，Remark，Tag进行标准化的约束。这个类暂时不做过多介绍，等用到的时候进行相关解释。

Remark

Text

Tag 这三个接口类，实现node，为具体的Node提供接口。

至此，org.htmlParser 阅读完毕，看见还是比较简单的。为什么我要阅读这个代码，一时项目中用到了这个HTMLParser，因为不放心，想知道内部是如何实现的。二是，我自动机以前挂掉过，我想自己实现一个Parser。三是，我的时间和充裕。

下一章节，我们将介绍另一个核心包：org.htmlparser.lexer 。同时，我们要对html页面标准进行学习，不然如何自己实现一个Parser呢？

欢迎批评指正。

吸取上次代码过多的教训，这次主要讲设计。

org.htmlparser.lexer 包，是主要的进行html解析的包。Page类可以根绝传入的urlConnection，text，stream等类型，构造相应的Page对象，Page对象中比较关键的是Source，url，PageIndex对象，他们的用途是：Source相当于一个Reader，但是与Reader不同的地方是，Source应当是线程同步的，字符可以改变，而且有可能多次请求。这边主要是要对解析的内容进行记录，记录位置等信息。Source是抽象类，具体的实现是StringSource和InputStreamSource。而inputStreamSource 的同步操作是依赖Stream类实现的同步方法。对于string类型的的source，直接构造相应buffered data即可。pageIndex对象是是对每行的第一个字符的位置进行记录。最后lexer是对page对象进行词法解析，我们看到有如下的几个方法，parseCDATA，scanJIS，parseString，parseTag，parseRemark，parseJsp，parsePI。这个可能要根据不同的页面进行不同的解析方法的编写。剩下的比较重要的包无非就是filter包。那这样我们对HTMLParser的构造就大致了解了，用图呈现如下：

HTMLParser各个包之间的关系图（只将比较重要的几个类，用流程的方式串联起来）

以上是我对htmlparser 包和类之间的分析，具体的htmlparser的包的组织结构如下图：

大部分的包都在关系图中显示了，剩下的是一些测试包，一些数据的组织包，并不是htmlparser的核心。按照这个思路，下一步可以自己做一个小型的parser解析器了。

总感觉少了点什么，缺又不清楚少了什么。

博客园精华文章收录

http://www.cnblogs.com/pick/ 架构篇

作者： Leo_wl
　　　　
出处： http://www.cnblogs.com/Leo_wl/
　　　　
本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。
版权信息

查看更多关于HTMLParser1.6 源代码阅读的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://haodehen.cn/did48502

更新时间：2022-09-24 阅读：38次