C#解析HTML

wsql

浏览: 11757580 次
性别:
来自: 深圳

最近访客更多访客>>

GDGZWQZ

IT_way

qq_33632159

mhx1535

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (14573)

社区版块

存档分类

2013-03 ( 19)
2013-02 ( 15)
2013-01 ( 268)
更多存档...

在搜索引擎的开发中，我们需要对网页的Html内容进行检索，难免的就需要对Html进行解析。拆分每一个节点并且获取节点间的内容。此文介绍两种C#解析Html的方法。
第一种方法：
用System.Net.WebClient下载Web Page存到本地文件或者String中，用正则表达式来分析。这个方法可以用在Web Crawler等需要分析很多Web Page的应用中。
估计这也是大家最直接，最容易想到的一个方法。
转自网上的一个实例：所有的href都抽取出来：

usingSystem;
usingSystem.Net;
usingSystem.Text;
usingSystem.Text.RegularExpressions;
namespaceHttpGet
{
classClass1
{
[STAThread]
staticvoidMain(string[]args)
{
System.Net.WebClientclient=newWebClient();
byte[]page=client.DownloadData("http://www.google.com");
stringcontent=System.Text.Encoding.UTF8.GetString(page);
stringregex="href=[\\\"\\\'](http:\\/\\/|\\.\\/|\\/)?\\w+(\\.\\w+)*(\\/\\w+(\\.\\w+)?)*(\\/|\\?\\w*=\\w*(&\\w*=\\w*)*)?[\\\"\\\']";
Regexre=newRegex(regex);
MatchCollectionmatches=re.Matches(content);

System.Collections.IEnumeratorenu=matches.GetEnumerator();
while(enu.MoveNext()&&enu.Current!=null)
{
Matchmatch=(Match)(enu.Current);
Console.Write(match.Value+"\r\n");
}
}
}
}

一些爬虫的HTML解析中也是用的类似的方法。
第二种方法：

利用Winista.Htmlparser.Net 解析Html。这是.NET平台下解析Html的开源代码，网上有源码下载，百度一下就能搜到，这里就不提供了。并且有英文的帮助文档。找不到的留下邮箱。
个人认为这是.net平台下解析html不错的解决方案，基本上能够满足我们对html的解析工作。
自己做了个实例：

usingSystem;
usingSystem.Collections.Generic;
usingSystem.ComponentModel;
usingSystem.Data;
usingSystem.Drawing;
usingSystem.Linq;
usingSystem.Text;
usingSystem.Windows.Forms;
usingWinista.Text.HtmlParser;
usingWinista.Text.HtmlParser.Lex;
usingWinista.Text.HtmlParser.Util;
usingWinista.Text.HtmlParser.Tags;
usingWinista.Text.HtmlParser.Filters;

namespaceHTMLParser
{
publicpartialclassForm1:Form
{
publicForm1()
{
InitializeComponent();
AddUrl();
}

privatevoidbtnParser_Click(objectsender,EventArgse)
{
#region获得网页的html
try
{

txtHtmlWhole.Text="";
stringurl=CBUrl.SelectedItem.ToString().Trim();
System.Net.WebClientaWebClient=newSystem.Net.WebClient();
aWebClient.Encoding=System.Text.Encoding.Default;
stringhtml=aWebClient.DownloadString(url);
txtHtmlWhole.Text=html;
}
catch(Exceptionex)
{
MessageBox.Show(ex.Message);
}
#endregion

#region分析网页html节点
Lexerlexer=newLexer(this.txtHtmlWhole.Text);
Parserparser=newParser(lexer);
NodeListhtmlNodes=parser.Parse(null);
this.treeView1.Nodes.Clear();
this.treeView1.Nodes.Add("root");
TreeNodetreeRoot=this.treeView1.Nodes[0];
for(inti=0;i<htmlNodes.Count;i++)
{
this.RecursionHtmlNode(treeRoot,htmlNodes[i],false);
}

#endregion

}

privatevoidRecursionHtmlNode(TreeNodetreeNode,INodehtmlNode,boolsiblingRequired)
{
if(htmlNode==null||treeNode==null)return;

TreeNodecurrent=treeNode;
TreeNodecontent;
//currentnode
if(htmlNodeisITag)
{
ITagtag=(htmlNodeasITag);
if(!tag.IsEndTag())
{
stringnodeString=tag.TagName;
if(tag.Attributes!=null&&tag.Attributes.Count>0)
{
if(tag.Attributes["ID"]!=null)
{
nodeString=nodeString+"{id=\""+tag.Attributes["ID"].ToString()+"\"}";
}
if(tag.Attributes["HREF"]!=null)
{
nodeString=nodeString+"{href=\""+tag.Attributes["HREF"].ToString()+"\"}";
}
}

current=newTreeNode(nodeString);
treeNode.Nodes.Add(current);
}
}

//获取节点间的内容
if(htmlNode.Children!=null&&htmlNode.Children.Count>0)
{
this.RecursionHtmlNode(current,htmlNode.FirstChild,true);
content=newTreeNode(htmlNode.FirstChild.GetText());
treeNode.Nodes.Add(content);
}

//thesiblingnodes
if(siblingRequired)
{
INodesibling=htmlNode.NextSibling;
while(sibling!=null)
{
this.RecursionHtmlNode(treeNode,sibling,false);
sibling=sibling.NextSibling;
}
}
}
privatevoidAddUrl()
{
CBUrl.Items.Add("http://www.hao123.com");
CBUrl.Items.Add("http://www.sina.com");
CBUrl.Items.Add("http://www.heuet.edu.cn");
}

}
}

运行效果：
实现取来很容易，结合Winista.Htmlparser源码很快就可以实现想要的效果。

小结：
简单介绍了两种解析Html的方法，大家有什么其他好的方法还望指教。