Wednesday, March 28, 2012

The best way to parse an html file?

Hi,

I have a html file file that I want to parse with ASP.NET to retreive the
value of a custom tag. Let's say that the average html file is about 30 ko.
Once the html file is loaded and converted into a single string, I'm using
for now is two string.indexOf to find the begin and the end of the desired
tag and then a string.substring to extract the data. I'm not using regular
expressions since I know exactly what are the tags to find.
My function goes like this:

private string ParseHtml(string html)
{
html = html.Replace("\r\n","");
int begin = html.IndexOf("%%StartGetHtml%%");
int end = html.IndexOf("%%EndGetHtml%%",begin);
int begin2, end2;
string str = null;
if (begin > 0 && end > 0)
{
// Gets the beginning of the tag
begin2 = html.IndexOf("<",begin);
// Gets the end of the tag
end2 = html.IndexOf(">",end-3);
if (begin2 < end2 && end2 < end)
{
// Gets the tag
str = html.Substring(begin2,end-begin2);
}
}
return str;
}

Is this the fastest way or there could be a better way to do this?

Thanks

Stephane
Stephane wrote:

> I have a html file file that I want to parse with ASP.NET to retreive the
> value of a custom tag. Let's say that the average html file is about 30 ko.
> Once the html file is loaded and converted into a single string, I'm using
> for now is two string.indexOf to find the begin and the end of the desired
> tag and then a string.substring to extract the data. I'm not using regular
> expressions since I know exactly what are the tags to find.
> My function goes like this:
> private string ParseHtml(string html)
> {
> html = html.Replace("\r\n","");
> int begin = html.IndexOf("%%StartGetHtml%%");
> int end = html.IndexOf("%%EndGetHtml%%",begin);
> int begin2, end2;
> string str = null;
> if (begin > 0 && end > 0)
> {
> // Gets the beginning of the tag
> begin2 = html.IndexOf("<",begin);
> // Gets the end of the tag
> end2 = html.IndexOf(">",end-3);
> if (begin2 < end2 && end2 < end)
> {
> // Gets the tag
> str = html.Substring(begin2,end-begin2);
> }
> }
> return str;
> }
> Is this the fastest way or there could be a better way to do this?

If those string processing attempts suffice for you then use them but in
general if you want to parse HTML you might want to check SGMLReader, see
http://www.gotdotnet.com/community/...uery=sgmlreader

--

Martin Honnen
http://JavaScript.FAQTs.com/

0 comments:

Post a Comment