作者yingwan (yingwan)
看板RegExp
标题[问题] 抓html tag
时间Sat Nov 8 07:39:43 2008
我想把一个网页里的<> 跟 <\> 分别抓出来
原始码是
<HTML >
<HEAD><TITLE> Hello World </TITLE></HEAD >
<BODY>
<H1>Greetings</H1>
<a href="index,html"
targe=_self > Homepage </a ><p>
<strong >Tat Tval Asi</strong>
</BODY>
</HTML>
抓出来後变成:
These are the opening tags:
<HTML>
<HEAD>
<TITLE>
<BODY>
<H1>
<a href="index.html" targe=_self>
<p>
<strong>
These are the closing tags:
</TITLE>
</HEAD>
</H1>
</a>
</strong>
</BODY>
</HTML>
我用perl是这样写的:
open(IN, $file) || die "can't read $file";
@file = <IN>;
print "These are the opening tags:\n";
foreach $line (@file){
find_opening_tags($line);
}
print "\n";
print "These are the closing tags:\n";
foreach $line (@file){
find_closing_tags($line);
}
close IN;
# end of main
#-------------------
# subroutines
#-------------------
sub find_opening_tags {
my $line = $_[0];
if ($line=~ /(\<[^\/].*\>)/){
print "$1\n";
}
}
sub find_closing_tags {
my $line = $_[0];
if ($line =~ /(\<\/.*\>)/) {
print "$1\n";
}
}
结果是
These are the opening tags:
<HTML >
<HEAD>
<BODY>
<H1>
<p>
<strong >
These are the closing tags:
</TITLE></HEAD >
</H1>
</a ><p>
</strong>
</BODY>
</HTML>
希望高手指点一下,谢谢
--
※ 发信站: 批踢踢实业坊(ptt.cc)
◆ From: 149.159.132.73
1F:→ supertitler:使用*?避免吃掉後面的字串 11/08 13:37