html2regexp - Regular Expression Generator for HTML Element
English / Japanese
html2regexp
html2regexp is a ruby program
of generating regular expressions for extracting HTML elements.
An example
Input of html2regexp is a UTF-8 HTML file where
target HTML elements are specified the last "h2r" attribute.
<ul>
<li><a href="hoge" class="h" h2r>hoge</a></li>
<li><a href="huga" class="h" h2r>huga</a></li>
</ul>
<div>
<a href="f">f</a>
</div>
html2regexp will generate next regular expressions.
(<(\w*?)\s*([^>]*?" class="h"[^>]*?)>(.*?)<\/\2>)
To use thie regular expressions, multiline option, ignore case option and utf-8 option must be specified.
Applications
- HTML scraping
- Functional tests for web application.
Limitations
- Target HTML elements must have close tags.
- Target HTML elements must not to contain same tag.
For example,
html2regexp can extract
<div class="2">hoge</div>
from <body><div class="1"><div class="2">hoge</div></div></body>
but, cannot extract
<div class="1"><div class="2">hoge</div></div>
- html2regexp will genrate rouch regular expressions which cannot treat attributes containing '<' and '>'.
- html2regexp cannot always extract an approciate regular expression.
In this cases, html2regexp try to extract better regular expression.
Demo is here.
Requiarements
html2regexp requires ruby.
libstree
libstree-0.4.2-y is an expansion of suffix tree library libstree-0.4.2.
Caution: libstree-0.4.2-y is not compatible with the orignal libstree, and
installation of libstree-0.4.2-y will overwrite the orignal version.
$ tar xvzf libstree-0.4.2-y.tar.gz
$ cd libstree-0.4.2-y
$ ./configure
$ make
$ sudo make install
liblaika
liblaika is a C++ library for extracting discriminating substrings from postive and negtive strings.
$ tar xvzf liblaika-0.0.1.tar.gz
$ cd liblaika-0.0.1
$ ./configure
$ make
$ sudo make install
laika-ruby
laika-ruby is a Ruby binding for liblaika made by using swig.
$ cd liblaika-0.0.1
$ cd ruby
$ ruby extconf.rb
$ make
$ sudo make install
html2regexp
$ tar xvzf html2regexp-0.0.1
$ cd html2regexp-0.0.1
$ sudo ruby setup.rb
llamerada at gmail dot com
Last modified: Thu Oct 19 22:38:32 JST 2006