html2regexp - Regular Expression Generator for HTML Element

English / Japanese

Introduction

html2regexp

html2regexp is a ruby program of generating regular expressions for extracting HTML elements.

An example

Input of html2regexp is a UTF-8 HTML file where target HTML elements are specified the last "h2r" attribute.

<ul>
   <li><a href="hoge" class="h" h2r>hoge</a></li>
   <li><a href="huga" class="h" h2r>huga</a></li>
</ul>
<div>
   <a href="f">f</a>
</div>
html2regexp will generate next regular expressions.
(<(\w*?)\s*([^>]*?" class="h"[^>]*?)>(.*?)<\/\2>)
To use thie regular expressions, multiline option, ignore case option and utf-8 option must be specified.

Applications

Limitations

Demo

Demo is here.

Download

Install

Requiarements

html2regexp requires ruby.

libstree

libstree-0.4.2-y is an expansion of suffix tree library libstree-0.4.2.
Caution: libstree-0.4.2-y is not compatible with the orignal libstree, and installation of libstree-0.4.2-y will overwrite the orignal version.
$ tar xvzf libstree-0.4.2-y.tar.gz
$ cd libstree-0.4.2-y
$ ./configure
$ make
$ sudo make install

liblaika

liblaika is a C++ library for extracting discriminating substrings from postive and negtive strings.
$ tar xvzf liblaika-0.0.1.tar.gz
$ cd liblaika-0.0.1
$ ./configure
$ make
$ sudo make install

laika-ruby

laika-ruby is a Ruby binding for liblaika made by using swig.
$ cd liblaika-0.0.1
$ cd ruby
$ ruby extconf.rb
$ make
$ sudo make install

html2regexp

$ tar xvzf html2regexp-0.0.1
$ cd html2regexp-0.0.1
$ sudo ruby setup.rb

llamerada at gmail dot com Last modified: Thu Oct 19 22:38:32 JST 2006