Projects / HtmlRipper

HtmlRipper

HtmlRipper is a Java package that enables dynamic data to be extracted from Web pages, using pre-defined rule sets. It allows multiple data sets to be combined into a single dynamic web page, and is ideal for the creation of data mining, page analysis, Web page filtering, and article clipping software. The package includes a sample rules-enabled browser and rules editor.

Tags
Implementation

Recent releases

  •  27 Mar 2007 01:01

    Release Notes: Minor bugfixes to the example browser and ripping classes. The documentation has been updated. Shell scripts have been added to execute the various example programs in the package.

    •  22 Mar 2007 21:01

      Release Notes: New division (packaging) of classes within the jar file. Minor bugfixes. Minor modifications to the interface with the updated Fishcroft Java Utilities.

      •  11 Apr 2005 04:06

        Release Notes: The REB browser, as part of the rule set creation process, can now transfer your rule sets to a central rule set repository located on the 'web' for use by any standard or rules enabled browser. Minor bugfixes were made. Servlet classes for the creation of rules repositories are now included.

        •  30 Mar 2005 05:40

          Release Notes: Minor bugs in the rule creation routines of the REB browser were corrected.

          •  27 Mar 2005 08:03

            Release Notes: A sample rules-enabled Web browser with rule creation and editing features has been added. It supports GET/POST and helper applications. The rules manipulation classes have been rewritten.

            Recent comments

            15 Jan 2006 21:02 rgmccue

            Re: Why not just use XSLT?

            > Hi, Bear with me if I sound like a
            > newbie. I am confused with the usage. I
            > download the zip file ripper and I can
            > not find JRipViwer as you mentioned. A
            > more basic question is with a rule file
            > created (e.x. your existing one such
            > cnn.hrr), how could one use it to
            > extract?
            >
            > reb program does NOT seem to have a
            > functionality to allow one to use the
            > rule file .hrr to extract
            >
            >
            >

            The JRipViewer was absorved by and replaced with the REB (Rules Enabled Browser) sample program.


            To use a HtmlRipper Rules File , .hrr, in the REB example program juse enter the address of the particular rules fill in the REB address line. The REB program will determin that the file to display is an hrr file and not and html file and will process it accordingly.

            For example entering the address:

            http://www.htmlripper.org/rules/cert.hrr

            Will cause the "Advisory & Incident Notes" to be extracted from the CERT home page and displayed in the REB program.

            Please note that not all of the rules files, .hrr, are up to date and they may not reflect the current structure of the target web pages.

            To have the .hrr file displayed by your own browser use the helper program:

            java ca.fishcroft.htmlRipper.HrrHelper [--help] [file_name]

            Regards,
            Robert

            15 Jan 2006 01:19 xjz27

            Re: Why not just use XSLT?
            Hi, Bear with me if I sound like a newbie. I am confused with the usage. I download the zip file ripper and I can not find JRipViwer as you mentioned. A more basic question is with a rule file created (e.x. your existing one such cnn.hrr), how could one use it to extract?

            reb program does NOT seem to have a functionality to allow one to use the rule file .hrr to extract

            >

            > % XSLT does this, has many

            > % implementations, and from

            > % the looks of things is easier to use.

            > % What does

            > % this offer over it? Just curious...

            >

            > In answer to your query:

            >

            > XSLT is a language to create so called

            > "stylesheets" containing

            > transformation rules for "XML

            > documents". A XSLT processor allows

            > you to apply the stylesheet to a XML

            > document. The transformed output

            > document can be another XML document, a

            > HTML document or any text document.

            >

            > Our HtmlRipper on the other hand is

            > designed to extract structure (data)

            > from an "HTML document"

            > regardless of how poorly it is

            > formatted. It will even function on

            > documents that will make and XML parser

            > that can handle HTML, "give up and

            > cry". This data is then made

            > available to other applications.

            >

            > As to ease of use, I was able to create

            > a rules file that would rip the top five

            > entries out of the Freshmeat home page's

            > new products list in less than five

            > minutes. Most of that time was just

            > studying how your page was put together.

            > I took a copy of the home page, entered

            > the following tags:

            >

            > <htmlripper type="source"

            > name="FreshMeat Home Page"

            > source_type="url"

            > source="http://freshmeat.html">

            >

            > The above was inserted after the

            > <BODY> tag to describe the source

            > document for the rip.

            >

            > Then the two tags:

            >

            > <htmlripper type="element"

            > name="top five">

            >

            > </htmlripper>

            >

            > were inserted to bracket the structure

            > of the top five stories.

            >

            > Then using our sample JRipViewer program

            > ( supplied in the zip file) this file

            > was then passed to the rules generator

            > and an XML file was created that defined

            > the area of the html document to rip.

            > The XML file was then passed to the

            > ripper and the data was ripped out of

            > your home page without any of the

            > surrounding page entries and displayed

            > in the JRipViewer.

            >

            > All in less than five minutes.

            >

            > So now at any time I can grab the top

            > five entries from the home page without

            > having the other stuff cluttering up my

            > screen. I already use it on several news

            > sites on the net.

            >

            > Not only that but it is small and fast.

            >

            > Hope this helps.

            >

            > Regards,

            > Robert

            >

            >

            17 May 2001 21:20 rgmccue

            Re: Why not just use XSLT?

            > XSLT does this, has many
            > implementations, and from
            > the looks of things is easier to use.
            > What does
            > this offer over it? Just curious...

            In answer to your query:

            XSLT is a language to create so called "stylesheets" containing transformation rules for "XML documents". A XSLT processor allows you to apply the stylesheet to a XML document. The transformed output document can be another XML document, a HTML document or any text document.

            Our HtmlRipper on the other hand is designed to extract structure (data) from an "HTML document" regardless of how poorly it is formatted. It will even function on documents that will make and XML parser that can handle HTML, "give up and cry". This data is then made available to other applications.

            As to ease of use, I was able to create a rules file that would rip the top five entries out of the Freshmeat home page's new products list in less than five minutes. Most of that time was just studying how your page was put together. I took a copy of the home page, entered the following tags:

            <htmlripper type="source" name="FreshMeat Home Page" source_type="url" source="http://freshmeat.html">

            The above was inserted after the <BODY> tag to describe the source document for the rip.

            Then the two tags:

            <htmlripper type="element" name="top five">

            </htmlripper>

            were inserted to bracket the structure of the top five stories.

            Then using our sample JRipViewer program ( supplied in the zip file) this file was then passed to the rules generator and an XML file was created that defined the area of the html document to rip. The XML file was then passed to the ripper and the data was ripped out of your home page without any of the surrounding page entries and displayed in the JRipViewer.

            All in less than five minutes.

            So now at any time I can grab the top five entries from the home page without having the other stuff cluttering up my screen. I already use it on several news sites on the net.

            Not only that but it is small and fast.

            Hope this helps.

            Regards,
            Robert

            17 May 2001 19:18 mnot

            Why not just use XSLT?
            XSLT does this, has many implementations, and from
            the looks of things is easier to use. What does
            this offer over it? Just curious...

            Screenshot

            Project Spotlight

            OpenStack4j

            A Fluent OpenStack client API for Java.

            Screenshot

            Project Spotlight

            TurnKey TWiki Appliance

            A TWiki appliance that is easy to use and lightweight.