Projects / HtmlRipper / Comments

Comments for HtmlRipper

15 Jan 2006 21:02 rgmccue

Re: Why not just use XSLT?

> Hi, Bear with me if I sound like a
> newbie. I am confused with the usage. I
> download the zip file ripper and I can
> not find JRipViwer as you mentioned. A
> more basic question is with a rule file
> created (e.x. your existing one such
> cnn.hrr), how could one use it to
> extract?
>
> reb program does NOT seem to have a
> functionality to allow one to use the
> rule file .hrr to extract
>
>
>

The JRipViewer was absorved by and replaced with the REB (Rules Enabled Browser) sample program.


To use a HtmlRipper Rules File , .hrr, in the REB example program juse enter the address of the particular rules fill in the REB address line. The REB program will determin that the file to display is an hrr file and not and html file and will process it accordingly.

For example entering the address:

http://www.htmlripper.org/rules/cert.hrr

Will cause the "Advisory & Incident Notes" to be extracted from the CERT home page and displayed in the REB program.

Please note that not all of the rules files, .hrr, are up to date and they may not reflect the current structure of the target web pages.

To have the .hrr file displayed by your own browser use the helper program:

java ca.fishcroft.htmlRipper.HrrHelper [--help] [file_name]

Regards,
Robert

15 Jan 2006 01:19 xjz27

Re: Why not just use XSLT?
Hi, Bear with me if I sound like a newbie. I am confused with the usage. I download the zip file ripper and I can not find JRipViwer as you mentioned. A more basic question is with a rule file created (e.x. your existing one such cnn.hrr), how could one use it to extract?

reb program does NOT seem to have a functionality to allow one to use the rule file .hrr to extract

>

> % XSLT does this, has many

> % implementations, and from

> % the looks of things is easier to use.

> % What does

> % this offer over it? Just curious...

>

> In answer to your query:

>

> XSLT is a language to create so called

> "stylesheets" containing

> transformation rules for "XML

> documents". A XSLT processor allows

> you to apply the stylesheet to a XML

> document. The transformed output

> document can be another XML document, a

> HTML document or any text document.

>

> Our HtmlRipper on the other hand is

> designed to extract structure (data)

> from an "HTML document"

> regardless of how poorly it is

> formatted. It will even function on

> documents that will make and XML parser

> that can handle HTML, "give up and

> cry". This data is then made

> available to other applications.

>

> As to ease of use, I was able to create

> a rules file that would rip the top five

> entries out of the Freshmeat home page's

> new products list in less than five

> minutes. Most of that time was just

> studying how your page was put together.

> I took a copy of the home page, entered

> the following tags:

>

> <htmlripper type="source"

> name="FreshMeat Home Page"

> source_type="url"

> source="http://freshmeat.html">

>

> The above was inserted after the

> <BODY> tag to describe the source

> document for the rip.

>

> Then the two tags:

>

> <htmlripper type="element"

> name="top five">

>

> </htmlripper>

>

> were inserted to bracket the structure

> of the top five stories.

>

> Then using our sample JRipViewer program

> ( supplied in the zip file) this file

> was then passed to the rules generator

> and an XML file was created that defined

> the area of the html document to rip.

> The XML file was then passed to the

> ripper and the data was ripped out of

> your home page without any of the

> surrounding page entries and displayed

> in the JRipViewer.

>

> All in less than five minutes.

>

> So now at any time I can grab the top

> five entries from the home page without

> having the other stuff cluttering up my

> screen. I already use it on several news

> sites on the net.

>

> Not only that but it is small and fast.

>

> Hope this helps.

>

> Regards,

> Robert

>

>

17 May 2001 21:20 rgmccue

Re: Why not just use XSLT?

> XSLT does this, has many
> implementations, and from
> the looks of things is easier to use.
> What does
> this offer over it? Just curious...

In answer to your query:

XSLT is a language to create so called "stylesheets" containing transformation rules for "XML documents". A XSLT processor allows you to apply the stylesheet to a XML document. The transformed output document can be another XML document, a HTML document or any text document.

Our HtmlRipper on the other hand is designed to extract structure (data) from an "HTML document" regardless of how poorly it is formatted. It will even function on documents that will make and XML parser that can handle HTML, "give up and cry". This data is then made available to other applications.

As to ease of use, I was able to create a rules file that would rip the top five entries out of the Freshmeat home page's new products list in less than five minutes. Most of that time was just studying how your page was put together. I took a copy of the home page, entered the following tags:

<htmlripper type="source" name="FreshMeat Home Page" source_type="url" source="http://freshmeat.html">

The above was inserted after the <BODY> tag to describe the source document for the rip.

Then the two tags:

<htmlripper type="element" name="top five">

</htmlripper>

were inserted to bracket the structure of the top five stories.

Then using our sample JRipViewer program ( supplied in the zip file) this file was then passed to the rules generator and an XML file was created that defined the area of the html document to rip. The XML file was then passed to the ripper and the data was ripped out of your home page without any of the surrounding page entries and displayed in the JRipViewer.

All in less than five minutes.

So now at any time I can grab the top five entries from the home page without having the other stuff cluttering up my screen. I already use it on several news sites on the net.

Not only that but it is small and fast.

Hope this helps.

Regards,
Robert

17 May 2001 19:18 mnot

Why not just use XSLT?
XSLT does this, has many implementations, and from
the looks of things is easier to use. What does
this offer over it? Just curious...

Screenshot

Project Spotlight

ReciJournal

An open, cross-platform journaling program.

Screenshot

Project Spotlight

Veusz

A scientific plotting package.