An open community 
of Macintosh users,
for Macintosh users.

FineTunedMac Dashboard widget now available! Download Here

Previous Thread
Next Thread
Print Thread
Downloading text only from web pages
#27060 10/14/13 05:27 PM
Joined: Sep 2009
deniro Offline OP
OP Offline

Joined: Sep 2009
I know there are many ways to do this, but I haven't found one that I'm really happy with.

I often find articles that I would like to download from the web. But I want only the text, not an html file.

I don't want to select the text and then cut and paste. I've done it many times in the past but it is tiresome, a pain in the neck, and often doesn't work very well. I don't want to have to deal with a lot of awkward paragraph spacing, extra space, odd sentence length and so on (Tom Bender's Tex Edit used to do a great job of cleaning up files, but that too was time consuming.) I know the browsers are supposed to be able to download text only files (I have Firefox 3.6 and Safari 4.1) but I can't seem to get it the way I want, text only. I've tried converting html with TexEdit but it doesn't work very well.

This is especially a problem on big files. I tried using TexEdit to convert them, but TexEdit does a poor job of handling long documents. That leaves me with NeoOffice, which should work converting html to text, but it doesn't seem to.

For example, a long file from Project Gutenberg:

http://www.gutenberg.org/files/16350/16350-h/16350-h.htm


2) I also have an xhtml file dowloaded from a Windows desktop but I can't convert it. Surely I have something on my computer that ought to be able to read this file. So far no luck. Tried TextEdit but it didn't work.

Re: Downloading text only from web pages
deniro #27061 10/14/13 07:04 PM
Joined: Aug 2009
Likes: 4
Offline

Joined: Aug 2009
Likes: 4
Have you tried saving as a PDF file?

Given your OS 10.4.11 and ancient Firefox 3.6.28 I suspect you can't. The latest versions of Firefox have Adobe Acrobat - Create PDF 1.2 loaded as an extension, so it's a piece of cake.

Re: Downloading text only from web pages
deniro #27063 10/14/13 07:39 PM
Joined: Aug 2009
Likes: 16
Moderator
Offline
Moderator

Joined: Aug 2009
Likes: 16
A problem with text conversion on many web pages is because a substantial segment of the text is not HTML, rather it is a GIF, JPG, PDF, or other graphic image. In that case the "text" will have to be processed with with an Optical Character Recognition (OCR) app before it is selectable as text.

For what it is worth, I just downloaded your "long file from the Gutenberg project" and opened it in NeoOffice 2013.1 and it opened with no problems, the text was clear and readable and the links all worked as advertised, as it did in Nisus Wroter Pro 2.0.6 as well (IMO NeoOffice did a better job of rendering and formatting the pages). Pages, on the other hand could not open the file at all. It should be noted I am running OS X 10.8.5, and those versions of NeoOffice and Nisus Writer Pro are probably not available to you in Tiger.


If we knew what it was we were doing, it wouldn't be called research, would it?

— Albert Einstein
Re: Downloading text only from web pages
deniro #27066 10/14/13 09:37 PM
Joined: Aug 2009
Offline

Joined: Aug 2009
Sometimes there is a "Print" version available for an article or web page; those are usually a lot easier to deal with for a cut-and-paste.

of course, not every article or page will have them, but.


MacBook 2.4 Ghz · 4 Gb ram · 10.7.5
stuff I'm interested in
iPhone 4s 7.0.2
Re: Downloading text only from web pages
roger #27067 10/14/13 10:11 PM
Joined: Sep 2009
deniro Offline OP
OP Offline

Joined: Sep 2009
Thanks for the quick replies.

1) Yes, I love when there is a print command giving me a text page with nothing but text. Just doesn't happen often enough.

2) I goofed. I meant I was using TextEdit to try to convert html. Tom Bender's program from long ago was called Tex Edit (he was from Texas).

3) My version of NeoOffice stopped at 3.1.1

4) Maybe a Firefox extension?

Re: Downloading text only from web pages
deniro #27073 10/15/13 06:43 PM
Joined: Aug 2009
Likes: 1
Offline

Joined: Aug 2009
Likes: 1
I use Text Wrangler from Bare Bones Software to clean up text in cases like this.

If you use Safari, you can often get clean text by hitting the Reader button, then copy-pasting that into a text editor.


Photo gallery, all about me, and more: www.xeromag.com/franklin.html
Re: Downloading text only from web pages
deniro #27075 10/15/13 09:21 PM
Joined: Sep 2009
Offline

Joined: Sep 2009
Give iCab a try. It can save the files as a text document. Opening them in TextWrangler seems to work better than TextEdit.

Re: Downloading text only from web pages
tacit #27080 10/16/13 11:53 AM
Joined: Aug 2009
Likes: 7
Offline

Joined: Aug 2009
Likes: 7
Originally Posted By: tacit
If you use Safari, you can often get clean text by hitting the Reader button, then copy-pasting that into a text editor.
If memory serves, the version of Safari in 10.4 does not include Reader.


Jon

macOS 11.7.10, iMac Retina 5K 27-inch, late 2014, 3.5 GHz Intel Core i5, 1 TB fusion drive, 16 GB RAM, Epson SureColor P600, Photoshop CC, Lightroom CC, MS Office 365
Re: Downloading text only from web pages
deniro #27083 10/16/13 08:21 PM
Joined: Aug 2009
Offline

Joined: Aug 2009
You might also be able to get a lighter copy by changing your agent-id string in your browser to something that makes your browser look like a mobile device that really ought not to get pushed big flashy banner ads etc.

But in any case, the text is there and in many cases it's not too difficult to sort it out from the rest, but formatting can be lost. Columns that appear side by side will appear one after the other. Tables may be completely split up. Comments, captions, and other unnecessary clutter normally formatted away from the body of the text may appear spliced in with the body.


I work for the Department of Redundancy Department
Re: Downloading text only from web pages
deniro #27086 10/16/13 11:30 PM
Joined: Aug 2009
Likes: 7
Offline

Joined: Aug 2009
Likes: 7
Are Services available in 10.4? I don't recall. If so, you can select the desired text and then choose Services from the Safari or Firefox menu. At that point, select Print Selection.


Jon

macOS 11.7.10, iMac Retina 5K 27-inch, late 2014, 3.5 GHz Intel Core i5, 1 TB fusion drive, 16 GB RAM, Epson SureColor P600, Photoshop CC, Lightroom CC, MS Office 365

Moderated by  alternaut, dianne, MacManiac 

Link Copied to Clipboard
Powered by UBB.threads™ PHP Forum Software 7.7.4
(Release build 20200307)
Responsive Width:

PHP: 7.4.33 Page Time: 0.026s Queries: 34 (0.019s) Memory: 0.6182 MB (Peak: 0.7043 MB) Data Comp: Zlib Server Time: 2024-04-23 10:34:12 UTC
Valid HTML 5 and Valid CSS