[Cialug] OT - manipulating blogger exported XML

Nicolai nicolai-cialug at chocolatine.org
Sun Oct 2 19:19:45 CDT 2011


On Sun, Oct 02, 2011 at 06:11:38PM -0500, Nathan C. Smith wrote:

> Or instead use wget and pull down the whole site into html and somehow
> stitch that together into a single document from the resultant pages.

After looking at Dave's posted blog.xml file, and the contents of a
typical http://example.blogspot.com/date/file.html, I gotta second the
wget approach, maybe with a little lynx -dump magic.  (Does wget have
such functionality built-in?)  Those html files are heinous.

With wget + lynx -dump on the resulting files (ugly, but not as bad as
parsing blog.xml), the (1) title and (2) text body are easy to identify.
There may be a few [EMBED] type strings in the text, but those are easy
to fix.

Nicolai


More information about the Cialug mailing list