From: vasan999 on
The site says, that this will convert html to latex. Can anyone
explain me this
code? I am not familiar with such difficult commands especially there
are no
comments line by line explanation and overall operation.

1i\
\\documentstyle{article}
1i\
\\begin{document}
$a\
\\end{document}
# Too bad there's no way to make sed ignore case!
/<[Xx][Mm][Pp]>/,/<.[Xx][Mm][Pp]>/b lit
/<.[Xx][Mm][Pp]>/b lit
/<[Ll][Ii][Ss][Tt][Ii][Nn][Gg]>/,/<.[Ll][Ii][Ss][Tt][Ii][Nn][Gg]>/b
lit
/<.[Ll][Ii][Ss][Tt][Ii][Nn][Gg]>/b lit
/<[Pp][Rr][Ee]>/,/<.[Pp][Rr][Ee]>/b pre
/<.[Pp][Rr][Ee]>/b pre
# Stuff to ignore
s?<[Ii][Ss][Ii][Nn][Dd][Ee][Xx]>??
s?</[Aa][Dd][Dd][Rr][Ee][Ss][Ss]>??g
s?<[Nn][Ee][Xx][Tt][Ii][Dd][^>]*>??g
# character set translations for LaTex special chars
s?&gt.?>?g
s?&lt.?<?g
s?\\?\\backslash ?g
s?{?\\{?g
s?}?\\}?g
s?%?\\%?g
s?\$?\\$?g
s?&?\\&?g
s?#?\\#?g
s?_?\\_?g
s?~?\\~?g
s?\^?\\^?g
# Paragraph borders
s?<[Pp]>?\\par ?g
s?</[Pp]>??g
# Headings
s?<[Tt][Ii][Tt][Ll][Ee]>\([^<]*\)</[Tt][Ii][Tt][Ll][Ee]>?\
\section*{\1}?g
s?<[Hh]n>?\\part{?g
s?</[Hh]n>?}?g
s?<[Hh]1>?\\section*{?g
s?</[Hh][0-9]>?}?g
s?<[Hh]2>?\\subsection*{?g
s?<[Hh]3>?\\subsubsection*{?g
s?<[Hh]4>?\\subsubsection*{?g
s?<[Hh]5>?\\paragraph{?g
s?<[Hh]6>?\\subparagraph{?g
# UL is itemize
s?<[Uu][Ll]>?\\begin{itemize}?g
s?</[Uu][Ll]>?\\end{itemize}?g
s?<[Ll][Ii]>?\\item ?g
# DL is description
s?<[Dd][Ll]>?\\begin{description}?g
s?</[Dd][Ll]>?\\end{description}?g
# closing delimiter for DT is first < or end of line which ever comes
first NO
#s?<[Dd][Tt]>\([^<]*\)<?\\item[\1]<?g
#s?<[Dd][Tt]>\([^<]*\)$?\\item[\1]?g
#s?<[Dd][Dd]>??g
s?<[Dd][Tt]>?\\item[<?g
s?<[Dd][Dd]>?]?g
# Other common SGML markup. this is ad-hoc
s?<sec[ab]>??
s?</sec[ab]>??g
# Italics
s?<it>\([^<]*\)</it>?{\\it \1 }?g
# Get rid of Anchors
:pre
s?<[Aa][^>]*>??g
s?</[Aa]>??g
# This is a subroutine in sed, in case you are not a sed guru
: lit
s?<[Xx][Mm][Pp]>?\\begin{verbatim}?g
s?</[Xx][Mm][Pp]>?\\end{verbatim}?
s?<[Ll][Ii][Ss][Tt][Ii][Nn][Gg]>?\\begin{verbatim}?g
s?</[Ll][Ii][Ss][Tt][Ii][Nn][Gg]>?\\end{verbatim}?


On Oct 22, 2:57 pm, vasan...(a)hotmail.com wrote:
> Basically, it should do all that any of the tools below and in
> addition,
>
> 1/
> human readable output that maintains the text lines of the source, ie
> does not scramble the text lines or insert newlines unnecessarily or
> removes them. inserts minimal latex elements.
>
> 2/
> maintains cross-links, ie convert <href to \ref and <name= to \label
>
> but if the set of htmls is incomplete proceed with the assumption that
> the reference is there, ie dont delete the links or try to modify them
> or their addresses. One of the tool I tested is too smart in this
> respect and actually ruins the result.
>
> 3/
> proper conversion of images, tables, etc. No math mode involved in
> html.
>
> 4/
> Even an emacs lisp function could be written by a guru that can do the
> job.
>
> 5/
> Is there any commercial wysiwig tool ?
>
> LaTeX etc
>
> * html2latex is a program based on the NCSA html parser. Contact:
> Nathan.Torking...(a)vuw.ac.nz.
> * Another html2latex can combine several HTML files into a single
> LaTeX file, converting links between the files to references. External
> URL's can be converted into footnotes or into a bibliography sorted on
> URL. Contact: F.J.Fa...(a)cs.utwente.nl (Frans J. Faase)
> * Another html2latex implemented on Linux by yacc+lex+C. Also
> available from the TSX-11 Linux FTP site as nc-html2latex-0.97.tar.gz.
> Contact: naoc...(a)naochan.com (Naoya Tozuka)
> * htmlatex.pl is a perl script to do the conversion (may be moving
> soon). Contact: n9146...(a)cc.wwu.edu (Jake Kesinger)
> * There is also a sed script to convert HTML into LaTeX.