From: Werner Opriel on 15 Dec 2009 04:00 I have a text file containing some random words with unwanted spaces between their characters, such as: === This is a correct line with text, this is still Text this is still Text this is still Text, b u t this is T e x t w e don't w a n t so it's u n w a n t e d F o r m this Text is ok this Text is ok this Text is ok w r o n g this Text is ok === The example is showing one paragraph without a Linefeed. Can anyone give me a hint for a regex to solve this problem?
From: Stachu 'Dozzie' K. on 15 Dec 2009 04:01 On 15.12.2009, Werner Opriel wrote: > I have a text file containing some random words with unwanted spaces between > their characters, such as: > >=== > This is a correct line with text, this is still Text this is still Text this > is still Text, b u t this is T e x t w e don't w a n t so it's u n w a n t > e d F o r m this Text is ok this Text is ok this Text is ok w r o n g this > Text is ok >=== > > The example is showing one paragraph without a Linefeed. > > Can anyone give me a hint for a regex to solve this problem? s/ //g Or maybe you should define how to tell unwanted from wanted spaces. -- Stanislaw Klekot
From: Sidney Lambe on 15 Dec 2009 04:41 On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote: > I have a text file containing some random words with unwanted spaces between > their characters, such as: > >=== > This is a correct line with text, this is still Text this is still Text this > is still Text, b u t this is T e x t w e don't w a n t so it's u n w a n t > e d F o r m this Text is ok this Text is ok this Text is ok w r o n g this > Text is ok >=== > > The example is showing one paragraph without a Linefeed. > > Can anyone give me a hint for a regex to solve this problem? > Your only solution is to prevent the corruption of the files in the first place. Looks to me like the garbage produced by some shoddy pdf to text utilities. Sid
From: Werner Opriel on 15 Dec 2009 04:55 Sidney Lambe wrote: > On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote: >> I have a text file containing some random words with unwanted spaces >> between their characters, such as: >> >>=== >> This is a correct line with text, this is still Text this is still Text >> this is still Text, b u t this is T e x t w e don't w a n t so it's u n w >> a n t e d F o r m this Text is ok this Text is ok this Text is ok w r o n >> g this Text is ok >>=== >> >> The example is showing one paragraph without a Linefeed. >> >> Can anyone give me a hint for a regex to solve this problem? >> > > Your only solution is to prevent the corruption of the files > in the first place. > > Looks to me like the garbage produced by some shoddy pdf to text > utilities. > > > Sid You are right, but it was not the pdftotext utility, it's already the garbage pdf file itself.
From: Sidney Lambe on 15 Dec 2009 05:45
On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote: > Sidney Lambe wrote: > >> On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote: >>> I have a text file containing some random words with unwanted spaces >>> between their characters, such as: >>> >>>=== >>> This is a correct line with text, this is still Text this is still Text >>> this is still Text, b u t this is T e x t w e don't w a n t so it's u n w >>> a n t e d F o r m this Text is ok this Text is ok this Text is ok w r o n >>> g this Text is ok >>>=== >>> >>> The example is showing one paragraph without a Linefeed. >>> >>> Can anyone give me a hint for a regex to solve this problem? >>> >> >> Your only solution is to prevent the corruption of the files >> in the first place. >> >> Looks to me like the garbage produced by some shoddy pdf to text >> utilities. >> >> >> Sid > > You are right, but it was not the pdftotext utility, it's already the > garbage pdf file itself. Then I'd guess that the text for the pdf file was taken from converted pdf files in the first place. Probably done by a script. Someone ripping off google's conversions, maybe. Perhaps you could locate the original pdf files? There's a slim chance that the corruptions are mathematically predictable, but we are still talking about a very complex script that I cannot imagine anyone being willing to take the time to write. Something could be thrown together that would reduce the amount of manual editiog needed to clean them up, but not by much. Sid |