Textfile with unwanted spaces in words [Shell]

Prev: Is is OK to unset a variable that is not set ?
Next: split in memory

From: Werner Opriel on 15 Dec 2009 04:00

I have a text file containing some random words with unwanted spaces between
their characters, such as:

===
This is a correct line with text, this is still Text this is still Text this
is still Text, b u t this is T e x t w e don't w a n t so it's u n w a n t
e d F o r m this Text is ok this Text is ok this Text is ok w r o n g this
Text is ok
===

The example is showing one paragraph without a Linefeed.

Can anyone give me a hint for a regex to solve this problem?

From: Stachu 'Dozzie' K. on 15 Dec 2009 04:01

On 15.12.2009, Werner Opriel wrote:
> I have a text file containing some random words with unwanted spaces between
> their characters, such as:
>
>===
> This is a correct line with text, this is still Text this is still Text this
> is still Text, b u t this is T e x t w e don't w a n t so it's u n w a n t
> e d F o r m this Text is ok this Text is ok this Text is ok w r o n g this
> Text is ok
>===
>
> The example is showing one paragraph without a Linefeed.
>
> Can anyone give me a hint for a regex to solve this problem?

s/ //g

Or maybe you should define how to tell unwanted from wanted spaces.

--
Stanislaw Klekot

From: Sidney Lambe on 15 Dec 2009 04:41

On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote:
> I have a text file containing some random words with unwanted spaces between
> their characters, such as:
>
>===
> This is a correct line with text, this is still Text this is still Text this
> is still Text, b u t this is T e x t w e don't w a n t so it's u n w a n t
> e d F o r m this Text is ok this Text is ok this Text is ok w r o n g this
> Text is ok
>===
>
> The example is showing one paragraph without a Linefeed.
>
> Can anyone give me a hint for a regex to solve this problem?
>

Your only solution is to prevent the corruption of the files
in the first place.

Looks to me like the garbage produced by some shoddy pdf to text
utilities.

Sid

From: Werner Opriel on 15 Dec 2009 04:55

Sidney Lambe wrote:

> On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote:
>> I have a text file containing some random words with unwanted spaces
>> between their characters, such as:
>>
>>===
>> This is a correct line with text, this is still Text this is still Text
>> this is still Text, b u t this is T e x t w e don't w a n t so it's u n w
>> a n t e d F o r m this Text is ok this Text is ok this Text is ok w r o n
>> g this Text is ok
>>===
>>
>> The example is showing one paragraph without a Linefeed.
>>
>> Can anyone give me a hint for a regex to solve this problem?
>>
>
> Your only solution is to prevent the corruption of the files
> in the first place.
>
> Looks to me like the garbage produced by some shoddy pdf to text
> utilities.
>
>
> Sid

You are right, but it was not the pdftotext utility, it's already the
garbage pdf file itself.

From: Sidney Lambe on 15 Dec 2009 05:45

On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote:
> Sidney Lambe wrote:
>
>> On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote:
>>> I have a text file containing some random words with unwanted spaces
>>> between their characters, such as:
>>>
>>>===
>>> This is a correct line with text, this is still Text this is still Text
>>> this is still Text, b u t this is T e x t w e don't w a n t so it's u n w
>>> a n t e d F o r m this Text is ok this Text is ok this Text is ok w r o n
>>> g this Text is ok
>>>===
>>>
>>> The example is showing one paragraph without a Linefeed.
>>>
>>> Can anyone give me a hint for a regex to solve this problem?
>>>
>>
>> Your only solution is to prevent the corruption of the files
>> in the first place.
>>
>> Looks to me like the garbage produced by some shoddy pdf to text
>> utilities.
>>
>>
>> Sid
>
> You are right, but it was not the pdftotext utility, it's already the
> garbage pdf file itself.

Then I'd guess that the text for the pdf file was taken from converted
pdf files in the first place. Probably done by a script. Someone ripping
off google's conversions, maybe.

Perhaps you could locate the original pdf files?

There's a slim chance that the corruptions are mathematically
predictable, but we are still talking about a very complex
script that I cannot imagine anyone being willing to take
the time to write.

Something could be thrown together that would reduce the amount
of manual editiog needed to clean them up, but not by much.

Sid

| Next | Last
Pages: 1 2
Prev: Is is OK to unset a variable that is not set ?
Next: split in memory