Textfile with unwanted spaces in words [Shell]

Prev: Is is OK to unset a variable that is not set ?
Next: split in memory

From: Moody on 15 Dec 2009 06:06

On Dec 15, 2:00 pm, Werner Opriel <w....(a)gmx.de> wrote:
> I have a text file containing some random words with unwanted spaces between
> their characters, such as:
>
> ===
> This is a correct line with text, this is still Text this is still Text this
> is still Text, b u t this is T e x t w e don't w a n t so it's u n w a n t
> e d F o r m this Text is ok this Text is ok this Text is ok w r o n g this
> Text is ok
> ===

Try Copying the text to some other word processors and changing the
font, this is only way to make it happen. The OP is a very much
generic and any sort of stream editing will make many words to appear
a single word...

>
> The example is showing one paragraph without a Linefeed.
>
> Can anyone give me a hint for a regex to solve this problem?

From: Werner Opriel on 15 Dec 2009 06:30

Sidney Lambe wrote:

> On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote:
>> Sidney Lambe wrote:
>>
>>> On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote:
>>>> I have a text file containing some random words with unwanted spaces
>>>> between their characters, such as:
>>>>
>>>>===
>>>> This is a correct line with text, this is still Text this is still Text
>>>> this is still Text, b u t this is T e x t w e don't w a n t so it's u n
>>>> w a n t e d F o r m this Text is ok this Text is ok this Text is ok w r
>>>> o n g this Text is ok
>>>>===
>>>>
>>>> The example is showing one paragraph without a Linefeed.
>>>>
>>>> Can anyone give me a hint for a regex to solve this problem?
>>>>
>>>
>>> Your only solution is to prevent the corruption of the files
>>> in the first place.
>>>
>>> Looks to me like the garbage produced by some shoddy pdf to text
>>> utilities.
>>>
>>>
>>> Sid
>>
>> You are right, but it was not the pdftotext utility, it's already the
>> garbage pdf file itself.
>
> Then I'd guess that the text for the pdf file was taken from converted
> pdf files in the first place. Probably done by a script. Someone ripping
> off google's conversions, maybe.
>
> Perhaps you could locate the original pdf files?

I only have the garbaged pdf file, no original.
> There's a slim chance that the corruptions are mathematically
> predictable, but we are still talking about a very complex
> scrI'm aware ipt that I cannot imagine anyone being willing to take
> the time to write.

Ok.

> Something could be thrown together that would reduce the amount
> of manual editiog needed to clean them up, but not by much.

Reducing the amount of manual editing is what i'm trying to do, but i think
it's a little bit to strong for me.

My first try: match paragraphs single characters with sed:
/ \([[:alpha:]] [[:alpha:]] \)\+/n

seems to be ok.
But i don't know how to substitute the spaces.
I'm aware that words will appear as single word, but i hope that's the
lesser of the two evils.

From: Werner Opriel on 15 Dec 2009 08:28

Moody wrote:

> On Dec 15, 2:00�pm, Werner Opriel <w....(a)gmx.de> wrote:
>> I have a text file containing some random words with unwanted spaces
>> between their characters, such as:
>>
>> ===
>> This is a correct line with text, this is still Text this is still Text
>> this is still Text, b u t this is T e x t w e don't w a n t so it's u n w
>> a n t e d F o r m this Text is ok this Text is ok this Text is ok w r o n
>> g this Text is ok
>> ===
>
> Try Copying the text to some other word processors and changing the
> font, this is only way to make it happen. The OP is a very much
> generic and any sort of stream editing will make many words to appear
> a single word...

Hi Moody,
thanks, you are right. It seems that it has been a Fontproblem!
xpdf and pdftotext obviously cannot display / interpret one of the used
Fonts correctly.
Opening the pdf-file in Acrobat-Reader under "Windows" and saving it to a
Textfile fixed the problem.
Thank's a lot!

From: Sidney Lambe on 15 Dec 2009 18:24

On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote:
> Sidney Lambe wrote:
>
>> On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote:
>>> Sidney Lambe wrote:
>>>
>>>> On comp.unix.shell, Werner Opriel <w.opr(a)gmx.de> wrote:
>>>>> I have a text file containing some random words with unwanted spaces
>>>>> between their characters, such as:
>>>>>
>>>>>===
>>>>> This is a correct line with text, this is still Text this is still Text
>>>>> this is still Text, b u t this is T e x t w e don't w a n t so it's u n
>>>>> w a n t e d F o r m this Text is ok this Text is ok this Text is ok w r
>>>>> o n g this Text is ok
>>>>>===
>>>>>
>>>>> The example is showing one paragraph without a Linefeed.
>>>>>
>>>>> Can anyone give me a hint for a regex to solve this problem?
>>>>>
>>>>
>>>> Your only solution is to prevent the corruption of the files
>>>> in the first place.
>>>>
>>>> Looks to me like the garbage produced by some shoddy pdf to text
>>>> utilities.
>>>>
>>>>
>>>> Sid
>>>
>>> You are right, but it was not the pdftotext utility, it's already the
>>> garbage pdf file itself.
>>
>> Then I'd guess that the text for the pdf file was taken from converted
>> pdf files in the first place. Probably done by a script. Someone ripping
>> off google's conversions, maybe.
>>
>> Perhaps you could locate the original pdf files?
>
> I only have the garbaged pdf file, no original.
>> There's a slim chance that the corruptions are mathematically
>> predictable, but we are still talking about a very complex
>> scrI'm aware ipt that I cannot imagine anyone being willing to take
>> the time to write.
>
> Ok.
>
>> Something could be thrown together that would reduce the amount
>> of manual editiog needed to clean them up, but not by much.
>
> Reducing the amount of manual editing is what i'm trying to do, but i think
> it's a little bit to strong for me.
>
> My first try: match paragraphs single characters with sed:
> / \([[:alpha:]] [[:alpha:]] \)\+/n
>
> seems to be ok.
> But i don't know how to substitute the spaces.
> I'm aware that words will appear as single word, but i hope that's the
> lesser of the two evils.

Maybe the corruptions are mathematically predictable (every so many
bytes)?

Woody's post gave me an idea: Try the utility 'par' and see what
happens. It has many options and args amd is really quite amazing.
It's no longer maintained but you should be able to find the source
or a compiled executable. I could send it to you if all else fails.

usenet4444
AT
gmail
(dot)
com

Wish I could be of more help, Werner, but just thinking about it gives
me a headache :-/

Sid

First | Prev |
Pages: 1 2
Prev: Is is OK to unset a variable that is not set ?
Next: split in memory