Bogus assumption prevents d2u/u2d/conv/etal working on mixed files.

Charles Wilson cygwin@cwilson.fastmail.fm
Sat Apr 3 04:49:00 GMT 2004


Dave Korn wrote:

>   I was pretty stunned to find d2u didn't have the same effect as tr -d.  A
> few seconds work in the debugger, however, made it clear.
> 
>   Right inside conv.c, in the main convert (...) function, there's an
> attempted optimisation.  After opening the file for conversion, it reads a
> char at a time until it finds the first '\n' or '\r' in the whole file.  If
> a '\n' comes first, it assumes the file is in Unix format; if a '\r' comes
> first, it assumes the file must be in DOS format.
> 
>   Now, these assumptions are reasonable enough ways of guessing the file
> format if it hasn't been specified by the command name or command line
> switch, and therefore of deducing which kind of translation is required.
>
>   But then it checks to see if the guessed format matches the format you've
> asked it to convert into.  If so, it attempts to 'optimise' the conversion
> by simply not performing it: it closes the file and leaves it untouched.

This is not just a performance enhancement.  It also has the following 
properties:
   (1) the access time on files which are not actually modified does not 
get updated

   (2) it's an attempt to prevent users from permanently scrogging 
binary files.  See: d2u, on a binary file, is an irreversible operation. 
  So, if you do "d2u *" you'll probably kill something deep inside some 
binary file, and you can't fix it -- unless some minimal safeguards are 
in place.

   u2d MAY be reversible -- IF there were no pre-exising \r\n 
combinations in the file to begin with -- so when (OMG-fixit-)d2u is 
run, obviously the first '\n' is preceeded by a (newly-added) '\r\n', so 
the prog merrily replaces ALL '\r\n' with '\n'...which MAY fix your 
oops, but maybe not.


So, with the current code, if you snarf the first "line" -- all chars 
until the first '\n' -- if it's a binary file the odds are pretty low 
that the immediately-preceeding character is a '\r' -- so d2u as 
currently coded will bail out, and no harm is done.

It doesn't work so well in the other direction -- by the same logic 
above, you'll almost never bail out early if you run 'u2d' on a binary 
file -- but if you immediately do a 'd2u' you MIGHT be able to recover.)

> 
>   Unfortunately, there is an extra unstated assumption in between deducing
> the file type from the first EOL in the file and deducing that you don't
> need to perform a conversion, and that assumption is that every other line
> in the file has the same EOL as the first line.  And that assumption is
> bogus, and it means that d2u/u2d and friends are no use on files which have
> mixed EOL types, unless by good chance the very first line has the EOL type
> that you wish to convert away from.
> 
>   My attached patch simply removes the attempted optimisation.  Like I say,
> I think it's an invalid shortcut to assume that every line in a file has the
> same EOL type.  I could imagine a case could be made for keeping the
> 'optimisation' and perhaps providing a command-line switch "-f" or "--force"
> to force full processing of files even if they seem to already be in the
> right mode;  OTOH I'd say that even if you wanted to keep the optimisation
> in some cases, it's a dangerous optimisation that can lead to incorrect
> output, and therefore it should only be switched on when the user
> deliberately adds a command-line option, rather than being on by default and
> disableable.

While I admit that the 'safety' aspect is not foolproof, it is a 
valuable mini-feature.  To make the default behavior un-"safe" kinda 
defeats the purpose.  Further, why do you need an extra commandline 
option at all?  Obvious, the first EOL is either one or the other.

So, if you have a mixed-mode file, and want to go to UNIX EOL, then this 
will always work:

u2d myFile.txt && d2u myFile.txt

I'm not shooting down your patch, but I'd like further discussion on 
list.  I think the "safety" features of the current behavior have been 
overlooked, and I hesistate to remove them without a thorough 
discussion.  'cause as soon as I remove it, some poor slob is going to 
do 'd2u /cygdrive/c/WINNT/nt.dat' and invent new swear words using my 
name...

--
Chuck




--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/



More information about the Cygwin mailing list