From: Mihai N. on 15 May 2010 06:21 > Do you know anywhere where I can get a table that maps all > of the code points to their category? ftp://ftp.unicode.org/Public/5.2.0/ucd UnicodeData.txt The main guide for that is ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html (if you don't want to go thru the standard, which is the adviseable thing) And when you bump your head, remeber that joe and I warned you about utf-8. It was not designed for this kind of usage. -- Mihai Nita [Microsoft MVP, Visual C++] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
From: Peter Olcott on 15 May 2010 10:12 "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message news:Xns9D7922352F422MihaiN(a)207.46.248.16... > >> Do you know anywhere where I can get a table that maps >> all >> of the code points to their category? > > ftp://ftp.unicode.org/Public/5.2.0/ucd > > UnicodeData.txt > The main guide for that is > ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html > (if you don't want to go thru the standard, which is the > adviseable thing) > > And when you bump your head, remeber that joe and I warned > you about utf-8. > It was not designed for this kind of usage. > > Joe also said that UTF-8 was designed for data interchange which is how I will be using it. Joe also falsely assumed that I would be using UTF-8 for my internal representation. I will be using UTF-32 for my internal representation. I will be using UTF-8 as the source code for my language interpreter, which has the advantage of simply being ASCII for the English language, and working across every platform without requiring adaptations such as Little Endian and Big Endian. UTF-8 will also be the output of my OCR4Screen DFA recognizer. > > -- > Mihai Nita [Microsoft MVP, Visual C++] > http://www.mihai-nita.net > ------------------------------------------ > Replace _year_ with _ to get the real email >
From: Peter Olcott on 15 May 2010 11:08 "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message news:Xns9D7922352F422MihaiN(a)207.46.248.16... > >> Do you know anywhere where I can get a table that maps >> all >> of the code points to their category? > > ftp://ftp.unicode.org/Public/5.2.0/ucd > What I am looking for is a mapping between Unicode code points (compressed into code point ranges when possible) that maps to General Category Values as two character abbreviations. I will look though this first link to see if I can find this. Initially I saw a lot of things that were not this. > UnicodeData.txt > The main guide for that is > ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html > (if you don't want to go thru the standard, which is the > adviseable thing) > > And when you bump your head, remeber that joe and I warned > you about utf-8. > It was not designed for this kind of usage. > > > > -- > Mihai Nita [Microsoft MVP, Visual C++] > http://www.mihai-nita.net > ------------------------------------------ > Replace _year_ with _ to get the real email >
From: Peter Olcott on 15 May 2010 11:48 "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message news:Xns9D7922352F422MihaiN(a)207.46.248.16... > >> Do you know anywhere where I can get a table that maps >> all >> of the code points to their category? > > ftp://ftp.unicode.org/Public/5.2.0/ucd I found the table that I was looking for here: ftp://ftp.unicode.org/Public/5.2.0/ucd/UnicodeData.txt Thanks for all your help. > > UnicodeData.txt > The main guide for that is > ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html > (if you don't want to go thru the standard, which is the > adviseable thing) > > And when you bump your head, remeber that joe and I warned > you about utf-8. > It was not designed for this kind of usage. > > > > -- > Mihai Nita [Microsoft MVP, Visual C++] > http://www.mihai-nita.net > ------------------------------------------ > Replace _year_ with _ to get the real email >
From: Joseph M. Newcomer on 17 May 2010 00:03
How about a non-answer is a substitute for "this is the most incredibly stupid idea I have seen in decades, and I'm not going to waste my time pointing out the obvious silliness of it"? You are again spending massive effort to solve an artificial problem of your own creation, caused by making poor initial design choices, and supported by nonsensical rationalizations. A professional programmer knows certain patterns (that is our strength!) and among these are the recognition that if you have to implement complex solutions to simple problems, you have made a bad design choice and are best served by re-examining the design choices and making design choices that eliminate the need for complex solutions, particularly when the complexity simply goes away if a different set of solutions is postulated. Personally, if I had to do a complex parser design, I'd want to eliminate the need to deal with UTF-16 surrogates, and I'd write my code in terms of UTF-32. Much simpler, and isolates the complexity and the input and output edges, not making it uniformly distributed throughout the code. And I'd know not to make childish decisions such as "it costs too much to do the conversion" because I outgrew those kinds of arguments certainly by 1980 (that's thirty years ago). My first instance of this was a typesetting program I did around 1970 where I stored the text as 9-bit rather than 7-bit bytes because I could encode font informtion more readily in the upper two bits. And I didn't even CONSIDER the size and performance issues of 9-bit vs. 7-bit bytes because I knew they didn't matter in the slightest. So I guess I learned this lesson 40 years ago. It greatly simplified the internal coding. But you are sounding like a first-semester programmer who was taught by some old PDP-11 programmer, and I don't buy either the size or the conversion performance arguments. You don't even have NUMBERS to argue your position! Optimization decisions that are argued without quantitative supporting measurments are almost always wrong. But we've had this discussion before, and your view is "My mind is made up, don't require me to get FACTS to support my decision!" In the Real World, before we can justify wasting lots of programmer time to implement bad decisions, we require justification. But maybe that's just my project management experience talking. Horrible, this dependence on reality that I have. If someone came to me with such a design, and was as insistent as you will be, my first requirement would be "Write a program that reads UTF-8 files of the expected size, then writes them back out. Measure its performance reading several dozen different files, and run each experiment 100 times, measuring the time-to-completion". Then "modify the program to convert the data to UTF-16, convert it back to UTF-8, and run the same experiment sent. Demonstrate that the change in the mean time is statistically significant". Hell, the variation of LOADING the PROGRAM Is going to differ from experiment to experiment by a variance several orders of magnitude greater than the conversion cost! So don't try to make the case that the conversion cost matters; the truth, based on actual performance measurements end-to-end, is that it does not. But, not having actually done performance measurement, you don't understand that. Those of us who devoted nontrivial parts of our lives to optimizing program performance KNOW what the problems are, and know that the conversion cannot possibly matter. joe ***** joe **** On Fri, 14 May 2010 11:53:30 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote: > >"Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote in message >news:uU4O0P48KHA.1892(a)TK2MSFTNGP05.phx.gbl... >> >> "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in >> message news:jfvou5ll41i9818ub01a4mgbfvetg4giu1(a)4ax.com... >>> Actually, what it does is give us another opportunity to >>> point how how really bad this >>> design choice is, and thus Peter can tell us all we are >>> fools for not answering a question >>> that should never have been asked, not because it is >>> inappropriate for the group, but >>> because it represents the worst-possible-design decision >>> that could be made. >>> joe >> >> Come on Joe, give Mr. Olcott some credit. I'm sure that he >> could dream up an even worse design as he did with his OCR >> project once he is given (and ignores) input from the >> professionals whos input he claims to seek. ;) >> >> >> -Pete >> >> > >Most often I am not looking for "input from professionals", >I am looking for answers to specific questions. > >I now realize that every non-answer response tends to be a >mask for the true answer of "I don't know". > Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm |