Basics of ASCII;Unicode; File IO; using LabVIEW

Basics:
Mathematics: 
Bit: Stands for Binary Digit,can represent  two possible states 0 and 1 states. To represent that numbers are in binary system lets denote it as b0 and b1.
Nibble: Group of 4 Bits forms a nibble, can represent 16 possible states, namely b0000 to b1111 (0x0 to 0xF in hexadecimal).
Octet:  Group of 8 Bits (or 2 nibbles) forms a Octet, can represent 256 possible states, namely b00000000 to b11111111 (0x00 to 0xFF in hexadecimal).        

Memory Terminologies:
Byte: The byte is the smallest addressable unit for a CPU. The byte can be 8bits,16bits or any width depending up on the CPU.
Word : Two bytes width of memory forms a Word.
Dword: Two Words width of memory forms a Dword (meaning Double word)
Qword: Four Words width of memory forms a Qword (meaning Quad word)

In many computer architectures, the smallest addressable memory is Octet. Only with respect to those systems, byte refers to 8bits, word is 16bits, Dword is 32bits and Qword is 64bits (and this became de facto standard too). For all our further discussions lets strict to this definition itself.

Information Storage:
                            As we know computer can store/process numbers only. But we want to our computers to store/process any kind of information. So we need many translators to translate any kind of information to numbers. Let us look at various translators available.

ASCII Translation Table: This standard defines the mapping between the numbers and the English language characters. Using this mapping one can store all English based content on computers.
If one has to store the word "Hello" then using the ASCII look-up table, we get number sequence "72 101 108 108 111". Now these numbers has to be stored in the adjacent memory locations. 

Custom Defined Translation Table: Let us define the translation table to store English alphabets only.
0 to 25 corresponds to small letters notation of the alphabets "a to z"
26 to 51 corresponds to Capital letters notation of the alphabets "A to Z".
With this translation table, let us see how the word "Hello" will be stored. 
We have to store the number sequence "33 4 11 11 14". This number sequence is different from the previous one.

From the above discussions we can conclude the following two things.

1) By using different translation tables, the same information(Hello) is stored differently in the computer memory.
2) If we don't know the translation table, then the data stored in the computer memory has no meaning, as the computer storage is purely meaningless numbers.

                                    Even though ASCII Translation table is popular among computer users, when it comes to storing of other language contents, ASCII definition can't be used. So there were too many custom defined Translation table came into picture for storing many Latin language contents. Then came the Unicode Translation table, which defined characters for all the languages around the world with unique numbers.  The numbers ranges from 0x0 to 0x10FFFF. This clear definition of Unicode makes it suitable for multiple language support, without any confusion. The mapping of first 128 numbers in Unicode Translation table is exact same mapping as defined by ASCII Translation table. 

So logically, we can say that the English content stored using Unicode TT and ASCII TT will be one and same.But physically if you examine on the memory device you will observe at least 3Bytes of difference in the overall storage size between a ASCII file and a Unicode file. Lets us see in detail. 

Storage requirements for ASCII and Unicode characters: 

As per the ASCII definition, the number ranges from 0 to 127, to store numbers in this range, 7 bits is perfectly fine. But on most of the computer the least addressable memory width is 8 bits, so 1 Byte is used to store a single English character.   So if one stores the word "Hello" using ASCII it will 5 bytes of memory.
ASCII Translation
Hello = 48 65 6C 6C 6F 

                                    As per the Unicode definition, the number ranges from 0x0 to 0x10FFFF to store any number in this range, we need 21 bits. But on a typical computer at least it takes 3 bytes to store a single English Character. but computer can address as byte, word, Dword, Qword like that only, so it will take dword (4bytes) to store a single English character.  So if one stores the word "Hello" using Unicode it will 20 bytes of memory.
Unicode Translation  
Hello = U+48 U+65 U+6C U+6C U+6F (Unicode notation)

In memory space with 4 byte notation 
Hello = 00000048 00000065 0000006C 0000006C 0000006F

If you observe carefully for any English content stored Unicode format, the first 3 Most Significant Bytes are always zeros.  Also for most of the languages, First 2 bytes will be always zeros. So to reduce the storage space, Unicode standard also came up with 3 encoding  formats namely UTF-8, UTF-16LE and UTF-16BE.

Hello (UTF 8) = 48 65 6C 6C 6F 
Hello (UTF 16BE) = 00 48 00 65 00 6C 00 6C 00 6F 
Hello (UTF 16LE) = 48 00 65 00 6C 00 6C 00 6F 00.

similarly to store the tamil word "வணக்கம்"( Va'na'kam) using the unicode translation will look like this

வணக்கம் = 00000BB5 00000BA3 00000B95 00000BCD 00000B95 00000BAE 00000BCD

வணக்கம் (UTF 8) = E0 AE B5 E0 AE A3 E0 AE 95 E0 AF 8D E0 AE 95 E0 AE AE E0 AF 8D
வணக்கம் (UTF 16BE) = 0B B5 0B A3 0B 95 0B CD 0B 95 0B AE 0B CD 
வணக்கம் (UTF 16LE) = B5 0B A3 0B 95 0B CD 0B 95 0B AE 0B CD 0B

Similarly find the translations for few other indian language names in their fonts.

हिन्दी (Hindi)= U+939,U+93F,U+928,U+94D,U+926,U+940
தமிழ்(Tamil) = U+BA4, U+BAE, U+BAF, U+BB4, U+BCD
ಕನ್ನಡ(kannada) = U+C95, U+CA8, U+CCD, U+CA8, U+CA1,
తెలుగు(Telugu) = U+C24 ,U+C24 ,U+C32,U+C41,U+C17, U+C41
മലയാളം(Malayalam) = U+ D2E,U+ D32,U+D2F,U+D3E,U+D33, U+D02,

                                     

File Formats:

                       File Formats can be broadly classified into two categories, namely text files and binary files.
  • Text files on the other hand contains human readable language contents (Tamil, English,Spanish, etc)  translated using ASCII or Unicode characters only. so  it is easy to open such files in any text editors to understand the content of the file. For example, txt, csv, configuration files, xml files, html files....
  • Binary files are generally readable with a help of appropriate application. For example image files, music files, video files etc... If you know about the internal file format of such binary files you can read them and make sense out of the file.  In a high level you can say "all files as binary files, if you don't know the Translation table used for information exchange". 
  • Generally binary file format consumes less space to store a numerical information then to store same the information in text file format. Also Random access for read or write is possible with binary file format than compared to a text file (with fixed width data or data delimited by special character) format.

Lets look at file IO Primitives in LabVIEW.

LabVIEW ships with the following binary File IO primitives,

1) Raw Binary file read/write primitives

                      This works very well for the basic data types like numeric, boolean, strings etc.. See the following examples to understand this file format.  This is best suited for applications that involve acquiring only one channel data (lets say temperature of device under test) and writing one or more points at a time to a file. When it comes to data types that are more than 1 byte (in size) you have to use the same byte order for reading and writing. if you change the byte order then your data becomes invalid or meaningless. Big endian and little endian are the two types of byte orders.

  • <LabVIEW>\examples\file\smplfile.llb\Read Binary File.vi
  • <LabVIEW>\examples\file\smplfile.llb\Write Binary File.vi          
            When it comes to store a Complex data types like array of clusters, these binary files may not handle them properly. That is when you have several channel data grouped logically as cluster then several points forms array of clusters. you may have to consider the the next binary file primitives.                        

2) Datalog file read/write primitives

                      The drawbacks of raw binary file read/write to manage complex data type for random access can be overcome by using the datalog file primitives, refer the following examples to understand this file format.This is best suited for applications that involve acquiring more than one channels. Each record in this file format may be a logical group of point of each channels. As you go on acquiring you can write them to a single file. Thus the file contains array of cluster of several channels info(lets say temperature and pressure data of the device under test).

  • <LabVIEW>\examples\file\datalog.llb\Write Datalog File Example.vi 
  • <LabVIEW>\examples\file\datalog.llb\Read Datalog File Example.vi 
           When it comes to access only say one channel information it becomes tough with datalog, you may have to read all the records and then extract the channel of interest. i.e if you want to read and plot only pressure data, direct file read for this is not available. In that case you may have to consider TDMS file IOs  

3) TDMS file read/write primitives 

                         If you want faster File IO with ability to read any specific channels randomly, TDMS file format is the best. This File IO is generally applicable if you have several information in the following logical order. A single file having several groups and properties for file and properties for each group. Again, each group containing several channels  and properties for each channels. Each Channel contains several points.Properties and its values are user defined information about the file or group or channel.


To know more about this file format read the following white paper.
http://www.ni.com/white-paper/5696/en




கருத்துகள்

பிரபலமான இடுகைகள்