XenWORD Functions
This page has been updated to be consistent with XenWord Version 0.20, Jan 5, 2003.
XenWord provides the following functions:
Function | Description |
wordBreaks(CHARACTER, OUTPUT MEMPTR, SHORT, MEMPTR(256)) |
This functions parses text into words. The first CHARACTER argument
represents the text to be parsed. The second
MEMPTR argument will contain the results.
The third argument is the length in bytes of the second argument (MEMPTR).
The wordBreaks function will not write more bytes than that to the output argument.
The fourth argument is the word break table to be
used for text parsing. |
initTable(INPUT-OUTPUT MEMPTR(256)) |
This function takes a 256 byte MEMPTR and populates it with
a default set of values for Windows code page 1252 word breaking. |
breakEncoding(INPUT-OUTPUT CHARACTER, MEMPTR(256)) |
This function is primarily used for debugging. Given a CHARACTER string,
each character in the string will be replaced with
it's word break attribute value. That is each character will be replaced with a digit 1-8 taken from the word rules table.
The Word Break Attribute Assignments table
below lists the possible attributes and the numeric values are indicated in parentheses next to each attribute name. |
XenWordID(OUTPUT MEMPTR, SHORT) |
This function returns an identifier string to the MEMPTR. The SHORT specifies the maximum number of bytes to write to the MEMPTR.
The current version of the XenWord DLL writes: "XenWord Version 0.20. Copyright @copy; 2002-2003 XenCraft. All rights reserved.".
|
Using the XenWord functions
To parse text into words, you need to have a set of word rules
that define which characters are used in words and which
characters are not used in words. For some characters, they
participate in words only under certain conditions.
The rules are encoded as values in a 256 byte table, one byte for each possible character in a single-byte code page.
This table of values (word rules table) needs to be made available for use by the functions.
There are two ways to initialize such a table. You can populate a
MEMPTR of size 256 bytes and set the
values individually or you can load the values from one of the
Progress Word Rules files. Of course, if you are looking to
evaluate or emulate how Progress word indexes will parse text,
then you should use the relevant word rules file.
To load the MEMPTR, populate it with the last 256 bytes of the Word Rules file.
For example:
DEFINE VARIABLE bvals AS MEMPTR.
DEFINE VARIABLE wordTable AS MEMPTR.
FILE-INFO:FILE-NAME = "c:\src-vs\word\proword.2". /* rules file to read */
SET-SIZE(bvals) = FILE-INFO:FILE-SIZE.
INPUT FROM value(FILE-INFO:FILE-NAME).
IMPORT bvals.
INPUT CLOSE.
fsize = GET-SIZE(bvals). /* size of file */
SET-SIZE(wordTable) = 256. /* extract last 256 bytes */
wordTable = GET-BYTES(bvals, fsize - 255, 256).
If you just want to test the functionality of wordBreaks, you can instead use the initTable function to create a set of rules that
use all the Progress defaults for ASCII characters and use reasonable values for Windows code page 1252 for the characters greater than 127.
DEFINE VARIABLE wordTable AS MEMPTR.
SET-SIZE(wordTable) = 256.
run initTable(INPUT-OUTPUT wordTable).
Using the wordBreaks function is easy. Put the text you want parsed into a CHARACTER variable.
Have a MEMPTR variable available for the output. It should be about the same size
as the input string. The BYTE LENGTH
of the output variable and the word rules MEMPTR must also be supplied.
For example:
DEFINE VARIABLE myText AS CHARACTER.
DEFINE VARIABLE parsedText AS MEMPTR.
myText = "Here is the text to parse.".
SET-SIZE(parsedText) = LENGTH(parsedText, "RAW") + 1. /* include null */
RUN wordBreaks(myText, OUTPUT parsedText, LENGTH(parsedText, "RAW"), wordTable).
DISPLAY GET-STRING(parsedText, 1). /* Convert to CHARACTER, see results! */
SET-SIZE(parsedText) = 0. /* release memory */
That's pretty much it. If you are not sure if word rules table is correct, you can use the breakEncoding function to verify
any character's attribute (or list of characters):
DEFINE VARIABLE myText AS CHARACTER. /* Can be either CHAR or MEMPTR */
myText = "123abcABC$%^&*". /* find out the attributes of these characters */
RUN breakEncoding(INPUT-OUTPUT myText, wordTable).
DISPLAY myText. /* shows "33322222244111" */
The attribute 3 indicates a digit. 2 is a letter. 4 indicates the
character is always considered part of a word. 1 indicates the
character always terminates a word. The function replaces the
string that was input with an equivalent number of digits
indicating each character's attribute.
Of course, to call a function from a DLL requires a little (painless) set up in Progress:
PROCEDURE breakEncoding EXTERNAL "c:\...\xenword.dll" PERSISTENT:
DEFINE INPUT-OUTPUT PARAMETER hIn AS CHAR.
DEFINE INPUT PARAMETER mTABLE AS MEMPTR.
END.
PROCEDURE wordBreaks EXTERNAL "c:\...\xenword.dll" PERSISTENT:
DEFINE INPUT PARAMETER cIn AS CHAR.
DEFINE OUTPUT PARAMETER cOut AS MEMPTR.
DEFINE INPUT PARAMETER iLen AS SHORT.
DEFINE INPUT PARAMETER mTABLE AS MEMPTR.
END.
PROCEDURE initTable EXTERNAL "c:\...\xenword.dll" PERSISTENT:
DEFINE INPUT-OUTPUT PARAMETER mTABLE AS MEMPTR.
END.
PROCEDURE XenWordID EXTERNAL "c:\...\xenword.dll" PERSISTENT:
DEFINE OUTPUT PARAMETER mVersionInfo AS MEMPTR.
DEFINE INPUT PARAMETER iLen AS SHORT.
END.
Word Break Attribute Assignments
Progress Word Indexes use a set of character attributes as the basis for text being parsed into word units.
These characteristics are called Word Break Attributes. The Attributes are defined in the
following table, which is taken from the Progress Internationalization Guide.
The third column, declares the attributes assigned to each of the characters in the ASCII character
set, i.e. those characters with code points less than 128. The ASCII characters are the first 128 characters
of all the non-EBCDIC code pages that Progress supports and so the Word Break Attribute assignments are the same
for all code pages, for the first 128 characters.
Upgrading to TYPE 3 word break table is strongly recommended. There is no downside, and using TYPE 3 fixes some problems.
Word Delimiter Attributes
Word Delimiter Attribute |
Word Index behavior when a character with this attribute is encountered |
Default Assignments |
LETTER (2) |
Always part of a word. |
Assigned to all characters that the current code page attribute table, defines as alphabetic.
I.E. the ISALPHA attribute is "1".
These are the uppercase characters A-Z and the lowercase characters a-z.
|
DIGIT (3) |
Always part of a word unit. |
Assigned to the characters 0-9. |
USE_IT (4) |
Always part of a word. |
Assigned to the following characters:
Dollar sign ($), Percent sign (%), Number sign (#),
At symbol (@), Underline (_)
|
BEFORE_LETTER (5) |
Part of a word only if followed by a character with the LETTER attribute.
Else, treated as a word delimiter. |
No characters have this attribute by default. |
BEFORE_DIGIT (6) |
Treated as part of a word only if followed by a character with the DIGIT attribute. |
Assigned to the following characters:
Period (.), Comma (,), Hyphen (-)
For example, "12.34" is one word, but "ab.cd" is two words.
|
BEFORE_LET_DIG (7) |
Treated as part of a word only if followed by a character with the LETTER or DIGIT attribute. |
No characters have this attribute by default. |
IGNORE (8) |
The character is removed from the word for indexing. |
Assigned to the apostrophe (').
For example, "John's" is equivalent to "Johns." |
TERMINATOR (1) |
Word delimiter.
I.E. The character is not a part of any word and can indicate the end of a word has been reached or a new
one is about to begin. |
Assigned to all other characters. |
|