XenWORD

XenCraft's word parser for Progress Software users

The XenCraft Word Parser for Progress (XenWORD) is a shared library that parses text into words using the Progress word rules table, the same rules used for database word indexes. This gives developers and users the ability to review how the text is being parsed, to confirm the correct rules are being used or to experiment with new rules. The library can also be used more generally for text parsing and manipulation.


XenWORD Functions

This page has been updated to be consistent with XenWord Version 0.20, Jan 5, 2003.

XenWord provides the following functions:
FunctionDescription
wordBreaks(CHARACTER, OUTPUT MEMPTR, SHORT, MEMPTR(256)) This functions parses text into words. The first CHARACTER argument represents the text to be parsed. The second MEMPTR argument will contain the results. The third argument is the length in bytes of the second argument (MEMPTR). The wordBreaks function will not write more bytes than that to the output argument. The fourth argument is the word break table to be used for text parsing.
initTable(INPUT-OUTPUT MEMPTR(256)) This function takes a 256 byte MEMPTR and populates it with a default set of values for Windows code page 1252 word breaking.
breakEncoding(INPUT-OUTPUT CHARACTER, MEMPTR(256)) This function is primarily used for debugging. Given a CHARACTER string, each character in the string will be replaced with it's word break attribute value. That is each character will be replaced with a digit 1-8 taken from the word rules table. The Word Break Attribute Assignments table below lists the possible attributes and the numeric values are indicated in parentheses next to each attribute name.
XenWordID(OUTPUT MEMPTR, SHORT) This function returns an identifier string to the MEMPTR. The SHORT specifies the maximum number of bytes to write to the MEMPTR. The current version of the XenWord DLL writes:
"XenWord Version 0.20. Copyright @copy; 2002-2003 XenCraft. All rights reserved.".

Using the XenWord functions

To parse text into words, you need to have a set of word rules that define which characters are used in words and which characters are not used in words. For some characters, they participate in words only under certain conditions. The rules are encoded as values in a 256 byte table, one byte for each possible character in a single-byte code page. This table of values (word rules table) needs to be made available for use by the functions.

There are two ways to initialize such a table. You can populate a MEMPTR of size 256 bytes and set the values individually or you can load the values from one of the Progress Word Rules files. Of course, if you are looking to evaluate or emulate how Progress word indexes will parse text, then you should use the relevant word rules file.


To load the MEMPTR, populate it with the last 256 bytes of the Word Rules file. For example:
DEFINE VARIABLE bvals AS MEMPTR.
DEFINE VARIABLE wordTable AS MEMPTR.
FILE-INFO:FILE-NAME = "c:\src-vs\word\proword.2". /* rules file to read */
SET-SIZE(bvals) = FILE-INFO:FILE-SIZE.
INPUT FROM value(FILE-INFO:FILE-NAME).
IMPORT bvals.
INPUT CLOSE.
fsize = GET-SIZE(bvals). /* size of file */
SET-SIZE(wordTable) = 256. /* extract last 256 bytes */
wordTable = GET-BYTES(bvals, fsize - 255, 256).

If you just want to test the functionality of wordBreaks, you can instead use the initTable function to create a set of rules that use all the Progress defaults for ASCII characters and use reasonable values for Windows code page 1252 for the characters greater than 127.
DEFINE VARIABLE wordTable AS MEMPTR.
SET-SIZE(wordTable) = 256.
run initTable(INPUT-OUTPUT wordTable).


Using the wordBreaks function is easy. Put the text you want parsed into a CHARACTER variable. Have a MEMPTR variable available for the output. It should be about the same size as the input string. The BYTE LENGTH of the output variable and the word rules MEMPTR must also be supplied. For example:

DEFINE VARIABLE myText AS CHARACTER.
DEFINE VARIABLE parsedText AS MEMPTR.
myText = "Here is the text to parse.".
SET-SIZE(parsedText) = LENGTH(parsedText, "RAW") + 1. /* include null */
RUN wordBreaks(myText, OUTPUT parsedText, LENGTH(parsedText, "RAW"), wordTable).
DISPLAY GET-STRING(parsedText, 1). /* Convert to CHARACTER, see results! */
SET-SIZE(parsedText) = 0. /* release memory */


That's pretty much it. If you are not sure if word rules table is correct, you can use the breakEncoding function to verify any character's attribute (or list of characters):
DEFINE VARIABLE myText AS CHARACTER. /* Can be either CHAR or MEMPTR */
myText = "123abcABC$%^&*". /* find out the attributes of these characters */
RUN breakEncoding(INPUT-OUTPUT myText, wordTable).
DISPLAY myText. /* shows "33322222244111" */

The attribute 3 indicates a digit. 2 is a letter. 4 indicates the character is always considered part of a word. 1 indicates the character always terminates a word. The function replaces the string that was input with an equivalent number of digits indicating each character's attribute.


Of course, to call a function from a DLL requires a little (painless) set up in Progress:
PROCEDURE breakEncoding EXTERNAL "c:\...\xenword.dll" PERSISTENT:
DEFINE INPUT-OUTPUT PARAMETER hIn AS CHAR.
DEFINE INPUT PARAMETER mTABLE AS MEMPTR.
END.

PROCEDURE wordBreaks EXTERNAL "c:\...\xenword.dll" PERSISTENT:
DEFINE INPUT PARAMETER cIn AS CHAR.
DEFINE OUTPUT PARAMETER cOut AS MEMPTR.
DEFINE INPUT PARAMETER iLen AS SHORT.
DEFINE INPUT PARAMETER mTABLE AS MEMPTR.
END.

PROCEDURE initTable EXTERNAL "c:\...\xenword.dll" PERSISTENT:
DEFINE INPUT-OUTPUT PARAMETER mTABLE AS MEMPTR.
END.

PROCEDURE XenWordID EXTERNAL "c:\...\xenword.dll" PERSISTENT:
DEFINE OUTPUT PARAMETER mVersionInfo AS MEMPTR.
DEFINE INPUT PARAMETER iLen AS SHORT.
END.


Word Break Attribute Assignments

Progress Word Indexes use a set of character attributes as the basis for text being parsed into word units. These characteristics are called Word Break Attributes. The Attributes are defined in the following table, which is taken from the Progress Internationalization Guide. The third column, declares the attributes assigned to each of the characters in the ASCII character set, i.e. those characters with code points less than 128. The ASCII characters are the first 128 characters of all the non-EBCDIC code pages that Progress supports and so the Word Break Attribute assignments are the same for all code pages, for the first 128 characters.

Upgrading to TYPE 3 word break table is strongly recommended. There is no downside, and using TYPE 3 fixes some problems.

Word Delimiter Attributes
Word Delimiter
Attribute
Word Index behavior when a character with this attribute is encountered Default Assignments
LETTER (2) Always part of a word. Assigned to all characters that the current code page attribute table, defines as alphabetic. I.E. the ISALPHA attribute is "1".
These are the uppercase characters A-Z and the lowercase characters a-z.
DIGIT (3) Always part of a word unit. Assigned to the characters 0-9.
USE_IT (4) Always part of a word. Assigned to the following characters:
Dollar sign ($), Percent sign (%), Number sign (#), At symbol (@), Underline (_)
BEFORE_LETTER (5) Part of a word only if followed by a character with the LETTER attribute.
Else, treated as a word delimiter.
No characters have this attribute by default.
BEFORE_DIGIT (6) Treated as part of a word only if followed by a character with the DIGIT attribute. Assigned to the following characters:
Period (.), Comma (,), Hyphen (-)
For example, "12.34" is one word, but "ab.cd" is two words.
BEFORE_LET_DIG (7) Treated as part of a word only if followed by a character with the LETTER or DIGIT attribute. No characters have this attribute by default.
IGNORE (8) The character is removed from the word for indexing. Assigned to the apostrophe (').
For example, "John's" is equivalent to "Johns."
TERMINATOR (1) Word delimiter.
I.E. The character is not a part of any word and can indicate the end of a word has been reached or a new one is about to begin.
Assigned to all other characters.


 
To Top