Online Herbs Shopping Project
Word Similarity algorithm for Merging Thai Herb Information from Heterogeneous Data Sources
In the past few years, there are many researches that work on word similarity. Those researches can be classified into two models. The first model is based on string matching algorithms [8-11]. The principle of this model is to find a number of characters matched and characters un-matched and normalize with a common divider. The similarity score can be ranked trom 0 (no similarity) to I (closely similar). The other model is based on semantic similarity. Researches based on this model often use WordNet to measure the similarity between words. The famous measure that many researchers used is an edge-counting technique [12-15]. Given words a and b, a lowest common ancestor node of them is noted. The shorter path length trom either a or b to their lowest common ancestor node is used as a denomination, which is then normalized by summation of a path length from a to b (via this common ancestor node) and their depths. Hence, the score can also be ranked from 0 (no similarity) to 1 (perfect synonymy). Our approach to find similarities between Thai herb names uses exact-matched string matching model. However, we proposed a new algorithm for finding similarities between symptom names. Code Shoppy There are three reasons that we cannot use the two models above for symptom names. First, we do not have Thai WordNet for symptoms. Second, similar or same symptoms can have many names. Third, similar symptom names can be totally different symptoms.
A. Merging Thai herb name The objective of this process is to collect Thai herb information for both official name and other names (common names). This process is necessary in order to complete the integration of Thai herb information from various data sources. The process of merging Thai herb names has 2 main steps as fo llows. First step, we compare an official name of a data source to an official name of other data sources. If official names are exact-matched, both Thai herb information are merged. Second step, we compare both official name and other names of a data source to other names of other data sources. If either official name or other names of the data source is matched to other names of other data sources, both Thai herb information are merged.
B. Finding similarities between symptoms This process is the main challenge of this work because some Thai symptom names can be called by many names, such as Thongruang (ii<J-J:i1-J) and Thongsia (ii<J-J!iY�). On the other hand, some totally different symptoms such as Puadhua (1Jl�Ml) and Puadthong (1Jl�ii<J-J), have high string-matching score above average. Hence, we propose an algorithm to find similarities between symptoms. The process can be divided into 4 steps as fo llows. 1) Step 1: Grouping body systems into sub-organ similarity tables: First, we divided groups of body organs into ten parts based on the organ body system which is shown in figure 2. Those are an alimentary system, a skeletal system, a nervous system, a muscular system, a cardiovascular system, a respiratory system, a reproductive system, an integumentary system, a lymphatic system and an urinary system. Each system contains sub-organs, such as, an alimentary system has a mouth, a stomach, an intestine, an anus and so on. Hence, ten 2-D tables are constructed; each contains similarity scores between 0 to I of each pair of sub-organs (based on the human expert opinion). For example, mouth and teeth has similarity score 0.7 while mouth and stomach has similarity score 0.0. 2) Step 2: Constructing a list of all symptoms organized by sub-organs: Our assumption is that the same symptom affects the same sub-organ. Hence, we construct 35 lists of symptoms which can affect each organ body part. The symptom data used in our work are extracted from five websites and two databases. Those lists are input to the next step. 3) Step 3: Constructing symptom word similarity tables: Next, we take a symptom name and separate it into words. In Thai language, one symptom is usually indicated by at least two words. One specifies a body part, the other specify the irregularities happen to the body part. These irregularity specification words (or symptom words) are grouped into tables based on sub-organs of body systems from step 1. Some of these symptom words are colic (�fl!iY��), ache (JJl�), and so on. Similar to the sub-organ similarity tables, the similarity scores are between 0 to I. However, instead of relying on the human expert opinion, the similarity score of a symptom word pair is calculated as fo llows. • For each sub-organ, if the word pair appears together at least once, count it as I. (Hence, for totally 35 sub-organs, there can be at most 35 counts). • Divide the count number by the number of suborgans (35). 4) Step 4: Calculating symptom similarity score: This is the main step for calculating symptom similarity based on previous lists and tables in step 1-3. A pair of symptom names is sent to an algorithm 1 shown in figure 3. The algorithm is developed based on a standard edit distance score (distance_score) using dynamic programming from  with some modification. The modification is as fo llows. First, the input pair, symptom S I and symptom S2, is in Thai word instead of English string. Next, if S I and S2 are in different sub-organ lists (from step 3 above) then return distance_score 1 (mean that they are not the same symptom). Then, the pair are segmented into list of words using LexiTo  and sent to calculate distance_score with a new modification of diff(word I, word2) as fo llows: • Step 4.1: if word1 and word2 are the same word then return distance score o. • Step 4.2: if word 1 and word2 are synonyms in the lexitron Thai dictionary  then return distance score O. • Step 4.3: lookup wordscore _similarity (word I, word2) in tables from step 1 and step 2 above, and return distance_score = l-wordscore _similarity. • Finally, the distance_score is divided by a maximum length of the pair and minus by one. In sum, two symptoms S 1 and S2 has similarity score defined as: . ·Z ·t (51 52) 1 distance_score(Sl,S2) srmr an y , =-MaXLength(Sl,S2) (1) This similarity score is between 0 and 1. The symptom synonym must gain similarity score above 0.6 in order to be considered the same symptom. Table I shows some result of symptom pairs. The score 0.6 is chosen from a preliminary experiments using 20 symptoms as seeds and varied scores from 0.4, 0.5, 0.6, 0.7 and 0.8. We found that scores below 0.5 did not indicate a good pair of symptom synonyms. In addition, only few symptom synonyms can achieve scores above 0.7. This due to the nature of symptom word pairs which have length mostly between two and three.