As of 26 06 2008

Processing SMS from the database/ Testing

Task 12 –Working on now

Banned words

When a person sends a SMS to the system it may contain banned words. In order to check against system also consists of list of band words. In would be very inefficient to check the words from the database. Therefore all the banned words in the database get loaded to a list when the system initialize. The table “swearwords” is being used to hold all the banned words in the system.

db_bannedwords.jpg

Invalid characters

If the SMS contains characters that can not be processed by POSTagger it will generate errors. In order to prevent this list of letters that can be processed is initialized. Following string contains all the letters that can be processed by the system.

String^ charLine    = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz`1234567890-=[]\\;',./~!@#$%^&*()_+|}{:\"<>?";

The SMS received by the system is analyzed to find whether it contains any characters other than the listed ones. If it does it would generate an error in POSTagger so we will not process it further to prevent that from happening.

POSTagger

The English POS Tagger used in this application was not developed by us. For more information on it refer to;

Tsuruoka Y., and Tsujii, J. Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. In Proceedings of HLT/EMNLP, 2005, pp. 467-474.

POSTagger is being used to analyze the users SMS.

Finding the poetry

The system goes into a while loop and continually check for entries in “sms” table with status == 0.

When it finds such entry, it checks for message length. It can not be bigger than maximum allowed length which is set from the configuration file. Message length can not equal to zero (nothing is been sent to BlogWall). If any of them occurred BlogWall gives invalid length message, delete current entry in “sms” table and move on to next entry in “sms” table.

The message is then analyzed to find out whether it is a polling reply. If so, necessary steps would be taken to update poll related tables.

The message is then analyzed to find out whether it contains invalid characters. The lists of valid characters are loaded in the initialization process. If invalid characters are present in the message BlogWall gives invalid length message, delete current entry in “sms” table and move on to next entry in “sms” table.

Then the message is processed using POSTagger. If the length of the output is equal to zero, that indicates an error in tagging process. If so, give error message; delete current entry in “sms” table and move on to next entry in “sms” table.

Check for number of words in the SMS. If the SMS has less than 3 unique words, system can not generate proper poetry. In this case entry in the “sms” table is NOT deleted. Instead table entry valid is set to 0. The display system can either display the users SMS without poetry or ignore it completely.

It is very unlikely a user would send a SMS with a word having length over 40 chars. Probability of this being a malicious attack is high. So if the system received a word containing over 40 chars the SMS is considered to be illegal. In this case entry in the “sms” table is NOT deleted. Instead table entry valid is set to 0. The display system can decided what to do with them.

Check for banned words in the SMS. If message constrains a banned word, give an error message then delete current entry in “sms” table and move on to next entry in “sms” table.

Next step is the calculation of the emotional weight of the SMS.

Now identify the tag ids in the output string generated by POSTagger.

After that, retrieve the tf-idf weight of each word from the database. Words with the highest values get selected. Number of words get selected is set by the configuration file. In the next step fetch synonyms for each of the words. Now again rank the synonyms based on the tf-idf weight retrieved from the database. Highest weighted synonyms get selected. Number of synonyms get selected is set from the configuration file. The data is stored in "sms_text_word" table.

db_sms_text_word.jpg
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.