HBKU research explores AI use in detecting fraudulent text messages

The College of Science and Engineering (CSE) of Hamad Bin Khalifa University focuses on world-class research and innovation in the field of conversational artificial intelligence. Its goal is to create and build useful technologies that have an obvious benefit for the regional industry.

According to Dr. David Yang, an associate professor of CSE, the college recently carried out research on identifying fake SMS messages while maintaining high privacy protection.

The research effort was carried out in conjunction with Ooredoo and was supported by the Qatar National Research Fund (QNRF).

In order to identify fraudulent messages, safeguard clients, and enhance customer experience generally, the project intends to create and build privacy-preserving data analytics solutions on telecommunication data, according to Dr. Yang.

“This project’s outcomes include novel model for natural language processing (NLP) techniques based on the Transformer architecture, which are required for parsing and classifying text messages, and graph analytics tools that analyze customer relationships in order to identify vulnerable customers who tend to become fraud victims,” he continued.

The National Priority Research Program (NPRP) program Cycle 10 funding was provided by QNRF for the study, which resulted in two major academic prizes.

Artificial intelligence (AI) that can understand and respond to human speech is known as conversational AI. Chatbots, virtual assistants, and other applications that rely on natural language processing use this kind of AI.

In his discussion of the potential for conversational AI and its widespread applications, Dr. Yang said that the AI system does grasp the question before coming up with a natural, succinct response on its own.

Dr. Yang outlined the process of natural language understanding (NLU), stating, “We do not quite understand how natural language understanding occurs. We do, however, understand how to create an AI system for this use. A large-scale Transformer, a deep learning model trained with a substantial corpus of text downloaded from the Internet, is typically used for this.

Dr. Yang claims that as of August 2022, the Transformer architecture is well known, and there is a wealth of information available online that may be utilized to train models.

So, anyone with access to enough computational power can create an NLU model. Although we still lack a solid theoretical grasp of how artificial intelligence comprehends natural language, he continued.

Dr. Yang provided examples of NLU and the industries in which they are most commonly used, citing the widespread use of Apple Siri (or Amazon Echo/Google Home), the prevalence of text autocomplete in emails written in Outlook and Gmail, and the widespread use of chatbots on websites to provide customer service.

The chatbot, which incorporates AI capabilities offered by major cloud-computing platforms like Google Cloud and IBM Watson, is a typical example of extensively used technology in Qatar. Additionally, a frequent application of NLP is in Arabic to English machine translation.

He claimed that every AI model training requires data. Unstructured data can be utilized to train a general model for NLU; this procedure is commonly referred to as ‘pre-training. Then, a task-specific model can be trained using structured data, a process known as “fine-tuning.” For instance, by using unstructured data from the Internet to pre-train a model for Arabic language comprehension, we can then fine-tune the model to interact with users in a chatbot for a particular domain, like telecom customer service, by using structured data from this domain, explained Dr. Yang.

The ethics of artificial intelligence, he added, is another crucial aspect. For instance, he claimed that a chatbot that has been programmed to respond to posts on unmoderated Internet forums often uses objectionable language, such as profanity and racial insults. The training data must be “sanitized” by deleting certain offensive language examples to prevent this.

“With enough data and a contemporary AI model architecture, the AI model finally picks up the language. The fundamental issue is that for certain less common languages, there is not a lot of data that is freely available on the Internet, he added, not the perceived difficulty of the language or accents.

This effort resulted in two significant academic awards: the Best Paper Award at the VLDB 2021 conference and first place in the NLP4IF competition at the EMNLP 2019 conference.