top of page

Pre-training & ​Unbiased Learning for Web Search 


Learning to Rank (LTR), aiming to measure documents' relevance with respect to queries, is a popular research topic in information retrieval with huge practical usage in web search engines, e-commerce, and multiple different streaming services. However, directly optimizing the model with click data results in unsatisfied performance due to the strong bias on implicit user feedback, such as position bias, trust bias, and click necessary bias. Unbiased learning to rank (ULTR) is then proposed for debiasing user feedback with counterfactual learning algorithms. However, real-world user feedback can be more complex than synthetic feedback generated with specific user behavior assumptions like position-dependent click model and ULTR algorithms with good performance on synthetic datasets may not show consistently good performance in the real-world scenario. Furthermore, it is nontrivial to directly apply the recent advancements in PLMs to web-scale search engine systems since explicitly capturing the comprehensive relevance between queries and documents is crucial to the ranking task. However, existing pre-training objectives, either sequence-based tasks (e.g., masked token prediction) or sentence pair-based tasks (e.g., permuted language modeling), learn contextual representations based on the intra/inter-sentence coherence relationship, which cannot be straightforwardly adapted to model the query-document relevance relations. Therefore, in this project, we focus on unbiased learning and pre-training for web search under real long-tail user feedback. A large scale dataset and a series of work has been proposed in this project for developing more practical large language model for web search. 

Causal Inference for Web Search

Causal inference has become an increasingly important topic in web search, as it allows us to determine the causal relationship between user behavior and search results. By leveraging counterfactual reasoning, causal inference can help us to overcome the limitations of observational data, such as selection bias and confounding variables. However, applying causal inference to web search is challenging, as it requires modeling the complex interactions between users, queries, and search results. Additionally, the scale and complexity of web search data present unique challenges for causal inference methods. In this project, we aim to develop novel causal inference methods that are tailored to the needs of web search. Our approach will involve both theoretical development and practical implementation, leveraging state-of-the-art techniques from machine learning and statistics. By combining causal inference with machine learning, we hope to improve the accuracy and fairness of web search, making it a more useful and trustworthy tool for users.


Personalized Recommender System

Personalized Recommender Systems (PRS) are widely used in e-commerce, social networks, and content platforms to provide personalized recommendations to users based on their historical behavior, preferences, and interests. PRS aims to predict the user's preference for a given item or content and recommends items that the user is likely to be interested in. However, designing an effective PRS is challenging due to the sparsity of user-item interaction data and the need to balance exploration and exploitation trade-offs.PRS proposes a series of methods based on machine learning and data science, such as collaborative filtering, content-based filtering, deep learning and reinforcement learning.Despite the success of PRS, there are still many challenges to be addressed. For example, PRS may suffer from the cold start problem for new users or items with limited historical data. Additionally, PRS may also face issues of fairness and diversity, as it can reinforce existing biases and result in a lack of exposure to new or underrepresented content. Therefore, ongoing research efforts are focused on developing more sophisticated PRS models that can address these challenges and provide more accurate, diverse, and personalized recommendations to users.


Product Search for E-Commerce

Product Search for E-commerce is a crucial aspect of the online shopping experience, enabling customers to efficiently find products that meet their needs. In recent years, significant progress has been made in this field through the development of sophisticated ranking algorithms that take into account multiple factors such as relevance, popularity, and price. However, challenges still remain, particularly when it comes to handling long-tail queries, which account for a significant portion of search traffic but are often difficult to match with relevant products. To address this issue, researchers have proposed a range of approaches, such as query expansion and query reformulation, to improve the effectiveness of product search. Additionally, there is growing interest in leveraging large language models (LLMs) to enhance the relevance of search results, by training them on large-scale e-commerce datasets that incorporate real-world user feedback. However, the application of LLMs to e-commerce search is still in its early stages, and much work remains to be done to optimize their performance for this specific task. In this context, unbiased learning is a promising approach to enhance the accuracy of product search, by accounting for various sources of bias in user feedback. By developing new techniques for pre-training and fine-tuning LLMs on large-scale e-commerce datasets, we can improve the relevance of product search results and provide a more satisfying shopping experience for customers.


Social Media Analysis and Mining

Social Media Analysis and Mining is a rapidly growing field that focuses on extracting insights and knowledge from social media platforms such as Twitter, Facebook, and Instagram. With the explosive growth of social media in recent years, there is an increasing need for effective methods to analyze and make sense of the vast amounts of data generated by these platforms. Social Media Analysis and Mining techniques include natural language processing, machine learning, network analysis, and sentiment analysis, among others. These techniques can be used to extract information about user behavior, trends, opinions, and social networks, and can have a wide range of applications in areas such as marketing, politics, health, and public opinion analysis. However, social media data is often noisy, unstructured, and biased, which makes it challenging to extract meaningful insights. To address these challenges, researchers in the field are developing new algorithms and tools that can handle large-scale, heterogeneous, and complex social media data, and that can account for sources of bias and misinformation. Additionally, there is growing interest in leveraging advanced techniques such as deep learning and graph mining to improve the accuracy and scalability of social media analysis. With these advancements, Social Media Analysis and Mining has the potential to transform our understanding of social behavior and provide valuable insights into a wide range of social phenomena.


Natural Language Understanding and Generation

Natural Language Understanding (NLU) and Generation (NLG) are two crucial research areas in the field of artificial intelligence that aim to enable machines to comprehend human language and generate human-like responses. With the increasing demand for intelligent conversational agents, chatbots, and virtual assistants, the development of NLU and NLG technologies has become more important than ever.NLU is concerned with extracting meaning from human language. It involves processing natural language input, such as text or speech, and analyzing it to derive meaning. NLU techniques include syntactic and semantic analysis, entity recognition, and sentiment analysis. Applications of NLU include chatbots, language translation, and voice assistants.NLG, on the other hand, is concerned with generating human-like responses in natural language. NLG techniques involve converting structured data or machine-readable input into natural language output. Applications of NLG include conversational agents, automated report generation, and content creation.Recent advancements in deep learning and natural language processing have led to significant progress in NLU and NLG. Pre-trained language models, such as GPT-3 and BERT, have revolutionized the field by achieving state-of-the-art performance on various NLU and NLG tasks.However, there are still many challenges in developing more robust and efficient NLU and NLG systems, such as handling rare or complex language constructs, dealing with ambiguity and sarcasm, and ensuring the ethical use of these technologies. As such, ongoing research in NLU and NLG is focused on addressing these challenges and developing more sophisticated and reliable systems.

bottom of page