Publications | Shubhashis Roy Dipta

2025

Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Shubhashis Roy Dipta, and Francis Ferraro

2025

@misc{dipta2025q2equerytoeventdecompositionzeroshot,
  title = {Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval},
  author = {Dipta, Shubhashis Roy and Ferraro, Francis},
  year = {2025},
  publisher = {arXiv},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  url = {https://arxiv.org/abs/2506.10202},
  dimensions = {true},
}

2024

NAACL
UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation

Shubhashis Roy Dipta, and Sai Vallurupalli

SemEval 2024, 2024

Abs arXiv Bib PDF Code

This paper describes the system we developed for SemEval-2024 Task 1, "Semantic Textual Relatedness for African and Asian Languages." The aim of the task is to build a model that can identify semantic textual relatedness (STR) between two sentences of a target language belonging to a collection of African and Asian languages. We participated in Subtasks A and C and explored supervised and cross-lingual training leveraging large language models (LLMs). Pre-trained large language models have been extensively used for machine translation and semantic similarity. Using a combination of machine translation and sentence embedding LLMs, we developed a unified STR model, TranSem, for subtask A and fine-tuned the T5 family of models on the STR data, FineSem, for use in subtask C. Our model results for 7 languages in subtask A were better than the official baseline for 3 languages and on par with the baseline for the remaining 4 languages. Our model results for the 12 languages in subtask C resulted in 1st place for Africaans, 2nd place for Indonesian, and 3rd place for English with low performance for the remaining 9 languages.
@article{dipta2024umbclu, title = {UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation}, author = {Dipta, Shubhashis Roy and Vallurupalli, Sai}, journal = {SemEval 2024}, year = {2024}, url = {https://arxiv.org/abs/2402.12730}, dimensions = {true} }
NAACL
HU at SemEval-2024 Task 8A: Can Contrastive Learning Learn Embeddings to Detect Machine-Generated Text?

Shubhashis Roy Dipta, and Sadat Shahriar

SemEval 2024, 2024

Abs arXiv Bib PDF Code

This paper describes our system developed for SemEval-2024 Task 8, "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection." Machine-generated texts have been one of the main concerns due to the use of large language models (LLM) in fake text generation, phishing, cheating in exams, or even plagiarizing copyright materials. A lot of systems have been developed to detect machine-generated text. Nonetheless, the majority of these systems rely on the text-generating model, a limitation that is impractical in real-world scenarios, as it’s often impossible to know which specific model the user has used for text generation. In this work, we propose a single model based on contrastive learning, which uses 40% of the baseline’s parameters (149M vs. 355M) but shows a comparable performance on the test dataset (21st out of 137 participants). Our key finding is that even without an ensemble of multiple models, a single base model can have comparable performance with the help of data augmentation and contrastive learning.
@article{dipta2024hu, title = {HU at SemEval-2024 Task 8A: Can Contrastive Learning Learn Embeddings to Detect Machine-Generated Text?}, author = {Dipta, Shubhashis Roy and Shahriar, Sadat}, journal = {SemEval 2024}, year = {2024}, url = {https://arxiv.org/abs/2402.11815}, dimensions = {true}, }

2023

ACL
Semantically-informed Hierarchical Event Modeling

Shubhashis Roy Dipta, Mehdi Rezaee, and Francis Ferraro

In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), Jul 2023

Abs arXiv Bib HTML PDF Code

Prior work has shown that coupling sequential latent variable models with semantic ontological knowledge can improve the representational capabilities of event modeling approaches. In this work, we present a novel, doubly hierarchical, semi-supervised event modeling framework that provides structural hierarchy while also accounting for ontological hierarchy. Our approach consistsof multiple layers of structured latent variables, where each successive layer compresses and abstracts the previous layers. We guide this compression through the injection of structured ontological knowledge that is defined at the type level of events: importantly, our model allows for partial injection of semantic knowledge and it does not depend on observing instances at any particular level of the semantic ontology. Across two different datasets and four different evaluation metrics, we demonstrate that our approach is able to out-perform the previous state-of-the-art approaches by up to 8.5%, demonstrating the benefits of structured and semantic hierarchical knowledge for event modeling.
@inproceedings{roy-dipta-etal-2023-semantically, title = {Semantically-informed Hierarchical Event Modeling}, author = {Roy Dipta, Shubhashis and Rezaee, Mehdi and Ferraro, Francis}, booktitle = {Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)}, month = jul, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.starsem-1.31}, doi = {10.18653/v1/2023.starsem-1.31}, pages = {353--369}, dimensions = {true} }
arXiv
SeeBel: Seeing is Believing

Sourajit Saha, and Shubhashis Roy Dipta

arXiv preprint, Jul 2023

Abs arXiv Bib PDF

Semantic Segmentation is a significant research field in Computer Vision. Despite being a widely studied subject area, many visualization tools do not exist that capture segmentation quality and dataset statistics such as a class imbalance in the same view. While the significance of discovering and introspecting the correlation between dataset statistics and AI model performance for dense prediction computer vision tasks such as semantic segmentation is well established in the computer vision literature, to the best of our knowledge, no visualization tools have been proposed to view and analyze the aforementioned tasks. Our project aims to bridge this gap by proposing three visualizations that enable users to compare dataset statistics and AI performance for segmenting all images, a single image in the dataset, explore the AI model’s attention on image regions once trained and browse the quality of masks predicted by AI for any selected (by user) number of objects under the same tool. Our project tries to further increase the interpretability of the trained AI model for segmentation by visualizing its image attention weights. For visualization, we use Scatterplot and Heatmap to encode correlation and features, respectively. We further propose to conduct surveys on real users to study the efficacy of our visualization tool in computer vision and AI domain.
@article{saha2023seebel, title = {SeeBel: Seeing is Believing}, author = {Saha, Sourajit and Roy Dipta, Shubhashis}, journal = {arXiv preprint}, year = {2023}, publisher = {arXiv}, dimensions = {true} }

2022

Springer

MethEvo: an accurate evolutionary information-based methylation site predictor

Sadia Islam, Shafayat Bin Shabbir Mugdha, Shubhashis Roy Dipta, MD Easin Arafat, Swakkhar Shatabda, Hamid Alinejad-Rokny, and Iman Dehzangi

Neural Computing and Applications, Jul 2022

Bib HTML

@article{islam2022methevo,
  title = {MethEvo: an accurate evolutionary information-based methylation site predictor},
  author = {Islam, Sadia and Mugdha, Shafayat Bin Shabbir and Dipta, Shubhashis Roy and Arafat, MD Easin and Shatabda, Swakkhar and Alinejad-Rokny, Hamid and Dehzangi, Iman},
  journal = {Neural Computing and Applications},
  pages = {1--12},
  year = {2022},
  publisher = {Springer},
  doi = {10.1007/s00521-022-07738-9},
  dimensions = {true}
}

2020

Genes
Accurately predicting glutarylation sites using sequential bi-peptide-based evolutionary features

Md Easin Arafat, Md Wakil Ahmad, SM Shovan, Abdollah Dehzangi, Shubhashis Roy Dipta, Md Al Mehedi Hasan, Ghazaleh Taherzadeh, Swakkhar Shatabda, and Alok Sharma

Genes, Jul 2020

Abs Bib HTML

Post Translational Modification (PTM) is defined as the alteration of protein sequence upon interaction with different macromolecules after the translation process. Glutarylation is considered one of the most important PTMs, which is associated with a wide range of cellular functioning, including metabolism, translation, and specified separate subcellular localizations. During the past few years, a wide range of computational approaches has been proposed to predict Glutarylation sites. However, despite all the efforts that have been made so far, the prediction performance of the Glutarylation sites has remained limited. One of the main challenges to tackle this problem is to extract features with significant discriminatory information. To address this issue, we propose a new machine learning method called BiPepGlut using the concept of a bi-peptide-based evolutionary method for feature extraction. To build this model, we also use the Extra-Trees (ET) classifier for the classification purpose, which, to the best of our knowledge, has never been used for this task. Our results demonstrate BiPepGlut is able to significantly outperform previously proposed models to tackle this problem. BiPepGlut achieves 92.0%, 84.8%, 95.6%, 0.82, and 0.88 in accuracy, sensitivity, specificity, Matthew’s Correlation Coefficient, and F1-score, respectively. BiPepGlut is implemented as a publicly available online predictor.
@article{arafat2020accurately, title = {Accurately predicting glutarylation sites using sequential bi-peptide-based evolutionary features}, author = {Arafat, Md Easin and Ahmad, Md Wakil and Shovan, SM and Dehzangi, Abdollah and Dipta, Shubhashis Roy and Hasan, Md Al Mehedi and Taherzadeh, Ghazaleh and Shatabda, Swakkhar and Sharma, Alok}, journal = {Genes}, volume = {11}, number = {9}, pages = {1023}, year = {2020}, publisher = {MDPI}, doi = {10.3390/genes11091023}, dimensions = {true} }
IEEE Access
Mal-light: Enhancing lysine malonylation sites prediction problem using evolutionary-based features

Md Wakil Ahmad, Md Easin Arafat, Ghazaleh Taherzadeh, Alok Sharma, Shubhashis Roy Dipta, Abdollah Dehzangi, and Swakkhar Shatabda

IEEE access, Jul 2020

Abs Bib HTML

Post Translational Modification (PTM) is considered an important biological process with a tremendous impact on the function of proteins in both eukaryotes, and prokaryotes cells. During the past decades, a wide range of PTMs has been identified. Among them, malonylation is a recently identified PTM which plays a vital role in a wide range of biological interactions. Notwithstanding, this modification plays a potential role in energy metabolism in different species including Homo Sapiens. The identification of PTM sites using experimental methods is time-consuming and costly. Hence, there is a demand for introducing fast and cost-effective computational methods. In this study, we propose a new machine learning method, called Mal-Light, to address this problem. To build this model, we extract local evolutionary-based information according to the interaction of neighboring amino acids using a bi-peptide based method. We then use Light Gradient Boosting (LightGBM) as our classifier to predict malonylation sites. Our results demonstrate that Mal-Light is able to significantly improve malonylation site prediction performance compared to previous studies found in the literature. Using Mal-Light we achieve Matthew’s correlation coefficient (MCC) of 0.74 and 0.60, Accuracy of 86.66% and 79.51%, Sensitivity of 78.26% and 67.27%, and Specificity of 95.05% and 91.75%, for Homo Sapiens and Mus Musculus proteins, respectively. Mal-Light is implemented as an online predictor which is publicly available at: (http://brl.uiu.ac.bd/MalLight/).
@article{ahmad2020mal, title = {Mal-light: Enhancing lysine malonylation sites prediction problem using evolutionary-based features}, author = {Ahmad, Md Wakil and Arafat, Md Easin and Taherzadeh, Ghazaleh and Sharma, Alok and Dipta, Shubhashis Roy and Dehzangi, Abdollah and Shatabda, Swakkhar}, journal = {IEEE access}, volume = {8}, pages = {77888--77902}, year = {2020}, publisher = {IEEE}, doi = {10.1109/access.2020.2989713}, dimensions = {true} }
Elsevier
SEMal: Accurate protein malonylation site predictor using structural and evolutionary information

Shubhashis Roy Dipta, Ghazaleh Taherzadeh, Md Wakil Ahmad, Md Easin Arafat, Swakkhar Shatabda, and Abdollah Dehzangi

Computers in biology and medicine, Jul 2020

Abs Bib HTML Code

Post Transactional Modification (PTM) is a vital process which plays an important role in a wide range of biological interactions. One of the most recently identified PTMs is Malonylation. It has been shown that Malonylation has an important impact on different biological pathways including glucose and fatty acid metabolism. Malonylation can be detected experimentally using mass spectrometry. However, this process is both costly and time-consuming which has inspired research to find more efficient and fast computational methods to solve this problem. This paper proposes a novel approach, called SEMal, to identify Malonylation sites in protein sequences. It uses both structural and evolutionary-based features to solve this problem. It also uses Rotation Forest (RoF) as its classification technique to predict Malonylation sites. To the best of our knowledge, our extracted features as well as our employed classifier have never been used for this problem. Compared to the previously proposed methods, SEMal outperforms them in all metrics such as sensitivity (0.94 and 0.89), accuracy (0.94 and 0.91), and Matthews correlation coefficient (0.88 and 0.82), for Homo Sapiens and Mus Musculus species, respectively. SEMal is publicly available as an online predictor at: http://brl.uiu.ac.bd/SEMal/.
@article{dipta2020semal, title = {SEMal: Accurate protein malonylation site predictor using structural and evolutionary information}, author = {Dipta, Shubhashis Roy and Taherzadeh, Ghazaleh and Ahmad, Md Wakil and Arafat, Md Easin and Shatabda, Swakkhar and Dehzangi, Abdollah}, journal = {Computers in biology and medicine}, volume = {125}, pages = {104022}, year = {2020}, publisher = {Elsevier}, doi = {10.1016/j.compbiomed.2020.104022}, dimensions = {true} }