Spring 2025 Projects Mini Conference
Deep Learning for Autism Screening Using Transfer Learning on Video Data
We propose to develop a deep learning-based classifier for early autism screening by analyzing video recordings of children’s behavior and gaze. By leveraging transfer learning from pre-trained visual models and integrating temporal modeling techniques, our approach aims to provide a robust probability estimate of Autism Spectrum Disor- der (ASD) from standard video recordings. This project has the potential to improve early screening accuracy while using accessible, low-cost hardware.
Cricket Commentary Generation Using LLaVA
This project aims to develop a deep learning model for real-time cricket commentary using video highlights. We leverage Whisper for speech-to-text transcription of existing commentary and LLaVA for generating context-aware commentary. Training on structured cricket event data ensures accuracy and engagement, advancing automated sports broadcasting with vision-language modeling.
Legal Contract Dataset
We aim to create a labeled dataset of contract clauses that will serve as a foundation for training and evaluating AI models in legal contract NLP. By curating and annotating contracts with a focus on clause categorization and summarization, our dataset will contribute to the development of AI-driven tools for contract analysis and verification.
Bone fractures Graphical Chatbot
Bone fractures are a significant healthcare challenge, requiring timely and accurate diagnosis for effective treatment. However, radiographic interpretation remains prone to variability and depends heavily on specialized expertise, limiting access to quality care in resource-constrained settings. Recent advances in LLM and computer vision offer promising solutions for automating fracture detection and classification. In this study, we develop an LLM-powered ChatBot integrated with a graph segmentation model for fracture diagnosis, leveraging computer vision to analyze medical images.
EfficientMonocular Depth Estimation
This project aims to develop a lightweight, efficient model for monocular depth estimation that runs smoothly on mobile systems. By using techniques like quantization, pruning, and knowledge distillation, along with efficient architectures such as MobileNetV3 and depth-wise separable convolutions, we strive to achieve fast inference without compromising depth accuracy.
Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning
Deep learning for radiologic image analysis is a rapidly growing field in biomedical research and is likely to become a standard practice in modern medicine. The authors will use a publicly available dataset from Kaggle containing X-Ray images that are classified by the presence or absence of pneumonia to make a neural network that is as accurate at diagnosing pneumonia as trained radiologists. Success of the model will be determined by the algorithm’s F1 score as it is a standard parameter to evaluate machine learning models.
Efficient Computer Vision: Pushing the Frontier of Edge Computing
The Low Power Computer Vision Challenge (LPCVC) advances three critical directions in computer vision by focusing on model optimization for edge devices. Leveraging the Qualcomm AI-Hub ecosystem, this competition enables developers to deploy efficient vision models on mobile phones and AI PCs. Participants can submit models in various formats (PyTorch, TensorFlow, TFLite, ONNX), making the challenge globally accessible.
Unlike traditional competitions that rely on cloud computing, LPCVC emphasizes practical applications that run directly on edge devices with minimal hardware requirements. This approach democratizes participation for developers, researchers, and students worldwide. The challenge provides open-source sample solutions as qualification benchmarks, with winning solutions and test datasets being publicly released to foster continued innovation in efficient computer vision.
In this paper, we propose a solution to the LPCVC challenge that uses a combination of model architecture improvements such as quantization, pruning, token-efficient methods. In addition, we use enhancements to the training pipeline including 8-bit General Matrix Multiplication (GEMM) and Multi-head Latent Attention (MLA). Finally, we propose techniques for edge device optimization. Our solution is designed to be lightweight, efficient, and run directly on edge devices with minimal hardware requirements.
ChainThought: Enhancing Reasoning Capabilities in Language Models Through Step-by-Step Problem Solving
This project aims to develop a fine-tuned language model that demonstrates ex- plicit chain-of-thought reasoning capabilities for solving mathematical problems. By training a Deepseek-7B model on the GSM8K dataset with step-by-step reasoning an- notations, we seek to create an AI system that not only produces correct answers but also transparently explains its problem-solving process, similar to how students show their work in educational settings.
Efficient Open-Vocabulary Models for Low-Power Computer Vision
This project aims to improve open-vocabulary grounding segmentation using the LPCVC 2025 dataset and model framework. We will start by reproducing the sample code provided and then enhance it using Focal-T / ViT-b models trained on COCO and evaluated on RefCOCOg. Our goal is to refine segmentation performance by experimenting with different model improvements, such as fine-tuning strategies and data augmentation.
AI-Based Car Crash Detection Using Deep Learning
Car crash detection is a critical application of AI-driven traffic surveillance, enabling real-time accident identification and response. This project aims to develop a deep learning-based system that detects car crashes from video and audio data using multimodal approaches. We will integrate computer vision techniques for video analysis, optical flow for motion tracking, autoencoders for anomaly detection, and audio classification for crash sound recognition. Our approach will leverage multiple datasets, including DoTA, UCF-Crime (road accidents), and BDD100K (normal driving footage). The final system will be evaluated based on precision, recall, and the Spatial-Temporal Area Under Curve (STAUC) metric.
Enhanced Relationformer for Linear Detection
Linear detection is crucial for tasks in computer vision. While Relationformer has advanced in inter-object relationship modeling, it struggles with endpoint prediction and matching stability. To solve these issues, we plan to reproduce and validate Relationformer. And then we will propose a distance-weighted approach, which replaces the Dirac delta distribution to enhance endpoint localization. Moreover, inspired by General Focal Loss, we will mitigate errors arising from negative samples.
Machine Learning And Physics
In this project, we will explore research on the interface of machine learning and Physics. More specifically, we will attempt to recreate the plots of the papers published by Sharma et al. and Nguyen et al., and then test their models on unseen astrophysical data. To make sure that our recreations are accurate, we will compare them qualitatively with the plots from the papers and show consistency with the theory when we test this on unseen experimental data. The end goal of our project will be a report on the success of our model recreation as well as a pedagogical presentation to shed light on this emerging field of study.
Kilter Board Beta and Problem Generation
Rock climbing is a sport of problem-solving, especially in bouldering. The Kilter board is a standardized training board specifically for bouldering, and for many climbers, reading where to put their hands and feet is a struggle at first: this tool will read the routes and give the optimal path for the climber to attempt. In addition, based on the given routes, the tool will generate similar routes based on a given grade that a climber selects.
Deep Learning Approach on Music Recommendation System
This project aims to develop a deep learning-based music recommen- dation system using content-based filtering to suggest relevant music to users according to distinct characteristics, such as danceability, tempo, and energy. By training neural networks on these features, the model will map songs to a latent space where similar music is categorized. Essentially, this project demonstrates the feasibility of personalizing music discovery using AI and contributes to the improvement of overall recommendation systems.
Predicting Alzheimer’s Disease using Structural MRIs: Activation Map Analysis of Memory Related Brain Regions
Alzheimer’s Disease (AD) is a progressive neurodegenerative disorder known for being the primary cause of dementia, characterized by significant memory loss and cognitive decline. This project aims to develop a deep learning model, trained on 3D T1-weighted structural MRI scans, that analyzes structural patterns in memory related regions of the brain in order to predict AD onset. The model will be refined using insights from activation maps of established models in order to enhance accuracy and consistency, with an end goal to improve early detection and provide an more interpretable and reliable approach to predicting Alzheimer’s.
Smart Multimodal Classroom Video Recorder
This project proposes a smart multimodal classroom video recording system that automatically composes multiple content streams—camera feeds, slides, and white-board—based on real-time cues like gestures and spoken references. By leveraging computer vision, automatic speech recognition (ASR), and content analysis, it can dynamically pan, zoom, and switch between sources to create a more engaging, context-aware lecture recording. The goal is to overcome the limitations of static cameras and provide a richer, more immersive experience for both live and recorded viewers.
AI-Powered Storyboard Generator Using Prompt-Based Diffusion Models
Storyboarding is an essential process in films, animations, and comics, providing a structured visual representation of narratives before production begins. However, traditional storyboard creation is time-consuming and requires artistic expertise. This project introduces an AI-powered storyboard generator that leverages prompt-based diffusion models and large language models (LLMs). The system will automate the creation of frame-by-frame storyboards by collecting and processing comic and storyboard datasets, generating scene descriptions using image captioning models, and fine-tuning a diffusion model for sequential image generation. The approach follows an agent-based architecture, where frames are generated iteratively to ensure narrative coherence. This project builds upon research in script-to-storyboard automation, generative storytelling, and AI-assisted animation, enhancing efficiency, accessibility, and creative possibilities in storyboard creation.
Reproduction of DeepSeek R1 Zero: A case study in Math Task
This project aims to reproduce the improvement of reasoning ability in DeepSeek R1 Zero. Specifically, we will fine-tune the Qwen-2.5 1.5B models using PPO and GRPO. Our main focus is on exploring the impact of hyperparameters on LLM fine-tuning performance.
Real-Time Localized Image Enhancement via YOLOv8 and Lightweight Super Resolution Model
This project proposes a real-time framework for enhancing image clarity in localized regions by integrating YOLOv8 for object detection and a lightweight super-resolution (SR) model. Targeting applications such as video conferencing and live streaming, the system dynamically identifies regions of interest (e.g., human subjects) and applies adaptive SR enhancement while maintaining a processing speed of 30+ FPS on consumer-grade GPUs. Our approach balances computational efficiency and perceptual quality through model optimization and/or hardware acceleration.
Fine-Grained Visual Classification of Bird Species
This project explores Fine-Grained Visual Classification (FGVC) using deep learning, focusing on the CUB-200-2011 dataset. We aim to develop a model leveraging ResNet50 as a baseline, with improvements from attention mechanisms and Supervised Contrastive Learning to enhance classification accuracy and interpretability.
ProtMotifGen: Protein Motif Generation
This project aims to develop a conditional protein generative model which can generate novel protein with amino acid sequences and 3D structures, conditioning on a specific target motif, or a functional sub unit. It will try to combine both the existing protein generative models such as EvoDiff or ProteinMPNN, and the protein structure prediction models, such as AlphaFold or OmegaFold, to validate the structural feasibility of the new novel protein that contains the motif.
Explaining Deepseek with Manim
New AI models, such as Deepseek, typically use models on the cutting edge of technology and thus may not have easily digestible resources explaining how the model itself works. This project aims to create a video using the open sourced library manim to explain the math behind Deepseek as well as how it works to a viwer without any prior knowledge of the subject.
Enhancing AI-Generated Image Detection: A Comparative Study of CNNs, Transformers, and Contrastive Learning
AI has reached a point where it can generate highly realistic faces, scenes, and objects. This study addresses the problem of distinguishing AI-generated visuals from authentic photographs using a unique dataset, ”AI vs. Human-Generated Images,” from a Kaggle competition. Unlike conventional datasets, this dataset provides paired images where each real image has a corresponding AI-generated counterpart, allowing for direct comparative analysis. We leverage this structured pairing within a deep learning framework, incorporating convolutional neural networks (CNNs) and transformer-based architectures to develop robust classifiers. In addition, we explore contrastive learning to enhance feature discrimination, hypothesizing that it improves generalization by enforcing a more distinct separation between real and AI-generated images