Menta: A Small Language Model for On-Device Mental Health Prediction

Abstract

Mental health conditions affect hundreds of millions globally, yet early detection remains limited. While large language models (LLMs) have shown promise in mental health applications, their size and computational demands hinder practical deployment. Small language models (SLMs) offer a lightweight alternative, but their use for social media–based mental health prediction remains largely underexplored. In this study, we introduce Menta, the first optimized SLM fine-tuned specifically for multi-task mental health prediction from social media data. Menta is jointly trained across six classification tasks using a LoRA-based framework, a cross-dataset strategy, and a balanced accuracy–oriented loss. Evaluated against nine state-of-the-art SLM baselines, Menta achieves an average improvement of 15.2% across tasks covering depression, stress, and suicidality compared with the best-performing non–fine-tuned SLMs. It also achieves higher accuracy on depression and stress classification tasks compared to 13B-parameter LLMs, while being approximately 3.25× smaller. Moreover, we demonstrate real-time, on-device deployment of Menta on an iPhone 15 Pro Max, requiring only approximately 3GB RAM. Supported by a comprehensive benchmark against existing SLMs and LLMs, Menta highlights the potential for scalable, privacy-preserving mental health monitoring. main

1. Overview

Menta is a small language model for digital mental health prediction from social media text. Instead of relying on large server-side LLMs, Menta focuses on early screening of stress, depression, and suicidality in a way that is lightweight enough to run fully on consumer devices such as smartphones.

The model is jointly trained on six Reddit-based classification tasks that cover stress, depression severity, suicidal ideation, and suicide risk categories. This multi-task setup encourages shared representations across related conditions while preserving task-specific decision boundaries.

4B-parameter small language model with a Qwen-style backbone.
Six mental health classification tasks collected from expert-annotated Reddit corpora.
LoRA-based multi-task fine-tuning for efficient adaptation.
Balanced-accuracy–aware optimization to handle imbalanced labels.
Demonstrated real-time on-device deployment for privacy-preserving mental health screening.

2. Model and Training

Menta is built on top of a 4B-parameter transformer-based small language model and fine-tuned with parameter-efficient LoRA adapters for multi-task mental health prediction. The base model remains mostly frozen while LoRA layers capture task-specific adaptations.

4B SLM backbone LoRA adapters Multi-task training Balanced-accuracy loss

The training pipeline uses a shared transformer backbone with task-specific classification heads. LoRA adapters are inserted into attention projections (e.g., query and value matrices), enabling effective fine-tuning while updating only a small fraction of the total parameters.

Parameter-efficient tuning. Only LoRA parameters and classifier heads are trained; base model weights are frozen, substantially reducing GPU memory requirements.
Task sampling. A task-level sampling strategy mitigates dataset size imbalance, preventing large datasets from dominating the multi-task objective.
Class imbalance handling. Class-weighted cross-entropy and a balanced-accuracy–aware term are combined to encourage robust performance on minority classes.
Joint optimization. All six tasks are optimized in a single training run, encouraging the model to share knowledge across stress, depression, and suicidality detection.

3. Results

3.1 Overall performance compared with other small language models

We first compare Menta with several strong small language models that are used without mental health fine tuning, including Phi 4 Mini, StableLM and Falcon. On the six tasks that cover depression, stress and suicidality, Menta clearly moves the average level of performance. The average accuracy improves by about fifteen point two percentage points compared with the best setting among these models that are not fine tuned for mental health prediction.

Figure 6 shows accuracy and balanced accuracy scores for each task. The bars for Menta are usually the tallest in both plots. This means that Menta not only predicts correctly more often in general, but also treats minority labels such as severe depression and high suicide risk in a more balanced way. In other words, the model does not simply focus on the majority class, which is important for applications that care about safety.

Accuracy and balanced accuracy comparison for Phi 4 Mini, StableLM, Falcon and Menta on six tasks

3.2 Multi task Menta compared with task specific variants

We then ask whether it is better to train one shared model or separate models for each task. For this purpose we fine tune six task specific variants, named Menta T1 to Menta T6, each one trained only on a single task. All of them use the same backbone and the same LoRA capacity as the general Menta model.

The radar charts in Figure 7 summarise the result. The red curve for the general Menta model tends to lie near the outer boundary on almost all axes. The curves for the task specific variants sometimes peak slightly higher on their own task, but they drop more on the remaining tasks and produce an irregular shape. Overall, the shared multi task model reaches higher average accuracy and balanced accuracy and shows a more even profile across all six tasks.

This suggests that learning all tasks together helps the model capture common patterns in mental health language while still keeping good performance on each individual task. For deployment this is attractive, because one compact four billion parameter model is enough instead of six separate models.

Radar charts that show accuracy and balanced accuracy for the general Menta model and six task specific variants

3.3 Case study on depression severity

Finally we look at an example from the depression severity task in more detail. The post describes a sudden and very uncomfortable feeling on the tongue that has lasted for about twelve hours. The writer says that the tongue really starts to hurt and asks whether anyone else has ever felt something similar. In the dataset this case is labelled as minimum level of depression.

In Figure 10 the upper box contains the original post. The lower left part shows the prediction and reasoning of Menta, and the lower right part shows the output of Qwen three. Menta correctly assigns the minimum level. In its explanation it focuses on the fact that the main concern is a strange but local physical sensation, not sadness, hopelessness or other typical signs of a depressive episode. Menta reads the message as a request for medical reassurance and keeps the mental health severity low.

Qwen three, in contrast, treats the same post as a case of severe distress. Its reasoning strongly reacts to phrases about intense pain and long duration and connects them directly with high anxiety and serious psychological problems. As a result it overestimates the level of depression. This case study illustrates that after domain specific training Menta is better at separating physical complaints from genuine mental health symptoms and follows the labelling rules of the dataset more closely.

Case study that compares Menta and Qwen three on a depression severity example with colour coded highlights

4. Datasets

Menta is trained and evaluated on four expert-annotated Reddit corpora, organized into six classification tasks that cover stress, depression, suicidal ideation, and suicide risk.

Task	Dataset	Label type
Task 1	Dreaddit	Stress vs. non-stress
Task 2–3	Depression severity dataset	Binary depression + multi-level severity
Task 4	SDCNL	Suicidal ideation vs. non-ideation
Task 5–6	CSSRS-based suicide risk dataset	Binary risk + multi-level risk categories

dataset

5. On-Device Deployment

We deploy Menta on mobile devices using a lightweight inference stack with quantized weights. The goal is to enable privacy-preserving, real-time mental health screening directly on user devices without uploading raw text to remote servers.

Four billion parameter small language model with a Qwen style backbone.
Six mental health classification tasks collected from expert annotated Reddit data.
LoRA based multi task fine tuning for efficient adaptation.
Balanced accuracy aware training objective that handles imbalanced labels.
Real time on device deployment for privacy preserving mental health screening.

Menta mobile UI – sample count selection

6. Implementation and Resources

The full training and evaluation code, together with instructions for reproducing our experiments and running Menta on mobile devices, is available in the GitHub repository. Pre-trained weights and configuration files for different quantization levels are hosted on Hugging Face.

GitHub: end-to-end training, evaluation, and on-device demo code.
Hugging Face: model checkpoints and configuration files.
Mobile demo: a reference iOS application showing multi-task predictions for stress, depression, and suicidality from example posts.

BibTeX

If you find Menta useful in your research, please cite it as:

@article{menta2025,
  title   = {Menta: A Small Language Model for On-Device Mental Health Prediction},
  author  = {Zhang, Tianyi and Xue, Xiangyuan and Ruan, Lingyan
             and Fu, Shiya and Xia, Feng and D'Alfonso, Simon
             and Kostakos, Vassilis and Dang, Ting and Jia, Hong},
  journal = {arXiv preprint arXiv:2512.02716},
  year    = {2025}
}