Get the .pdf version of this blog from here.
A Survey on Word Sense Disambiguation
WSD systems have gone from 65% to 89% F1 over four decades, mostly by getting better at context modeling. Nouns are nearly solved (94%). Verbs are not (74.6%). The remaining verb errors cluster around semantically similar senses, yet every system trains with cross-entropy, which treats all wrong answers equally. We test whether replacing random negatives with WordNet-guided hard negatives and a triplet margin loss improves verb disambiguation, using a small frozen backbone trainable on a laptop.
Introduction
Run has 41 senses in WordNet. Bank has 10. Picking the right one from context is trivial for humans and has been a core NLP problem since the 1980s. It matters because downstream tasks, translation, search, QA, all break in predictable ways when word sense is wrong.
The task is genuinely hard: you need local syntax, world knowledge, and commonsense simultaneously. It is sometimes called AI-complete for this reason.
WordNet
The standard sense inventory is Princeton WordNet (Miller, 1995)—a lexical graph where nodes are synsets (one concept, multiple surface forms) and edges are semantic relations.
car.n.01 = {car, auto, automobile, motorcar}, defined as "a motor vehicle with four wheels." Synsets connect upward via IS-A (car to vehicle to artifact) and laterally via PART-OF, SIMILAR-TO, etc.
WordNet 3.0: 117,000+ synsets, 206,941 word-sense pairs. Verbs average 2.17 senses; nouns 1.24. Verbs are harder.
Benchmark
Raganato et al. (2017) unified five shared tasks into one benchmark. Training: SemCor (Miller et al., 1994), 226,036 hand-labeled instances. Test: 7,253 instances across Senseval-2/3, SemEval-2007/2013/2015. Metric: micro-F1.
Prior Work
Knowledge-Based Systems (1986–2010)
Lesk
Lesk (1986): for a target word in context, compare its candidate sense glosses against the glosses of surrounding words; pick the sense with maximum word overlap. Extended Lesk (Banerjee and Pedersen, 2002) expands glosses with WordNet neighbors:
- Collect glosses of all WordNet neighbors of each candidate sense
- Collect glosses of all context words
- Pick the sense with highest overlap
Glosses are short (~12 words). Exact overlap is sparse. Tops out at 51–65% F1.
UKB
Agirre and Soroa (2009) runs Personalized PageRank over the WordNet graph, seeded by context words. The highest-ranked synset wins. Reaches ~68% F1; no learning from examples.
Supervised ML (2010–2016)
IMS
Zhong and Ng (2010) trains one SVM per ambiguous word on POS context, surrounding words, and bigram features. 68.9% F1. Hard limit: words with fewer than 10 training instances get 0%—the system abstains on unseen vocabulary.
Neural (2016–2018)
Static Embeddings
Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) assign one vector per word-type. The vector for bank is a weighted average across all its senses:
This is the meaning conflation problem (Camacho-Collados et al., 2016). The vector is uninformative for disambiguation.
Bi-LSTM
Kågebäck and Salomonsson (2016): context-sensitive token representations via bidirectional LSTM. Different occurrences of bank get different vectors. Limited by fixed-window context. ~69% F1.
Transformers (2019–2023)
BERT
Devlin et al. (2019): full self-attention over the input sequence. Every token's representation is a function of every other token:
Fine-tuned on SemCor: 73.7% F1.
GlossBERT
Huang et al. (2019): instead of classifying sense IDs, run BERT on (sentence, gloss) pairs and score each candidate:
[CLS] He went to the bank. [SEP] A financial institution.
Scores every candidate and takes the argmax. 77.0% F1.
EWISER
Bevilacqua and Navigli (2020): post-hoc graph smoothing on BERT's output logits,
where are hypernyms and hyponyms. The graph touches only inference, not training. 80.1% F1.
ConSeC
Barba et al. (2021): autoregressive disambiguation—already-resolved senses are appended as context for subsequent words:
- For each word in sentence
- Append glosses of resolved words to input
- Predict sense of
82.0% F1.
BEM
Blevins and Zettlemoyer (2021): dual-encoder retrieval. Context and gloss are encoded separately; disambiguation is nearest-neighbor lookup by cosine similarity:
84.5% F1. Glosses are pre-encoded, so inference is fast.
SANDWiCH
Guzman-Olivares et al. (2025): the current state of the art. Modifies WordNet by severing edges between distinct senses of the same lemma, partitioning its graph into disjoint per-sense subgraphs. The model classifies subgraphs rather than individual nodes:
89.0% F1 overall; 94.0% nouns; 74.6% verbs; 77.1% rare senses.
Results
| System | ALL | N | V | Adj | Key idea |
|---|---|---|---|---|---|
| MFS Baseline | 65.5 | 67.7 | 49.8 | 73.1 | Most frequent sense |
| IMS | 68.9 | 70.2 | 55.1 | 75.6 | SVM per word |
| GlossBERT | 77.0 | 79.8 | 67.1 | 79.6 | Sentence-gloss matching |
| EWISER | 80.1 | 81.7 | 66.3 | 81.2 | Inference-time graph smoothing |
| ConSeC | 82.0 | 85.4 | 70.8 | 84.0 | Autoregressive context |
| BEM | 84.5 | 81.4 | 68.5 | 83.0 | Dual-encoder retrieval |
| SANDWiCH | 89.0 | 94.0 | 74.6 | 86.8 | Subgraph classification |
Nouns are nearly done. Verb errors concentrate on fine-grained distinctions—run.v.01 (operate) vs. run.v.03 (execute)—senses that are siblings in WordNet but receive identical gradient signal under cross-entropy.
The Gap
Every system above trains with cross-entropy:
Cross-entropy penalizes all incorrect senses uniformly. Predicting bank.n.03 (the building) when the answer is bank.n.01 (the institution) produces the same gradient as predicting bank.n.02 (river). In WordNet:
financial_institution(IS-A)bank.n.01structure(IS-A)bank.n.03geological_formation(IS-A)bank.n.02
n01 and n03 are far closer than n01 and n02. The model never learns this.
SANDWiCH uses the graph to restructure the label space but still uses cross-entropy within each group. Hard negatives that span group boundaries are not explicitly targeted.
Proposed Experiment
Hard Negative Mining
For each training instance , we sample a hard negative from the graph neighborhood of :
- Find hypernym of
- For each sibling sense (same hypernym, same lemma, ):
- For each other sense of the target lemma:
- If :
- return sample from
Path similarity: .
Triplet Margin Loss
is cosine distance, . The correct gloss must be at least closer to the context than the hard negative. If already satisfied, gradient is zero.
Setup
- Backbone:
all-MiniLM-L6-v2(22M params), frozen. - Trainable: linear projection head, .
- Data: SemCor verbs only, instances.
- Eval: Raganato verb subset.
- Baseline: same setup, random negatives instead of graph-guided.
The goal is not to beat SANDWiCH. It is to isolate the effect of graph-guided negatives with everything else held constant.
Expected Outcomes
- Hard negatives outperform random negatives on verb F1.
- Ablating the graph selection (random negatives, same loss) degrades performance.
- t-SNE of learned sense embeddings shows tighter within-sense clusters and larger sibling-sense margins under hard negative training.
Conclusion
WSD is largely solved for nouns. Verbs remain hard, and the errors are not random—they follow the structure of WordNet. Cross-entropy cannot see this structure. A triplet loss trained on graph-guided negatives can. The experiment described here is a minimal, reproducible test of that claim. If it works, the idea scales naturally to larger models. If it does not, the bottleneck is probably data coverage or gloss quality, not the loss.
References
- Agirre, E. and Soroa, A. (2009). Personalizing PageRank for word sense disambiguation.
- Banerjee, S. and Pedersen, T. (2002). An adapted Lesk algorithm for word sense disambiguation using WordNet.
- Barba, E., Procopio, L., and Navigli, R. (2021). ConSeC: Word sense disambiguation as continuous sense comprehension.
- Bevilacqua, M. and Navigli, R. (2020). Breaking through the 80% glass ceiling: Raising the state of the art in word sense disambiguation by incorporating knowledge graph information.
- Blevins, T. and Zettlemoyer, L. (2021). Moving down the long tail of word sense disambiguation with gloss-informed bi-encoders.
- Camacho-Collados, J., Pilehvar, M.T., and Navigli, R. (2016). Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities.
- Devlin, J. et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding.
- Guzman-Olivares, D. et al. (2025). SANDWiCH: Semantical analysis of neighbours for disambiguating words in context ad hoc.
- Huang, L. et al. (2019). GlossBERT: BERT for word sense disambiguation with gloss knowledge.
- Kågebäck, M. and Salomonsson, H. (2016). Word sense disambiguation using a bidirectional LSTM.
- Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries.
- Mikolov, T. et al. (2013). Distributed representations of words and phrases and their compositionality.
- Miller, G.A. (1995). WordNet: A lexical database for English.
- Miller, G.A. et al. (1994). A semantic concordance.
- Pennington, J. et al. (2014). GloVe: Global vectors for word representation.
- Raganato, A. et al. (2017). Word sense disambiguation: A unified evaluation framework and empirical comparison.
- Zhong, Z. and Ng, H.T. (2010). It makes sense: A wide-coverage word sense disambiguation system for free text.