Teaching Machines to Tell Words Apart

March 29, 2026

Get the .pdf version of this blog from here.

A Survey on Word Sense Disambiguation

WSD systems have gone from 65% to 89% F1 over four decades, mostly by getting better at context modeling. Nouns are nearly solved (94%). Verbs are not (74.6%). The remaining verb errors cluster around semantically similar senses, yet every system trains with cross-entropy, which treats all wrong answers equally. We test whether replacing random negatives with WordNet-guided hard negatives and a triplet margin loss improves verb disambiguation, using a small frozen backbone trainable on a laptop.

Introduction

Run has 41 senses in WordNet. Bank has 10. Picking the right one from context is trivial for humans and has been a core NLP problem since the 1980s. It matters because downstream tasks, translation, search, QA, all break in predictable ways when word sense is wrong.

The task is genuinely hard: you need local syntax, world knowledge, and commonsense simultaneously. It is sometimes called AI-complete for this reason.

WordNet

The standard sense inventory is Princeton WordNet (Miller, 1995)—a lexical graph where nodes are synsets (one concept, multiple surface forms) and edges are semantic relations.

car.n.01 = {car, auto, automobile, motorcar}, defined as "a motor vehicle with four wheels." Synsets connect upward via IS-A (car to vehicle to artifact) and laterally via PART-OF, SIMILAR-TO, etc.

WordNet 3.0: 117,000+ synsets, 206,941 word-sense pairs. Verbs average 2.17 senses; nouns 1.24. Verbs are harder.

Benchmark

Raganato et al. (2017) unified five shared tasks into one benchmark. Training: SemCor (Miller et al., 1994), 226,036 hand-labeled instances. Test: 7,253 instances across Senseval-2/3, SemEval-2007/2013/2015. Metric: micro-F1.

Prior Work

Knowledge-Based Systems (1986–2010)

Lesk

Lesk (1986): for a target word in context, compare its candidate sense glosses against the glosses of surrounding words; pick the sense with maximum word overlap. Extended Lesk (Banerjee and Pedersen, 2002) expands glosses with WordNet neighbors:

  1. Collect glosses of all WordNet neighbors of each candidate sense
  2. Collect glosses of all context words
  3. Pick the sense with highest overlap

Glosses are short (~12 words). Exact overlap is sparse. Tops out at 51–65% F1.

UKB

Agirre and Soroa (2009) runs Personalized PageRank over the WordNet graph, seeded by context words. The highest-ranked synset wins. Reaches ~68% F1; no learning from examples.

Supervised ML (2010–2016)

IMS

Zhong and Ng (2010) trains one SVM per ambiguous word on POS context, surrounding words, and bigram features. 68.9% F1. Hard limit: words with fewer than 10 training instances get 0%—the system abstains on unseen vocabulary.

Neural (2016–2018)

Static Embeddings

Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) assign one vector per word-type. The vector for bank is a weighted average across all its senses:

vbankαvfinance+βvriver+\vec{v}_{\text{bank}} \approx \alpha \cdot \vec{v}_{\text{finance}} + \beta \cdot \vec{v}_{\text{river}} + \cdots

This is the meaning conflation problem (Camacho-Collados et al., 2016). The vector is uninformative for disambiguation.

Bi-LSTM

Kågebäck and Salomonsson (2016): context-sensitive token representations via bidirectional LSTM. Different occurrences of bank get different vectors. Limited by fixed-window context. ~69% F1.

Transformers (2019–2023)

BERT

Devlin et al. (2019): full self-attention over the input sequence. Every token's representation is a function of every other token:

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Fine-tuned on SemCor: 73.7% F1.

GlossBERT

Huang et al. (2019): instead of classifying sense IDs, run BERT on (sentence, gloss) pairs and score each candidate:

[CLS] He went to the bank. [SEP] A financial institution.

Scores every candidate and takes the argmax. 77.0% F1.

EWISER

Bevilacqua and Navigli (2020): post-hoc graph smoothing on BERT's output logits,

sifinal=siBERT+λjN(i)wijsjBERTs_i^{\text{final}} = s_i^{\text{BERT}} + \lambda \sum_{j \in \mathcal{N}(i)} w_{ij} \cdot s_j^{\text{BERT}}

where N(i)\mathcal{N}(i) are hypernyms and hyponyms. The graph touches only inference, not training. 80.1% F1.

ConSeC

Barba et al. (2021): autoregressive disambiguation—already-resolved senses are appended as context for subsequent words:

  1. For each word wiw_i in sentence
  2. Append glosses of resolved words to input
  3. Predict sense of wiw_i

82.0% F1.

BEM

Blevins and Zettlemoyer (2021): dual-encoder retrieval. Context and gloss are encoded separately; disambiguation is nearest-neighbor lookup by cosine similarity:

score(s)=cos ⁣(Encctx(c,w),  Encgloss(gs))\text{score}(s) = \cos\!\big(\text{Enc}_{\text{ctx}}(c,w),\;\text{Enc}_{\text{gloss}}(g_s)\big)

84.5% F1. Glosses are pre-encoded, so inference is fast.

SANDWiCH

Guzman-Olivares et al. (2025): the current state of the art. Modifies WordNet by severing edges between distinct senses of the same lemma, partitioning its graph into disjoint per-sense subgraphs. The model classifies subgraphs rather than individual nodes:

si,sjsenses(w):N(si)N(sj)=\forall\, s_i, s_j \in \text{senses}(w):\quad \mathcal{N}(s_i) \cap \mathcal{N}(s_j) = \emptyset

89.0% F1 overall; 94.0% nouns; 74.6% verbs; 77.1% rare senses.

Results

SystemALLNVAdjKey idea
MFS Baseline65.567.749.873.1Most frequent sense
IMS68.970.255.175.6SVM per word
GlossBERT77.079.867.179.6Sentence-gloss matching
EWISER80.181.766.381.2Inference-time graph smoothing
ConSeC82.085.470.884.0Autoregressive context
BEM84.581.468.583.0Dual-encoder retrieval
SANDWiCH89.094.074.686.8Subgraph classification

Nouns are nearly done. Verb errors concentrate on fine-grained distinctions—run.v.01 (operate) vs. run.v.03 (execute)—senses that are siblings in WordNet but receive identical gradient signal under cross-entropy.

The Gap

Every system above trains with cross-entropy:

LCE=logexp(scorrect)iexp(si)\mathcal{L}_{\text{CE}} = -\log \frac{\exp(s_{\text{correct}})}{\sum_{i} \exp(s_i)}

Cross-entropy penalizes all incorrect senses uniformly. Predicting bank.n.03 (the building) when the answer is bank.n.01 (the institution) produces the same gradient as predicting bank.n.02 (river). In WordNet:

  • financial_institution \leftarrow (IS-A) bank.n.01
  • structure \leftarrow (IS-A) bank.n.03
  • geological_formation \leftarrow (IS-A) bank.n.02

n01 and n03 are far closer than n01 and n02. The model never learns this.

SANDWiCH uses the graph to restructure the label space but still uses cross-entropy within each group. Hard negatives that span group boundaries are not explicitly targeted.

Proposed Experiment

Hard Negative Mining

For each training instance (x,s+)(x, s^+), we sample a hard negative ss^- from the graph neighborhood of s+s^+:

  1. H\mathcal{H} \leftarrow \emptyset
  2. Find hypernym of s+s^+
  3. For each sibling sense ss (same hypernym, same lemma, ss+s \neq s^+):
    • HH{s}\mathcal{H} \leftarrow \mathcal{H} \cup \{s\}
  4. For each other sense ss of the target lemma:
    • If sim(s+,s)>0.7\mathrm{sim}(s^+, s) > 0.7: HH{s}\mathcal{H} \leftarrow \mathcal{H} \cup \{s\}
  5. return sample from H\mathcal{H}

Path similarity: sim(si,sj)=1dist(si,sj)+1\mathrm{sim}(s_i, s_j) = \frac{1}{\mathrm{dist}(s_i,s_j)+1}.

Triplet Margin Loss

L=max ⁣(0,  d(c,g+)d(c,g)+α)\mathcal{L} = \max\!\left(0,\; d(\mathbf{c}, \mathbf{g}^+) - d(\mathbf{c}, \mathbf{g}^-) + \alpha\right)

dd is cosine distance, α=0.4\alpha = 0.4. The correct gloss must be at least α\alpha closer to the context than the hard negative. If already satisfied, gradient is zero.

Setup

  • Backbone: all-MiniLM-L6-v2 (22M params), frozen.
  • Trainable: linear projection head, 768256768 \to 256.
  • Data: SemCor verbs only, 40K\approx 40K instances.
  • Eval: Raganato verb subset.
  • Baseline: same setup, random negatives instead of graph-guided.

The goal is not to beat SANDWiCH. It is to isolate the effect of graph-guided negatives with everything else held constant.

Expected Outcomes

  1. Hard negatives outperform random negatives on verb F1.
  2. Ablating the graph selection (random negatives, same loss) degrades performance.
  3. t-SNE of learned sense embeddings shows tighter within-sense clusters and larger sibling-sense margins under hard negative training.

Conclusion

WSD is largely solved for nouns. Verbs remain hard, and the errors are not random—they follow the structure of WordNet. Cross-entropy cannot see this structure. A triplet loss trained on graph-guided negatives can. The experiment described here is a minimal, reproducible test of that claim. If it works, the idea scales naturally to larger models. If it does not, the bottleneck is probably data coverage or gloss quality, not the loss.

References

  • Agirre, E. and Soroa, A. (2009). Personalizing PageRank for word sense disambiguation.
  • Banerjee, S. and Pedersen, T. (2002). An adapted Lesk algorithm for word sense disambiguation using WordNet.
  • Barba, E., Procopio, L., and Navigli, R. (2021). ConSeC: Word sense disambiguation as continuous sense comprehension.
  • Bevilacqua, M. and Navigli, R. (2020). Breaking through the 80% glass ceiling: Raising the state of the art in word sense disambiguation by incorporating knowledge graph information.
  • Blevins, T. and Zettlemoyer, L. (2021). Moving down the long tail of word sense disambiguation with gloss-informed bi-encoders.
  • Camacho-Collados, J., Pilehvar, M.T., and Navigli, R. (2016). Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities.
  • Devlin, J. et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding.
  • Guzman-Olivares, D. et al. (2025). SANDWiCH: Semantical analysis of neighbours for disambiguating words in context ad hoc.
  • Huang, L. et al. (2019). GlossBERT: BERT for word sense disambiguation with gloss knowledge.
  • Kågebäck, M. and Salomonsson, H. (2016). Word sense disambiguation using a bidirectional LSTM.
  • Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries.
  • Mikolov, T. et al. (2013). Distributed representations of words and phrases and their compositionality.
  • Miller, G.A. (1995). WordNet: A lexical database for English.
  • Miller, G.A. et al. (1994). A semantic concordance.
  • Pennington, J. et al. (2014). GloVe: Global vectors for word representation.
  • Raganato, A. et al. (2017). Word sense disambiguation: A unified evaluation framework and empirical comparison.
  • Zhong, Z. and Ng, H.T. (2010). It makes sense: A wide-coverage word sense disambiguation system for free text.
GitHub
LinkedIn
X