赵子龙 (@xiaoyu-ops) 在 看看俺这个摘要咋样嘿嘿 中发帖
后面附上翻译 😀
Large-scale web datasets often form a chaotic “digital swamp” -
—unstructured, heterogeneous collections with substantial redun-
dancy that challenge machine learning workflows. Existing cleaning
methods typically assume pre-sorted data or lack robustness against
near-duplicates and mixed modalities. This paper introduces MMd-
edup, an end-to-end framework built on a novel “Triage-a...