IDENTIFYING SIMILAR DOCUMENTS IN A FILE REPOSITORY USING UNIQUE DOCUMENT SIGNATURES

Fecha de publicación: 15/08/2024
Fuente: Wipo "digitalization"
Methods, systems, and non-transitory computer readable storage media are disclosed for determining clusters of similar digital documents using unique document signatures. Specifically, the disclosed system processes digital text in a digital document to tokenize character strings (e.g., words) in the digital document by combining a subset of character values and string lengths in the character strings. Additionally, the disclosed system generates a document signature for the digital document by combining subsets of tokens generated for the digital document into a token sequence indicative of the digital text in the digital document. The disclosed system determines a cluster of similar digital documents including the digital document by comparing the document signature of the digital document to document signatures corresponding to a plurality of digital documents.