DedupEndNote: Details, lots of details ...

Details of the comparisons

  1. Publication year: Are they at most 1 year apart?
  2. Starting page or DOI: Are they the same?
  3. Authors: Is the Jaro-Winkler similarity of the first 40 authors > 0.67
  4. Title: Is the Jaro-Winkler similarity of (one of) the normalized titles > 0.89?
  5. ISBN, ISSN or Journal: Are they the same (ISBN, ISSN) or similar (Journal)?

The following comparisons are used (in this order, chosen for performance reasons):

1. Publication year: Are they at most 1 year apart?

This 1 year margin was chosen because "ahead of print" publications and final publications often are not published in the same year. A margin larger than 1 year would make the program a lot slower (see Justification: Order of comparisons).

  • Preprocessing: Publications years before 1800 are removed.
  • Insufficient data: Records without a publication year are compared to all records unless they already have been identified as a duplicate.
  • Special cases: Cochrane Reviews are compared for the same publication year only.

2. Starting page or DOI: Are they the same?

If the Starting and Ending page of at least one of the publications are more than 2 pages apart,
then: the DOIs are compared first. If the DOIs are different or one or both are absent, then the starting pages are compared,
otherwise: the starting pages are compared first. If the starting pages are different or one or both are absent, the DOIs are compared.

Meeting abstracts often get the DOI of the whole conference proceedings. Comparing them by DOI produces a lot of False Positives. Using this "more than 2 pages" choice circumvents this problem.

  • Preprocessing:
    • Article number is treated as a starting page if starting page itself is empty or contains "-".
    • Starting pages are compared only for number: "S123" and "123" are considered the same.
    • In DOIs 'http://dx.doi.org/', 'http://doi.org/', ... are left out, so that only the part starting with "10." is compared.
  • Insufficient data: If one or both DOIs are missing AND one or both of the starting pages are missing, the answer is YES. This is important because of PubMed ahead of print publications.
  • Special cases: Cochrane Reviews: if both records have a DOI, only the DOIs are compared, otherwise the starting pages are compared.

3. Authors: Is the Jaro-Winkler similarity of the first 40 authors > 0.67?

Jaro-Winkler similarity: See this Wikipedia page for a definition and here for some examples.

  • Preprocessing:
    • The author "Anonymous," and all Author Groups are skipped.
    • Only the first 40 authors are retained.
    • All diacritical signs/accents, supplemental characters (e.g. "£") and all characters in non-Latin scripts are removed
    • First names are reduced to initials ("Moorthy, Ranjith K." becomes "Moorthy RK").
    • Names with a space in the last name, are also compared in a transposed form ("Lofving Gupta, S." is compared as "Lofving Gupta S" and as "Gupta SL").
    • All authors from each record are joined by "; ". This is the string that is used in the comparison.
  • Insufficient data: If one or both records have no authors,
    • if both have an ISBN AND (one or both have no DOI OR one or both have no starting page), the answer is YES
      Reason: Conference proceedings have no authors, the comparison by ISBN will happen in step 5.
    • if one or both have no DOI OR one or both have no starting page, the answer is NO,
      Reason: Too few fields to make a safe comparison.
  • Special case: If one of the records is a reply, erratum or comment (see below), a higher Jaro-Winkler Similarity threshold is used.

4. Title: Is the Jaro-Winkler similarity of (one of) the normalized titles > 0.89?

The fields Original publication (OP), Short Title (ST), Title (TI) and sometimes Book section (T3, see below) are treated as titles.
Because the Jaro-Winkler similarity algorithm puts a heavy penalty on differences at the beginning of a string, the normalized titles are also reversed if the publication is longer than 1 page (i.e. except for meeting abtracts, replies, retraction notices, ...).

  • Preprocessing:
    • The titles are normalized (converted to lower case, text between "<...>" removed, all diacritical signs/accents, supplemental characters (e.g. "£") and all characters in non-Latin scripts are removed, all characters which are not letters or numbers are replaced by a space character, ...).
    • Titles which can be split on ". ", ": " or "? " and have 2 parts of at least 50 characters, are also compared on these parts.
    • In the titles of retracted publications all parts which refer to the retraction are removed. "RETRACTED: Response of Breast Cancer Cells and Cancer Stem Cells to Metformin and Hyperthermia Alone or Combined (Retracted article. See vol. 20, 2025)". A publication is considered a retraction if the Title starts with "retracted", "removed" or "withdrawn", or contains "retracted article" (all case insensitive).
  • Insufficient data: If one of the records is a reply, erratum or comment (see below), the titles are not compared / the answer is YES (but the Jaro-Winkler similarity of the authors should be > 0.75 and the comparison between the journals is more strict).

5. ISBN, ISSN or Journal: Are they the same (ISBN, ISSN) or similar (Journal)?

This rule is skipped if both records have the same DOI (that comparison was made in step 2).
The fields Journal / Book Title (T2), Alternate Journal (J2) and sometimes Book section (T3, see below) are treated as journals. All ISBNs, ISSns and journal titles (including abbreviations) in the records are used.

If both records have an ISBN, the ISBNs are compared (stop), if both have an ISSN, the ISSns are compared (stop), else the journal titles are compared.
Abbreviated and full journal titles are compared in a sensible way (see Examples of comparisons).

  • Preprocessing:
    • ISBNs and ISSNs are normalized (dashes are removed, lowercased). For ISBN-10 the first 9 digits are used, for ISBN-13 the 9 digits starting at position 4.
    • Journal titles of the form "Zhonghua wai ke za zhi [Chinese journal of surgery]" or "Zhonghua wei chang wai ke za zhi = Chinese journal of gastrointestinal surgery" or "The Canadian Journal of Neurological Sciences / Le Journal Canadien Des Sciences Neurologiques" are split into 2 journal titles.
    • the journal titles are normalized (hyphens, dots and apostrophes are replaced with space, end part between round or square brackets is removed, initial article is removed, ...).

Remarks:

Comment:
a publication is considered a comment if the title (fields ST and TI) contains words as "comment" or "commentary".
Erratum:
a publication is considered am erratum if the title (fields ST and TI) contains "Correction", "Corrigendum" or "Erratum".
Reply:
a publication is considered a reply if the title (fields ST and TI) contains "reply", or contains "author(...)respon(...)", or is nothing but "response" (all case insensitive).
T3 field:
Especially EMBASE (OVID) uses this field for (1) Conference title (majority of cases), (2) an alternative journal title, and (3) original (non English) title. Case 1 (identified as containing a number or "Annual", "Conference", "Congress", "Meeting" or "Society") is skipped. All other T3 fields are treated as Journals and as titles.

Examples of comparisons

In the following table the results of EndNote's Find duplicates is compared to the comparisons in DedupEndNote. For these tests only one field was selected in EndNote in "Edit > Preferences > Duplicates", and "Ignore spacing and punctuation" was selected.

Field Examples EndNote finds duplicates DedupEndNote Score
Starting page and article number
  • ...
  • ...
??? ???
Title
  • 90Y radioembolization using resin microspheres in patients with hepatocellular carcinoma and portal vein thrombosis
  • 90Y RADIOEMBOLIZATION USING RESIN MICROSPHERES IN PATIENTS WITH HEPATOCELLULAR CARCINOMA AND PORTAL VEIN THROMBOSIS
Yes 1.00 == Yes
Title
  • Comments about Glisson's capsule phleboliths and portal vein thrombosis [1]
  • COMMENTS ABOUT GLISSON CAPSULE PHLEBOLITHS AND PORTAL-VEIN THROMBOSIS
No 0.92 == Yes
Title
  • Transarterial chemoembolization and <sup>90</sup>y radioembolization for hepatocellular carcinoma: Review of current applications beyond intermediate-stage disease
  • Transarterial Chemoembolization and Y-90 Radioembolization for Hepatocellular Carcinoma: Review of Current Applications Beyond Intermediate-Stage Disease
No 0.92 == Yes
Title
  • Epidemiology and diagnosis profile of digestive cancer in teaching hospital campus of lome: About 250 cases. [French]
  • Epidemiology and diagnosis profile of digestive cancer in teaching Hospital Campus of Lome: about 250 cases
No 0.99 == Yes
Title
  • Post Splenectomy Outcome in beta-Thalassemia
  • Post Splenectomy Outcome in β-Thalassemia
No 0.96 == Yes
Title
  • Letter: portal vein obstruction--which subset of patients could benefit the most? Authors' reply
  • Letter: Portal vein obstruction - Which subset of patients could benefit the most?
No 0.97 == Yes *
Title
  • Title: Some diseases associated with ulcero- hemorrhagic colitis: complication or coincidence. [French]
    Original Title: Quelques maladies associees a la colite ulcero- Hemorragique: Complications ou coincidences
  • Title: [Various diseases associated with ulcero-hemorrhagic colitis: complications or coincidences]
    Original Title: Quelques maladies associees a la colite ulcero-hemorragique: complications ou coincidences.
No 1.00 == Yes *
Title
  • Title: [HELLP in the second trimester in a patient with antiphospholipid syndrome]
    Original Title: HELLP kan ses i andet trimester ved antifosfolipidsyndrom.
  • Title: HELLP kan ses i andet trimester ved antifosfolipidsyndrom
No 1.00 == Yes *
Title
  • Title: NFkappaB inhibition decreases hepatocyte proliferation but does not alter apoptosis in obstructive jaundice
    Reversed Title: ecidnuaj evitcurtsbo ni sisotpopa retla ton seod tub noitarefilorp etycotapeh sesaerced noitibihni BappakFN
  • Title: NF kappa B inhibition decreases hepatocyte proliferation but does not alter apoptosis in obstructive jaundice
    Reversed title: ecidnuaj evitcurtsbo ni sisotpopa retla ton seod tub noitarefilorp etycotapeh sesaerced noitibihni B appak FN
No 1.00 == Yes *
Title
  • Title: Case report. Duplication of the portal vein: a rare congenital anomaly
    Reversed Title: ylamona latinegnoc erar a :niev latrop eht fo noitacilpuD .troper esaC
  • Title: Duplication of the portal vein - A rare congenital anomaly
    Reversed title: ylamona latinegnoc erar A - niev latrop eht fo noitacilpuD
No 0.96 == Yes *
Title
  • Title: La sémantique de l'image radiologique. Intérêt du procédé de soustraction électronique en couleurs d'Oosterkamp en angiographie abdominale
    Reversed Title: elanimodba eihpargoigna ne pmakretsoO'd srueluoc ne euqinortcelé noitcartsuos ed édécorp ud têrétnI .euqigoloidar egami'l ed euqitnamés aL
  • Title: INTERET DU PROCEDE DE SOUSTRACTION ELECTRONIQUE EN COULEURS D'OOSTERKAMP EN ANGIOGRAPHIE ABDOMINALE
    Reversed title: ELANIMODBA EIHPARGOIGNA NE PMAKRETSOO'D SRUELUOC NE EUQINORTCELE NOITCARTSUOS ED EDECORP UD TERETNI
No 0.91 == Yes *
Authors
  • Cobos Mateos, J. M.; Aguinaga Manzanos, M. V.; Casas Pinillos, M. S.; Gonzalez Conde, R.; Gonzalez Sanchez, J. A.; De Miguel Velasco, J. E.; Soleto Saez, E.; Suarez Mier, M. P.
  • Mateos, J. M. C.; Manzanos, M. V. A.; Pinillos, M. S. C.; Conde, R. G.; Sanchez, J. A. G.; Velasco, J. E. D.; Saez, E. S.; Mier, M. P. S.
No 0.75 == Yes
Authors
  • Danilă, M.; Sporea, I.; Popescu, A.; şirli, R.
  • Danila, M.; Sporea, I.; Popescu, A.; Sirli, R.
No 0.93 == Yes
Authors
  • Lv, Y.; Qi, X.; Xia, J.; Fan, D.; Han, G.
  • Lv, Y.; Qi, X. S.; Xia, J. L.; Fan, D. M.; Han, G. H.
No 0.90 == Yes
Authors
  • [empty]
  • Anonymous,
No Yes
Journal
  • British journal of surgery
  • Br J Surg
No Similar == Yes
Journal
  • European Journal of Gastroenterology and Hepatology
  • European Journal of Gastroenterology & Hepatology
No Similar == Yes
Journal + ISSN
  • Japanese Journal of Cancer and Chemotherapy [ISSN: 2690-2692]
  • Gan To Kagaku Ryoho [ISSN: 2690-2692]
No Similar == Yes
Journal
  • JAMA
  • Journal of the American Medical Association
No Similar == Yes
Journal
  • The Lancet Haematology
  • Lancet Haematol
No Similar == Yes
Journal
  • Hepatology
  • Hepatology International
No Similar == Yes *
Journal
  • AJR Am J Roentgenol
  • American Journal of Roentgenology
No Similar == Yes
Journal
  • British journal of surgery
  • Surgery
No NOT similar == No

*: In these cases the comparison of DedupEndNote for this content for this field is not accurate. However, the comparison of the other fields for these records does not result in YES answers, so the records are ultimately not considered duplicates.

Enrichment of the deduplicated records

Only the first record of a set of duplicate records is copied to the output file.

When writing the output file, the following fields will be changed:

Author (AU)
  • if the (only) author is "Anonymous", the author is omitted
DOI (DO):
  • the DOIs of the removed duplicate records are copied to the saved record and deduplicated. The DOI field is important for finding the full text in EndNote.
  • DOIs of the form "10.1038/ctg.2014.12", "http://dx.doi.org/10.1038/ctg.2014.12", ... are rewritten in the prescribed form "https://doi.org/10.1038/ctg.2014.12". DOIs of this form are clickable links in EndNote.
Journal name (T2)
  • if the saved record has no value for the Journal name (T2) but has a value for the Alternate Journal (J2), this J2 value is copied to the T2 field. This is relevant for the ClinicalTrials.gov records in Embase.
  • missing Journal name (T2) for "Social Science Research Network" records from Embase.com are filled
Publication year (PY):
  • if the saved record has no value for its Publication year but one of the removed duplicate records has, the first not empty Publication year of the duplicates is copied to the saved record.
Starting page (SP) and Article Number (C7):
  • the article number is put in the Pages field (SP) if the Pages field is empty or does not contain a "-", overwriting the Pages field content.
  • the article number field (C7) is omitted
  • if the saved record has no value for its Pages field (e.g. PubMed ahead of print publications) but one of the removed duplicate records has, the first not empty pages of the duplicates are copied to the saved record.
  • the Pages field gets an unabbreviated form: e.g. "482-91" is rewritten as "482-491".
  • if the ending page is the same as the starting page, only the starting page is written ("192" instead of "192-192").
  • for Cochrane Reviews a missing review number ("CD...") is extracted from the DOI.
Title (TI):
  • If the publication is a reply / erratum / comment / retraction, the title is replaced with the longest title from the duplicates (e.g. "Reply from the authors" is replaced by "Coagulation parameters and portal vein thrombosis in cirrhosis Reply")

Justification

  1. 1. General rule
  2. 2. Comparisons
  3. 2.1. The fields chosen
  4. 2.2. The fields not chosen
  5. 2.3. Similarity
  6. 2.4. Order of comparisons
  7. 2.5. Effect of insufficient data from one of two records
  8. 3. Enriching the duplicate chosen

1. General rule

The general rule is:

General rule
Comparison Result Action
1 ... 5 YES go to next comparison if present,
else mark the records as duplicates
(insufficient data for the comparison in one of the records)
NO stop comparisons for this pair of record
The part "(insufficient data for the comparison in one of the records) → goto next comparison ..." sets DedupEndNote apart from other deduplication programs.

Justification:

  • If bibliographic databases didn't miss some data: DOI https://doi.org/10.1007/5584_2016_118 has Pages information in PubMed, but not in EMBASE or Scopus
  • If EndNote import filters (EndNote provided, provided by producers of bibliographic databases, home made) always put comparable data in the same EndNote fields
  • If all bibliographic data of imported records were already available, e.g. not with PubMed ahead of print publications

then the absence of Starting Page / DOI / Authors / ... in an EndNote record would mean that the corresponding publication has no Starting Page / DOI / Authors / .... Alas, the world isn't perfect yet.

DedupEndNote therefore interprets a comparison with insufficient data from one of two records not as "NO" (i.e. these records are different), but as "UNKNOWN" (i.e. we can't tell (yet)), and continues with the other comparisons for these records.

As a consequence, fields which are useful in comparisons but are not always available (e.g. DOI), can be used.

NOTE: If both records have insufficient data for a comparison, DedupEndNote interprets the fields as different. If there is no alternative for the comparison (see below), then DedupEndNote stops the comparisons for this pair of records and considers them different publications. See 2.5. Effect of insufficient data from one of two records.

2. Comparisons

2.1. The fields chosen

The EndNote fields used in the comparisons are:

The EndNote fields used in the comparisons
EndNote field Content Treated as Used in comparison no.
PY Publication year Year 1
SP Pages Starting page 2
C7 Article number
DO DOI DOI
AU Authors Authors 3
TI Title Title 4
OP * Original title (when reference type is not Conference Proceedings)
ST Short title
T3 ** Conference title
Alternate journal title
Original title
SN ISSN or ISBN ISBN
ISSN
5
T2 Journal title
Book title
Journal
J2 Alternate journal
OP * Conference title (when reference type is Conference Proceedings)
T3 ** Conference title
Alternate journal title
Original title
VL *** Can be last part of journal title (T2)

*: the field OP can be a title variant or a journal variant.

**: Conference titles in T3 are omitted.

***: VL can be the last part of the T2 field ("T2 - American journal of physiology" and "VL - Regulatory, integrative and comparative physiology. 303").

Justification:

  • These fields (or combinations of fields) are present for most records. DOIs may be absent, but in comparison 2 (Starting Page OR DOI) most records provide data
  • These fields proved to be sufficient to get good deduplication results.
  • Author: Reducing first names to initials identifies more duplicates although cases where first and last names are mixed up ("Ashish Anil, Sule" vs "Sule, A. A.") are no longer identified as duplicates.
  • Journal: There is no standard (used) for journals. PubMed puts the journal abbreviation in the Journal field, most other databases use the full journal title. There is no agreement on the journal names / abbreviations: "Am J Roentgenol" vs "AJR Am J Roentgenol".
    DedupendNote treats the content of the journal fields as a set. Comparing the journal of two records comes down to "Is there a journal from the set of journals of record 1 sufficiently similar to a journal from the set of journals of record 2?"
  • Title: There is no standard (used) for non-English titles. Bibliographic databases put the English translation of the title in the Title field, and sometimes the original title in another field.
    The translation may be their own: the original title "Autogreffe de cellules souches hématopoïétiques périphériques dans le cadre du traitement d'hémopathies malignes. Partie I: patients." became:
    • [Autologous transplantation of peripheral blood hematopoietic stem cells in the treatment of hematological malignancies. I: patients] (PubMed)
    • Autologous transplantation of peripheral blood stem cells for haematological malignancies. Part I: patients (Scopus, Web of Science)

    In the test database of 52.000 records 11% of the records were of non-English origin

  • Starting Page: Bibliographic databases do not always use the same fields for comparable content (e.g. PubMed doesn't use the Article number field, and puts the article number in the Pages field, Web of Science uses the Article Number field, and puts the number of pages of such publication in the Pages field).
    While reading the EndNote file DedupEndNote first picks the Article number (field C7) as Starting page (if present), overwriting it with the Starting page (field SP) of the file if it contains a "-". In the absence of an Article number (C7), Starting page (SP) is taken as Starting page.

    The DOI would be very useful field for deduplication if only each meeting abstract would have a unique DOI. However all the meeting abstract within a conference proceedings share the same DOI. Because it is not easy to identify meeting abstracts, DedupEndNote looks at the number of pages: if it is greater than 2, DedupEndNote considers the publication as NOT a meeting abstract, and prefers the comparison by DOI above the comparison by starting page.
2.2. The fields not chosen

Justification:

  • Reference type: Bibliographic databases don't agree on this. Publications in "Advances in Experimental Medicine and Biology" are Journal Article when imported from PubMed or EMBASE, but Serial when imported from Web of Science.
    The only exception to this rule is Conference proceedings: The OP field can be the title of a conference or the original (non-English) title
  • Volume and Issue: these fields were found to have no additional value for deduplication
  • Accession number: are tied to specific bibliographic databases, so of minimal value for deduplication
  • Type of Work: Not always provided by bibliographic databases, not standardized over bibliographic databases
  • Ending page: Not always provided by bibliographic databases. Comparison by only Starting page proved to be sufficient to get good deduplication results.
2.3. Similarity

DedupEndNote uses Jaro-Winkler Similarity instead of equality when comparing some fields (Authors, Titles, partly with Journals).

Justification:

  • (see Examples of comparisons)
  • Most bibliographic databases use UTF-8, some however still use the ASCII set (Web of Science) making comparisons by equality hopeless when diacritical signs or Greek letters are used in one but not both records
  • Bibliographic databases may add metadata (e.g. for Title: "[French]", "Authors reply", ...)
  • Bibliographic databases may use different rules for number of authors recorded (when to use "et al."). Are:
    • Albrecht, M. H.; Vogl, T. J.; Wichmann, J. L.; Martin, S. S.; Scholtz, J. E.; Fischer, S.; Hammerstingl, R. M.; Harth, M.; Nour-Eldin, N. A.; Thalhammer, A.; et al.,
    • Albrecht, M. H.; Vogl, T. J.; Wichmann, J. L.; Martin, S. S.; Scholtz, J. E.; Fischer, S.; Hammerstingl, R. M.; Harth, M.; Nour-Eldin, N. A.; Thalhammer, A.; Zangos, S.; Bauer, R. W.
    the same authors?
  • Is "NFκB" (publication form) "NFκB", "NFkappaB" or "NF kappa B"?
  • Bibliographic databases don't agree on the author name:
    • Is "Xingshun Qi" (publication form) "Qi, X." (PubMed) or "Qi, X. S." (Web of Science)?
    • Is "E. MorenoGonzález" (publication form) "MorenoGonzález, E." (Scopus), "Moreno González, E." (PubMed) or "Gonzalez, E. M." (Web of Science)?

Why Jaro-Winkler Similarity (JWS) and not Levenshtein distance / ...?

  • JWS is always a value between 0 and 1 making it easier to choose a cut-off point for allowed similarity (e.g. similarity > 0.91)
  • JWS's property of putting a heavier penalty on differences at the beginning of strings proved very helpful in comparing Titles and Authors:
    • "..." and "... - Commentary" are more similar than if a less partial similarity measure was used
    • "..." and "Case report: ..." (when compared in their reversed form) are more similar than if a less partial similarity measure was used
    • some bibliographic databases list all authors, others limit them and add "et al."
  • JWS proved useful, so no other similarity measures were tested

Thresholds used for Jaro-Winkler Similarity:

Thresholds used for Jaro-Winkler Similarity
Field Case Threshold Explanation
Authors default (i.e. not a Reply) 0.67
Reply and sufficient Start Pages 0.75 When a Reply the titles are not compared
Reply and insufficient Start Pages 0.8 When a Reply the titles are not compared
Journals * default (i.e. not a Reply) 0.9
Reply 0.93
Title Sufficient Start Pages or DOIs 0.89
Insufficient Start Pages and DOIs 0.94
Journals: if the JWS between Journal titles is below the threshold, two comparisons of the following types are tried:
  • "Ann Fr Anesth Reanim" and "Ann... Fr... Anesth... Reanim..."
  • "BMJ" and "B... M... J..."
2.4. The order of comparisons

Records are put in year sets based on the publication year. Records without publication year are put in a special year set YEAR_0.

When deduplicating 1 file, records are compared in descending order of pairs of year sets. The records in YEAR_0 are added to each of these pairs except for the ones which are already marked as duplicates. For an EndNote RIS file with records from 1889 to 2020:

  • YEAR 2020 + YEAR 2019 + YEAR 0: all records of YEAR 2020 are compared to all records of YEAR 2020 and YEAR 2019 and YEAR 0
  • YEAR 2019 + YEAR 2018 + YEAR 0: all records of YEAR 2019 are compared to all records of YEAR 2019 and YEAR 2018 and YEAR 0
  • ...
  • YEAR 1889 + YEAR 0: all records of YEAR 1889 are compared to all records of YEAR 1889 and YEAR 0

Justification:

  • Pairs of year sets:
    • Bibliographic databases sometimes don't agree on the publication year, in most cases the difference is only 1 year
    • The publication year of Ahead of print publications and of their final form can be more than 1 year apart, but in the majority of cases they are at most 1 year apart.
    • Extending to groupings of more years would take longer and could cause more false positives
  • Descending order: Because the first record of a set of duplicate records is saved to the output file,
    • when encountering an ahead of print publication of e.g. 2018 and a corresponding final record of 2019, the 2019 record will be saved
    • when encountering a record without a publication year and a corresponding record with a publication year, the record with the publication year will be saved

When deduplicating 2 files, records of both files are compared in ascending order of pairs of year sets. The YEAR_0 is added to each of these pairs, but records which are marked as duplicates are first removed.
The records of the OLD file are read before the records of the NEW file; because the duplicate chosen in a set of duplicate records is the first one encountered, duplicate records from the OLD file will be chosen when present.
For an EndNote RIS file with records from 1889 to 2020:

  • YEAR 0 + YEAR 1889 + YEAR 1890: all records of YEAR 0 + YEAR 1889 are compared to all records of YEAR 0 and YEAR 1889 and YEAR 1890
  • YEAR 0 + YEAR 1890 + YEAR 1891: all records of YEAR 0 + YEAR 1890 are compared to all records of YEAR 0 and YEAR 1890 and YEAR 1891
  • ...
  • YEAR 0 + YEAR 2020: all records of YEAR 0 + YEAR 2020 are compared to all records of YEAR 0 and YEAR 2020

The output file only contains records from the NEW file which are not duplicates of records of the OLD file, and (if there are duplicates within the NEW file) are the first duplicate encountered within that duplicate set.

Justification:

  • Ascending order:
    • when encountering an ahead of print publication of e.g. 2018 in the OLD file and a corresponding final record of 2018 or 2019 in the NEW file, the NEW record will be seen as a duplicate of the OLD record and not be saved. Saving that record would create a duplicate
    • when encountering a record without a publication year in the OLD file and a corresponding record with a publication year in the NEW file, the record with the publication year will be seen as a duplicate of the OLD record and not be saved. Saving that record would create a duplicate
2.5. Effect of 'insufficient data from one of two records'

The 5th comparison (ISSN or Journal: Are they the same (ISSN) or similar (Journal)?) looks at first sight to only compare journal articles (with the additional effect that no publications of another type can ever be a duplicate). This is not completely true:

  • ISBN is treated the same way as a ISSN
  • The EndNote fields T2, J2, OP and T3 can also be used with other publications types

However: the general rule treats 2 records / field sets as different if both records have insufficient data for that comparison. Two book records with the same authors, publication year and book title will be considered duplicates only if both have the same / a similar ISBN.

Relaxing this general rule (so that comparisons with insufficient data in one or both records are treated the same way: UNKNOWN, so go on to the next comparison) would result in a lot more False Positives.

3. Enriching the duplicate chosen

Justification:

  • (all cases): the enriched data are copied from existing duplicate records or occur in a lot of records (empty author, full pages form, omitted same ending page)
  • Author "Anonymous": There is no reason to use 2 forms (Author "Anonymous" and no Author) for this case
  • DOI - adding from other duplicates: Considered a useful addition
  • DOI - standardized form: Considered a useful addition (clickable in EndNote)
  • Publication year - missing added from other duplicates: Records without publication year are subpar
  • Starting page and Article number:
    • Some bibliographic databases (e.g. PubMed) treat them the same way, others (e.g. Scopus, Web of Science) don't
    • EndNote output formats (e.g. Vancouver style) can't handle Article numbers
  • Starting page - missing added from other duplicates: Records without starting page are subpar
  • Starting page - standardized to full form:
    • EndNote deduplication treats "492-5" and "492-495" as different values
    • EndNote output formats (e.g. Vancouver style) handle this full form gracefully (emit "492-5" for pages "492-495")
  • Starting page - omitting same end page:
    • Some bibliographic databases (esp. Web of Science) sometimes use this form (e.g. "211-211")
    • EndNote deduplication treats "211" and "211-211" as different values
    • EndNote output formats (e.g. Vancouver style) emit "211-" for pages "211-211"
  • Title - Reply / Erratum / Comment / Retraction: Inconsistent use of bibliographic databases. The longest title holds most information

Performance

Data are from:

  • [SRA] Rathbone, J., Carter, M., Hoffmann, T. et al. Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant-Deduplication Module. Syst Rev 4, 6 (2015). https://doi.org/10.1186/2046-4053-4-6
    The data sets are available at https://osf.io/dyvnj/
  • [McKeown] McKeown, S., Mir, Z.M. Considerations for conducting systematic reviews: evaluating the performance of different methods for de-duplicating references. Syst Rev 10, 38 (2021). https://doi.org/10.1186/s13643-021-01583-y
  • [BIG_SET] Own test database for DedupEndNote on portal vein thrombosis (52,828 records, with 5078 records validated)
Name Tool True pos False neg Sensitivity True neg False pos Specificity Accuracy
SRA: Cytology screening
(1856 rec)
EndNote X9 885 518 63.1% 452 1 99.8% 72.0%
SRA-DM 1265 139 90.1% 452 0 100.0% 92.5%
DedupEndNote 1361 59 95.8% 436 0 100.0% 96.8%
SRA: Haematology (1415 rec) EndNote 159 87 64.6% 1165 4 99.7% 93.6%
SRA-DM 208 38 84.6% 1169 0 100.0% 97.3%
DedupEndNote 222 6 97.3% 1186 1 99.9% 99.5%
SRA: Respiratory
(1988 rec)
EndNote X9 410 391 51.2% 1185 2 99.8% 80.2%
SRA-DM 674 125 84.4% 1189 0 100.0% 93.7%
DedupEndNote 768 18 97.7% 1202 0 100.0% 99.0%
SRA: Stroke
(1292 rec)
EndNote X9 372 134 73.5% 784 2 99.7% 89.5%
SRA-DM 426 81 84.0% 785 0 100.0% 93.7%
DedupEndNote 497 8 98.4% 787 0 100.0% 99.4%
McKeown
(3130 rec)
OVID 1982 90 95.7% 1058 0 100.0% 97.1%
EndNote 1541 531 74.4% 850 208 80.3% 76.4%
Mendeley 1877 195 90.6% 1041 17 98.4% 93.2%
Zotero 1473 599 71.1% 1038 20 98.1% 80.2%
Covidence 1952 120 94.2% 1056 2 99.8% 96.1%
Rayyan 2023 49 97.6% 1006 52 95.1% 96.8%
DedupEndNote 2023 33 98.4% 1074 0 100.0% 98.9%
BIG_SET
(5082 rec)
DedupEndNote 3952 92 97.7% 1030 8 99.2% 98.0%

See Test results - details for a description of this test database BIG_TEST on portal vein thrombosis.

Perfomance table
Tool Setting Duplicates found Duplicates to delete After deduplication % kept
EndNote Author + Year + Title + Reference Type
(default setting)
32,891 19,959 32,869 62%
EndNote Author + Year + Title 32,920 19,976 32,852 62%
EndNote Author + Year + Title + Secondary Title (Journal) 22,120 12,333 40,495 77%
DedupEndNote 38,617 24,357 28,471 54%
TODO: UPDATE examples

The False Positives are all conference abstracts:

  • wrong DOI in second of 2 EMBASE (OVID) records: same authors, year, title, DOI, different starting page
    • Segovia, M. C., et al. (2019). "Combined multivisceral and renal transplant in a patient with JAK-2 mutation." Transplantation 103(7 Supplement 2): S143. DOI: 10.1097/01.tp.0000576288.84252.91
    • Segovia, M. C., et al. (2019). "Combined multivisceral and renal transplant in a patient with JAK-2 mutation." Neurology 92(15 Supplement 1). DOI: 10.1097/01.tp.0000576288.84252.91
  • wrong DOI in second of 2 EMBASE (OVID) records: same authors, year, title, DOI, different starting page
    • Galvao, F. H., et al. (2019). "Intestinal and multivisceral transplantation at hospital dasclinicas da faculdade De medicina da universidade De Sao Paulo (HC-FMUSP)-Brazil." Transplantation 103(7 Supplement 2): S171. DOI: 10.1097/01.tp.0000576492.69414.80
    • Galvao, F. H., et al. (2019). "Intestinal and multivisceral transplantation at hospital dasclinicas da faculdade De medicina da universidade De Sao Paulo (HC-FMUSP)-Brazil." Neurology 92(15 Supplement 1). DOI: 10.1097/01.tp.0000576492.69414.80
  • reversed title 3 seen as similar to reversed title of 1 and 2: same authors, year, title, starting page
    • Cool, J., et al. (2018). "TRENDS IN THE PREVALENCE OF PORTAL VEIN THROMBOSIS AND ASSOCIATED MORTALITY IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT." Gastroenterology 154(6): S1178-S1178. [Web of Science]
    • Cool, J., et al. (2018). "TRENDS IN THE PREVALENCE OF PORTAL VEIN THROMBOSIS AND ASSOCIATED MORTALITY IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT." Gastroenterology 154(6 Supplement 1): S-1178. DOI: 10.1016/S0016-5085%2818%2933901-5 [Embase OVID]
    • Cool, J., et al. (2018). "THE ASSOCIATION BETWEEN PORTAL VEIN THROMBOSIS AND OTHER VENOUS THROMBOEMBOLISM IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT." Gastroenterology 154(6): S1178-S1179. [Web of Science]

Mark mode

If you want to manually merge the records which are duplicates according to DedupEndNote, you can use Mark mode.

Because Zotero does not work with / does not show record numbers, Mark Mode is not relevant for Zotero users.

In Mark mode the ID of the first record of a set of duplicate records is copied to the label field ("LB") of all duplicate records. The input file is copied to the output file with the addition of the Label field if it is not empty. The original content of the Label field is overwritten! All other fields are copied as is (so no enriching of the output: prescribed form of DOI, ...).

After importing the results file into a new EndNote database, make the Label field visible. The IDs in the Label field refer to the IDs of the original EndNote database.

  • To see the duplicates, search for "Label" "is greater than" 0, and sort on the label field.
  • To manually merge, change in Preferences / Deduplicate the fields to deduplicate on to the Label field, deduplicate (in EndNote) and merge at will

Comparing the results of deduplication by EndNote itself and DedupEndNote:

  • Import the result file into a new EndNote database
  • Deduplicate by EndNote itself
  • Select all records in the "Duplicate References" set, and mark them as Read
  • Select the "All References" set, limit to the duplicates found by DedupEndNote (search for "Label" "is greater than" 0), and sort on the label field.
  • The records marked as Unread were identified as duplicates by DedupEndNote, but not by EndNote itself.
  • Select the "All References" set, limit to the duplicates not found by DedupEndNote (search for "Label" "is less than" 0).
  • The records marked as Read were identified as duplicates by EndNote itself, but not by DedupEndNote.

Special cases

1. ClinicalTrials.gov records

Records from ClinicalTrials.gov are also available within the Cochrane Library and EMBASE, but the format of the data is quite different. DedupEndNote changes the data of these records to a common format when it imports them so that deduplication can work. The deduplicated output is also standardized:
  • Reference Type: Journal Article
  • Authors: (empty)
  • Journal: https://clinicaltrials.gov
  • Pages: the ClinicalTrials.gov ID (e.g. NCT06923007)
  • URL: the first URL is for ClinicalTrials.gov (e.g. https://clinicaltrials.gov/study/NCT06923007)
  • the other fields are from the first record in a deduplication set

Limitations

  • Input file size: The maximum size of the input file is limited to 150MB.
  • Input file format: The program only handles files in RIS format, not in XML or CSV format.
  • Input file encoding: The program assumes that the input file is encoded as UTF-8.
  • If authors AND (all) titles AND (all) journal names for a record use a non-Latin script, results for this record may be inaccurate.
  • (when deduplicating one file:) The input file must be an export from ONE EndNote database: the ID fields are used internally for identifying the records, so they have to be unique. However, if the RIS file does not have an ID field in the first publication, DedupEndNote assumes the whole file has no ID fields and gives every publication an ID (starting with 1). This has only been tested on Zotero files!
  • The program has been developed and tested for biomedical databases (PubMed, EMBASE, ...) and some general databases (Web of Science, Scopus). The data sets used were the results of biomedical queries. Deduplicating records from other databases is not guaranteed to work, and performance is often very poor, esp. for non journal articles (see Justification. 2.5. Effect of insufficient data from one of two records).
  • The program uses a bibliographic point of view:
    • an article or conference abstract that has been published in more than one (issue of a) journal is not considered a duplicate publication.
    • Records for each publication year are compared to records from the same and the following year: a record from 2016 is compared to the records from 2015 (when treating the records from 2015) and from 2016 and 2017 (when treating the records from 2016). A PubMed ahead-of-print record from 2013 and a corresponding record from 2017 (when it was 'officially' published) will not be compared (and possibly deduplicated).
    • Bibliographic databases are not always very accurate in the starting page of a publication. Because starting page is part of the comparisons, DedupEndNote misses the duplicates when bibliographic databases don't agree on the starting page (and one or both records have no DOIs).

FAQ

  1. Why only EndNote / Zotero RIS export files?
  2. I don't use EndNote, but ...
  3. I use groups, comments, ... in EndNote and don't want to lose them
  4. I have deleted records in EndNote and don't want them to reappear when updating
  5. I prefer PubMed records above EMBASE records
  6. Is DedupEndNote perfect?

1. Why only EndNote / Zotero RIS export files?

EndNote (and Zotero?) is very good at importing export files from many bibliographic databases. The RIS export from an EndNote / Zotero database has a standard format, which makes reading and interpreting records for DedupEndNote a lot easier.
(This version of) DedupEndNote does not read EndNote XML exports, because reading and writing XML files is more complex than reading and writing a RIS file, without offering any benefit (?).
DedupEndNote was developped with EndNote databases and RIS files. Zotero and Zotero RIS files were only later taken in consideration. In 2025 the results from a query in PubMed, Embase and Web of Science were imported into both an EndNote and a Zotero database, exported as RIS files, and deduplicated with DedupEndNote. The results were identical.

2. I don't use EndNote, but ...

Sorry, the program has been mainly developed and tested with EndNote export files in RIS format.
However, the code is available on GitHub. For other databases the main changes will probably have to be made only in the functions for reading and writing.

3. I use groups, comments, ... in EndNote and don't want to lose them

EndNote export files in RIS format do not contain grouping information!

If you have groupings, ... in your EndNote database before you started using DedupEndNote: use DedupEndNote in MarkMode, and deduplicate then within EndNote only on the Label field.

TODO: EXPAND

TODO: XML (formatting)

However, for new projects the following will work: using a MASTER Database and a WORKING Database
(The names of the EndNote databases and the RIS files are just examples)

  1. create a MASTER database
  2. import your first set of records from one or more bibliographic databases into the MASTER Database
  3. (optionally: clean up the MASTER Database) if the MASTER Database can contain duplicate records, export the MASTER Database as a RIS file, deduplicate the file with DedupEndNote, import the deduplicated results into a NEW_MASTER Database EndNote file, remove the old MASTER Database, and make a copy of the NEW_MASTER database as your MASTER database (File > Save a Copy ...)
  4. Make a copy of the MASTER Database as your WORKING Database EndNote file (File > Save a Copy ...)
  5. edit, comment, delete, group, ... records only in the WORKING Database!

When you have new export files from the same or other bibliographic databases:

  1. import these results into a new TEMP Endnote database, export that TEMP EndNote database into a RIS file NEW_RECORDS.txt (overwriting the file if it already exists), and remove the TEMP EndNote database
  2. export the MASTER Database into a RIS file OLD_RECORDS.txt (overwriting the file if it already exists)
  3. use DedupEndNote - deduplicate 2 files with these two files (results will be in NEW_RECORDS_deduplicated.txt)
  4. import NEW_RECORDS_deduplicated.txt into the MASTER Database
  5. import NEW_RECORDS_deduplicated.txt into the WORKING Database.
    And again: edit, comment, delete, group, ... records only in the WORKING Database!
    • because the deleted records from the WORKING Database are still present in the MASTER Database (and as a consequence also in OLD_RECORDS.txt), duplicates for these records in NEW_RECORDS.txt will be recognized by DedupEndNote and not reappear in NEW_RECORDS_deduplicated.txt
    • the existing groupings, edits, ... in the WORKING Database will not be overwritten.

    TODO: What about manual adds?

    TODO: What about updates of ahead-of-print publications?

4. I have deleted records in EndNote and don't want them to reappear when updating

See previous question

5. I prefer PubMed records above EMBASE records

DedupEndNote compares records in the order of the input file, after they are grouped on publication year. This makes it possible to influence the choice of duplicate within a duplicate set.

Suppose all records have the field Database Provider filled in with either "PubMed" or "EMBASE". If you order the orginal EndNote database on the field Database Provider in descending order before exporting to a RIS file, DedupEndNote will use a PubMed record if a duplicate set contains one or more PubMed records and one or more EMBASE records (except if one or more EMBASE records of this duplicate set has a publication year one year later than all PubMed records). TODO: Explain why "later"?

Your preference will probably not be the same as the alphabetical order of the names of database providers ("CINAHL", "Cochrane ...", "EMBASE", ...). You could however change this field in "1 - PubMed", "2 - EMBASE", "3 - Cochrane", ... and order your EndNote database on this field (EndNote will order numerically) before exporting to a RIS file, or use another field with similar content.

6. Is DedupEndNote perfect?

Of course not.

However, DedupEndNote tries to avoid at all costs to wrongly identify records as duplicates. The MarkMode was specifically developed for this, making comparison between results of deduplication by EndNote and by DedupEndNote feasible.

False positives

There was only 1 false positive found in the validation sets (11,474 validated records):

  • Cytology screening (from SRA-DM, 1856 records, all validated)
  • Haematology (from SRA-DM, 1415 records, all validated)
  • Respiratory (from SRA-DM, 1988 records, all validated)
  • Stroke (from SRA-DM, 1292 records, all validated)
  • BIG SET (our own test database of 52,828 records, with 4923 records validated)
  • Cool, J.; Rosenblatt, R.; Kumar, S.; Lucero, C.; Fortune, B.; Crawford, C. V.; Jesudian, A.: The Association Between Portal Vein Thrombosis and Other Venous Thromboembolism in Cirrhosis: Analysis of a Nationally Representative Inpatient Cohort, in: Gastroenterology 154 (6 Supplement 1), 2018, S-1179
  • Cool, J.; Rosenblatt, R.; Kumar, S.; Lucero, C.; Fortune, B.; Crawford, C. V.; Jesudian, A.: TRENDS IN THE PREVALENCE OF PORTAL VEIN THROMBOSIS AND ASSOCIATED MORTALITY IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT, in: Gastroenterology 154 (6 Supplement 1), 2018, S-1178
False negatives

DedupEndNote will miss some duplicate records, e.g.:

  1. Records have different starting pages and not 2 DOIs which are the same
  2. The publication years of duplicate records can be more than 1 year apart. The validated subset of BIG_SET (4923 records) has 60 examples of duplicate records with different publication years: differences are 1 year (48), 2 years (8) and 4 years (4).
  3. Journals titles are not identified as similar (e.g. "Jbr-btr" and "Journal Belge de Radiology" with different or no ISSNs, or "Am J Dig Dis" and "Aaierj.Dio.Dis." (from EMBASE, definitely a typo), or "J Can Assoc Radiol" and "Canadian Association of Radiologists Journal")
  4. Replies: Identifying replies by "response" as one of the title words is not accurate enough (e.g. "Regulation of hepatic blood flow: The hepatic arterial buffer response revisited") and could wrongly identify duplicates. As a consequence "Incidence and Natural Course of Portal Vein Thrombosis in Cirrhosis Response" (Web of Science) is not marked as a duplicate of "Response to Senzolo and García-Pagán" (PubMed, EMBASE and Scopus) (https://doi.org/10.1038/ajg.2013.298).

After deduplication by DedupEndNote, deduplication with EndNote itself (default setting: Author, Year, Title) could be worthwhile. The case of different starting pages can only be solved by looking up the publication itself, the case of the missed similar journals by your brain.

The next results are for version 1.0.0 / these results should be updated for the latest version:
After deduplicating our test database of 52,828 records with DedupEndNote and importing the result file into a new EndNote database, deduplication with EndNote itself (default setting: Author, Year, Title) found 260 duplicates. After manual inspection, we found 118 records which were true duplicates. Some of the "reasons":

  • different starting page: 25
  • different starting page and DOI: 1
  • different journals: 7, i.e. DedupEndNote couldn't be expected to recognize these journals as similar
  • error in starting page: 1
  • Database error in Journal name: 12
    Scopus sometimes uses the continuation title too early:
    • "Blood Vessels" (PubMed) vs "Journal of Vascular Research" (Scopus, continuation of "Blood Vessels")
    • "Acta Pathol Jpn" (PubMed, EMBASE) vs "Pathology International" (Scopus, continuation of "Acta Pathol Jpn")
    • "Acta Paediatr Jpn" (PubMed) and "Pediatrics International" (Scopus, continuation of "Acta Paediatr Jpn")
    • "American Journal of Pediatric Hematology/Oncology" (PubMed) and "Journal of pediatric hematology/oncology" (Cochrane, Scopus, continuation of "American Journal of Pediatric Hematology/Oncology")
    • "Z Kinderchir" (PubMed) and "European Journal of Pediatric Surgery" (Scopus, continuation of "Z Kinderchir")
    • "Cardiovasc Surg" (PubMed) and "Vascular" (Scopus, continuation of "Cardiovasc Surg")
    • "Int J Addict" (PubMed) and "Substance Use and Misuse" (Scopus, continuation of "Int J Addict")
    • "Aust Paediatr J" (PubMed) and "Journal of Paediatrics and Child Health" (Scopus, continuation of "Aust Paediatr J")

    2025-08-01: the change in version 1.0.2 (comparison 5 (ISSN and Journal names) is skipped when the DOIs are the same) solves some (a lot?) of these problems. But: the journal name used in the deduplicated file is the one from the first record in the duplicate set.

  • DedupEndNote error in Journal name: 3, i.e. DedupEndNote would be expected to recognize these journals as similar
    • ("Jbr-btr" vs "Journal Belge de Radiologie")
    • "Langenbecks Arch Chir Suppl Kongressbd" vs "Langenbecks archiv fur chirurgie. Supplement. Kongressband. Deutsche gesellschaft fur chirurgie. Kongress"
    • "J Vasc Surg Venous Lymphat Disord" vs "Journal of Vascular Surgery: Venous and Lymphatic Disorders"
  • DedupEndNote parsing error in Starting page: 1 ("S6-97-s6-99" vs "S697-s699", DedupEndNote used "6" in the first case instead of "697")

Some cases of duplicate records are "unsolvable" for programs? Take e.g.the following publication (https://www.nejm.org/doi/full/10.1056/NEJM199105303242207):

  1. Cabot RC, Scully RE, Mark EJ, McNeely WF, McNeely BU, Podolsky DK, Lewandrowski KB. Case 22-1991: A 15-Year-Old Boy with Fever of Unknown Origin, Severe Anemia, and Portal-Vein Thrombosis. New England Journal of Medicine. 1991;324(22):1575-84.
  2. Podolsky DK, Lewandrowski KB. Case records of the Massachusetts General Hospital. New England Journal of Medicine. 1991;324(22):1575-84.
  3. Podolsky DK, Ferrucci JT, Ellis DS, Mark EJ, Pasternack MS, Huang PL, Lewandrowski KB, Dowling WJ, Goldfinger SE. A 15-YEAR-OLD BOY WITH FEVER OF UNKNOWN ORIGIN, SEVERE ANEMIA, AND PORTAL-VEIN THROMBOSIS - APPENDICITIS WITH PERIAPPENDICITIS AND FOREIGN-BODY GIANT-CELL REACTION CONSISTENT WITH PRIOR RUPTURE - PYLEPHLEBITIS WITH THROMBOSIS, ACUTE AND CHRONIC TRIADITIS, AND CHOLANGITIS. New England Journal of Medicine. 1991;324(22):1575-84.
  4. Case records of the Massachusetts General Hospital. Weekly clinicopathological exercises. Case 22-1991. A 15-year-old boy with fever of unknown origin, severe anemia, and portal-vein thrombosis. N Engl J Med. 1991;324(22):1575-84.

"Cabot RC, Scully RE, Mark EJ, McNeely WF, McNeely BU" in (1) were the (associate) editors of the NEJM in 1991, "Podolsky DK, Lewandrowski KB" in (2) described the case, all authors in (3) discussed the case.

For developers

  1. Validation
  2. Logging and log levels

1. Validation

A relational database was used for validation.
The RDBM used was MS Access (MS Office 2016):

  • Because MS Access uses 2 types of Text field (short text: max. 255 characters, searchable and sortable; long text (formerly known as Memo field): unlimited, not sortable, not searchable) there are 2 fields for both title and authors
  • In MS Access the format for all boolean fields was changed to "True/False" (design view, General tab for these fields)

The table for a validation set contains the fields:

Format of validation database table
Field Type Default Content
id INTEGER The original ID in the EndNote DB. PRIMARY KEY
dedupid INTEGER NULL Content of the Label field in Mark mode, i.e. the ID of the first record in a duplicate set
correction INTEGER NULL Manually set for the False Positive (FP) and False Negative (FN) results (see below)
validated BOOLEAN FALSE Manually set to TRUE if the DedupEndNote result is validated
tp BOOLEAN FALSE Manually set to TRUE if record is indeed a duplicate of the record with DedupID
tn BOOLEAN FALSE Manually set to TRUE if record has no duplicates
fp BOOLEAN FALSE Manually set to TRUE if DedupEndNote has wrongly identified the record as a duplicate of record with DedupID.
If the record has no duplicates, Correction contains the ID, otherwise the ID of the true duplicate.
fn BOOLEAN FALSE Manually set to TRUE if DedupEndNote has not identified the record as a duplicate.
The ID of the missed duplicate is stored in Correction.
If the record is a False Positive but also has duplicates, it is only marked as False Positive: otherwise TP + TN + FP + FN would be greater than the size of the validation set.
unsolvable BOOLEAN FALSE ???
authors_truncated TEXT Authors joined with '; ', truncated at 254 characters
In an MS Access DB: SHORT TEXT (i.e. max. 255 characters), to make the field sortable and searchable.
authors TEXT Authors joined with '; '
In an MS Access DB: LONG TEXT (a.k.a. MEMO), not sortable or searchable.
publ_year TEXT Publication Year
title_truncated TEXT Title, truncated at 254 characters
In an MS Access DB: SHORT TEXT (i.e. max. 255 characters), to make the field sortable and searchable.
title TEXT Title
In an MS Access DB: LONG TEXT (a.k.a. MEMO), not sortable or searchable.
title2 TEXT Journal Title / Book Title
volume TEXT Volume
issue TEXT Issue
pages TEXT Starting Page
article_number TEXT Article Number
dois TEXT DOIs joined with '; '
publ_type TEXT Type of publication. 'type' is a SQL reserved word
database TEXT Database Provider
number_authors INTEGER Number of authors
"ValidationTests.java" can write a tab-delimited file of a DedupEndnote run in Mark mode.
  • import the file into the RDBM
    When importing this file into MS Access (tab delimited, no text delimiter) also open the advanced button (and change the encoding to UTF-8). Without using the advanced button, the "Long text" get truncated, some fields are considered unparseable, ... Is this the encoding or just the fact that the Advanced button is used?
  • validate a number of records
  • select the validated records and export them as a tab delimited file. (In MS Access: select the whole set of validated records, copy, paste in a text editor, save)
"ValidationTests.java" has tests for comparing the results of a new version of DedupEndNote with the validated set (exported as a tab delimited file).

2. Logging and log levels

  • Reading the records
    java -Dlogging.level.edu.dedupendnote.services.IOService=DEBUG -jar DedupEndNote-0.9.7b-SNAPSHOT.jar
    If everythings works, the log should end with "Publications read: ". If not, the log will end with a message with the Record ID and title of the last publication that was successfully read.
  • Converting the records
    java -Dlogging.level.edu.dedupendnote.services.IOService=DEBUG -Dlogging.level.edu.dedupendnote.domain.Publication=DEBUG -jar DedupEndNote-0.9.7b-SNAPSHOT.jar
  • Deduplicating the records
    java -Dlogging.level.edu.dedupendnote.services=DEBUG -jar DedupEndNote-0.9.7b-SNAPSHOT.jar

Do not set the level to TRACE. The log file will be flooded, and the program may come to a halt.
The test file MissedDuplicatesTests.java uses the level TRACE.