DedupEndNote: Developers

Home
  1. Validation
  2. Logging and log levels

1. Validation

A relational database was used for validation.
The RDBM used was MS Access (MS Office 2016):

  • Because MS Access uses 2 types of Text field (short text: max. 255 characters, searchable and sortable; long text (formerly known as Memo field): unlimited, not sortable, not searchable) there are 2 fields for both title and authors
  • In MS Access the format for all boolean fields was changed to "True/False" (design view, General tab for these fields)

The table for a validation set contains the fields:

Field Type Default Content
id INTEGER The original ID in the EndNote DB. PRIMARY KEY
dedupid INTEGER NULL Content of the Label field in Mark mode, i.e. the ID of the first record in a duplicate set
correction INTEGER NULL Manually set for the False Positive (FP) and False Negative (FN) results (see below)
validated BOOLEAN FALSE Manually set to TRUE if the DedupEndNote result is validated
tp BOOLEAN FALSE Manually set to TRUE if record is indeed a duplicate of the record with DedupID
tn BOOLEAN FALSE Manually set to TRUE if record has no duplicates
fp BOOLEAN FALSE Manually set to TRUE if DedupEndNote has wrongly identified the record as a duplicate of record with DedupID.
If the record has no duplicates, Correction contains the ID, otherwise the ID of the true duplicate.
fn BOOLEAN FALSE Manually set to TRUE if DedupEndNote has not identified the record as a duplicate.
The ID of the missed duplicate is stored in Correction.
If the record is a False Positive but also has duplicates, it is only marked as False Positive: otherwise TP + TN + FP + FN would be greater than the size of the validation set.
unsolvable BOOLEAN FALSE ???
authors_truncated TEXT Authors joined with '; ', truncated at 254 characters
In an MS Access DB: SHORT TEXT (i.e. max. 255 characters), to make the field sortable and searchable.
authors TEXT Authors joined with '; '
In an MS Access DB: LONG TEXT (a.k.a. MEMO), not sortable or searchable.
publ_year TEXT Publication Year
title_truncated TEXT Title, truncated at 254 characters
In an MS Access DB: SHORT TEXT (i.e. max. 255 characters), to make the field sortable and searchable.
title TEXT Title
In an MS Access DB: LONG TEXT (a.k.a. MEMO), not sortable or searchable.
title2 TEXT Journal Title / Book Title
volume TEXT Volume
issue TEXT Issue
pages TEXT Starting Page
article_number TEXT Article Number
dois TEXT DOIs joined with '; '
publ_type TEXT Type of publication. 'type' is a SQL reserved word
database TEXT Database Provider
number_authors INTEGER Number of authors
"ValidationTests.java" can write a tab-delimited file of a DedupEndnote run in Mark mode.
  • import the file into the RDBM
    When importing this file into MS Access (tab delimited, no text delimiter) also open the advanced button (and change the encoding to UTF-8). Without using the advanced button, the "Long text" get truncated, some fields are considered unparseable, ... Is this the encoding or just the fact that the Advanced button is used?
  • validate a number of records
  • select the validated records and export them as a tab delimited file. (In MS Access: select the whole set of validated records, copy, paste in a text editor, save)
"ValidationTests.java" has tests for comparing the results of a new version of DedupEndNote with the validated set (exported as a tab delimited file).

2. Logging and log levels

  • Reading the records
    java -Dlogging.level.edu.dedupendnote.services.IOService=DEBUG -jar DedupEndNote-0.9.5-SNAPSHOT.jar
    If everythings works, the log should end with ""Records read: ". If not,log will show what the last record successfully read in was
  • Converting the records
    java -Dlogging.level.edu.dedupendnote.services.IOService=DEBUG -Dlogging.level.edu.dedupendnote.domain.Record=DEBUG -jar DedupEndNote-0.9.5-SNAPSHOT.jar
  • Deduplicating the records
    java -Dlogging.level.edu.dedupendnote.services=DEBUG -jar DedupEndNote-0.9.5-SNAPSHOT.jar