DedupEndNote: Developers

Home

Validation
Logging and log levels

1. Validation

A relational database was used for validation.
The RDBM used was MS Access (MS Office 2016):

Because MS Access uses 2 types of Text field (short text: max. 255 characters, searchable and sortable; long text (formerly known as Memo field): unlimited, not sortable, not searchable) there are 2 fields for both title and authors
In MS Access the format for all boolean fields was changed to "True/False" (design view, General tab for these fields)

The table for a validation set contains the fields:

ormat of validation database table
Field	Type	Default	Content
id	INTEGER		The original ID in the EndNote DB. PRIMARY KEY
dedupid	INTEGER	NULL	Content of the Label field in Mark mode, i.e. the ID of the first record in a duplicate set
correction	INTEGER	NULL	Manually set for the False Positive (FP) and False Negative (FN) results (see below)
validated	BOOLEAN	FALSE	Manually set to TRUE if the DedupEndNote result is validated
tp	BOOLEAN	FALSE	Manually set to TRUE if record is indeed a duplicate of the record with DedupID
tn	BOOLEAN	FALSE	Manually set to TRUE if record has no duplicates
fp	BOOLEAN	FALSE	Manually set to TRUE if DedupEndNote has wrongly identified the record as a duplicate of record with DedupID. If the record has no duplicates, Correction contains the ID, otherwise the ID of the true duplicate.
fn	BOOLEAN	FALSE	Manually set to TRUE if DedupEndNote has not identified the record as a duplicate. The ID of the missed duplicate is stored in Correction. If the record is a False Positive but also has duplicates, it is only marked as False Positive: otherwise TP + TN + FP + FN would be greater than the size of the validation set.
unsolvable	BOOLEAN	FALSE	???
authors_truncated	TEXT		Authors joined with '; ', truncated at 254 characters In an MS Access DB: SHORT TEXT (i.e. max. 255 characters), to make the field sortable and searchable.
authors	TEXT		Authors joined with '; ' In an MS Access DB: LONG TEXT (a.k.a. MEMO), not sortable or searchable.
publ_year	TEXT		Publication Year
title_truncated	TEXT		Title, truncated at 254 characters In an MS Access DB: SHORT TEXT (i.e. max. 255 characters), to make the field sortable and searchable.
title	TEXT		Title In an MS Access DB: LONG TEXT (a.k.a. MEMO), not sortable or searchable.
title2	TEXT		Journal Title / Book Title
volume	TEXT		Volume
issue	TEXT		Issue
pages	TEXT		Starting Page
article_number	TEXT		Article Number
dois	TEXT		DOIs joined with '; '
publ_type	TEXT		Type of publication. 'type' is a SQL reserved word
database	TEXT		Database Provider
number_authors	INTEGER		Number of authors

"ValidationTests.java" can write a tab-delimited file of a DedupEndnote run in Mark mode.

import the file into the RDBM

When importing this file into MS Access (tab delimited, no text delimiter) also open the advanced button (and change the encoding to UTF-8). Without using the advanced button, the "Long text" get truncated, some fields are considered unparseable, ... Is this the encoding or just the fact that the Advanced button is used?
validate a number of records
select the validated records and export them as a tab delimited file. (In MS Access: select the whole set of validated records, copy, paste in a text editor, save)

"ValidationTests.java" has tests for comparing the results of a new version of DedupEndNote with the validated set (exported as a tab delimited file).

2. Logging and log levels

Reading the records
```
java -Dlogging.level.edu.dedupendnote.services.IOService=DEBUG -jar DedupEndNote-0.9.5-SNAPSHOT.jar
```
If everythings works, the log should end with ""Records read: ". If not,log will show what the last record successfully read in was

Converting the records

java -Dlogging.level.edu.dedupendnote.services.IOService=DEBUG -Dlogging.level.edu.dedupendnote.domain.Record=DEBUG -jar DedupEndNote-0.9.5-SNAPSHOT.jar

Deduplicating the records

java -Dlogging.level.edu.dedupendnote.services=DEBUG -jar DedupEndNote-0.9.5-SNAPSHOT.jar