OpenArXiv
Relational DB Schema
Outlined below are the
tables and their columns which will be used in the first draft of the OpenArXiv relational database. Some data has been
intentionally omitted due to either its difficulty to gather via an automated
process or due to its irrelevance to the project goal. In all cases where a
primary key is undefined, an arbitrary numeric ID has been provided to assist
with indexing.
The main goal of this database is to track the relationships between publications, authors, and the various categories ArXiv uses for classification.
publications
(pubid, archive, year, month, seq,
title, comments (nulls ok), journal-ref (nulls ok))
Where pubid is the citation ID
(hep-th/0506123 etc.), archive is the category it's entered under (hep-th), and year/month/seq are
the components of the numerical part of the ID.
Title, comments,
and journal-ref are text fields that capture identical data as their ArXiv citation counterparts.
Having a separate attribute for archive might be redundant, but
for the time being it addresses those strange cases in ArXiv
where you find a hep-lat article entered under hep-th instead of hep-lat.
Whether or not it’s ultimately necessary depends on how ArXiv's
organization is dealt with in the end product.
persons
(ID
(arbitrary numeric ID), pubid, abbr,
name)
Pubid is taken from the previous table, abbr and name are the abbreviated and full
name of the person in question. A system of assigning unique IDs to each person
to resolve the issue of duplicate names is ideal, but currently impossible
based on how ArXiv stores its data and the methods
we’ve used to retrieve it.
referenced
(ID
(arbitrary numeric ID), citing, cited)
Citing and cited
are both formatted identically to pubid from the publications
table, but are still varchar variables and not their
own separate data type.
hierarchy
(ID
(arbitrary numeric ID), topic, archive, sclasstype)
Where topic is the topmost category (physics, mathematics, etc.),
archive is the abbreviated subtopic (hep-th),
and sclasstype represents the full name of the
subtopic (High Energy Physics – Theory). All categories except physics use an
abbreviation of the topic and an abbreviation of the sclasstype
to form the archive name (e.g. Mathematics + Number Theory becomes math.NT). Physics archives are either shortened versions of
the sclass type (hep-th) or
the catch-all “physics” for class types that don’t really belong to any given
archive.
subjclass
(ID
(arbitrary numeric ID), pubid, sclasstype)
Where sclasstype is a subj-class entry in the citation. Since an article can have
n sclasstype values, it's not practical to
make subj-class an attribute of publications.
acmclass
(ID
(arbitrary numeric ID), pubid, aclasstype)
Where aclasstype is an ACM-class entry
in the citation. Since an article can have n aclasstype
values, it's not practical to make ACM-class an attribute of publications.
mscclass
(ID
(arbitrary numeric ID), pubid, mclasstype)
Where mclasstype is an MSC-class entry
in the citation. Since an article can have n mclasstype
values, it's not practical to make MSC-class an attribute of publications.
Suggested Improvements
Separate the persons table into two tables: persons(perid, abbr, name) and authored(perid, pubid) where perid is a unique ID for identifying a given person. This was the originally desired format for this data, but because of ArXiv’s method of storing data, it was impossible to automate data retrieval in this format. As a result, duplicate names appear in the persons table, which can cause confusion and created issues with the paging function.
Because of this, the persons search feature retrieves the top 50 distinct names and displays them on a single page; that being said, as long as the name entered is reasonably accurate, it will show up in the first 50 results.