OpenArXiv Relational DB Schema

 

Outlined below are the tables and their columns which will be used in the first draft of the OpenArXiv relational database. Some data has been intentionally omitted due to either its difficulty to gather via an automated process or due to its irrelevance to the project goal. In all cases where a primary key is undefined, an arbitrary numeric ID has been provided to assist with indexing.

The main goal of this database is to track the relationships between publications, authors, and the various categories ArXiv uses for classification.

publications

(pubid, archive, year, month, seq, title, comments (nulls ok), journal-ref (nulls ok))

 

Where pubid is the citation ID (hep-th/0506123 etc.), archive is the category it's entered under (hep-th), and year/month/seq are the components of the numerical part of the ID.

 

Title, comments, and journal-ref are text fields that capture identical data as their ArXiv citation counterparts.

 

Having a separate attribute for archive might be redundant, but for the time being it addresses those strange cases in ArXiv where you find a hep-lat article entered under hep-th instead of hep-lat. Whether or not it’s ultimately necessary depends on how ArXiv's organization is dealt with in the end product.

 

persons

(ID (arbitrary numeric ID), pubid, abbr, name)

 

Pubid is taken from the previous table, abbr and name are the abbreviated and full name of the person in question. A system of assigning unique IDs to each person to resolve the issue of duplicate names is ideal, but currently impossible based on how ArXiv stores its data and the methods we’ve used to retrieve it.

 

referenced

(ID (arbitrary numeric ID), citing, cited)

 

Citing and cited are both formatted identically to pubid from the publications table, but are still varchar variables and not their own separate data type.

 

hierarchy

(ID (arbitrary numeric ID), topic, archive, sclasstype)

 

Where topic is the topmost category (physics, mathematics, etc.), archive is the abbreviated subtopic (hep-th), and sclasstype represents the full name of the subtopic (High Energy Physics – Theory). All categories except physics use an abbreviation of the topic and an abbreviation of the sclasstype to form the archive name (e.g. Mathematics + Number Theory becomes math.NT). Physics archives are either shortened versions of the sclass type (hep-th) or the catch-all “physics” for class types that don’t really belong to any given archive.

 

subjclass

(ID (arbitrary numeric ID), pubid, sclasstype)

 

Where sclasstype is a subj-class entry in the citation. Since an article can have n sclasstype values, it's not practical to make subj-class an attribute of publications.

 

acmclass

(ID (arbitrary numeric ID), pubid, aclasstype)

 

Where aclasstype is an ACM-class entry in the citation. Since an article can have n aclasstype values, it's not practical to make ACM-class an attribute of publications.

 

mscclass

(ID (arbitrary numeric ID), pubid, mclasstype)

 

Where mclasstype is an MSC-class entry in the citation. Since an article can have n mclasstype values, it's not practical to make MSC-class an attribute of publications.

 

Suggested Improvements

 

Separate the persons table into two tables: persons(perid, abbr, name) and authored(perid, pubid) where perid is a unique ID for identifying a given person. This was the originally desired format for this data, but because of ArXiv’s method of storing data, it was impossible to automate data retrieval in this format. As a result, duplicate names appear in the persons table, which can cause confusion and created issues with the paging function.

 

Because of this, the persons search feature retrieves the top 50 distinct names and displays them on a single page; that being said, as long as the name entered is reasonably accurate, it will show up in the first 50 results.