cheese-shop/report/report.tex

\documentclass[a4paper,12pt]{article}
\usepackage[english,vietnamese]{babel}
\usepackage{amsmath}
\usepackage{booktabs}
\usepackage{lmodern}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{lmodern}
\usepackage[nottoc,numbib]{tocbibind}
\newcommand{\id}[1]{\underline{#1\_id}}
\renewcommand{\thefootnote}{\fnsymbol{footnote}}

\begin{document}
\setcounter{page}{0}
\thispagestyle{empty}
\vspace*{\stretch{1}}
\begin{flushright}
  \setlength{\baselineskip}{1.4\baselineskip}
\textbf{\Huge Python Package\\Metadata Management}
  \noindent\rule{\textwidth}{5pt}
  \emph{\Large Basic Databases}
  \vspace{\stretch{1}}

  \textbf{by Nguyễn Gia Phong, Nguyễn Quốc Thông,\\
          Nguyễn Văn Tùng and Trần Minh Vương\\}
  \selectlanguage{english}
  \today
\end{flushright}
\vspace*{\stretch{2}}
\pagebreak

\selectlanguage{english}
\tableofcontents
\pagebreak

\section{Introduction}
\subsection{Brief Description}
In traditional Unix-like operating systems like GNU/Linux distributions
and BSD-based OSes, package managers tries to synchronize the packages metadata
(such as available versions and dependencies) with that of central repositories.
While this proves to be reliable and efficient, language-specific
package managers do not usually have such synchronized databases,
since they focus on development libraries which have more flexible contraints.

Within the Python packaging ecosystem, this is the case, where package managers
like \verb|pip| needs to fetch metadata of each package to be installed
to find out dependencies and other information.  This turns out to have heavy
performance penalty on the dependency resolution process alone, which is
already a NP-hard problem.  This project explores ways to store these metadata
in an efficient in a database, to be used in practice either locally or in a
local/regional network, to avoid Python package managers from having to
fetch (and potentially build) entire packages just to find out if it conflicts
with others.

\selectlanguage{vietnamese}
\subsection{Authors and Credits}
The work has been undertaken by group number 8, whose members are listed
in the following table.
\begin{center}
  \begin{tabular}{c c}
    \toprule
    Full name & Student ID\\
    \midrule
    Nguyễn Gia Phong & BI9-184\\
    Nguyễn Quốc Thông & BI9-214\\
    Nguyễn Văn Tùng & BI9-229\\
    Trần Minh Vương & BI9-239\\
    \bottomrule
  \end{tabular}
\end{center}

This report is licensed under a CC BY-SA 4.0 license, while the source code is
available on GitHub\footnote{\url{https://github.com/McSinyx/cheese-shop}}
under AGPLv3+.

We would like to express our special thanks to Dr. Nguyễn Hoàng Hà,
whose lectures gave us basic understanding on the key principles of
relational databases.  In addition, we also recieved a lot of help from
the Python packaging community over \#pypa on Freenode on understanding
the structure of the metadata as well as finding a way to fetch these
data from package indices.

\selectlanguage{english}
\section{User Requirements}\label{req}
This project aims to provide a database for metadata queries and Python packages
exploration.  We try to replicate the PyPI's XML-RPC API~\cite{xmlrpc},
which supports queris similar to the following:
\begin{itemize}
  \item \verb|list_projects()|: Retrieve a list of registered project names.
  \item \verb|project_releases(project)|: Retrieve a list of releases for
    the given \verb|project|, ordered by version.
  \item \verb|project_release_latest()|: Retrieve the latest release
    of the given \verb|project|.
  \item \verb|belong_to(name)|: Retrieve a list of projects whose author
    is \verb|name|.
  \item \verb|browse(classifier)|: Retrieve a list of (\verb|project|,
    \verb|version|) of all releases classified by the given \verb|classifier|.
  \item \verb|release_data(project, version)|: Retrieve the following metadata
    matching the given release: project, version, homepage, author,
    author's email, summary, keywords, classifiers and dependencies.
  \item \verb|search_name(pattern)|: Retrieve a list of (\verb|project|,
    \verb|version|, \verb|summary|) where the project name matches the pattern.
  \item \verb|search_summary(pattern)|: Retrieve a list of (\verb|project|,
    \verb|version|, \verb|summary|) where the summary matches the pattern.
\end{itemize}

\section{Data Definition}
\subsection{Entity Relationship Diagram}
The entity relationship diagram represents the relationship between each of
its entity set of data extracted from projects:
\begin{itemize}
  \item Author(Releases-Contact: Many-One): Within each release, there could be
    one author, due to data extraction method doesn't support multi-author.
    Yet an author could have multiple releases under per name.
  \item Require(Releases-Dependencies: Many-Many): Every release would require
    a number of dependencies, and many dependencies can each be used by
    multiple releases.
  \item Classify(Releases-Trove: Many-Many): This relationship indicates the
    relationship between trove classifier and each releases, with many release
    could be classified under one trove classifier, and a release could be
    classified by many classifiers.
  \item Contain(Releases-Keyword: Many-Many): A release has many keywords,
    and also a keyword can also be in many different releases.
  \item Release(Releases-Distribution: One-Many): Within each releases,
    a number of distribution(s) would be released.  A distribution could
    relate to only one releases, but many distributions could be released
    in the same releases.
\end{itemize}
\includegraphics[width=\textwidth]{erd.png}

\newpage
\subsection{Database Schema}
Based on the entity relationship diagram, we worked out a schema complying
with the third normal form~\cite{3nf}.  Concrete definitions can be found in
\verb|sql/def.sql|.
\begin{center}
  \includegraphics[width=\textwidth]{schema.png}
\end{center}

\paragraph{contacts(\underline{email}, name)} Contact information of an author,
including per email as the primary key and per name.

\paragraph{releases(\underline{id}, project, version, summary, homepage, email)}
This relation represents each release of a project, including its name, version,
summary, homepage and the email of its author.  The ID of each release is
the primary key to represent each one of them.  This release ID is also
the foreign key of many primary key in other entity set.

\paragraph{troves(\underline{id}, classifier)} Valid trove classifiers,
identified by their ID.

\paragraph{classifiers(\id{release}, \id{trove})}
Release ID and corresponding trove classifiers ID the release is classified by.

\paragraph{keywords(\id{release}, \underline{term})} Keywords of a specific
release.  Both the ID of the release and the keyword are set as primary key.

\paragraph{dependencies(\id{release}, \underline{dependency})} This relation
represents the dependency list of each release, which is a pattern can be
matched by a release of another project.

\paragraph{distributions(\id{release}, \underline{filename}, size, url,
dist\_type, python\_version, requires\_python, sha256, md5)}
Each distribution (i.e. the file that the package manager can use to install)
and the corresponding url, checksums and other auxiliary information.

With the database defined, we then fetch the metadata of around 100 most
popular projects on PyPI.  The code used to do this is available in
\verb|tools/make-cheeses.py|.  More instructions on the setup can be found
in the project's README.

\section{Data Query}
In addition to primary keys and unique attributes which are
implicitly~\cite{idx}, to further optimize, we index \verb|contacts.name|
and \verb|releases.summary|.  Then, the tasks defined in section \ref{req}
can be carried out as follows.

\subsection{Project Listing}
To retrieve a list of project names, we can use the following SQL query
\begin{verbatim}
SELECT DISTINCT project
FROM releases;
\end{verbatim}

\subsection{Project Release Listing}
To retrieve a list of releases of project \verb|spam|, ordered by version:
\begin{verbatim}
SELECT version
FROM releases
WHERE project = 'spam'
ORDER BY version;
\end{verbatim}

\subsection{Project Latest Release}
To retrieve the latest version of the project \verb|spam|:
\begin{verbatim}
SELECT version
FROM releases
WHERE project = 'spam'
ORDER BY version DESC
LIMIT 1;
\end{verbatim}

\subsection{Project Listing by Author}
For convenience purposes, we create a view of \verb|authorship|:
\begin{verbatim}
CREATE VIEW authorships
AS SELECT name as author, project
FROM contacts NATURAL JOIN releases
GROUP BY author, project;
\end{verbatim}
A list of projects authored by Monty Python is then as easy to obtain as
\begin{verbatim}
SELECT project
FROM authorships
WHERE author='Monty Python';
\end{verbatim}

\subsection{Browse by Classifier}
Since the query is non-trivial, we define a procedure for convenient use:
\begin{verbatim}
DELIMITER //
CREATE PROCEDURE browse(class varchar(255))
BEGIN
  SELECT project, version
  FROM releases, classifiers
  WHERE id = release_id AND trove_id = (
    SELECT id
    FROM troves
    WHERE classifier = class);
END//
DELIMITER ;
\end{verbatim}
For instance, listing releases classified as
\verb|Programming Language :: C| will be as simple as
\verb|CALL browse('Programming Language :: C');|

\subsection{Release Metadata}
We also define a stored procedure for this task, which seems to be even
more complex.  The procedure can then be used in similar manner,
e.g. \verb|CALL release_data('udata', '1.1.1');|
\begin{verbatim}
DELIMITER //
CREATE PROCEDURE release_data(project varchar(32),
                              version varchar(32))
BEGIN
  DECLARE i smallint;
  SET i = (
    SELECT id
    FROM releases
    WHERE releases.project = project AND releases.version = version);
  SELECT project, version, homepage, name as author, email, summary
  FROM releases NATURAL JOIN contacts
  WHERE id = i;

  SELECT term as keyword
  FROM keywords
  WHERE release_id = i;

  SELECT classifier
  FROM classifiers, troves
  WHERE release_id = i AND trove_id = troves.id;

  SELECT dependency
  FROM dependencies
  WHERE release_id = i;
END//
DELIMITER ;
\end{verbatim}

\subsection{Project Search by Name}
To retrieve project by name matching the SQL pattern \verb|py%|
\begin{verbatim}
SELECT project, version, summary
FROM releases
WHERE project LIKE 'py%';
\end{verbatim}

\subsection{Project Search by Summary}
To retrieve project by summary matching the SQL pattern \verb|%num%|
\begin{verbatim}
SELECT project, version, summary
FROM releases
WHERE summary LIKE '%num%';
\end{verbatim}

\section{Conclusion}
Working on this project has been a great opportunity for us to and apply
our knowledge in relational databases in real-life tasks.
During the design and implementation process, we gained a better concept
about relational databases, as well as how to build from ground up and use
a database in SQL (MySQL in particular), along with abstractions
and optimizations.

We also had a chance to take a closer look at the Python Package Index
and at how their packages' metadata are used and distibuted.  Through
this project, we think we have reached closer to the final goal of
metadata mirroring and synchronization.

\begin{thebibliography}{69}
  \bibitem{xmlrpc} The Python Packaging Authority.
    \href{https://warehouse.readthedocs.io/api-reference/xml-rpc}
         {\emph{PyPI’s XML-RPC methods}}.
    Warehouse documentation.
  \bibitem{3nf} Edgar~F.~Codd.
    \emph{Further Normalization of the Data Base Relational Model}.
    IBM Research Report RJ909, August 31, 1971.
  \bibitem{idx} MySQL 8.0 Reference Manual.
    \emph{Oracle}.  Section 8.3.1:
    \href{https://dev.mysql.com/doc/refman/8.0/en/mysql-indexes.html}
         {\emph{How MySQL Uses Indexes}}.
\end{thebibliography}
\end{document}