315 lines
12 KiB
TeX
315 lines
12 KiB
TeX
\documentclass[a4paper,12pt]{article}
|
||
\usepackage[english,vietnamese]{babel}
|
||
\usepackage{amsmath}
|
||
\usepackage{booktabs}
|
||
\usepackage{lmodern}
|
||
\usepackage[utf8]{inputenc}
|
||
\usepackage{graphicx}
|
||
\usepackage{hyperref}
|
||
\usepackage{lmodern}
|
||
\usepackage[nottoc,numbib]{tocbibind}
|
||
\newcommand{\id}[1]{\underline{#1\_id}}
|
||
\renewcommand{\thefootnote}{\fnsymbol{footnote}}
|
||
|
||
\begin{document}
|
||
\setcounter{page}{0}
|
||
\thispagestyle{empty}
|
||
\vspace*{\stretch{1}}
|
||
\begin{flushright}
|
||
\setlength{\baselineskip}{1.4\baselineskip}
|
||
\textbf{\Huge Python Package\\Metadata Management}
|
||
\noindent\rule{\textwidth}{5pt}
|
||
\emph{\Large Basic Databases}
|
||
\vspace{\stretch{1}}
|
||
|
||
\textbf{by Nguyễn Gia Phong, Nguyễn Quốc Thông,\\
|
||
Nguyễn Văn Tùng and Trần Minh Vương\\}
|
||
\selectlanguage{english}
|
||
\today
|
||
\end{flushright}
|
||
\vspace*{\stretch{2}}
|
||
\pagebreak
|
||
|
||
\selectlanguage{english}
|
||
\tableofcontents
|
||
\pagebreak
|
||
|
||
\section{Introduction}
|
||
\subsection{Brief Description}
|
||
In traditional Unix-like operating systems like GNU/Linux distributions
|
||
and BSD-based OSes, package managers tries to synchronize the packages metadata
|
||
(such as available versions and dependencies) with that of central repositories.
|
||
While this proves to be reliable and efficient, language-specific
|
||
package managers do not usually have such synchronized databases,
|
||
since they focus on development libraries which have more flexible contraints.
|
||
|
||
Within the Python packaging ecosystem, this is the case, where package managers
|
||
like \verb|pip| needs to fetch metadata of each package to be installed
|
||
to find out dependencies and other information. This turns out to have heavy
|
||
performance penalty on the dependency resolution process alone, which is
|
||
already a NP-hard problem. This project explores ways to store these metadata
|
||
in an efficient in a database, to be used in practice either locally or in a
|
||
local/regional network, to avoid Python package managers from having to
|
||
fetch (and potentially build) entire packages just to find out if it conflicts
|
||
with others.
|
||
|
||
\selectlanguage{vietnamese}
|
||
\subsection{Authors and Credits}
|
||
The work has been undertaken by group number 8, whose members are listed
|
||
in the following table.
|
||
\begin{center}
|
||
\begin{tabular}{c c}
|
||
\toprule
|
||
Full name & Student ID\\
|
||
\midrule
|
||
Nguyễn Gia Phong & BI9-184\\
|
||
Nguyễn Quốc Thông & BI9-214\\
|
||
Nguyễn Văn Tùng & BI9-229\\
|
||
Trần Minh Vương & BI9-239\\
|
||
\bottomrule
|
||
\end{tabular}
|
||
\end{center}
|
||
|
||
This report is licensed under a CC BY-SA 4.0 license, while the source code is
|
||
available on GitHub\footnote{\url{https://github.com/McSinyx/cheese-shop}}
|
||
under AGPLv3+.
|
||
|
||
We would like to express our special thanks to Dr. Nguyễn Hoàng Hà,
|
||
whose lectures gave us basic understanding on the key principles of
|
||
relational databases. In addition, we also recieved a lot of help from
|
||
the Python packaging community over \#pypa on Freenode on understanding
|
||
the structure of the metadata as well as finding a way to fetch these
|
||
data from package indices.
|
||
|
||
\selectlanguage{english}
|
||
\section{User Requirements}\label{req}
|
||
This project aims to provide a database for metadata queries and Python packages
|
||
exploration. We try to replicate the PyPI's XML-RPC API~\cite{xmlrpc},
|
||
which supports queris similar to the following:
|
||
\begin{itemize}
|
||
\item \verb|list_projects()|: Retrieve a list of registered project names.
|
||
\item \verb|project_releases(project)|: Retrieve a list of releases for
|
||
the given \verb|project|, ordered by version.
|
||
\item \verb|project_release_latest()|: Retrieve the latest release
|
||
of the given \verb|project|.
|
||
\item \verb|belong_to(name)|: Retrieve a list of projects whose author
|
||
is \verb|name|.
|
||
\item \verb|browse(classifier)|: Retrieve a list of (\verb|project|,
|
||
\verb|version|) of all releases classified by the given \verb|classifier|.
|
||
\item \verb|release_data(project, version)|: Retrieve the following metadata
|
||
matching the given release: project, version, homepage, author,
|
||
author's email, summary, keywords, classifiers and dependencies.
|
||
\item \verb|search_name(pattern)|: Retrieve a list of (\verb|project|,
|
||
\verb|version|, \verb|summary|) where the project name matches the pattern.
|
||
\item \verb|search_summary(pattern)|: Retrieve a list of (\verb|project|,
|
||
\verb|version|, \verb|summary|) where the summary matches the pattern.
|
||
\end{itemize}
|
||
|
||
\section{Data Definition}
|
||
\subsection{Entity Relationship Diagram}
|
||
The entity relationship diagram represents the relationship between each of
|
||
its entity set of data extracted from projects:
|
||
\begin{itemize}
|
||
\item Author(Releases-Contact: Many-One): Within each release, there could be
|
||
one author, due to data extraction method doesn't support multi-author.
|
||
Yet an author could have multiple releases under per name.
|
||
\item Require(Releases-Dependencies: Many-Many): Every release would require
|
||
a number of dependencies, and many dependencies can each be used by
|
||
multiple releases.
|
||
\item Classify(Releases-Trove: Many-Many): This relationship indicates the
|
||
relationship between trove classifier and each releases, with many release
|
||
could be classified under one trove classifier, and a release could be
|
||
classified by many classifiers.
|
||
\item Contain(Releases-Keyword: Many-Many): A release has many keywords,
|
||
and also a keyword can also be in many different releases.
|
||
\item Release(Releases-Distribution: One-Many): Within each releases,
|
||
a number of distribution(s) would be released. A distribution could
|
||
relate to only one releases, but many distributions could be released
|
||
in the same releases.
|
||
\end{itemize}
|
||
\includegraphics[width=\textwidth]{erd.png}
|
||
|
||
\newpage
|
||
\subsection{Database Schema}
|
||
Based on the entity relationship diagram, we worked out a schema complying
|
||
with the third normal form~\cite{3nf}. Concrete definitions can be found in
|
||
\verb|sql/def.sql|.
|
||
\begin{center}
|
||
\includegraphics[width=\textwidth]{schema.png}
|
||
\end{center}
|
||
|
||
\paragraph{contacts(\underline{email}, name)} Contact information of an author,
|
||
including per email as the primary key and per name.
|
||
|
||
\paragraph{releases(\underline{id}, project, version, summary, homepage, email)}
|
||
This relation represents each release of a project, including its name, version,
|
||
summary, homepage and the email of its author. The ID of each release is
|
||
the primary key to represent each one of them. This release ID is also
|
||
the foreign key of many primary key in other entity set.
|
||
|
||
\paragraph{troves(\underline{id}, classifier)} Valid trove classifiers,
|
||
identified by their ID.
|
||
|
||
\paragraph{classifiers(\id{release}, \id{trove})}
|
||
Release ID and corresponding trove classifiers ID the release is classified by.
|
||
|
||
\paragraph{keywords(\id{release}, \underline{term})} Keywords of a specific
|
||
release. Both the ID of the release and the keyword are set as primary key.
|
||
|
||
\paragraph{dependencies(\id{release}, \underline{dependency})} This relation
|
||
represents the dependency list of each release, which is a pattern can be
|
||
matched by a release of another project.
|
||
|
||
\paragraph{distributions(\id{release}, \underline{filename}, size, url,
|
||
dist\_type, python\_version, requires\_python, sha256, md5)}
|
||
Each distribution (i.e. the file that the package manager can use to install)
|
||
and the corresponding url, checksums and other auxiliary information.
|
||
|
||
With the database defined, we then fetch the metadata of around 100 most
|
||
popular projects on PyPI. The code used to do this is available in
|
||
\verb|tools/make-cheeses.py|. More instructions on the setup can be found
|
||
in the project's README.
|
||
|
||
\section{Data Query}
|
||
In addition to primary keys and unique attributes which are
|
||
implicitly~\cite{idx}, to further optimize, we index \verb|contacts.name|
|
||
and \verb|releases.summary|. Then, the tasks defined in section \ref{req}
|
||
can be carried out as follows.
|
||
|
||
\subsection{Project Listing}
|
||
To retrieve a list of project names, we can use the following SQL query
|
||
\begin{verbatim}
|
||
SELECT DISTINCT project
|
||
FROM releases;
|
||
\end{verbatim}
|
||
|
||
\subsection{Project Release Listing}
|
||
To retrieve a list of releases of project \verb|spam|, ordered by version:
|
||
\begin{verbatim}
|
||
SELECT version
|
||
FROM releases
|
||
WHERE project = 'spam'
|
||
ORDER BY version;
|
||
\end{verbatim}
|
||
|
||
\subsection{Project Latest Release}
|
||
To retrieve the latest version of the project \verb|spam|:
|
||
\begin{verbatim}
|
||
SELECT version
|
||
FROM releases
|
||
WHERE project = 'spam'
|
||
ORDER BY version DESC
|
||
LIMIT 1;
|
||
\end{verbatim}
|
||
|
||
\subsection{Project Listing by Author}
|
||
For convenience purposes, we create a view of \verb|authorship|:
|
||
\begin{verbatim}
|
||
CREATE VIEW authorships
|
||
AS SELECT name as author, project
|
||
FROM contacts NATURAL JOIN releases
|
||
GROUP BY author, project;
|
||
\end{verbatim}
|
||
A list of projects authored by Monty Python is then as easy to obtain as
|
||
\begin{verbatim}
|
||
SELECT project
|
||
FROM authorships
|
||
WHERE author='Monty Python';
|
||
\end{verbatim}
|
||
|
||
\subsection{Browse by Classifier}
|
||
Since the query is non-trivial, we define a procedure for convenient use:
|
||
\begin{verbatim}
|
||
DELIMITER //
|
||
CREATE PROCEDURE browse(class varchar(255))
|
||
BEGIN
|
||
SELECT project, version
|
||
FROM releases, classifiers
|
||
WHERE id = release_id AND trove_id = (
|
||
SELECT id
|
||
FROM troves
|
||
WHERE classifier = class);
|
||
END//
|
||
DELIMITER ;
|
||
\end{verbatim}
|
||
For instance, listing releases classified as
|
||
\verb|Programming Language :: C| will be as simple as
|
||
\verb|CALL browse('Programming Language :: C');|
|
||
|
||
\subsection{Release Metadata}
|
||
We also define a stored procedure for this task, which seems to be even
|
||
more complex. The procedure can then be used in similar manner,
|
||
e.g. \verb|CALL release_data('udata', '1.1.1');|
|
||
\begin{verbatim}
|
||
DELIMITER //
|
||
CREATE PROCEDURE release_data(project varchar(32),
|
||
version varchar(32))
|
||
BEGIN
|
||
DECLARE i smallint;
|
||
SET i = (
|
||
SELECT id
|
||
FROM releases
|
||
WHERE releases.project = project AND releases.version = version);
|
||
SELECT project, version, homepage, name as author, email, summary
|
||
FROM releases NATURAL JOIN contacts
|
||
WHERE id = i;
|
||
|
||
SELECT term as keyword
|
||
FROM keywords
|
||
WHERE release_id = i;
|
||
|
||
SELECT classifier
|
||
FROM classifiers, troves
|
||
WHERE release_id = i AND trove_id = troves.id;
|
||
|
||
SELECT dependency
|
||
FROM dependencies
|
||
WHERE release_id = i;
|
||
END//
|
||
DELIMITER ;
|
||
\end{verbatim}
|
||
|
||
\subsection{Project Search by Name}
|
||
To retrieve project by name matching the SQL pattern \verb|py%|
|
||
\begin{verbatim}
|
||
SELECT project, version, summary
|
||
FROM releases
|
||
WHERE project LIKE 'py%';
|
||
\end{verbatim}
|
||
|
||
\subsection{Project Search by Summary}
|
||
To retrieve project by summary matching the SQL pattern \verb|%num%|
|
||
\begin{verbatim}
|
||
SELECT project, version, summary
|
||
FROM releases
|
||
WHERE summary LIKE '%num%';
|
||
\end{verbatim}
|
||
|
||
\section{Conclusion}
|
||
Working on this project has been a great opportunity for us to and apply
|
||
our knowledge in relational databases in real-life tasks.
|
||
During the design and implementation process, we gained a better concept
|
||
about relational databases, as well as how to build from ground up and use
|
||
a database in SQL (MySQL in particular), along with abstractions
|
||
and optimizations.
|
||
|
||
We also had a chance to take a closer look at the Python Package Index
|
||
and at how their packages' metadata are used and distibuted. Through
|
||
this project, we think we have reached closer to the final goal of
|
||
metadata mirroring and synchronization.
|
||
|
||
\begin{thebibliography}{69}
|
||
\bibitem{xmlrpc} The Python Packaging Authority.
|
||
\href{https://warehouse.readthedocs.io/api-reference/xml-rpc}
|
||
{\emph{PyPI’s XML-RPC methods}}.
|
||
Warehouse documentation.
|
||
\bibitem{3nf} Edgar~F.~Codd.
|
||
\emph{Further Normalization of the Data Base Relational Model}.
|
||
IBM Research Report RJ909, August 31, 1971.
|
||
\bibitem{idx} MySQL 8.0 Reference Manual.
|
||
\emph{Oracle}. Section 8.3.1:
|
||
\href{https://dev.mysql.com/doc/refman/8.0/en/mysql-indexes.html}
|
||
{\emph{How MySQL Uses Indexes}}.
|
||
\end{thebibliography}
|
||
\end{document}
|