2020-07-05 17:40:31 +02:00
|
|
|
|
\documentclass[a4paper,12pt]{article}
|
|
|
|
|
\usepackage[english,vietnamese]{babel}
|
|
|
|
|
\usepackage{amsmath}
|
|
|
|
|
\usepackage{booktabs}
|
|
|
|
|
\usepackage{lmodern}
|
2020-07-07 15:23:33 +02:00
|
|
|
|
\usepackage[utf8]{inputenc}
|
|
|
|
|
\usepackage{graphicx}
|
2020-07-05 17:40:31 +02:00
|
|
|
|
\usepackage{hyperref}
|
|
|
|
|
\usepackage{lmodern}
|
|
|
|
|
\usepackage[nottoc,numbib]{tocbibind}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
\newcommand{\id}[1]{\underline{#1\_id}}
|
2020-07-05 17:40:31 +02:00
|
|
|
|
\renewcommand{\thefootnote}{\fnsymbol{footnote}}
|
|
|
|
|
|
|
|
|
|
\begin{document}
|
|
|
|
|
\setcounter{page}{0}
|
|
|
|
|
\thispagestyle{empty}
|
|
|
|
|
\vspace*{\stretch{1}}
|
|
|
|
|
\begin{flushright}
|
|
|
|
|
\setlength{\baselineskip}{1.4\baselineskip}
|
|
|
|
|
\textbf{\Huge Python Package\\Metadata Management}
|
|
|
|
|
\noindent\rule{\textwidth}{5pt}
|
|
|
|
|
\emph{\Large Basic Databases}
|
|
|
|
|
\vspace{\stretch{1}}
|
|
|
|
|
|
|
|
|
|
\textbf{by Nguyễn Gia Phong, Nguyễn Quốc Thông,\\
|
|
|
|
|
Nguyễn Văn Tùng and Trần Minh Vương\\}
|
|
|
|
|
\selectlanguage{english}
|
|
|
|
|
\today
|
|
|
|
|
\end{flushright}
|
|
|
|
|
\vspace*{\stretch{2}}
|
|
|
|
|
\pagebreak
|
|
|
|
|
|
|
|
|
|
\selectlanguage{english}
|
|
|
|
|
\tableofcontents
|
|
|
|
|
\pagebreak
|
|
|
|
|
|
|
|
|
|
\section{Introduction}
|
|
|
|
|
\subsection{Brief Description}
|
|
|
|
|
In traditional Unix-like operating systems like GNU/Linux distributions
|
|
|
|
|
and BSD-based OSes, package managers tries to synchronize the packages metadata
|
|
|
|
|
(such as available versions and dependencies) with that of central repositories.
|
|
|
|
|
While this proves to be reliable and efficient, language-specific
|
|
|
|
|
package managers do not usually have such synchronized databases,
|
|
|
|
|
since they focus on development libraries which have more flexible contraints.
|
|
|
|
|
|
|
|
|
|
Within the Python packaging ecosystem, this is the case, where package managers
|
|
|
|
|
like \verb|pip| needs to fetch metadata of each package to be installed
|
|
|
|
|
to find out dependencies and other information. This turns out to have heavy
|
|
|
|
|
performance penalty on the dependency resolution process alone, which is
|
|
|
|
|
already a NP-hard problem. This project explores ways to store these metadata
|
|
|
|
|
in an efficient in a database, to be used in practice either locally or in a
|
|
|
|
|
local/regional network, to avoid Python package managers from having to
|
|
|
|
|
fetch (and potentially build) entire packages just to find out if it conflicts
|
|
|
|
|
with others.
|
|
|
|
|
|
|
|
|
|
\selectlanguage{vietnamese}
|
|
|
|
|
\subsection{Authors and Credits}
|
|
|
|
|
The work has been undertaken by group number 8, whose members are listed
|
|
|
|
|
in the following table.
|
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tabular}{c c}
|
|
|
|
|
\toprule
|
|
|
|
|
Full name & Student ID\\
|
|
|
|
|
\midrule
|
|
|
|
|
Nguyễn Gia Phong & BI9-184\\
|
|
|
|
|
Nguyễn Quốc Thông & BI9-214\\
|
|
|
|
|
Nguyễn Văn Tùng & BI9-229\\
|
|
|
|
|
Trần Minh Vương & BI9-239\\
|
|
|
|
|
\bottomrule
|
|
|
|
|
\end{tabular}
|
|
|
|
|
\end{center}
|
|
|
|
|
|
|
|
|
|
This report is licensed under a CC BY-SA 4.0 license, while the source code is
|
|
|
|
|
available on GitHub\footnote{\url{https://github.com/McSinyx/cheese-shop}}
|
|
|
|
|
under AGPLv3+.
|
|
|
|
|
|
|
|
|
|
We would like to express our special thanks to Dr. Nguyễn Hoàng Hà,
|
|
|
|
|
whose lectures gave us basic understanding on the key principles of
|
|
|
|
|
relational databases. In addition, we also recieved a lot of help from
|
|
|
|
|
the Python packaging community over \#pypa on Freenode on understanding
|
|
|
|
|
the structure of the metadata as well as finding a way to fetch these
|
|
|
|
|
data from package indices.
|
|
|
|
|
|
|
|
|
|
\selectlanguage{english}
|
|
|
|
|
\section{User Requirements}
|
2020-07-07 16:35:53 +02:00
|
|
|
|
This project aims to provide a database for metadata queries and Python packages
|
|
|
|
|
exploration. We try to replicate the PyPI's XML-RPC API~\cite{xmlrpc},
|
|
|
|
|
which supports queris similar to the following:
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item \verb|list_projects()|: Retrieve a list of registered project names.
|
|
|
|
|
\item \verb|project_releases(project)|: Retrieve a list of releases for
|
|
|
|
|
the given \verb|project|, ordered by version.
|
|
|
|
|
\item \verb|project_release_latest()|: Retrieve the latest release
|
|
|
|
|
of the given \verb|project|.
|
|
|
|
|
\item \verb|belong_to(name)|: Retrieve a list of projects whose author
|
|
|
|
|
is \verb|name|.
|
2020-07-07 16:53:33 +02:00
|
|
|
|
\item \verb|browse(classifier)|: Retrieve a list of (\verb|project|,
|
|
|
|
|
\verb|version|) of all releases classified with all of the given classifier.
|
|
|
|
|
\item \verb|release_data(project, version)|: Retrieve the following metadata
|
|
|
|
|
matching the given release: project, version, homepage, author,
|
|
|
|
|
author's email, summary, license, keywords, classifiers and dependencies
|
|
|
|
|
\item \verb|search_name(pattern)|: Retrieve a list of (\verb|project|,
|
|
|
|
|
\verb|version|, \verb|summary|) where the project name matches the pattern.
|
|
|
|
|
\item \verb|search_summary(pattern)|: Retrieve a list of (\verb|project|,
|
|
|
|
|
\verb|version|, \verb|summary|) where the summary matches the pattern.
|
2020-07-07 16:35:53 +02:00
|
|
|
|
\end{itemize}
|
2020-07-05 17:40:31 +02:00
|
|
|
|
|
|
|
|
|
\section{Data Definition}
|
|
|
|
|
\subsection{Entity Relationship Diagram}
|
2020-07-07 16:53:33 +02:00
|
|
|
|
The entity relationship diagram represents the relationship between each of
|
|
|
|
|
its entity set of data extracted from projects:
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item Author(Releases-Contact: Many-One): Within each release, there could be
|
|
|
|
|
one author, due to data extraction method doesn't support multi-author.
|
|
|
|
|
Yet an author could have multiple releases under per name.
|
|
|
|
|
\item Require(Releases-Dependencies: Many-Many): Every release would require
|
|
|
|
|
a number of dependencies, and many dependencies can each be used by
|
|
|
|
|
multiple releases.
|
|
|
|
|
\item Classify(Releases-Trove: Many-Many): This relationship indicates the
|
|
|
|
|
relationship between trove classifier and each releases, with many release
|
|
|
|
|
could be classified under one trove classifier, and a release could be
|
|
|
|
|
classified by many classifiers.
|
|
|
|
|
\item Contain(Releases-Keyword: Many-Many): A release has many keywords,
|
|
|
|
|
and also a keyword can also be in many different releases.
|
|
|
|
|
\item Release(Releases-Distribution: One-Many): Within each releases,
|
|
|
|
|
a number of distribution(s) would be released. A distribution could
|
|
|
|
|
relate to only one releases, but many distributions could be released
|
|
|
|
|
in the same releases.
|
|
|
|
|
\end{itemize}
|
2020-07-07 16:35:53 +02:00
|
|
|
|
\includegraphics[width=\textwidth]{erd.jpg}
|
2020-07-07 13:11:20 +02:00
|
|
|
|
|
2020-07-05 17:40:31 +02:00
|
|
|
|
\subsection{Database Schema}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
Based on the entity relationship diagram, we worked out a schema complying
|
|
|
|
|
with the third normal form~\cite{3nf}.
|
2020-07-07 16:35:53 +02:00
|
|
|
|
\begin{center}
|
|
|
|
|
\includegraphics[width=\textwidth]{schema.png}
|
|
|
|
|
\end{center}
|
2020-07-07 15:23:33 +02:00
|
|
|
|
|
2020-07-07 18:55:35 +02:00
|
|
|
|
\paragraph{contacts(\underline{email}, name)} Contact information of an author,
|
|
|
|
|
including per email as the primary key and per name.
|
|
|
|
|
|
|
|
|
|
\paragraph{releases(\underline{id}, project, version, summary, homepage, email)}
|
|
|
|
|
This relation represents each release of a project, including its name, version,
|
|
|
|
|
summary, homepage and the email of its author. The ID of each release is
|
|
|
|
|
the primary key to represent each one of them. This release ID is also
|
|
|
|
|
the foreign key of many primary key in other entity set.
|
|
|
|
|
|
|
|
|
|
\paragraph{troves(\underline{id}, classifier)} Valid trove classifiers,
|
|
|
|
|
identified by their ID.
|
|
|
|
|
|
|
|
|
|
\paragraph{classifiers(\id{release}, \id{trove})}
|
|
|
|
|
Release ID and corresponding trove classifiers ID the release is classified by.
|
|
|
|
|
|
|
|
|
|
\paragraph{keywords(\id{release}, \underline{term})} Keywords of a specific
|
|
|
|
|
release. Both the ID of the release and the keyword are set as primary key.
|
|
|
|
|
|
|
|
|
|
\paragraph{dependencies(\id{release}, \underline{dependency})} This relation
|
|
|
|
|
represents the dependency list of each release, which is a pattern can be
|
|
|
|
|
matched by a release of another project.
|
|
|
|
|
|
|
|
|
|
\paragraph{distributions(\id{release}, \underline{filename}, size, url,
|
|
|
|
|
dist\_type, python\_version, requires\_python, sha256, md5)}
|
|
|
|
|
Each distribution (i.e. the file that the package manager can use to install)
|
|
|
|
|
and the corresponding url, checksums and other auxiliary information.
|
|
|
|
|
|
2020-07-05 17:40:31 +02:00
|
|
|
|
\section{Data Query}
|
2020-07-07 16:31:44 +02:00
|
|
|
|
\subsection{Project Listing}
|
2020-07-07 18:31:47 +02:00
|
|
|
|
Retrieve a list of registered project names
|
|
|
|
|
\begin{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
SELECT DISTINCT project FROM releases
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\end{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
|
2020-07-07 16:31:44 +02:00
|
|
|
|
\subsection{Project Releases}
|
2020-07-07 18:31:47 +02:00
|
|
|
|
Retrieve a list of releases for the given project name, ordered by version.
|
|
|
|
|
\begin{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
SELECT * FROM releases
|
|
|
|
|
WHERE project = 'numpy'
|
|
|
|
|
ORDER BY version
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\end{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\subsection{Project Latest Release}
|
|
|
|
|
Retrieve the latest version of the given project.
|
|
|
|
|
\begin{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
SELECT *
|
|
|
|
|
FROM releases
|
|
|
|
|
WHERE project = 'numpy'
|
|
|
|
|
ORDER BY version
|
|
|
|
|
LIMIT 1
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\end{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\subsection{User's Project}
|
|
|
|
|
Retrieve a list of projects whose author is name.
|
|
|
|
|
\begin{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
SELECT project
|
|
|
|
|
FROM releases
|
|
|
|
|
LEFT JOIN contacts
|
|
|
|
|
ON releases.email = contacts.email
|
|
|
|
|
WHERE contacts.name = 'Travis E. Oliphant et al.'
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\end{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
|
2020-07-07 16:31:44 +02:00
|
|
|
|
\subsection{Classifiers}
|
2020-07-07 18:31:47 +02:00
|
|
|
|
Retrieve a list of name, version of all releases classified with all the given classifiers, classifiers must be a list of Trove classifier strings.
|
|
|
|
|
\begin{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
SELECT releases.name, releases.version, troves.classifier
|
|
|
|
|
FROM releases
|
|
|
|
|
JOIN classifier ON releases.id = classifier.release_id
|
|
|
|
|
INNER JOIN troves ON classifier.trove_id = troves.id
|
|
|
|
|
WHERE troves.classifier = 'Python'
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\end{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\subsection{Release Data}
|
|
|
|
|
Retrieve metadata describing a specific release.
|
|
|
|
|
\begin{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
SELECT rls.project, rls.version, rls.homepage, rls.author,
|
|
|
|
|
rls.email, rls.summary, keywords.term,
|
|
|
|
|
classiffier.troves.classifier,
|
|
|
|
|
dependencies.dependency
|
|
|
|
|
FROM releases AS rls
|
|
|
|
|
INNER JOIN contacts ON rls.email = contacts.email
|
|
|
|
|
RIGHT JOIN (classifier
|
|
|
|
|
INNER JOIN troves
|
|
|
|
|
ON classifier.trove_id = troves.id)
|
|
|
|
|
ON rls.id = classifier.release_id
|
|
|
|
|
RIGHT JOIN keywords ON rls.id = keywords.release_id
|
|
|
|
|
RIGHT JOIN dependencies ON rls.id = dependencies.release_id
|
|
|
|
|
WHERE rls.id = '1'
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\end{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\subsection{Search project by name}
|
|
|
|
|
Retrieve project by name SQL pattern
|
|
|
|
|
\begin{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
SELECT project, version, summary
|
|
|
|
|
FROM releases
|
|
|
|
|
WHERE project LIKE 'py%'
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\end{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\subsection{Search project name by summary}
|
|
|
|
|
Retrieve project by summary SQL pattern
|
|
|
|
|
\begin{verbatim}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
SELECT project, version, summary
|
|
|
|
|
FROM releases
|
|
|
|
|
WHERE summary LIKE '%num%'
|
2020-07-07 18:31:47 +02:00
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
2020-07-05 17:40:31 +02:00
|
|
|
|
\section{Conclusion}
|
|
|
|
|
|
|
|
|
|
\begin{thebibliography}{69}
|
2020-07-07 18:55:35 +02:00
|
|
|
|
\bibitem{xmlrpc} The Python Packaging Authority.
|
2020-07-07 16:35:53 +02:00
|
|
|
|
\href{https://warehouse.readthedocs.io/api-reference/xml-rpc}
|
|
|
|
|
{\emph{PyPI’s XML-RPC methods}}.
|
|
|
|
|
Warehouse documentation.
|
2020-07-07 18:55:35 +02:00
|
|
|
|
\bibitem{3nf} Edgar~F.~Codd.
|
|
|
|
|
\emph{Further Normalization of the Data Base Relational Model}.
|
|
|
|
|
IBM Research Report RJ909, August 31, 1971.
|
2020-07-05 17:40:31 +02:00
|
|
|
|
\end{thebibliography}
|
|
|
|
|
\end{document}
|