257 lines
9.7 KiB
TeX
257 lines
9.7 KiB
TeX
\documentclass[a4paper,12pt]{article}
|
||
\usepackage[english,vietnamese]{babel}
|
||
\usepackage{amsmath}
|
||
\usepackage{booktabs}
|
||
\usepackage{lmodern}
|
||
\usepackage[utf8]{inputenc}
|
||
\usepackage{graphicx}
|
||
\usepackage{hyperref}
|
||
\usepackage{lmodern}
|
||
\usepackage[nottoc,numbib]{tocbibind}
|
||
\newcommand{\id}[1]{\underline{#1\_id}}
|
||
\renewcommand{\thefootnote}{\fnsymbol{footnote}}
|
||
|
||
\begin{document}
|
||
\setcounter{page}{0}
|
||
\thispagestyle{empty}
|
||
\vspace*{\stretch{1}}
|
||
\begin{flushright}
|
||
\setlength{\baselineskip}{1.4\baselineskip}
|
||
\textbf{\Huge Python Package\\Metadata Management}
|
||
\noindent\rule{\textwidth}{5pt}
|
||
\emph{\Large Basic Databases}
|
||
\vspace{\stretch{1}}
|
||
|
||
\textbf{by Nguyễn Gia Phong, Nguyễn Quốc Thông,\\
|
||
Nguyễn Văn Tùng and Trần Minh Vương\\}
|
||
\selectlanguage{english}
|
||
\today
|
||
\end{flushright}
|
||
\vspace*{\stretch{2}}
|
||
\pagebreak
|
||
|
||
\selectlanguage{english}
|
||
\tableofcontents
|
||
\pagebreak
|
||
|
||
\section{Introduction}
|
||
\subsection{Brief Description}
|
||
In traditional Unix-like operating systems like GNU/Linux distributions
|
||
and BSD-based OSes, package managers tries to synchronize the packages metadata
|
||
(such as available versions and dependencies) with that of central repositories.
|
||
While this proves to be reliable and efficient, language-specific
|
||
package managers do not usually have such synchronized databases,
|
||
since they focus on development libraries which have more flexible contraints.
|
||
|
||
Within the Python packaging ecosystem, this is the case, where package managers
|
||
like \verb|pip| needs to fetch metadata of each package to be installed
|
||
to find out dependencies and other information. This turns out to have heavy
|
||
performance penalty on the dependency resolution process alone, which is
|
||
already a NP-hard problem. This project explores ways to store these metadata
|
||
in an efficient in a database, to be used in practice either locally or in a
|
||
local/regional network, to avoid Python package managers from having to
|
||
fetch (and potentially build) entire packages just to find out if it conflicts
|
||
with others.
|
||
|
||
\selectlanguage{vietnamese}
|
||
\subsection{Authors and Credits}
|
||
The work has been undertaken by group number 8, whose members are listed
|
||
in the following table.
|
||
\begin{center}
|
||
\begin{tabular}{c c}
|
||
\toprule
|
||
Full name & Student ID\\
|
||
\midrule
|
||
Nguyễn Gia Phong & BI9-184\\
|
||
Nguyễn Quốc Thông & BI9-214\\
|
||
Nguyễn Văn Tùng & BI9-229\\
|
||
Trần Minh Vương & BI9-239\\
|
||
\bottomrule
|
||
\end{tabular}
|
||
\end{center}
|
||
|
||
This report is licensed under a CC BY-SA 4.0 license, while the source code is
|
||
available on GitHub\footnote{\url{https://github.com/McSinyx/cheese-shop}}
|
||
under AGPLv3+.
|
||
|
||
We would like to express our special thanks to Dr. Nguyễn Hoàng Hà,
|
||
whose lectures gave us basic understanding on the key principles of
|
||
relational databases. In addition, we also recieved a lot of help from
|
||
the Python packaging community over \#pypa on Freenode on understanding
|
||
the structure of the metadata as well as finding a way to fetch these
|
||
data from package indices.
|
||
|
||
\selectlanguage{english}
|
||
\section{User Requirements}
|
||
This project aims to provide a database for metadata queries and Python packages
|
||
exploration. We try to replicate the PyPI's XML-RPC API~\cite{xmlrpc},
|
||
which supports queris similar to the following:
|
||
\begin{itemize}
|
||
\item \verb|list_projects()|: Retrieve a list of registered project names.
|
||
\item \verb|project_releases(project)|: Retrieve a list of releases for
|
||
the given \verb|project|, ordered by version.
|
||
\item \verb|project_release_latest()|: Retrieve the latest release
|
||
of the given \verb|project|.
|
||
\item \verb|belong_to(name)|: Retrieve a list of projects whose author
|
||
is \verb|name|.
|
||
\item \verb|browse(classifier)|: Retrieve a list of (\verb|project|,
|
||
\verb|version|) of all releases classified with all of the given classifier.
|
||
\item \verb|release_data(project, version)|: Retrieve the following metadata
|
||
matching the given release: project, version, homepage, author,
|
||
author's email, summary, license, keywords, classifiers and dependencies
|
||
\item \verb|search_name(pattern)|: Retrieve a list of (\verb|project|,
|
||
\verb|version|, \verb|summary|) where the project name matches the pattern.
|
||
\item \verb|search_summary(pattern)|: Retrieve a list of (\verb|project|,
|
||
\verb|version|, \verb|summary|) where the summary matches the pattern.
|
||
\end{itemize}
|
||
|
||
\section{Data Definition}
|
||
\subsection{Entity Relationship Diagram}
|
||
The entity relationship diagram represents the relationship between each of
|
||
its entity set of data extracted from projects:
|
||
\begin{itemize}
|
||
\item Author(Releases-Contact: Many-One): Within each release, there could be
|
||
one author, due to data extraction method doesn't support multi-author.
|
||
Yet an author could have multiple releases under per name.
|
||
\item Require(Releases-Dependencies: Many-Many): Every release would require
|
||
a number of dependencies, and many dependencies can each be used by
|
||
multiple releases.
|
||
\item Classify(Releases-Trove: Many-Many): This relationship indicates the
|
||
relationship between trove classifier and each releases, with many release
|
||
could be classified under one trove classifier, and a release could be
|
||
classified by many classifiers.
|
||
\item Contain(Releases-Keyword: Many-Many): A release has many keywords,
|
||
and also a keyword can also be in many different releases.
|
||
\item Release(Releases-Distribution: One-Many): Within each releases,
|
||
a number of distribution(s) would be released. A distribution could
|
||
relate to only one releases, but many distributions could be released
|
||
in the same releases.
|
||
\end{itemize}
|
||
\includegraphics[width=\textwidth]{erd.jpg}
|
||
|
||
\subsection{Database Schema}
|
||
Based on the entity relationship diagram, we worked out a schema complying
|
||
with the third normal form~\cite{3nf}.
|
||
\begin{center}
|
||
\includegraphics[width=\textwidth]{schema.png}
|
||
\end{center}
|
||
|
||
\paragraph{contacts(\underline{email}, name)} Contact information of an author,
|
||
including per email as the primary key and per name.
|
||
|
||
\paragraph{releases(\underline{id}, project, version, summary, homepage, email)}
|
||
This relation represents each release of a project, including its name, version,
|
||
summary, homepage and the email of its author. The ID of each release is
|
||
the primary key to represent each one of them. This release ID is also
|
||
the foreign key of many primary key in other entity set.
|
||
|
||
\paragraph{troves(\underline{id}, classifier)} Valid trove classifiers,
|
||
identified by their ID.
|
||
|
||
\paragraph{classifiers(\id{release}, \id{trove})}
|
||
Release ID and corresponding trove classifiers ID the release is classified by.
|
||
|
||
\paragraph{keywords(\id{release}, \underline{term})} Keywords of a specific
|
||
release. Both the ID of the release and the keyword are set as primary key.
|
||
|
||
\paragraph{dependencies(\id{release}, \underline{dependency})} This relation
|
||
represents the dependency list of each release, which is a pattern can be
|
||
matched by a release of another project.
|
||
|
||
\paragraph{distributions(\id{release}, \underline{filename}, size, url,
|
||
dist\_type, python\_version, requires\_python, sha256, md5)}
|
||
Each distribution (i.e. the file that the package manager can use to install)
|
||
and the corresponding url, checksums and other auxiliary information.
|
||
|
||
\section{Data Query}
|
||
\subsection{Project Listing}
|
||
Retrieve a list of registered project names
|
||
\begin{verbatim}
|
||
SELECT DISTINCT project FROM releases
|
||
\end{verbatim}
|
||
|
||
\subsection{Project Releases}
|
||
Retrieve a list of releases for the given project name, ordered by version.
|
||
\begin{verbatim}
|
||
SELECT * FROM releases
|
||
WHERE project = 'numpy'
|
||
ORDER BY version
|
||
\end{verbatim}
|
||
|
||
\subsection{Project Latest Release}
|
||
Retrieve the latest version of the given project.
|
||
\begin{verbatim}
|
||
SELECT *
|
||
FROM releases
|
||
WHERE project = 'numpy'
|
||
ORDER BY version
|
||
LIMIT 1
|
||
\end{verbatim}
|
||
|
||
\subsection{User's Project}
|
||
Retrieve a list of projects whose author is name.
|
||
\begin{verbatim}
|
||
SELECT project
|
||
FROM releases
|
||
LEFT JOIN contacts
|
||
ON releases.email = contacts.email
|
||
WHERE contacts.name = 'Travis E. Oliphant et al.'
|
||
\end{verbatim}
|
||
|
||
\subsection{Classifiers}
|
||
Retrieve a list of name, version of all releases classified with all the given classifiers, classifiers must be a list of Trove classifier strings.
|
||
\begin{verbatim}
|
||
SELECT releases.name, releases.version, troves.classifier
|
||
FROM releases
|
||
JOIN classifier ON releases.id = classifier.release_id
|
||
INNER JOIN troves ON classifier.trove_id = troves.id
|
||
WHERE troves.classifier = 'Python'
|
||
\end{verbatim}
|
||
|
||
\subsection{Release Data}
|
||
Retrieve metadata describing a specific release.
|
||
\begin{verbatim}
|
||
SELECT rls.project, rls.version, rls.homepage, rls.author,
|
||
rls.email, rls.summary, keywords.term,
|
||
classiffier.troves.classifier,
|
||
dependencies.dependency
|
||
FROM releases AS rls
|
||
INNER JOIN contacts ON rls.email = contacts.email
|
||
RIGHT JOIN (classifier
|
||
INNER JOIN troves
|
||
ON classifier.trove_id = troves.id)
|
||
ON rls.id = classifier.release_id
|
||
RIGHT JOIN keywords ON rls.id = keywords.release_id
|
||
RIGHT JOIN dependencies ON rls.id = dependencies.release_id
|
||
WHERE rls.id = '1'
|
||
\end{verbatim}
|
||
|
||
\subsection{Search project by name}
|
||
Retrieve project by name SQL pattern
|
||
\begin{verbatim}
|
||
SELECT project, version, summary
|
||
FROM releases
|
||
WHERE project LIKE 'py%'
|
||
\end{verbatim}
|
||
|
||
\subsection{Search project name by summary}
|
||
Retrieve project by summary SQL pattern
|
||
\begin{verbatim}
|
||
SELECT project, version, summary
|
||
FROM releases
|
||
WHERE summary LIKE '%num%'
|
||
\end{verbatim}
|
||
|
||
\section{Conclusion}
|
||
|
||
\begin{thebibliography}{69}
|
||
\bibitem{xmlrpc} The Python Packaging Authority.
|
||
\href{https://warehouse.readthedocs.io/api-reference/xml-rpc}
|
||
{\emph{PyPI’s XML-RPC methods}}.
|
||
Warehouse documentation.
|
||
\bibitem{3nf} Edgar~F.~Codd.
|
||
\emph{Further Normalization of the Data Base Relational Model}.
|
||
IBM Research Report RJ909, August 31, 1971.
|
||
\end{thebibliography}
|
||
\end{document}
|