Commit graph

11 commits

Author SHA1 Message Date
triaxx
af3e830ee9 py-scrapy: Update to 2.4.1
upstream cheanges:
------------------
A lot of changes listed at https://github.com/scrapy/scrapy/blob/master/docs/news.rst
2021-03-22 08:56:56 +00:00
adam
9cc23be607 py-scrapy: updated to 1.8.0
Scrapy 1.8.0:

Highlights:
* Dropped Python 3.4 support and updated minimum requirements; made Python 3.8
  support official
* New :meth:`Request.from_curl <scrapy.http.Request.from_curl>` class method
* New :setting:`ROBOTSTXT_PARSER` and :setting:`ROBOTSTXT_USER_AGENT` settings
* New :setting:`DOWNLOADER_CLIENT_TLS_CIPHERS` and
  :setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` settings
2020-01-29 22:06:30 +00:00
adam
d5f9a1b1d6 py-scrapy: updated to 1.7.3
Scrapy 1.7.3:
Enforce lxml 4.3.5 or lower for Python 3.4 (issue 3912, issue 3918).

Scrapy 1.7.2:
Fix Python 2 support (issue 3889, issue 3893, issue 3896).

Scrapy 1.7.1:
Re-packaging of Scrapy 1.7.0, which was missing some changes in PyPI.

Scrapy 1.7.0:
Highlights:
Improvements for crawls targeting multiple domains
A cleaner way to pass arguments to callbacks
A new class for JSON requests
Improvements for rule-based spiders
New features for feed exports


Backward-incompatible changes

429 is now part of the RETRY_HTTP_CODES setting by default
This change is backward incompatible. If you don’t want to retry 429, you must override RETRY_HTTP_CODES accordingly.

Crawler, CrawlerRunner.crawl and CrawlerRunner.create_crawler no longer accept a Spider subclass instance, they only accept a Spider subclass now.
Spider subclass instances were never meant to work, and they were not working as one would expect: instead of using the passed Spider subclass instance, their from_crawler method was called to generate a new instance.

Non-default values for the SCHEDULER_PRIORITY_QUEUE setting may stop working. Scheduler priority queue classes now need to handle Request objects instead of arbitrary Python data structures.


New features

A new scheduler priority queue, scrapy.pqueues.DownloaderAwarePriorityQueue, may be enabled for a significant scheduling improvement on crawls targetting multiple web domains, at the cost of no CONCURRENT_REQUESTS_PER_IP support (issue 3520)
A new Request.cb_kwargs attribute provides a cleaner way to pass keyword arguments to callback methods (issue 1138, issue 3563)
A new JSONRequest class offers a more convenient way to build JSON requests (issue 3504, issue 3505)
A process_request callback passed to the Rule constructor now receives the Response object that originated the request as its second argument (issue 3682)
A new restrict_text parameter for the LinkExtractor constructor allows filtering links by linking text (issue 3622, issue 3635)
A new FEED_STORAGE_S3_ACL setting allows defining a custom ACL for feeds exported to Amazon S3 (issue 3607)
A new FEED_STORAGE_FTP_ACTIVE setting allows using FTP’s active connection mode for feeds exported to FTP servers (issue 3829)
A new METAREFRESH_IGNORE_TAGS setting allows overriding which HTML tags are ignored when searching a response for HTML meta tags that trigger a redirect (issue 1422, issue 3768)
A new redirect_reasons request meta key exposes the reason (status code, meta refresh) behind every followed redirect (issue 3581, issue 3687)
The SCRAPY_CHECK variable is now set to the true string during runs of the check command, which allows detecting contract check runs from code (issue 3704, issue 3739)
A new Item.deepcopy() method makes it easier to deep-copy items (issue 1493, issue 3671)
CoreStats also logs elapsed_time_seconds now (issue 3638)
Exceptions from ItemLoader input and output processors are now more verbose (issue 3836, issue 3840)
Crawler, CrawlerRunner.crawl and CrawlerRunner.create_crawler now fail gracefully if they receive a Spider subclass instance instead of the subclass itself (issue 2283, issue 3610, issue 3872)


Bug fixes

process_spider_exception() is now also invoked for generators (issue 220, issue 2061)
System exceptions like KeyboardInterrupt are no longer caught (issue 3726)
ItemLoader.load_item() no longer makes later calls to ItemLoader.get_output_value() or ItemLoader.load_item() return empty data (issue 3804, issue 3819)
The images pipeline (ImagesPipeline) no longer ignores these Amazon S3 settings: AWS_ENDPOINT_URL, AWS_REGION_NAME, AWS_USE_SSL, AWS_VERIFY (issue 3625)
Fixed a memory leak in MediaPipeline affecting, for example, non-200 responses and exceptions from custom middlewares (issue 3813)
Requests with private callbacks are now correctly unserialized from disk (issue 3790)
FormRequest.from_response() now handles invalid methods like major web browsers
2019-08-22 08:21:11 +00:00
adam
a3537d9682 py-scrapy: updated to 1.6.0
Scrapy 1.6.0:

Highlights:
* better Windows support;
* Python 3.7 compatibility;
* big documentation improvements, including a switch
  from .extract_first() + .extract() API to .get() + .getall()
  API;
* feed exports, FilePipeline and MediaPipeline improvements;
* better extensibility: :signal:item_error and
  :signal:request_reached_downloader signals; from_crawler support
  for feed exporters, feed storages and dupefilters.
* scrapy.contracts fixes and new features;
* telnet console security improvements, first released as a
  backport in :ref:release-1.5.2;
* clean-up of the deprecated code;
* various bug fixes, small new features and usability improvements across
  the codebase.
2019-01-31 09:07:46 +00:00
adam
1b7dade2d4 py-scrapy: updated to 1.5.2
Scrapy 1.5.2:

* *Security bugfix*: Telnet console extension can be easily exploited by rogue
  websites POSTing content to http://localhost:6023, we haven't found a way to
  exploit it from Scrapy, but it is very easy to trick a browser to do so and
  elevates the risk for local development environment.

  *The fix is backwards incompatible*, it enables telnet user-password
  authentication by default with a random generated password. If you can't
  upgrade right away, please consider setting :setting:TELNET_CONSOLE_PORT
  out of its default value.

  See :ref:telnet console <topics-telnetconsole> documentation for more info

* Backport CI build failure under GCE environemnt due to boto import error.
2019-01-24 14:11:48 +00:00
adam
2b8104adc7 py-scrapy: updated to 1.5.1
Scrapy 1.5.1:
This is a maintenance release with important bug fixes, but no new features:
* O(N^2) gzip decompression issue which affected Python 3 and PyPy
  is fixed
* skipping of TLS validation errors is improved
* Ctrl-C handling is fixed in Python 3.5+
* testing fixes
* documentation improvements
2018-08-14 06:56:39 +00:00
adam
33769cc03b py-scrapy: updated to 1.5.0
Scrapy 1.5.0:
This release brings small new features and improvements across the codebase.
Some highlights:

* Google Cloud Storage is supported in FilesPipeline and ImagesPipeline.
* Crawling with proxy servers becomes more efficient, as connections
  to proxies can be reused now.
* Warnings, exception and logging messages are improved to make debugging
  easier.
* scrapy parse command now allows to set custom request meta via
  --meta argument.
* Compatibility with Python 3.6, PyPy and PyPy3 is improved;
  PyPy and PyPy3 are now supported officially, by running tests on CI.
* Better default handling of HTTP 308, 522 and 524 status codes.
* Documentation is improved, as usual.

Backwards Incompatible Changes
* Scrapy 1.5 drops support for Python 3.3.
* Default Scrapy User-Agent now uses https link to scrapy.org.
  **This is technically backwards-incompatible**; override
  :setting:USER_AGENT if you relied on old value.
* Logging of settings overridden by custom_settings is fixed;
  **this is technically backwards-incompatible** because the logger
  changes from [scrapy.utils.log] to [scrapy.crawler]. If you're
  parsing Scrapy logs, please update your log parsers.
* LinkExtractor now ignores m4v extension by default, this is change
  in behavior.
* 522 and 524 status codes are added to RETRY_HTTP_CODES

New features
- Support <link> tags in Response.follow
- Support for ptpython REPL
- Google Cloud Storage support for FilesPipeline and ImagesPipeline
- New --meta option of the "scrapy parse" command allows to pass additional
  request.meta
- Populate spider variable when using shell.inspect_response
- Handle HTTP 308 Permanent Redirect
- Add 522 and 524 to RETRY_HTTP_CODES
- Log versions information at startup
- scrapy.mail.MailSender now works in Python 3 (it requires Twisted 17.9.0)
- Connections to proxy servers are reused
- Add template for a downloader middleware
- Explicit message for NotImplementedError when parse callback not defined
- CrawlerProcess got an option to disable installation of root log handler
- LinkExtractor now ignores m4v extension by default
- Better log messages for responses over :setting:DOWNLOAD_WARNSIZE and
  :setting:DOWNLOAD_MAXSIZE limits
- Show warning when a URL is put to Spider.allowed_domains instead of
  a domain.

Bug fixes
- Fix logging of settings overridden by custom_settings;
  **this is technically backwards-incompatible** because the logger
  changes from [scrapy.utils.log] to [scrapy.crawler], so please
  update your log parsers if needed
- Default Scrapy User-Agent now uses https link to scrapy.org.
  **This is technically backwards-incompatible**; override
  :setting:USER_AGENT if you relied on old value.
- Fix PyPy and PyPy3 test failures, support them officially
- Fix DNS resolver when DNSCACHE_ENABLED=False
- Add cryptography for Debian Jessie tox test env
- Add verification to check if Request callback is callable
- Port extras/qpsclient.py to Python 3
- Use getfullargspec under the scenes for Python 3 to stop DeprecationWarning
- Update deprecated test aliases
- Fix SitemapSpider support for alternate links
2018-01-04 21:31:41 +00:00
wiz
ff22ec594f Follow some redirects. 2017-09-04 18:08:18 +00:00
adam
b69bc771b9 Scrapy 1.4 does not bring that many breathtaking new features
but quite a few handy improvements nonetheless.

Scrapy now supports anonymous FTP sessions with customizable user and
password via the new :setting:`FTP_USER` and :setting:`FTP_PASSWORD` settings.
And if you're using Twisted version 17.1.0 or above, FTP is now available
with Python 3.

There's a new :meth:`response.follow <scrapy.http.TextResponse.follow>` method
for creating requests; **it is now a recommended way to create Requests
in Scrapy spiders**. This method makes it easier to write correct
spiders; ``response.follow`` has several advantages over creating
``scrapy.Request`` objects directly:

* it handles relative URLs;
* it works properly with non-ascii URLs on non-UTF8 pages;
* in addition to absolute and relative URLs it supports Selectors;
  for ``<a>`` elements it can also extract their href values.
2017-05-20 06:25:36 +00:00
adam
992400803f Changes 1.3.3:
Bug fixes
- Make ``SpiderLoader`` raise ``ImportError`` again by default for missing
  dependencies and wrong :setting:`SPIDER_MODULES`.
  These exceptions were silenced as warnings since 1.3.0.
  A new setting is introduced to toggle between warning or exception if needed ;
  see :setting:`SPIDER_LOADER_WARN_ONLY` for details.
2017-03-19 22:59:10 +00:00
adam
3e7be94dc6 Added www/py-scrapy version 1.3.2
Scrapy is a fast high-level web crawling and web scraping framework, used to
crawl websites and extract structured data from their pages. It can be used for
a wide range of purposes, from data mining to monitoring and automated testing.
2017-02-13 21:25:33 +00:00