A HTML parser, for extracting the text from a web pages, with CSS selectors.
Go to file
Alexander Kozachenko 0e0da953f2
Update Readme.md
2023-11-05 12:09:35 +02:00
WebScrap.Demo.CLI Release v1.0.0 (#82) 2023-11-05 11:56:25 +02:00
src Release v1.0.0 (#82) 2023-11-05 11:56:25 +02:00
.gitignore Release v1.0.0 (#82) 2023-11-05 11:56:25 +02:00
Changelog.md Release v1.0.0 (#82) 2023-11-05 11:56:25 +02:00
License Nuget (#54) 2023-10-27 06:05:36 +03:00
Readme.md Update Readme.md 2023-11-05 12:09:35 +02:00
run_all_tests.sh Release v0.5 (#47) 2023-10-20 18:25:28 +03:00

Readme.md

ProSol.WebScrap

A HTML parser, for extracting the text from a web pages, with CSS selectors.

Purpose

The purpose of this library is to get the essential data from a web-page for a user, in JSON format.

It could be further used for:

  1. Analyzing the essential data. Like a charts, diagramms, plain tables.
  2. Tracking the history of the essential data. Like prices for sales, currencies, user activity.
  3. Searching for specific essential data. Some word in multiple html resources, like movie title, or any other product, any mentioning.

Usage

Let's make a console demo and install the package:

dotnet new console -n WebScrap.Demo.CLI
cd WebScrap.Demo.CLI
dotnet add package ProSol.WebScrap

And try the following code:

using ProSol.WebScrap;

var request = "https://en.wikipedia.org/wiki/Food_energy";

// Download the html:
using var client = new HttpClient();
using var response = await client.GetAsync(request);
var html = await response.Content.ReadAsStringAsync();

// Run the WebScrapper:
var css = "#firstHeading";
var result = new WebScrapper(html)
    .Run(css)
    .AsJson()
    .ToJsonString();

// Get the results:
Console.WriteLine(result);
// OUTPUT:
// [{"key":"#firstHeading","values":[{"value":"Food energy"}]}]
Console.Read();

Known Issues

The project currently under active development, and there are some issues, some of the obvious, which are not the priority right now.

CSS

  • multiple css entries, comma-separated, are not supported.
  • attribute-based css are not supported.

HTML

  • object model returns tags in reverse order.
  • non-unicode text is not converted.

Goals

This project is for extracting text from html in a performant way.

Extract text

  • Plain text: This tool must extract a plain text from html.
  • User-defined result structure: The amount of text, and it's structure is defined by user, via multiple css selectors.

Performance

  • Parallel processing: All of css selectors should process the html in parallel.
  • Stream-based processing: The processed parts of html should be disposed from memory.

Footnote

  • The versioning is complied to the Semver 2.0.0. Please refer to semver.org for details.
  • Please refer to the Changelog for the progress.