Skip to content

Commit 0f4767d

Browse files
Update README.md
1 parent 49e1628 commit 0f4767d

1 file changed

Lines changed: 221 additions & 8 deletions

File tree

README.md

Lines changed: 221 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Most of the PHP parsers I encountered in the past were either too complicated, *
44

55
# pQuery Web Scraper tutorial
66
## Getting started
7-
To start coding with pQuery simply include the main PHP file in this repository and initilize an object class like this:
7+
To start coding with pQuery simply include the main PHP file in this repository and initialize an object class like this:
88
```php
99
// include webscraper.php file
1010
include "path/webscraper.php";
@@ -409,7 +409,8 @@ These functions are built to test if an element has a specific class/attribute o
409409

410410
```html
411411
<p class="rice">Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.</p>
412-
<p data-target="lord">Donec eu libero sit amet quam egestas semper. Aenean ultricies mi vitae est. Mauris placerat eleifend leo.</p>
412+
<p data-target="lord">Donec eu libero sit amet quam egestas semper. Aenean ultricies mi vitae est. Mauris
413+
placerat eleifend leo.</p>
413414
```
414415

415416
### Php
@@ -437,10 +438,10 @@ p[1] has class "rice": true
437438
p[2] has attribute "data-target" with value "lady": false
438439
```
439440

440-
## `delete`
441+
## `remove` and `empty`
441442

442-
The delete function is used to delete DOM nodes.
443-
It accepts a boolean as a parameter: `true` or `false`, `true` tells it to keep its inner content (be it text or html nodes), while `false` tells it the opposite. The default parameter is set to `true`.
443+
The remove function is used to delete DOM nodes.
444+
It accepts a boolean as a parameter: `true` or `false`, `true` tells it to keep its inner content (be it text or html nodes), while `false` tells it the opposite. The default parameter is set to `false`. While `empty` clears out inner HTML.
444445

445446
### `$html`
446447

@@ -471,7 +472,7 @@ include "path/webscraper.php";
471472
$doc = new WebScraper();
472473
$doc->loadHTML($html);
473474

474-
$doc->Q("style")->delete();
475+
$doc->Q("style")->remove();
475476

476477
$doc->echo();
477478
```
@@ -494,7 +495,7 @@ include "path/webscraper.php";
494495
$doc = new WebScraper();
495496
$doc->loadHTML($html);
496497

497-
$doc->Q("p")->delete(true);
498+
$doc->Q("p")->remove(true);
498499

499500
$doc->echo();
500501
```
@@ -511,9 +512,48 @@ Will output
511512

512513
</body>
513514
```
515+
An example with `empty`:
516+
```php
517+
include "path/webscraper.php";
518+
$doc = new WebScraper();
519+
$doc->loadHTML($html);
514520

521+
$doc->Q("p[1]")->empty();
515522

516-
## `iterate` and `replaceText`
523+
$doc->echo();
524+
```
525+
```html
526+
<head>
527+
<style>
528+
code {
529+
font-family: Consolas,"courier new";
530+
color: crimson;
531+
background-color: #f1f1f1;
532+
padding: 2px;
533+
font-size: 105%;
534+
}
535+
</style>
536+
</head>
537+
<body>
538+
539+
<p></p>
540+
<p>The CSS <code>background-color</code> property defines the background color of an element.</p>
541+
542+
</body>
543+
```
544+
## Tip
545+
546+
You can delete all tags attributes by using the `::attributes` with the `remove` function:
547+
```php
548+
include "path/webscraper.php";
549+
$doc = new WebScraper();
550+
$doc->loadHTML($html);
551+
552+
$doc->Q("::attributes")->remove();
553+
554+
$doc->echo();
555+
```
556+
## `hasAttr` and `hasClass`
517557

518558
These functions are built to test if an element has a specific class/attribute or not. If it does, it returns true.
519559

@@ -551,3 +591,176 @@ p[1] has class "rice": true
551591

552592
p[2] has attribute "data-target" with value "lady": false
553593
```
594+
## `replaceText` and `replaceTextCallback`
595+
596+
Both work similarly to the native `preg_replace` and `preg_replace_callback` functions, respectively. With the only differences being that you are able to choose between injecting HTML/XML or not, and that they are able to automatically iterate over all node texts or specific node texts inside a chosen tag, leaving the overall XML/HTML structure untouched.
597+
598+
### `$html`
599+
```html
600+
601+
<p>Nam finibus, neque et placerat condimentum, eros ligula mattis libero, eget aliquet nisi dolor nec ex.
602+
Cras eleifend et nulla rutrum mattis. Etiam eu ipsum nisi. Sed non placerat ante. Aliquam urna tellus,
603+
faucibus a risus quis, porta eleifend mauris. Nullam sagittis consequat faucibus. Nunc metus tortor,
604+
blandit sit amet odio sit amet, iaculis pulvinar ipsum. Morbi in urna vel leo fringilla efficitur.
605+
Vivamus eget rutrum sem. Phasellus posuere nunc sem, vel ultricies metus rutrum nec.</p>
606+
607+
```
608+
### `replaceText` usage
609+
```php
610+
$doc->Q("something")->replaceText($pattern, $replace, true/false);
611+
// set true to allow HTML entity decoding
612+
// set false to disable HTML entity decoding
613+
// it is set to true by default
614+
```
615+
Use example:
616+
### Php
617+
```php
618+
include "path/webscraper.php";
619+
$doc = new WebScraper();
620+
$doc->loadHTML($html);
621+
622+
$pattern = "/a/i";
623+
$replace = "$";
624+
$doc->Q("::text")->replaceText($pattern, $replace);
625+
626+
$doc->echo();
627+
```
628+
### Output
629+
```html
630+
<p>N$m finibus, neque et pl$cer$t condimentum, eros ligul$ m$ttis libero, eget $liquet nisi dolor nec ex.
631+
Cr$s eleifend et null$ rutrum m$ttis. Eti$m eu ipsum nisi. Sed non pl$cer$t $nte. $liqu$m urn$ tellus,
632+
f$ucibus $ risus quis, port$ eleifend m$uris. Null$m s$gittis consequ$t f$ucibus. Nunc metus tortor,
633+
bl$ndit sit $met odio sit $met, i$culis pulvin$r ipsum. Morbi in urn$ vel leo fringill$ efficitur.
634+
Viv$mus eget rutrum sem. Ph$sellus posuere nunc sem, vel ultricies metus rutrum nec.</p>
635+
```
636+
### `replaceTextCallback` usage
637+
```php
638+
$doc->Q("something")->replaceText($pattern, function($m){
639+
# code
640+
// $m is mandatory,
641+
// $m is an array containing the matches
642+
// captured in parentheses in the pattern
643+
}, true/false);
644+
```
645+
646+
Use example:
647+
### Php
648+
```php
649+
include "path/webscraper.php";
650+
$doc = new WebScraper();
651+
$doc->loadHTML($html);
652+
653+
// simple Regex to match sentences
654+
$pattern = '/([A-Z][^\.!?]*[\.!?]*(<br>)*)/';
655+
656+
// First try: third parameter is set to true by default
657+
$doc->Q("::text")->replaceTextCallback($pattern, function($m){
658+
static $id = 0;
659+
$id++;
660+
return "<label id=\"$id\">".$m[1]."</label>";
661+
// it pretty much works the same way as a preg_replace_callback
662+
});
663+
664+
echo "$html = true\n\n";
665+
$doc->echo()."\n\n\n";
666+
667+
// Second try: third parameter was manually set to false
668+
$doc->Q("::text")->replaceTextCallback($pattern, function($m){
669+
static $id = 0;
670+
$id++;
671+
return "<label id=\"$id\">".$m[1]."</label>";
672+
}, false);
673+
674+
echo "$html = false\n\n";
675+
$doc->echo();
676+
```
677+
### Output
678+
```html
679+
$html = true
680+
681+
<p><label id="1">Nam finibus, neque et placerat condimentum, eros ligula mattis libero, eget aliquet nisi
682+
dolor nec ex.</label> <label id="2">Cras eleifend et nulla rutrum mattis.</label> <label id="3">Etiam eu
683+
ipsum nisi.</label> <label id="4">Sed non placerat ante.</label> <label id="5">Aliquam urna tellus,
684+
faucibus a risus quis, porta eleifend mauris.</label> <label id="6">Nullam sagittis consequat faucibus.
685+
</label> <label id="7">Nunc metus tortor, blandit sit amet odio sit amet, iaculis pulvinar ipsum.</label>
686+
<label id="8">Morbi in urna vel leo fringilla efficitur.</label> <label id="9">Vivamus eget rutrum sem.
687+
</label> <label id="10">Phasellus posuere nunc sem, vel ultricies metus rutrum nec.</label></p>
688+
689+
690+
$html = false
691+
692+
<p>&lt;label id="1"&gt;Nam finibus, neque et placerat condimentum, eros ligula mattis libero, eget aliquet
693+
nisi dolor nec ex.&lt;/label&gt; &lt;label id="2"&gt;Cras eleifend et nulla rutrum mattis.&lt;/label&gt;
694+
&lt;label id="3"&gt;Etiam eu ipsum nisi.&lt;/label&gt; &lt;label id="4"&gt;Sed non placerat ante.
695+
&lt;/label&gt; &lt;label id="5"&gt;Aliquam urna tellus, faucibus a risus quis, porta eleifend mauris.
696+
&lt;/label&gt; &lt;label id="6"&gt;Nullam sagittis consequat faucibus.&lt;/label&gt; &lt;label id="7"&gt;
697+
Nunc metus tortor, blandit sit amet odio sit amet, iaculis pulvinar ipsum.&lt;/label&gt; &lt;label id="8"&gt;
698+
Morbi in urna vel leo fringilla efficitur.&lt;/label&gt; &lt;label id="9"&gt;Vivamus eget rutrum sem.
699+
&lt;/label&gt; &lt;label id="10"&gt;Phasellus posuere nunc sem, vel ultricies metus rutrum nec.
700+
&lt;/label&gt;</p>
701+
```
702+
703+
## `replaceWith`
704+
705+
The both replacing functions above work with node texts, while `replaceWith` replaces whole HTML/XML tags.
706+
707+
## `count`
708+
709+
It counts occurrences of tag.
710+
711+
### `$html`
712+
```html
713+
<h1>Didn't melt fairer keepsakes since Fellowship elsewhere.</h1>
714+
<p>Woodlands payment Osgiliath tightening. Barad-dur follow belly comforts tender tough bell? Many that live
715+
deserve death. Some that die deserve life. Outwitted teatime grasp defeated before stones reflection corset
716+
seen animals Saruman's call?</p>
717+
<h2>Tad survive ensnare joy mistake courtesy Bagshot Row.</h2>
718+
<p>Ligulas step drops both? You shall not pass! Tender respectable success Valar impressive unfriendly
719+
bloom scraped? Branch hey-diddle-diddle pony trouble'll sleeping during jump Narsil.</p>
720+
<h3>North valor overflowing sort Iáve mister kingly money?</h3>
721+
<p>Curse you and all the halflings! Deserted anytime Lake-town burned caves balls. Smoked lthilien forbids
722+
Thrain?</p>
723+
<ul>
724+
<li>Adamant.</li>
725+
<li>Southfarthing!</li>
726+
<li>Witch-king.</li>
727+
<li>Precious.</li>
728+
<li>Gaffer's!</li>
729+
</ul>
730+
<ul>
731+
<li>Excuse tightening yet survives two cover Undómiel city ablaze.</li>
732+
<li>Keepsakes deeper clouds Buckland position 21 lied bicker fountains ashamed.</li>
733+
<li>Women rippling cold steps rules Thengel finer.</li>
734+
<li>Portents close Havens endured irons hundreds handle refused sister?</li>
735+
<li>Harbor Grubbs fellas riddles afar!</li>
736+
</ul>
737+
<h3>Narsil enjoying shattered bigger leaderless retrieve dreamed dwarf.</h3>
738+
<p>Ravens wonder wanted runs me crawl gaining lots faster! Khazad-dum surprise baby season ranks.
739+
I bid you all a very fond farewell.</p>
740+
<ol>
741+
<li>Narsil.</li>
742+
<li>Elros.</li>
743+
<li>Arwen Evenstar.</li>
744+
<li>Maggot's?</li>
745+
<li>Bagginses?</li>
746+
</ol>
747+
<ol>
748+
<li>Concerning Hobbits l golf air fifth bell prolonging camp.</li>
749+
<li>Grond humble rods nearest mangler.</li>
750+
<li>Enormity Lórien merry gravy stayed move.</li>
751+
<li>Diversion almost notion furs between fierce laboring Nazgûl ceaselessly parent.</li>
752+
<li>Agree ruling um wasteland Bagshot Row expect sleep.</li>
753+
</ol>
754+
```
755+
### Php
756+
```php
757+
include "path/webscraper.php";
758+
$doc = new WebScraper();
759+
$doc->loadHTML($html);
760+
761+
echo $doc->Q("li")->count();
762+
```
763+
### Output
764+
```html
765+
20
766+
```

0 commit comments

Comments
 (0)