The next snippet php will help you to obtain the source code of any URL in text format let it happen
This code is perfect for creating, for example, crawlers php to track patterns in web content and then process them. For example, you could save certain content such as images, links or texts.
The following snippet consists of a php function ready for you to copy and use immediately.
function display_sourcecode($url) { $lineas = file($url); $output = ""; foreach ($lineas as $line_num => $linea) { //recorremos todas las líneas HTML devueltas por la página $output.= "Line #<b>{$line_num}</b> : " . htmlspecialchars($linea) . "<br>\n"; } return $output; }
Content
- How do we read the source code of a page in php?
- Retrieve all the urls of a web page
- Retrieve only links that appear in href
How do we read the source code of a page in php?
You may be asking yourself this question if you have read or tried the previous php. If so, keep reading, then I’ll explain what exactly this does function to get source codes and I suggest improvements.
The first thing is to see what this function does exactly step by step:
- Using the function files recovers a array with each line of the file. Where do you get the file from? From the URL indicated as path.
- All the rows of the file content are scanned one by one and are incorporated into a text type variable. Each line is appropriately numbered in bold and the HTML characters are parsed with php function function htmlspecialchars() so that they are interpreted as text.
As you can see, the operation is simple, and this function, as it is, does not perform any special task beyond mounting the content of the page indicated as url.
Next I propose a couple of improvements to this function, among them retrieve all urls of the page either download all images from a url.
Retrieve all the urls of a web page
In the next version of this snippet I look for and retrieve all the URLS found in the indicated web url. This improvement could be the beginning of a php crawler or a possible automated bot.
The function will process all the urls found on the page and return an Array with all of them.
function display_sourcecode($url) { $lineas = file($url); $urls= []; foreach ($lineas as $line_num => $linea) { //recorremos todas las líneas HTML devueltas por la página preg_match_all('#\bhttps?://[^,\s()<>]+(?:\([\w\d]+\)|([^,[:punct:]\s]|/))#', $linea, $match); //sumamos los resultados (urls) encontradas a nuestro array de urls $urls = $urls + $match; } return $urls; }
This function, despite following getting the source code of a pageit differs from the first one in that I use the php function preg_match_all() to obtain, by means of a pattern of regular expressions, all the urls that are in a line. In addition, all these urls are accumulated in an array, which when finished, will contain all the complete URLS (with domain, etc.) that are in the html code of the desired url.
Remember that this array as only retrieve full urlsso if the destination page had not included urls in source code they would not recover. For this you would have to change the regular expression.
Retrieve only links that appear in href
Another way alternative to collect the links of a page but only those who appear in an HTML link tag ( A ) would be with a PHP object of type DOMDocument. It would also be necessary to replace the line-by-line path of the function file() by another that recovers all the source code as text.
Here is a proposal for you:
function display_sourcecode($url) { $html= file_get_contents($url); $dom = new DOMDocument; //Parse the HTML. The @ is used to suppress any parsing errors //that will be thrown if the $html string isn't valid XHTML. @$dom->loadHTML($html); //Get all links. You could also use any other tag name here, //like 'img' or 'table', to extract other tags. $links = $dom->getElementsByTagName('a'); $urls = []; //Iterate over the extracted links and display their URLs foreach ($links as $link){ //Extract and show the "href" attribute. $urls[] = $link->getAttribute('href'); } return $urls; }