Creation of the big web-projects

Any successful web-project sooner or later has problem of growth. Existing hardware-software resources cease to cope with growing loading. Universal recipes, unfortunately does not exist. In each project the good programmer will program differently. Nevertheless, in this clause{article} I shall try to give some typical recommendations on creation of the big web-projects. Such projects during creation and developments collide{face}, as a rule, with two almost opposite in the ways of the decision problems - the big speeds and great volumes of the data.

The big speeds


As an ideal example of a site for which speed is vital, it is possible to take a banner network. So, some receptions for acceleration of job of banner networks and other servers, critical to speed of job.

Creation of modules


Sense of this reception - vkompilirovat` the most important functions in the server. The idea is very simple. If we shall see at a parity{ratio} of time which is spent for various stages of performance of search we shall see an interesting picture. For example, at performance of the elementary perl-script consistently there is a following:


1) Apache server defines{determines} a perl-script for start, prepares and starts it ;

2) Start of a script actually begins with start of the perl-interpreter (it is a file, the size about{near} polumegabajta). The Perl-interpreter, having started, is placed on 2 megabytes in memory of the machine, and only after that starts to work with the user script;

3) This job begins with compilation of the program. Compilation of the program is, as a rule, one of the longest stages of processing of the program;

4) Only after preliminary compilation (in bajtkod) the script will start to be carried out.


The statistics depresses: time which is spent for start of the perl-interpreter and compilation of a script, as a rule, on the order of more time for which he is carried out.

On each site there are bottlenecks - programs which are caused very much often. For example, a banner cursor. As a rule, on one viewing of page it is necessary two - three banners, so also a call of the program. Understandably, that if to get rid from overhead charge (items{points} 2 and 3), job of the server will considerably be sped up. It can be made two similar ways.

The first - to write the module to Apache and vkompilirovat` it  to the server. So in a banner network of the Flamingo - 2 (http://www.f2.ru) in which creation I took part, the part of system which distributed banners to users has been realized. It was the module written in language C which function as a part of Apache server and consequently worked very quickly.

The second way - to use technologies of precompilation of programs. It is a lot of such technologies. For example, for perl-scripts it can be FastCGI and mod_perl. I shall tell more in detail about mod_perl. It vkompilirovannyj (besides as the module) in Apache the perl-compiler. First, even for simple scripts (at appropriate adjustment) it excludes the second stage of performance. But except for it mod_perl enables to write khehndlery - obrabotchiki the certain stages of performance of search. It is very powerful technology, therefore we shall consider her  more in detail.

It is possible to write, for example, khehndler which will be caused at search certain{determined} URL. It is done{made} so. In a file httpd.conf you register the next lines:

<Perl>


unshift (@INC, ' the Way to your module ');


@PerlModule = qw (MyHandler);


%Location = (

'/myhandler ' => {

' PerlHandler ' => ' MyHandler:: view ',

' SetHandler ' => ' perl-script ',

' PerlSendHeader ' => ' on '

},

);


</Perl>



Thus you specify Apache and to the module mod_perl, that if the user will request URL/myhandler for his  processing module MyHandler should be started, and in him procedure view. After change httpd.conf it is necessary to reload Apache. By the way, all files specified in the configuration module will be compiled at loading the server, instead of at the first search. It in some times will increase speed of job of the server.

Module MyHandler.pm can look, for example, so:

package MyHandler;

use strict;


* Procedure view

sub view {

print " <HTML> n <BODY> nUra! It was fulfilled by ours khehndler! </BODY> n </HTML> n ";

}



The mechanism khehndlerov possesses powerful opportunities. Actually you can replace any stage of transaction processing. We shall consider for an example creation of own mechanism of check of the password:

package MyAuthorization;

use strict;


* Obrabotchik, the requesting password

sub handler {

my $r = shift;


return AUTH_REQUIRED unless $r;


my (undef, $password) = $r-> get_basic_auth_pw;

my ($login) = $r-> connection-> user;


return AUTH_REQUIRED unless $password;


* We check, whether all in the order

* Check can be any

* It is possible to be verified with a database, and we shall consider, that the password should be

* It is equal to a login read back to front.


my $rev_login = reverse ($login);


* Check of the password

if ($rev_login ne $passwd_sent)

{

return AUTH_REQUIRED;

} else {

return OK;

}


};



In a file of adjustments of httpd.conf server it is necessary to specify, that we shall authorize the user:

%Location = (

'/myhandler ' => {

' PerlHandler ' => ' MyHandler:: view ',

' SetHandler ' => ' perl-script ',

' PerlSendHeader ' => ' on '

' require ' => ' valid-user ',

' Limit ' => {

' METHODS ' => ' GET POST '

},

' AuthType ' => ' Basic ',

' AuthName ' => ' PersonaUser ',

' PerlAuthenHandler ' => ' MyAuthorization-> handler () '

},

);



Now access to/myhandler is protected - the browser will deduce{remove} to the user a standard window for input of the password.

In more detail mod_perl it is possible to meet technology on a site http://perl.apache.org/

Use of conveyors


Try to not make data processing in interactive scripts. Write down them in broad gullies - files, and then aggregate and process already separate process. For example, the answer of the user in interactive voting can cause in you changes in ten various parameters of statistics (distribution of answers, activity of users, the general{common} number voted and so on). Do not spend them at once. Instead of it break procedure into two parts. The first - neposredstven-but voting, recording of result and a conclusion of reciprocal page to the user. The second - processing of voting, change of statistics, etc.

In general it is necessary to try to minimize quantity{amount} of interactive operations. In an ideal case the script for the account of voting in general does{makes} nothing, except for recording the information in a broad gully - file. And for data processing from a broad gully - file it is possible to start a separate process - demon.

For an example we shall consider the mechanism of processing of statistics in a banner network of the Flamingo - 2. In her 4 step conveyor has been realized:


1) The information on each search entered the name in a full broad gully. It was very detailed information and she entered the name without any compression for which a lot of time would be spent. The size of this broad gully is very great - one recording in him borrowed{occupied} 250 bytes. The data were not stored{kept} in this broad gully longer several hours.

2) With periodicity of times 10 minutes the program which processed a full broad gully was started and in a compact kind wrote the information to tables of a database. At same stage shows were taken into account, the time tables used for delivery of banners to the user and for job of the following stages changed.

3) The hour demon which built hourly statistics, made complex  geographical calculations and many other things, was started in the conveyor once at one o'clock. He any more had no access to a full broad gully and used the information extremely from the second stage.

4) Problems{Tasks} of last stage included day time rotation of files, statistics, leading of balances and dispatch of post preventions . This stage worked every day late at night when loading on the server was minimal.


As you can see, the mechanism complex  enough and to adjust his  correct job it was uneasy. The more stages, the there are more than problems at their interface with each other. Nevertheless, such system allowed to distribute{allocate} effectively enough loading and shustro worked on a simple IDE-disk (settlement bandwidth was about 2-3 million references{manipulations} day at peak loading of 200 references{manipulations} in a second). Thus the system conducted a plenty of statistics.


So, we summarize: for increase in speed of job of the programs cooperating with the user, we segment their job, and the interactive part should contain a minimum of calculations and operations of recording. All necessary calculations can be made later, during more favorable time from the point of view of loading and it is more effective.

Databases


Use a good database. What to choose? The uniform recipe no. All depends on a decided{solved} problem . If she simple enough and to you is not required to carry out complex  SQL-searches (for example, enclosed) the best decision will be, perhaps, database MySQL.

MySQL - one of the most simple servers of a DB. But even in this simple base there are ways of optimization for acceleration of searches. For example, not a secret, that INSERT - one of the longest operations (calculation of the physical address for an insert, an insert, the decision of a problem of a fragmentation, change of indexes and service tables). Good reception for acceleration of job of a script which inserts the data into a DB - replacement of operation INSERT with operation INSERT DELAYED (the postponed insert). Updating of the data will be executed only when it will not lead to to delay of job of the server.

Other example: if closely{attentively} to esteem documentation MySQL, it is possible to find a mention of the tables located in memory (HEAP tables). It is obvious, that operations with such tables are made much faster. Heap-tables can be used for the decision of some problems{tasks}.

There is a plenty of parameters of start of the server of the DB optimizing buffers of sorting, calculations, quantity{amount} of children and other parameters. As a rule, you beforehand know, that you will do{make} with base, and for increase of speed it is possible to set corresponding parameters. For example, we shall take quite real problem : construction of any catalogue. Clearly, what is it there will be one big table with a plenty of indexes. You know, that will use performances. Job with this table will consist in searches on an index without use of sorting. We shall see, as it is possible to adjust the server of a DB to performance of such problem  (an example from MySQL 3.23.25):

?         join_buffer_size - the buffer for creation of performances, is by default equal 131072 bajta;

?         key_buffer_size - the buffer for job with keys and indexes. The size by default - 1048540;

?         sort_buffer - the buffer for sorting. By default - 2097116 bytes.


Most likely, at increase in any buffer, speed of performance of the problem  connected to it  will increase. Proceeding from our problem , we shall increase the buffer for job with keys (speed of sample of values from the table will increase), we shall reduce the buffer of sorting (speed of sorting will decrease) and the buffer of performances (speed of job with performances will decrease).

The line of start of demon MySQL will look approximately so (concrete values depend on quantity{amount} of memory in system):

shell> safe_mysqld-O key_buffer=8M-O sort_buffer=1M-O join_buffer=16K



We summarize: at use of a database job of a script can be sped up considerably correct adjustment of the server of a DB. In a management{manual} of database MySQL there is the special section devoted to optimization. It is possible to address for more detailed information on sites:

Developers MySQL - http://www.mysql.com

Developers PostgreSQL - http://www.PostgreSQL.org/

Optimization MySQL - http://www.mysql.cz/information/presentations/presentation-oscon2000-20000719/index.html and http://support.ultrahost.ru/mysql_opt.php

Great volumes


One more problem of the big sites - great volume of the information. If not to apply any shifts support of a simple html-site during any moment will demand too much time.

Object-oriented programming


I already told about advantage{benefit} of the object-oriented approach. I shall repeat in brief. Everyone who though time tried to create dynamic sites, knows, that in many respects it - very monotonous problem . The guest book, conference, the form for departure of comments, a subscription, registration. As a rule, these scripts are poorly integrated and, at the best, use the general{common} library with constants and the general{common} procedures.


However if to list essence with which the set forth above scripts deal, we shall receive very interesting results:

?         Essence "user". Has the name, a surname, nik, the password, the electronic address: It is used practically in all scripts in different ipostasjakh.

?         Essence "message". You can object, that messages everywhere different. Anything similar! Forms of performance of messages, and the data, structure of fields and methods of processing - one differ. The author, heading, a body - and so in all projects.


Actually and all of essence with which the majority of scripts on a site operates. The guest book (she, by the way, itself can be object in more complex  projects) represents a chain of objects of a class "message". A forum or conference - the same messages organized hierarchically. Sending of the letter to the owner of a site - the message. Dispatch of announcements - perebor objects of a class "user" and sending to everyone of object of a class "message".

It would be effective to describe all these objects in one place, and then to build of them, as from kirpichikov, programs and scripts, simply inserting calls of objects in a code. Besides, the uniform space of messages, users and other objects considerably expands a field for creativity.

In it also there is an essence of the objective approach. You create set of objects - kirpichikov the future programs - and of them build the sites. Besides you can use such powerful methods OOP as inheritance and poliformizm without which construction of large projects is already impossible.

SHablonirovanie


I too shall tell about it in brief; probably clause{article} in one of following numbers{rooms} of "Programmer" will be devoted to this. We shall return to system of the Flamingo. How the interface of this banner network has been organized? 400 kinds of statistics correspond{meet} to 400 pages? No. One script - shablonizator to which parameters - number{room} of statistics and other data are passed: dates, restrictions, etc.

Under unique number{room} of statistics the script read out the description which has consisted of a name of a file with pseudo - html and names of files with SQL-searches. The file with the description looked so:

2:data/html/2.htx, data/queries/info.sql

9:data/html/9.htx, data/queries/ban-list-one.sql, data/queries/get-banners-list.sql

12:data/html/12.htx, data/queries/ban-getinfo.sql

38:data/html/38.htx, data/queries/acc-hosts-hits.sql

44:data/html/44.htx, data/queries/acc-getsites-today.sql



The general circuit is very simple - to execute all SQL-searches and to insert results in pseudo - html, having received thus a high-grade page and to give out to its{her} user. For example, for a conclusion of statistics with number{room} 2 (the information on an account), was required to execute SQL-search data/queries/info.sql, results to insert in data/html/2.htx. Result to display.

And here is how business was more in detail. The first problem  - formation of SQL-search. It is necessary to insert identifier of the user and other parameters which are transferred{handed} to a script into him . A typical example of SQL-search (data/queries/info.sql):

select

AccountName,

OwnerName,

OwnerEmail,

MainSite,

SiteName

from

Accounts

where

AccountId = <-AccountId->



At analysis of such search the option value was inserted on a place of a line <-ImjaParametra->. There were also special parameters, for example - <-UserName-> - a login name and <-AccountId-> - calculated named the identifier of an account.

The result of performance of the received search was brought in html as follows. Each value received from a database received "name" with which help his  site in a html-pattern was designated. The name was compound. The first part - a serial number of the SQL-search, the second part - an index of value in a file of results.

We admit{allow}, the SQL-search with a serial number 1 (for an example we shall consider search data/queries/info.sql) was carried out. The search returned a file of values. Accordingly, value AccountName returned by a database, had a serial number 0 in this file. In a html-pattern the place where it was necessary to insert AccountName was designated as <-1.1->.


Slice of a HTML-pattern data/html/2.htx from our example:

<TABLE BORDER=0 WIDTH=460>

<TR>

<TD WIDTH = " 50 % ">

<FONT SIZE = "-1 ">

Name, surname responsible{crucial}:

</FONT>

</TD> <TD>

<INPUT type = "text" name = "OwnerName" size=33 value = " <-1.1-> ">

</TD>

</TR>


<TR>

<TD>

<FONT SIZE = "-1 ">

The electronic address:

</TD> <TD>

<INPUT type = "text" name = "OwnerEmail" size=33 value = " <-1.2-> ">

</TD>

</TR>



Despite of seeming complexity of the circuit, she has a number{line} of advantages. With its{her} help we could construct for short time system with more than 400 kinds various statistik. Subsequently for addition of new statistics it was necessary to write only SQL-searches, to draw a HTML-pattern and to change a configuration of a script - shablonizatora. The new page of statistics appeared in system automatically.

The conclusion


I would like to repeat once again: there are no decisions for all occasions. Each time, in each project to you is necessary to think out own methods of optimization of speed and convenience of job. I hope, that receptions about which I have told, will be useful to you.