cancel
Showing results for 
Search instead for 
Did you mean: 

Coding Problem 6 Now Defined in Scripting Languages/Bioinformatics WIKI

former_member181923
Active Participant
0 Kudos

In the "Scripting Languages and Bioinformatics" WIKI, I have completed a new page here:

https://wiki.sdn.sap.com/wiki/display/EmTech/Bio-InformaticBasicsPartII-AlignmentToolsforRelatingProteinGenestoProteinPrimaryStructures

and also defined Coding Problem 6 at the end of this page.

This problem involves scripting an interactive session with a web site and parsing what the web site returns at various points duing the session.

Am looking foward to a script that solves this problem, if anyone feels like spending the time to do it.

Accepted Solutions (0)

Answers (3)

Answers (3)

former_member181923
Active Participant
0 Kudos

Anton - did you delete your php code from this thread ???

I can't find it anywhere.

former_member374
Active Contributor
0 Kudos

Hi Anton and David,

I really like this collaboration around a bioinformatics problem. May be one of these days I find a quiet minute and try out the solution.

Thanks to both of you for doing that, Mark.

former_member181923
Active Participant
0 Kudos

Mark -

Thanks much for the kind words.

As Anton has pointed out, what we'll eventually be doing is developing a framework for WDA->PHP calls that could be useful to traditional SAP customers in a variety of ways.

Best

djh

Former Member
0 Kudos

Hello.

Following will be the (fully functional) sketch of the solution scripted in PHP. While I start to write this I notice that there is a glitch in the script - it returns a DNA sequence, but not the correct one as it seems.

Anyway, I continue posting it as it is because the whole code is a nice example on how to utilize a given functionality designed to another user interface and tailor it to your needs. the same way like we interact with the given 'program' we could use a stock exchange website to automate some of our tradings or we could write a 'bot' to collect SDN points by analyzing questions, submitting them to other search engines and returning the result to the original poster.

To better understand how this works I encourage the reader to download a network sniffer, e.g. [Wireshark|www.wireshark.org], use it to analyze the network traffic during the regular procedure in the browser, additionally analyze the source code in the various screens of the given website and finally find out what the site requires and what it returns. Comparing this with the rather trivial script parts where the data to be submitted is put together should easily allow you to help me find the error and yourself learn how to script against an arbitrary website. Apart from that I spend a beer at the next possible occasion to the one who finds the bug in the original script

Okay, before we go into the details, I want to mention that this script can easily be reprogrammed in ABAP (at a more recent release where regexes are available) using the regex an http_client classes. The same of course works in Java. Maybe someone else feels encouraged to give it a try.

Have fun,

anton

to be continued ...

Former Member
0 Kudos

<?
// sequence to search ---------------------------------------------
$bquery      = "MTQIIKIDPLNPEIDKIKIAADVIRNGGTVAFPTETVYGLGANAFDG";
$bquery     .= "NACLKIFQAKNRPVDNPLIVHIADFNQLFEVAKDIPDKVLEIAQIVW";
$bquery     .= "PGPLTFVLKKTERVPKEVTAGLDTVAVRMPAHPIALQLIRESGVPIA";
$bquery     .= "APSANLATRPSPTKAEDVIVDLNGRVDVIIDGGHTFFGVESTIINVT";
$bquery     .= "VEPPVLLRPGPFTIEELKKLFGEIVIPEFAQGKKEAEIALAPGMKYK";
$bquery     .= "HYAPNTRLLLVENRNIFKDVVSLLSKKYKVALLIPKELSKEFEGLQQ";
$bquery     .= "IILGSDENLYEVARNLFDSFRELDKLNVDLGIMIGFPERGIGFAIMN";
$bquery     .= "RARKASGFSIIKAISDVYKYVNI";

// url for form1 --------------------------------------------------
$url = "http://blast.ncbi.nlm.nih.gov:80/Blast.cgi";

// parameters for form1 -------------------------------------------
$bdb         = "protein";
$bgeneticcode= "1"; 
$bquery_from = "";
$bquery_to   = "";
$bjobtitle   = "ACW-" . time();
$bdatabase   = "nr";
$bblastprog  = "tblastn";
$bpagetype   = "BlastSearch";

$pf  = "";
$pf .= "QUERY=" . $bquery;
$pf .= "&db=" . $bdb;
$pf .= "&QUERY_FROM=" . $bquery_from;
$pf .= "&QUERY_TO=" . $bquery_to;
$pf .= "&JOB_TITLE=" . $bjobtitle;
$pf .= "&DATABASE=" . $bdatabase;
$pf .= "&BLAST_PROGRAMS=" . $bblastprog;
$pf .= "&PAGE_TYPE=" . $bpagetype;

$pf .= "&GENETIC_CODE=1";
$pf .= "&DBTYPE=";
$pf .= "&EQ_MENU=";
$pf .= "&EQ_TEXT=";
$pf .= "&NEWWIN=";
$pf .= "&MATCH_SCORES=";


$pf .= "&MAX_NUM_SEQ=100";
$pf .= "&EXPECT=10";
$pf .= "&WORD_SIZE=3";
$pf .= "&MATRIX_NAME=BLOSUM62";
$pf .= "&GAPCOSTS=11 1";

$pf .= "&COMPOSITION_BASED_STATISTICS=2";
$pf .= "&REPEATS=";
$pf .= "&FILTER=";
$pf .= "&LCASE_MASK=";
$pf .= "&TEMPLATE_LENGTH=";
$pf .= "&TEMPLATE_TYPE=";
$pf .= "&I_THRESH=0.005";

$pf .= "&CLIENT=web";
$pf .= "&SERVICE=plain";
$pf .= "&CMD=request";
$pf .= "&PAGE=Translations";
$pf .= "&PROGRAM=tblastn";
$pf .= "&MEGABLAST=";
$pf .= "&RUN_PSIBLAST=";

$pf .= "&SELECTED_PROG_TYPE=tblastn";
$pf .= "&SAVED_SEARCH=true";
$pf .= "&BLAST_SPEC=";
$pf .= "&QUERY_BELIEVE_DEFLINE=";
$pf .= "&DB_DIR_PREFIX=";

$pf .= "&SHOW_OVERVIEW=true";
$pf .= "&SHOW_LINKOUT=true";
$pf .= "&GET_SEQUENCE=true";
$pf .= "&FORMAT_OBJECT=Alignment";
$pf .= "&FORMAT_TYPE=HTML";
$pf .= "&ALIGNMENT_VIEW=Pairwise";
$pf .= "&MASK_CHAR=2";
$pf .= "&MASK_COLOR=1";
$pf .= "&DESCRIPTIONS=100";
$pf .= "&ALIGNMENTS=100";
$pf .= "&NEW_VIEW=true";
$pf .= "&OLD_BLAST=true";

$pf .= "&NCBI_GI=false";
$pf .= "&SHOW_CDS_FEATURE=false";
$pf .= "&NUM_OVERVIEW=100";

$pf .= "&FORMAT_EQ_TEXT=";
$pf .= "&FORMAT_ORGANISM=";
$pf .= "&EXPECT_LOW=";
$pf .= "&EXPECT_HIGH=true";
$pf .= "&QUERY_INDEX=0";

// define the HTTP connection -----------------------------
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_POST, 1);

curl_setopt($ch,CURLOPT_POSTFIELDS, $pf);
curl_setopt($ch,CURLOPT_COOKIEJAR,
dirname(__FILE__).'/cookie.txt');
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER , 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);

// call it ------------------------------------------------
$my_result = curl_exec($ch);

// get the request ID; the query takes some time and this ID 
// allows to identify the original request 
$query = "/<tr><td>Request ID<\/td><td> <b>([0-9A-Z]*)<\/b><\/td><\/tr>/";
preg_match($query, $my_result, $request_id);

// query for the result page every 2 seconds...when the result
// is ready, we recognize it by its signature
$not_ready = true;
$result_url = "http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Get&VIEW_RESULTS=FromRes&RID=" . $request_id[1];
while($not_ready){
  $temp_page = null;
  curl_setopt($ch,CURLOPT_URL,$result_url);
  curl_setopt($ch, CURLOPT_POST, 0);
  $my_result = curl_exec($ch);
  $query = "/This page will be automatically updated in <b>(\d*)<\/b> seconds/";
  preg_match($query, $my_result, $temp_page);
  if(isset($temp_page[1])) {  sleep(2); }
  else { $not_ready = false; }
}

// parse the results page; split it into result fragments first --------
$pat = "/";
$pat .= "><input type=\"checkbox\" name=\"getSeqGi\"([^\n]*)\n";
$pat .= "Length=\d*\n\n";
$pat .= "[\w\s]*:\n";
$pat .= "(\s*<a[^>]*>[^<]*<\/a>\n)*\n";
$pat .= "([^\n]*\n){3}\n";
$pat .= "(([^\n]*\n){3}\n[^n])*";
$pat .= "/";
preg_match_all($pat, $my_result, $fragments);

// 'fine parse the result fragments and get some important parameters -----
foreach($fragments[0] as $fragment){
  $query = "/<a href=\"http:\/\/(www\.ncbi\.nlm\.nih\.gov\/entrez\/query\.fcgi\?";
  $query .= "[^>]*)>([^<]*)<\/a><a[^>]*>[^<]*<\/a>\s*<a[^>]*><img[^>]*><\/a>([^\n]*)/";
  preg_match($query, $fragment, $u);
  $query  = "/\s*Score\s*=\s*(\d*) bits\s*\(\d*\),\s*Expect\s*=\s*([^,]*),";
  $query .= "[^\n]*\n\s*Identities\s*=\s*\d*\/\d*\s*\((\d*%)\)[^\n]*\n\s*Frame\s*=\s*([^\n]*)\n/";
  preg_match($query, $fragment, $u1);
  
  preg_match_all("/Sbjct\s*(\d*)\s*[A-Z\-]*\s*(\d*)\n/", $fragment, $positions);
  
  $fr["url"]        = $u[1];
  $fr["shname"]     = $u[2];
  $fr["name"]       = $u[3];
  $fr["score"]      = $u1[1];
  $fr["expect"]     = $u1[2];
  $fr["identities"] = $u1[3];
  $fr["frame"]      = $u1[4];

// find out if we have to reverse the strand direction or not; adjust
// lower and upper limit
  if ($fr["frame"] >= 0) {
    $fr["high"] =  $positions[2][count($positions)-1]; $fr["low"] = $positions[1][0]; $fr["reverse"] = ""; }
  else {
    $fr["high"] =  $positions[1][0]; $fr["low"] = $positions[2][count($positions)-1]; $fr["reverse"] = true;}
    
  $frx[] = $fr; 

}

// we call the final details form for the top ranked result
// here we could loop over several results
curl_setopt($ch,CURLOPT_URL,$frx[0]["url"]);
curl_setopt($ch, CURLOPT_POST, 0);
$my_result = curl_exec($ch);

// parse the final details form to get the necessary parameters to execute
// the final details form
preg_match("/<input name=\"WebEnv\" type=\"hidden\" value=\"([^\"]*)\"/",
           $my_result, $u5);
preg_match("/<input name=\"query_key\" type=\"hidden\" value=\"([^\"]*)\"/", 
           $my_result, $u6);
preg_match("/<input name=\"db\" type=\"hidden\" value=\"([^\"]*)\"/", 
           $my_result, $u7);
preg_match("/<input name=\"qty\" type=\"hidden\" value=\"([^\"]*)\"/", 
           $my_result, $u8);
preg_match("/<input name=\"c_start\" type=\"hidden\" value=\"([^\"]*)\"/", 
           $my_result, $u9);

preg_match(
  "/<select name=\"dopt\".*<option value=\"([^\"]*)\" selected=\"1\">/",
  $my_result, $u10);
preg_match(
  "/<select name=\"dispmax\".*<option value=\"([^\"]*)\" selected=\"1\">/",
  $my_result, $u11);
preg_match(
  "/<select name=\"sendto\".*<option value=\"([^\"]*)\" selected=\"1\">/", 
  $my_result, $u12);
preg_match(
  "/<input type=\"hidden\" id=\"([^\"]*)\" name=\"fmt_mask\".*value=\"([^\"]*)\"/",
  $my_result, $u13);
preg_match(
  "/<input type=\"checkbox\" id=\"truncate\" name=\"truncate\" value=\"([^\"]*)\"\/>/", 
  $my_result, $u14);

// fill in the necessary parameters parsed out now and earlier
$of  = "WebEnv=" . $u5[1];
$of .= "&query_key=" . $u6[1];
$of .= "&db=" . $u7[1];
$of .= "&qty=" . $u8[1];
$of .= "&c_start=" . $u9[1];
$of .= "&dopt=" . $u10[1];
$of .= "&dispmax=" . $u11[1];
$of .= "&sendto=" . $u12[1];
$of .= "&fmt_mask=" . $u13[1];
$of .= "&truncate=" . $u14[1];

$of .= "&less_feat=";

$of .= "&from=" . $fr["low"];
$of .= "&to=" . $fr["high"];
$of .= "&strand=" . $fr["reverse"];

$of .= "&extrafeatpresent=1";
$of .= "&ef_SNP=1";
$of .= "&ef_CDD=8";
$of .= "&ef_MGC=16";
$of .= "&ef_HPRD=32";
$of .= "&ef_STS=64";
$of .= "&ef_tRNA=128";
$of .= "&ef_microRNA=256";
$of .= "&ef_Exon=512";
$of .= "&submit=Refresh";


$ourl = "http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?" . $of;

// execute the final details form
curl_setopt($ch,CURLOPT_URL,$ourl);
curl_setopt($ch, CURLOPT_POST, 0);
curl_setopt($ch,CURLOPT_POSTFIELDS, $of);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER , 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);

$my_result = curl_exec($ch);

// parse the result page and get the final ORIGIN snippet
preg_match("/ORIGIN\s*\n\s*<a name=\"[^\"]*\"><\/a>\s*([^\/]*)\//", $my_result, $u99);

// output the final result ----------------------------------
echo "\n" . preg_replace("/\s{2,}/", "\n", preg_replace("/\d+\s/", "", $u99[1]));
?>

Edited by: Anton Wenzelhuemer on Aug 6, 2008 8:08 PM (Re-Formatted)

Former Member
0 Kudos

just in case someone thinks 'I wish I could run this code and see if it really woks...', nothing easier than that. Just go to the PHP Website's [Download Section|http://www.php.net/downloads.php], download the version of your choice and install it according to the instructions given.

save the coding of the post above to some file file_name and run the whole thing with

php.exe file_name

one unusual PHP configuration setting might be required here, max_execution_time = 600 or something similar, since the server needs quiet some time to return ananswer on the first query...

former_member181923
Active Participant
0 Kudos

Anton -

Wow! That may have been easy for you to do, but certainly not for me.

I have responses and questions at several levels.

First response (at the "political" level)

After I read your two posts, I immediately emailed Mark Finnern, Mark Yolton, and Amir Blich to alert them that a question may arise regarding SAP "intellectual property" rights - whether other organizations can freely use code that you develop here. The reason this question may arise is because the overall project that I have in mind (to "automate" production of cases at the StrucClues web site)

may wind up getting some significant funding in the not too distant future). And if that happens, it would be a lot easier for that project just to steal your code here and any other code that you write between now and say Decmeber 2009 (assume you continue to have the time and interest.)

Second response (technical question):

The code below assumes that you have already gotten the "FASTA" primary structure sequence from the RCSB PDB web-site.

Could you put a tiny little front-end on it so it works like this?

Like this:

1) accept a PDB identifier from the user, e.g. "2eqa"

2) at the PDB web-site:

http://www.rcsb.org/pdb/home/home.do

submit this ID and when the site responds, tell the site to provide the "FASTA Sequence" (at the site itself, this option is on the left-hand side right under "Download Files";

3) pass the sequence you get in step 2 to your existing code.

Third question: (also technical):

In the simple example I gave, the whole gene for the protein had just one exon, and the exon is reported relative to "Frame -2" of the DNA.

But remember I said at the beginning of the WIKI that a protein can have more than one exon?

Well in that case, here's what happens when you do the blastn piece.

As you do now, you take the protein at the top of the list with ID "X" and scan the results lower-down for the section that belongs to ID "X".

But now, if the gene of the protein has more than one exon, you will get three result subsections within the result section belonging to ID "X":


Frame 0:  
                  20 
                                50
   
Frame -2: 
                 80
                                60

Frame +1:
                 90
                               120

So in this case you have to be able to make your code recognize when there is more than one result section, and recursively submit each one to the final part of the program, and then concatenate all three returns together to get the final result.

Is there any technical problem with doing this, or is just a minor recursive nuisance involving an elaboration of the routine you already have ?

Third response (technical question):

Please disregard this question - we cross-posted - thanks for the execution instructions !!!!!

I wasn't sure from your post if your code is working now, or if it will only work for someone who has found the bug and fixed it?

If it's not working as is, could you post back here what you actually get at the point where you were expecting a sequence?

Or would that be cheating ?????

Anyway, that's my last question for the moment.

I'm very grateful to you for doing what you've done.

If you continue to have interest and time, what you've already done will make it a lot easier to solve the next "real-life" problem, because it also involves identifying and manipulating multiple subsections within the results section provided by blastn.

Thanks again - really.

Best

djh

Edited by: David Halitsky on Aug 6, 2008 9:23 PM

Former Member
0 Kudos

ad 2nd reply:

this is the code to get the FASTA sequence for the PDB identifier (well, without UI )


<?
$pdb_identifier = "2EQA";

$url = "http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=FASTA&compression=NO&structureId=" . $pdb_identifier;

// define the HTTP connection -----------------------------
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_POST, 0);

curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER , 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);

$my_result = curl_exec($ch);

preg_match("/.*SEQUENCE\n([A-Z\n]*)/", $my_result, $bquery);

echo $bquery[1];
?>

ad 3rd response:

of course this is possible. don't see no logical objection.

cheers,

anton

former_member181923
Active Participant
0 Kudos

Anton -

Thanks very much for the "front-end" code. It's really importantt for the following reason.

Kim Henrick of the EBI (European Bioinformatics Institute) has recently sent me a list of the identifiers for the 2243 proteins of "unknown function" currently in the PDB.

Can your front-end be elaborated to read this list and return 2243 FASTA sequences in one file?

If so, the first few lines of Kim's list look like this:


HEADER STRUCTURAL GENOMICS, UNKNOWN FUNCTION
HEADER UNKNOWN FUNCTION 


1dcj  1di6  1di7  1dm5  1ehx  1ew4  1f89  1fl9  1fux  1g04
1g2r  1gh9  1h2h  1hqq  1hru  1htw  1hxl  1hxz  1hy2  1i17
1i36  1i60  1i6n  1i9h  1ihn  1iio  1ij8  1ilv  1in0  1iuj

So once your "front-end" code executed recursively over this list to generate a file with 2243 FASTA sequences, I would for now manually pass each of the 2243 sequences to you final chunk of code. But ideally, the "front-end" code that produces the 2243 sequences from Kim's list should feed each sequence to the final piece of code that does the blastn and gets the "gene" from the results.

Finally, I will to find an example of the way the blastn results section looks when the protein has more than one exon and the results section therefore has more than one subsection (like I just showed you above.)

Again, I cannot find a way to properly express my gratitude at the progress we are making here ....

Best regards

djh

Edited by: David Halitsky on Aug 6, 2008 10:20 PM

Former Member
0 Kudos

here we go:

1) reading and parsing the file


$in_file = file_get_contents("pdbs.txt");
preg_match_all("/([0-9a-z]{4})/", $in_file, $pdbs);
foreach($pdbs[0] as $pdb){
  echo $pdb . "\n";
}

2) combined with the FASTA query shown earlier this yields


<?

$in_file = file_get_contents("pdbs.txt");
preg_match_all("/([0-9a-z]{4})/", $in_file, $pdbs);

$ch = curl_init();
curl_setopt($ch, CURLOPT_POST, 0);
  
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER , 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
  
foreach($pdbs[0] as $pdb){
  $url = "http://www.rcsb.org/pdb/download/downloadFile.do?";
  $url.= "fileFormat=FASTA&compression=NO&structureId=" . $pdb;
  curl_setopt($ch,CURLOPT_URL,$url);
  $my_result = curl_exec($ch);
  
  preg_match("/.*SEQUENCE\n([A-Z\n]*)/", $my_result, $bquery);
  
  echo $bquery[1] . "\n-----------------------------------------\n";
}

?>

Executing this on the file fragment given takes 2min07 and yields



MTDLFSSPDHTLDALGLRCPEPVMMVRKTVRNMQPGETLLIIADDPATTRDIPGFCTFMEHELVAKETDGLPYRYLIRKG
G

-----------------------------------------
MATLRIGLVSISDRASSGVYQDKGIPALEEWLTSALTTPFELETRLIPDEQAIIEQTLCELVDEMSCHLVLTTGGTGPAR
RDVTPDATLAVADREMPGFGEQMRQISLHFVPTAILSRQVGVIRKQALILNLPGQPKSIKETLEGVKDAEGNVVVHGIFA
SVPYCIQLLEGPYVETAPEVVAAFRPKSARRDVSE

-----------------------------------------
MATLRIGLVSISDRASSGVYQDKGIPALEEWLTSALTTPFELETRLIPDEQAIIEQTLCELVDEMSCHLVLTTGGTGPAR
RDVTPDATLAVADREMPGFGEQMRQISLHFVPTAILSRQVGVIRKQALILNLPGQPKSIKETLEGVKDAEGNVVVHGIFA
SVPYCIQLLEGPYVETAPEVVAAFRPKSARRDVSE

-----------------------------------------
VVQGTVKPHASFNSREDAETLRKAMKGIGTDEKSITHILATRSNAQRQQIKTDYTTLFGKHLEDELKSELSGNYEAAALA
LLRKPDEFLAEQLHAAMKGLGTDKNALIDILCTQSNAQIHAIKAAFKLLYKEDLEKEIISETSGNFQRLLVSMLQGGRKE
DEPVNAAHAAEDAAAIYQAGEGQIGTDESRFNAVLATRSYPQLHQIFHEYSKISNKTILQAIENEFSGDIKNGLLAIVKS
VENRFAYFAERLHHAMKGLGTSDKTLIRILVSRSEIDLANIKETFQAMYGKSLYEFIADDCSGDYKDLLLQITGH

-----------------------------------------
MQDPTINPTSISAKAGSFADTKITLTPNGNTFNGISELQSSQYTKGTNEVTLLASYLNTLPENTTKTLTFDFGVGTKNPK
LTITVLPKDIPGLE

-----------------------------------------
MNDSEFHRLADQLWLTIEERLDDWDGDSDIDCEINGGVLTITFENGSKIIINRQEPLHQVWLATKQGGYHFDLKGDEWIC
DRSGETFWDLLEQAATQQAGETVSFR

-----------------------------------------
MSASKILSQKIKVALVQLSGSSPDKMANLQRAATFIERAMKEQPDTKLVVLPECFNSPYSTDQFRKYSEVINPKEPSTSV
QFLSNLANKFKIILVGGTIPELDPKTDKIYNTSIIFNEDGKLIDKHRKVHLFDVDIPNGISFHESETLSPGEKSTTIDTK
YGKFGVGICYDMRFPELAMLSARKGAFAMIYPSAFNTVTGPLHWHLLARSRAVDNQVYVMLCSPARNLQSSYHAYGHSIV
VDPRGKIVAEAGEGEEIIYAELDPEVIESFRQAVPLTKQRRFDVYSDVNAH

-----------------------------------------
GSHMESLTQYIPDEFSMLRFGKKFAEILLKLHTEKAIMVYLNGDLGAGKTTLTRGMLQGIGHQGNVKSPTYTLVEEYNIA
GKMIYHFDLYRLADPEELEFMGIRDYFNTDSICLIEWSEKGQGILPEADILVNIDYYDDARNIELIAQTNLGKNIISAFS
N

-----------------------------------------
AEFQVTSNEIKTGEQLTTSHVFSGFGCEGGNTSPSLTWSGVPEGTKSFAVTVYDPDAPTGSGWWHWTVVNIPATVTYLPV
DAGRRDGTKLPTGAVQGRNDFGYAGFGGACPPKGDKPHHYQFKVWALKTEKIPVDSNSSGALVGYMLNANKIATAEITPV
YEIKLE

-----------------------------------------
GNDYEDRYYRENMYRYPNQVYYRPVC

-----------------------------------------
GSHMKTRKIPLRKSVVSNEVIDKRDLLRIVKNKEGQVFIDPTGKANGRGAYIKLDNAEALEAKKKKVFNRSFSMEVEESF
YDELIAYVDHKVKRRELGLE

-----------------------------------------
MYIIFRCDCGRALYSREGAKTRKCVCGRTVNVKDRRIFGRADDFEEASELVRKLQEEKYGSCHFTNPSKRE

-----------------------------------------
MTVLIIGMGNIGKKLVELGNFEKIYAYDRISKDIPGVVRLDEFQVPSDVSTVVECASPEAVKEYSLQILKNPVNYIIIST
SAFADEVFRERFFSELKNSPARVFFPSGAIGGLDVLSSIKDFVKNVRIETIKPPKSLGLDLKGKTVVFEGSVEEASKLFP
RNINVASTIGLIVGFEKVKVTIVADPAMDHNIHIVRISSAIGNYEFKIENIPSPENPKTSMLTVYSILRTLRNLESKIIF
G

-----------------------------------------
RCCHPQCGAVEECR

-----------------------------------------
MNNNLQRDAIAAAIDVLNEERVIAYPTEAVFGVGCDPDSETAVMRLLELKQRPVDKGLILIAANYEQLKPYIDDTMLTDV
QRETIFSRWPGPVTFVFPAPATTPRWLTGRFDSLAVRVTDHPLVVALCQAYGKPLVSTSANLSGLPPCRTVDEVRAQFGA
AFPVVPGETGGRLNPSEIRDALTGELFR

-----------------------------------------
MESLTQYIPDEFSMLRFGKKFAEILLKLHTEKAIMVYLNGDLGAGKTTLTRGMLQGIGHQGNVKSPTYTLVEEYNIAGKM
IYHFDLYRLADPEELEFMGIRDYFNTDSICLIEWSEKGQGILPEADILVNIDYYDDARNIELIAQTNLGKNIISAFSN

-----------------------------------------
RCCHPQCGMAEECR

-----------------------------------------
RCCHPQCGMVEECR

-----------------------------------------
CCHPQCGAAYSC

-----------------------------------------
RVAENRPGAFIKQGRKLDIDFGAEGNRYYAANYWQFPDGIYYEGCSEANVTKEMLVTSCVNATQAANQAEFSREKQDSKL
HQRVLWRLIKEICSAKHCDFWLERGAA

-----------------------------------------
LRVGFIGFGEVAQTLASRLRSRGVEVVTSLEGRSPSTIERARTVGVTETSEEDVYSCPVVISAVTPGVALGAARRAGRHV
RGIYVDINNISPETVRMASSLIEKGGFVDAAIMGSVRRKGADIRIIASGRDAEEFMKLNRYGLNIEVRGREPGDASAIKM
LRSSYTKGVSALLWETLTAAHRLGLEEDVLEMLEYTEGNDFRESAISRLKSSCIHARRRYEEMKEVQDMLAEVIDPVMPT
CIIRIFDKLKDVKVSADARLQGCA

-----------------------------------------
MKLCFNEATTLENSNLKLDLELCEKHGYDYIEIRTMDKLPEYLKDHSLDDLAEYFQTHHIKPLALNALVFFNNRDEKGHN
EIITEFKGMMETCKTLGVKYVVAVPLVTEQKIVKEEIKKSSVDVLTELSDIAEPYGVKIALEFVGHPQCTVNTFEQAYEI
VNTVNRDNVGLVLDSFHFHAMGSNIESLKQADGKKIFIYHIDDTEDFPIGFLTDEDRVWPGQGAIDLDAHLSALKEIGFS
DVVSVELFRPEYYKLTAEEAIQTAKKTTVDVVSKYFSM

-----------------------------------------
MKLCFNEATTLENSNLKLDLELCEKHGYDYIEIRTMDKLPEYLKDHSLDDLAEYFQTHHIKPLALNALVFFNNRDEKGHN
EIITEFKGMMETCKTLGVKYVVAVPLVTEQKIVKEEIKKSSVDVLTELSDIAEPYGVKIALEFVGHPQCTVNTFEQAYEI
VNTVNRDNVGLVLDSFHFHAMGSNIESLKQADGKKIFIYHIDDTEDFPIGFLTDEDRVWPGQGAIDLDAHLSALKEIGFS
DVVSVELFRPEYYKLTAEEAIQTAKKTTVDVVSKYFSM

-----------------------------------------
DPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWK
NNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAAS

-----------------------------------------
SHMFSDCRFGSVTYRGREYRSDIVVHVDGSVTPRRKEISRRKYGTSHVMAEEELEELLEEKPESIIIGSGVHGALETGFR
SDATVLPTCEAIKRYNEERSAGRRVAAIIHVTC

-----------------------------------------
GSHMKMGVKEDIRGQIIGALAGADFPINSPEELMAALPNGPDTTCKSGDVELKASDAGQVLTADDFPFKSAEEVADTIVN
KAGL

-----------------------------------------
ARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYTTAVTATSNEIKESPLHGTENTINKRTQPTFGFTVNWKFSESTTVFT
GQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVGINIFTRLRTQKE

-----------------------------------------
MRILVTNDDGIQSKGIIVLAELLSEEHEVFVVAPDKERSATGHSITIHVPLWMKKVFISERVVAYSTTGTPADCVKLAYN
VVMDKRVDLIVSGVNRGPNMGMDILHSGTVSGAMEGAMMNIPSIAISSANYESPDFEGAARFLIDFLKEFDFSLLDPFTM
LNINVPAGEIKGWRFTRQSRRRWNDYFEERVSPFGEKYYWMMGEVIEDDDRDDVDYKAVREGYVSITPIHPFLTNEQCLK
KLREVYD

-----------------------------------------
MPSFDIVSEITLHEVRNAVENANRVLSTRYDFRGVEAVIELNEKNETIKITTESDFQLEQLIEILIGSCIKRGIEHSSLD
IPAESEHHGKLYSKEIKLKQGIETEMAKKITKLVKDSKIKVQTQIQGEQVRVTGKSRDDLQAVIQLVKSAELGQPFQFNN
FRD

-----------------------------------------
MFVTMNRIPVRPEYAEQFEEAFRQRARLVDRMPGFIRNLVLRPKNPGDPYVVMTLWESEEAFRAWTESPAFKEGHARSGT
LPKEAFLGPNRLEAFEVVLDSEGRDG

-----------------------------------------

note: there might one problem about this powerful script and that is that RCSB might not be amused to be queried a few thousand times by one consumer within a short amount of time. one had to either 'slow down' the script a little or ask the site owners for permission

regards,

anton

former_member181923
Active Participant
0 Kudos

Anton -

I'm getting very tired of saying "wowwwww! thank you". So I'll just type "WTY" from now on (not to be confused with "WTF", unless I have no idea what your code is doing.

Yes - I'm aware of the scripting "courtesy" issue - I would never submit more than 2 or so at a time in real life.

Actually the code you've posted today will be used in 2 different ways in the overall WDAapp.

In some cases, I want the WDA application to accept one PDB identifier from the user and pass it to the code you've just written.

In the other case, I want the WDA application to accept a file name and pass it to the coe you've just written.

WTY (again)

djh

former_member181923
Active Participant
0 Kudos

Oh I forgot - in the code above that generates the FASTA sequences from a given input file, could you add the pdb identigier above each FASTA sequence, like:

1xyz

ABCDEFGH ....

2abc

QWERTYUIOIOP

etc.

It would make it easier ...

Former Member
0 Kudos

well,

not much attendance here, so i got to start the transfer of the scripts to ABAP myself.

so, without much ado here's a sketch of step one, the 'FASTA query'. no big deal to write a little WD proggy around to ask for the input PDB identifier and display the result or even read a file with several identifiers, loop over them and display the resulting sequences.


REPORT  ZTW_FASTA.

* data declarations

data: client         type ref to if_http_client,
      lt_fields      type tihttpnvp,
      errortext      type string, "used for error handling
      url            type string,
      protocol       type string value 'HTTP/1.0',
      subrc          type sysubrc,
      content        type string,
      target         type string,
      delimiter(1)   value '#',
      fasta_sequence type table of string,
      fasta_line     type string,

      pdb_ident      type string value '2eqa'.

url = 'http://www.rcsb.org/'.
concatenate url 'pdb/download/downloadFile.do?' into url.
concatenate url 'fileFormat=FASTA&compression=NO&structureId=' into url.
concatenate url pdb_ident into url.

delimiter = cl_abap_char_utilities=>newline.

call method cl_http_client=>create_by_url
  EXPORTING
    url = url
*    host               = host
*    service            = service
*    proxy_host         = proxy_host
*    proxy_service      = proxy_service
*    scheme             = 'HTTP'
  IMPORTING
    client             = client
  EXCEPTIONS
    argument_not_found = 1
    internal_error     = 2
    plugin_not_active  = 3
    others             = 4.

if sy-subrc <> 0.
  write: / 'Create failed, subrc = ', sy-subrc.
  exit.
endif.

* set http method GET
call method client->request->set_method(
  if_http_request=>co_request_method_get ).

* set protocol version
if protocol = 'HTTP/1.0'.
  client->request->set_version(
       if_http_request=>co_protocol_version_1_0 ).
else.
  client->request->set_version(
        if_http_request=>co_protocol_version_1_1 ).
endif.

call method client->send
  EXPORTING
    timeout                    = 90
  EXCEPTIONS
    http_communication_failure = 1
    http_invalid_state         = 2
    http_processing_failed     = 3
    others                     = 4.

if sy-subrc <> 0.

  call method client->get_last_error
    IMPORTING
      code    = subrc
      message = errortext.

  write: / 'communication_error( send )',
         / 'code: ', subrc, 'message: ', errortext.
  exit.
endif.

* receive
call method client->receive
  EXCEPTIONS
    http_communication_failure = 1
    http_invalid_state         = 2
    http_processing_failed     = 3
    others                     = 4.

if sy-subrc <> 0.

  call method client->get_last_error
    IMPORTING
      code    = subrc
      message = errortext.

  write: / 'communication_error( receive )',
         / 'code: ', subrc, 'message: ', 'dummy'.
  exit.
endif.

*get data
content = client->response->get_cdata( ).

*parse response & put lines into a table
split content at delimiter into table fasta_sequence.
*delete the header data
delete fasta_sequence index 1.

*close the connection
call method client->close
  EXCEPTIONS
    http_invalid_state = 1
    others             = 2.

loop at fasta_sequence into fasta_line.
  write: / fasta_line.
endloop.

former_member181923
Active Participant
0 Kudos

Anton -

There is an expression in English "My cup runneth over" - I dont' know the German or Dutch translation - it means "I am very grateful".

The reason I haven't posted new questions in the WIKI is because I am very busy completing a formal paper with my colleagues Fresco and Lesk - this paper is critical to the possibility of being considered for US DOE (Dept of Energy) funding next year. (In the draft of the paper, I am referencing this thread, by the way.)

However, as soon as I can, I will find you an example where tblastn returns several result segments corresponding to different DNA exons (see my question above.)

Once I give you this, maybe you can tune the tblastn part of the code so that it is prepared for such "multiple" returns and knows how to deal with them.

One final thing - in the example ABAP you just gave - does this print out the PDB identifer on a line right above the FASTA sequence? (I haven't looked carefully.) If not, could you add this to the ABAP version and also to the "on-line php version you posted earlier???? This is very important. Even if we do only 5 of the 2243 entries at a time, weI'll need to keep track of which FASTA goes with which identifier (for entry into transparent table.)

Thank you very very much again.

djh

Former Member
0 Kudos

DISCLAIMER:

the code presented here is meant to be a proof of concept or draft to show that the problems can be solved in one or another scripting language or in ABAP. the code fragments are not to be used in any kind of productive context because they obviously miss some proper error handling for example. moreover I am assuming some static responses to allow parsing them. actually, we are querying public sites and we do not have any kind of SLA with them nor any other arrangement. This means, they can always substantially change the structure of their responses without any notice making the shown parsers break. In fact, there is no need for the involved 'providers' to keep their interfaces constant since they are not meant to be evaluated automatically.

to state it short (and bluntly): I do not advise anyone who isn't for example able to add to the coding a little tag to identify certain parts of the result to use this code in any productive context.

apart from that, have fun with it

anton

former_member181923
Active Participant
0 Kudos

heh heh heh.

Point taken - I'll add the tag myself !!!!

And I will try to find the multi-section return from tblastn as soon as possible.

Otherwise, I'm afraid you'll tell me I have to write that loop also! (Just kidding, just kidding!)

Best

djh