Precision Genomics

Menu, Login, User Registration, Query Request will go on this line ______ Please excuse this rough draft. It's a work in progress.
Logo

Images will go here

Only Biocomputational (Biocomp) searches the entire 3 GB Human Genome Project (HGP) with 100% accuracy. The NIH claims "Exhaustive searching of all the data is currently considered ridiculously time consuming, if not impossible." Biocomp exhaustively examines every nucleotide and/or amino acid 11 times. Only Biocomputational reports every exact and partial match without misses or false-positives [within the 600 match limit]. Only Biocomputational has the computational power to search the entire genome for amino acids in all 3 frame shifts reporting both amino acid and nucleotide matches [Perfect for biomarker discovery]. Other alignment systems search about 11% of the genome approved with science by committee. Only Biocomp allows you to search for the amino acid terminator (*) codone sets TAA, TAG, TGA. We will never substitute the magical "Optimal Alignment" scoring for a perfect mathematical alignment match. NCBI tool RepeatMasker handles repeats badly masking up to 50% of the genome.
Biocomp can search any genome mice to rice, based on user demand.

Biocomputational is now accepting query searches of 25 to 56 nucleotides through resellers who can also do result verification and enhancement. Result verification and Enhancement Assistants (REA) for a negotiated fee.
Nucleotides translated to the 3 frame shift amino acids will report amino acid matches and nucleotide matches [Perfect for Biomarker Discovery] will require a negotiated fee..

A reality check and disclaimer.
The raw probability of finding a 25 nucleotide sequence is 1.259e015, or 1 in 1,259,000,000,000,000. Remember that the higher the statistical significance, the higher the risk of false-positives. The Biocomp programs use the same CPU power and time to report that a 56 nucleotide query does not exist, as it does to report 100 exact and partial matches with the exact chromosome locations. Biocomp reports the first 600 matches with a full count of all matches This limit can be extended by special arraignment. This limit exists having found over 100,000 matches of short queries clogging our reporting system.
Biocomputational makes no claims or support for the medical diagnosis or genomic meaning of a match. We are not here to help a researcher prove a hypothesis. This is the job of the researcher. We do not provide statistical, significance, inference, interpretation, assumptions, probability, or importance of any match. In fact we do not utilize statistics or AI, we just search. Our software can not find the cure for cancer. Using this software service will provide verifiable and transparent fact based scientific data results providing certainty for serious research. Researchers and graduate students will save time, [days, weeks, months] analyzing huge volumes of confusing and inaccurate results. Most importantly it will not misdirect meaningful research.
We will not limit the use of these results, nor will we claim ownership of the sequences. We are not responsible for the same queries being searched by several researchers, example a class assignment, and that means the first to query has no ownership.
The name, email query sequence will not be shared.

Charles DeFilippo a programmer since Jan. 1965, professor, and expert search engineer, turned bioinformatician. After surviving stage 4 throat cancer, Charles volunteered to help a cancer researcher with genomic questions that the NCBI BLAST failed to accurately answer. BLAST went from a 5% to 10% error rate, to up to a 50% false-positives in various parts of nucleotide searches, and missing up to 90% of valid amino-acid matches by not searching the entire genome or terminator (*) codone sets TAA, TAG, TGA.

Charles has followed lung cancer research, his next cancer.
Ledford, Heidi, “Lists of cancer mutations awash with false positives.” NATURE, 2013. We discovered that only 11 of the 450 genes connected to lung cancer were valid, a 98% false positive rate, when researched by 600 post grad fellows world wide.
What are the costs of misdirected cancer research in time, funding, reputations and lives?

Biocomputational searches the chromosomes found at this address:
https://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/
GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_assembly_structure/Primary_Assembly/
assembled_chromosomes/FASTA/    GRCh38.p13_assembly
chr01.fna chr02.fna    chr03.fna    chr04.fna chr05.fna chr06.fna chr07.fna   chr08.fna
chr09.fna chr10.fna    chr11.fna    chr12.fna   chr13.fna   chr14.fna   chr15.fna   chr16.fna
chr17.fna    chr18.fna    chr19.fna    chr20.fna   chr21.fna   chr22.fna   chrxx.fna chryy.fna

Only Biocomputational offers an exhaustive genome search with %100 accuracy with the following competencies..
These services will be offered in order of user demand and enlistment of localized Result verification and Enhancement Assistants.
• Search exact and partial nucleotide matches. (25 to 56 for now, and eventually hundreds)
• Training of local Result verification and Enhancement Assistants (REA).
• Partial nucleotide searches, substitution Errors=(queryLength/WordSize11)-1. E=(L/W)-1 (56=4 errors)
• Searches of entire HGP (3 GB), results of 3 frame shift amino-acids and all nucleotide mutations (for biomarker discovery).
• Any genome mice to rice based on demand, with all of the above services.
Result Customization Assistants will provide a variety of extra research services.
     • A genome – match in context viewer.
     • Reliance and frequency of rarest 11 nucleotide sub-sequences.
     • Brute force match verification.
     • 51% of more contiguous matches.
     • Match sequence counts only.
• Find all Biomarkers that meets it's consensus sequence coding.
• New candidate genes by searching both start (M) and terminating codons (*).
     The biomarkers and new genes include ones that overlap so none are missed.
The following are not ready because user feedback and suggestions are needed.
• Next-gen short sequence sorting with repeat count.
• Next-gen of sorted sequences with mixed transcriptome or referenced search.
• Search chemical formula or empirical formula in textual documents.
• Micro-array data analysis to find sequences of sample differences.Tumors-Normal

Our mission of precision genomics is not just a catchy phrase, it’s intended to advance science, and maybe save some lives.
The saving lives part is the researcher's job, our job is to provide the best research tools possible.

• BLAST, O'Reilly Books - An Essential Guide to the Basic Local Alignment Search Tool, By Korf, Yandell, and Bedell, July 2003.
“Chapter 4. Sequence Similarity - Karlin-Altschul statistics”
In 1990, Samuel Karlin and Stephen Altschul published a theory of local alignment statistics.
“Karlin-Altschul statistics make five central assumptions:
• A positive score must be possible.
• The expected score must be negative.
• The letters of the sequences are independent and identically distributed (IID).
• The sequences are infinitely long.
• Alignments don’t contain gaps.
The first two assumptions are true for any scoring matrix estimated from real data. The last three assumptions are problematic (wrong) because biological sequences have context dependencies, aren’t infinitely long, and are frequently aligned with gaps. Both alignment and sequence similarity assume independence, and this is a necessary convenience. ”
Karlin-Altschul uses short cut integral calculus needed by the Poisson equation, and distribution.
“Chapter 5. The Five BLAST Programs” There are now many more.
“Chapter 8. 20 Tips to Improve Your BLAST Searches”
“8.1 Don't use the Default Parameters”
“8.2 Treat BLAST Searches as Scientific Experiments”
Use various versions of BLAST and manipulate parameters until you get what you want.
“8.3 Perform Controls, Especially in the Twilight Zone”
The Twilight Zone sequences are weak similarity matches that score at or below a Z-score percentile of 25 or less (bit-scores to Z-scores have no accurate conversion), so you get more matches and do not miss anything important.
“8.7 Know When to Use Complexity Filters”
“8.8 Mask Repeats in Genomic DNA ”
Repeats replaced (hard-masked) with N's for nucleotides and X's for amino-acid sequences. See RepeatMasker.com.
“8.20 How to Lie with BLAST Statistics”
• BLAST does not search actual genome data. Instead it produces a hash table index database and searches it instead.
It is prone to hash collisions and the remedies to collisions like a Bloom Filter that may be responsible for all the false positives.

    Charles DeFilippo Academics
Jan. 1965 Learned to program in machine language, and FORTRAN on one of the first IBM 360 mainframes
and Chemistry at the University of Florida.
1968 Chemistry at SUNY Farmingdale, NY.
1970 Psychology at University of South Florida.
1971 MA Teaching Emotionally Disturbed Children at University of South Florida.
1971 Certified to teach emotionally disturbed children and science up to the junior college level, in Georgia, and Florida.
1972 taught for the Southwest Georgia Program for Exceptional Children, the one teacher for two counties.
Charles went on to teach public elementary, middle school, and high schools in Florida.
He also taught at a Summerhill model school, a public alternative school, psych wards, and juvenile detention centers.
2004 PhD Candidate in UNM Educational Psychology, that required five 600 level statistics classes, completing classes
in analysis of variance statistics ANOVA, multiple regression statistics, Meta study statistics, AI[5 types of AI] and web-site HTML, ending with a second minor stroke and treatment for stage 4 throat cancer.
1975-1983 Studied COBOL, BASIC for Apple, IBM PC, and HP instrumentation, LOGO(for children), PILOT(for teachers).
1983 studied Pascal and Modula-2 with Niklaus Wirth, the inventor of both.
2002 trained and certified in Java, c shell and Solaris UNIX, by the SUN Academic Initiative for university instructors.
1981 Supervisor of UNM Algebra Tutorial Program 15 tutors serving 800 students per month.
1981 – 2015 professor at UNM main campus and Kirkland AFB,
Albuquerque College,
UNM Los Lunas campus and Los Lunas Medium Security Correctional Prison,
College of Sana Fe Albuquerque campus and Los Lunas Medium Security Correctional Prison and Kirkland AFB,
National College computer science department head,
SIPI Native American College for UNM.
Taught 3 levels of College Algebra (4000 students), Statistics, Database Design (SQL), Paradox, and Oracle SQL plus, spreadsheets, computer programming (2000 students) BASIC, COBOL, Pascal, Java, JavaScript, C, UNIX and Solaris.

   Charles DeFilippo science and programming:
1969 Employed as an assistant scientist chemist by American Cyanimide in Stamford Connecticut.
1987 Spectra Research Contractor for Sandia and Los Alamos National Laboratories.
Studied AI neural networking and fuzzy logic at Spectra for use with Full Text Retrieval.
Evaluated version 1 of Microsoft Projects for Sandia Labs to note real and potential problems.
Trained users and documented the Costpro estimating system (Nuclear power plant) for Los Alamos.
Safety analysis of the (rail based) MX-Missile, warhead, explosions, and propellant leaks for the US Air Force and Sandia Labs. Charles modeled the leak dispersion, when it proved to be the most dangerous problem.
Sandia expanded the MX-Missile propellant study to be a hazmat material dispersion modeling tool, assisted by a Fermi Laboratories FORTRAN algorithm that Charles adapted and converted to C.
Designed and populated a Trident Missile Warhead chemical interactions database for DOD and DOE with Sandia Labs.
Corrected and made functional a DOT hazmat physical and chemical property database.
Designed a research database for the UNM Hospital to collect trauma treatments and patient outcomes to be analyzed. The database needed to be more than 300 fields (255 fields the normal max) used for 2 types of trauma (physical trauma, stab, gun shoot, car crash, or burn trauma) in a single database, and all possible treatments (for instance all antibiotics) in drop down menus for consistent entries. The database is now used by several trauma centers.
1990 With expertise in database design and full text retrieval Charles asked the National Library of Medicine [in person] if it needed help with searching the existing genome data and was told the problem had been solved with BLAST.
Presented text retrieval solutions 2 times at the National Library of Medicine, the Pentagon and the Library of Congress 3 times.
Charles wrote an accounting report (over 800 pages, quarterly) indexing and searching program for auditing a small city.
Charles also wrote commercial applications in medical office management, dental (200,000 patient capability), chiropractor, and a video rental store.

TESTING - Comparing Biocomputational to BLAST results.
The test set was 20,000 randomly selected 25 nucleotide sequences (micro-array size) . A sub set of 135 sequences that had an exact match count of 10 to 20. Both brute force searches and the latest version of Biocomputational were used for verification of the true counts. The majority of the 135 test sequences produced results of 30% to 50% false positives (after 2010), with the worst sequence CGAAATGCCAGCTGAGGCACATGCC, BLAST reported 38 matches, when only 9 exist, even though the BLAST report shows only 14 matches with a perfect bit-score of 50.1. Only CTGGGTGTGGTGACGGGTGTCTGTA, 1 out of 135 test sequences reported 10 where 10 existed and showed 10 perfect bit-score of 50.1. BLAST over reported (false positives) on every other sequence, and the identification of missed sequences characters impossible.
https://blast.ncbi.nlm.nih.gov/Blast.cgi, Human, Genome(GRCh38.7, reference assembly top-level) Try out BLAST.
|#1|Reference Number |#2|Search Query Sequence |#3|Biocomp Count 38.p2 |#4|BLAST Count 38.p2
|#5|Biocomp Count 38.p7 |#6|BLAST Count 38.p7 |#7|BLAST Result Lines and Bytes of X queries.
|#1|#2                       |#3|#4|#5|#6|#7|
X 1 GCTTCCCAAAGTGCTGGGACTGACT|13|23|11|23|7M lines, 200MB
X 2 GGATTACAGCCGTGAGCCACCACAC|11|29|11|29|5.79M lines 172MB
X 3 TTGAGACGAGTCTTGCTCTGTTGCC|12|25|12|25|2M lines, 60MB
X 4 GCCTCAATCTTCTGGGCTCAAGTGA|12|22|18|32|.9M lines, 28MB
5 CTGGGTGTGGTGACGGGTGTCTGTA|10|10|10|10|
6 TGGGTTCTGTGCCCACACTCTAGAT|12|20|11|20|
7 CGAAATGCCAGCTGAGGCACATGCC| 9|38| 9|38|
8 ACCAACATGGAGAAATCTCGTCTCC|11|27|12|26|
9 TTGGCACCAGGGACTAGTTTTGTGG|16|27|16|27|70,000 lines, 2MB
    Nucleotide Sequence to Amino-Acids - finds Nucleotide Mutations
10 GGAGTTTCACTGTTGTTGCCCAGGCGT|1|1|GVSLLLPRR Amino-Acids 100|
X11 TCTCCTGCCTCAGCCTCCCCGGTAGTT|2|2|SPASASPVV Amino-Acids 50|BLAST does not fully download.

   Aspirering Bioinfomaticians Start Here. Get a compiled non object oriented[OOP] language like C, with a static type system that prevents many unintended operations, used in supercomputing (US government uses C for Mission Critical programs), and BLAST was more accurate in C. Python written in C, is object-oriented, and interpreted.
Exact String Matching Algorithms, code in C, animation in Java, Thierry Lecroq http://www-igm.univ-mlv.fr/~lecroq/string/
SEE string_searching_algorithm - Wikipedia
SEE Boyer-Moore and Boyer-Moore-Horspool Commentz-Walter
Google NCBI C++ Toolkit - NIH and/or book, to find the 11,000 classes use to program BLAST (formally in C ).
• A 45GB pseudo chromosome for bioinformaticians to test/debug any sequence-alignment method. We used it for debugging.
Biocomputational will provide classes in informatics for beginners (start where you need, Algebra, statistics, programming).

Biocomputational will not discriminate on the basis of race, color, religion, sex, national origin, disability, or age.
We will only service the USA and Canada because of difficulties in billing and currencies but we wish to service scientists everywhere.
Biocomp will do a second replacement search in the case of an error in processing or delivery, and if the user makes an error in entering a query.
Pricing, not yet established will probably range between 35$ and 60$ for nucleotide searches.