Menu, Login, User Registration, Query Request will go on this line ______ Please excuse this rough draft. It's a work in progress.
Images will go here
Only
Biocomputational (Biocomp) searches the entire 3 GB Human Genome
Project (HGP) with 100% accuracy. The NIH claims "Exhaustive searching
of all the data is currently considered ridiculously time consuming, if
not impossible." Biocomp exhaustively examines every nucleotide and/or amino acid 11 times. Only
Biocomputational reports every exact and partial match without misses or false-positives [within the 600 match limit]. Only
Biocomputational has the computational power to search the entire
genome for amino acids in all 3 frame shifts reporting both amino acid
and nucleotide matches [Perfect for biomarker discovery]. Other
alignment systems search about 11% of the genome approved with science
by committee. Only Biocomp allows you to search for the amino acid terminator (*) codone sets TAA, TAG, TGA. We
will never substitute the magical "Optimal Alignment" scoring for a perfect
mathematical alignment match. NCBI tool RepeatMasker handles repeats
badly masking up to 50% of the genome.
Biocomp can search any genome mice to
rice, based on user demand.
Biocomputational is now accepting query searches of 25 to 56 nucleotides through resellers who can also do result verification and enhancement. Result verification and Enhancement Assistants (REA) for a negotiated fee.
Nucleotides
translated to the 3 frame shift amino acids will report amino acid
matches and nucleotide matches [Perfect for Biomarker Discovery] will require a negotiated fee..
A reality check and disclaimer.
The raw probability of finding a 25 nucleotide sequence is 1.259e015, or 1 in 1,259,000,000,000,000.
Remember that the higher the statistical significance, the higher the risk of false-positives. The Biocomp programs use the same CPU power and time to report that a
56 nucleotide query does not exist, as it does to report 100 exact and
partial matches with the exact chromosome locations. Biocomp
reports the first 600 matches with a full count of all matches This limit can be extended by special arraignment. This
limit exists having found over 100,000 matches of short queries clogging our reporting system.
Biocomputational makes no claims or support for the medical diagnosis
or genomic meaning of a match. We are not here to help a researcher
prove a hypothesis. This is the job of the
researcher.
We do not provide statistical, significance, inference, interpretation,
assumptions,
probability, or importance of any match. In fact we do not utilize
statistics or AI, we just search. Our software can not find the cure
for cancer. Using this software service will provide verifiable and
transparent fact based scientific data results providing certainty for serious research.
Researchers and graduate students will save time, [days, weeks, months]
analyzing huge volumes of confusing and inaccurate results. Most
importantly it will not misdirect meaningful research.
We will not
limit the use of these results, nor will we claim ownership of the
sequences. We are not responsible for the same queries being searched
by several researchers, example a class assignment, and that means the first to query has no ownership.
The name, email query sequence will not be shared.
Charles
DeFilippo a programmer since Jan. 1965, professor, and expert
search engineer, turned bioinformatician. After surviving stage 4
throat cancer, Charles volunteered to help a cancer researcher with
genomic questions that the NCBI BLAST failed to accurately
answer. BLAST went from a 5% to 10% error rate, to up to a 50%
false-positives in various parts of nucleotide searches, and missing
up to 90% of valid amino-acid matches by not searching the entire genome or
terminator (*) codone sets TAA, TAG, TGA.
Charles has followed lung cancer research,
his next cancer.
Ledford, Heidi, “Lists of cancer mutations awash with false positives.”
NATURE, 2013. We discovered that only 11 of the 450 genes connected to
lung cancer were valid, a 98% false positive rate, when researched by
600 post grad fellows world wide.
What are the costs of misdirected
cancer research in time, funding, reputations and lives?
Biocomputational searches the chromosomes found at this address:
https://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/
GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_assembly_structure/Primary_Assembly/
assembled_chromosomes/FASTA/ GRCh38.p13_assembly
chr01.fna chr02.fna chr03.fna
chr04.fna chr05.fna chr06.fna
chr07.fna chr08.fna
chr09.fna chr10.fna
chr11.fna chr12.fna chr13.fna
chr14.fna chr15.fna chr16.fna
chr17.fna chr18.fna chr19.fna
chr20.fna chr21.fna
chr22.fna chrxx.fna chryy.fna
Only Biocomputational offers an exhaustive genome search with %100 accuracy with the following competencies..
These services will be offered in order of user demand and enlistment of localized Result verification and Enhancement Assistants.
• Search exact and partial nucleotide matches. (25 to 56 for now, and eventually hundreds)
• Training of local Result verification and Enhancement Assistants (REA).
• Partial nucleotide searches, substitution Errors=(queryLength/WordSize11)-1. E=(L/W)-1 (56=4 errors)
• Searches of entire HGP (3 GB), results of 3 frame shift amino-acids and all nucleotide mutations (for biomarker discovery).
• Any genome mice to rice based on demand, with all of the above services.
Result Customization Assistants will provide a variety of extra research services.
• A genome – match in context viewer.
• Reliance and frequency of rarest 11 nucleotide sub-sequences.
• Brute force match verification.
• 51% of more contiguous matches.
• Match sequence counts only.
• Find all Biomarkers that meets it's consensus sequence coding.
• New candidate genes by searching both start (M) and terminating codons (*).
The biomarkers and new genes include ones that overlap so none are missed.
The following are not ready because user feedback and suggestions are needed.
• Next-gen short sequence sorting with repeat count.
• Next-gen of sorted sequences with mixed transcriptome or referenced search.
• Search chemical formula or empirical formula in textual documents.
• Micro-array data analysis to find sequences of sample differences.Tumors-Normal
Our mission of
precision genomics is not just a catchy phrase, it’s intended to advance
science, and maybe save some lives.
The
saving lives part is the researcher's job, our job is to provide the best research tools
possible.
• BLAST, O'Reilly Books - An Essential Guide to the Basic Local Alignment Search Tool, By Korf, Yandell, and Bedell, July 2003.
“Chapter 4. Sequence Similarity - Karlin-Altschul statistics”
In 1990, Samuel Karlin and Stephen Altschul published a theory of local alignment statistics.
“Karlin-Altschul statistics make five central assumptions:
• A positive score must be possible.
• The expected score must be negative.
• The letters of the sequences are independent and identically distributed (IID).
• The sequences are infinitely long.
• Alignments don’t contain gaps.
The
first two assumptions are true for any scoring matrix estimated from
real data. The last three assumptions are problematic (wrong) because
biological sequences have context dependencies, aren’t infinitely long,
and are frequently aligned with gaps. Both alignment and sequence
similarity assume independence, and this is a necessary convenience. ”
Karlin-Altschul uses short cut integral calculus needed by the Poisson equation, and distribution.
“Chapter 5. The Five BLAST Programs” There are now many more.
“Chapter 8. 20 Tips to Improve Your BLAST Searches”
“8.1 Don't use the Default Parameters”
“8.2 Treat BLAST Searches as Scientific Experiments”
Use various versions of BLAST and manipulate parameters until you get what you want.
“8.3 Perform Controls, Especially in the Twilight Zone”
The
Twilight Zone sequences are weak similarity matches that score at or
below a Z-score percentile of 25 or less (bit-scores to Z-scores have
no accurate conversion), so you get more matches and do not miss
anything important.
“8.7 Know When to Use Complexity Filters”
“8.8 Mask Repeats in Genomic DNA ”
Repeats replaced (hard-masked) with N's for nucleotides and X's for amino-acid sequences. See RepeatMasker.com.
“8.20 How to Lie with BLAST Statistics”
• BLAST does not search actual genome data. Instead it produces a hash table index database and searches it instead.
It is prone to hash collisions and the remedies to collisions like a
Bloom Filter that may be responsible for all the false positives.
Charles DeFilippo Academics
Jan.
1965 Learned to program in machine language, and FORTRAN on one of the
first IBM 360 mainframes
and Chemistry at the University of Florida.
1968 Chemistry at SUNY Farmingdale, NY.
1970 Psychology at University of South Florida.
1971 MA Teaching Emotionally Disturbed Children at University of South Florida.
1971 Certified to teach emotionally disturbed children and science up to the junior college level, in Georgia, and Florida.
1972 taught for the Southwest Georgia Program for Exceptional Children, the one teacher for two counties.
Charles went on to teach public elementary, middle school, and high schools in Florida.
He also taught at a Summerhill model school, a public alternative school, psych wards, and juvenile detention centers.
2004
PhD Candidate in UNM Educational Psychology, that required five 600
level statistics classes, completing classes
in analysis of variance statistics ANOVA, multiple regression
statistics, Meta study statistics, AI[5 types of AI] and web-site HTML,
ending
with a second minor stroke and treatment for stage 4 throat cancer.
1975-1983 Studied COBOL, BASIC for Apple, IBM PC, and HP instrumentation, LOGO(for children), PILOT(for teachers).
1983 studied Pascal and Modula-2 with Niklaus Wirth, the inventor of both.
2002 trained and certified in Java, c shell and Solaris UNIX, by the SUN Academic Initiative for university instructors.
1981 Supervisor of UNM Algebra Tutorial Program 15 tutors serving 800 students per month.
1981 – 2015 professor at UNM main campus and Kirkland AFB,
Albuquerque College,
UNM Los Lunas campus and Los Lunas Medium Security Correctional Prison,
College of Sana Fe Albuquerque campus and Los Lunas Medium Security Correctional Prison and Kirkland AFB,
National College computer science department head,
SIPI Native American College for UNM.
Taught
3 levels of College Algebra (4000 students), Statistics, Database
Design (SQL), Paradox, and Oracle SQL plus, spreadsheets, computer
programming (2000 students) BASIC, COBOL, Pascal, Java, JavaScript, C,
UNIX and Solaris.
Charles DeFilippo science and programming:
1969 Employed as an assistant scientist chemist by American Cyanimide in Stamford Connecticut.
1987 Spectra Research Contractor for Sandia and Los Alamos National Laboratories.
Studied AI neural networking and fuzzy logic at Spectra for use with Full Text Retrieval.
Evaluated version 1 of Microsoft Projects for Sandia Labs to note real and potential problems.
Trained users and documented the Costpro estimating system (Nuclear power plant) for Los Alamos.
Safety
analysis of the (rail based) MX-Missile, warhead, explosions, and
propellant leaks for the US Air Force and Sandia Labs. Charles modeled
the leak dispersion, when it proved to be the most dangerous problem.
Sandia
expanded the MX-Missile propellant study to be a hazmat material
dispersion modeling tool, assisted by a Fermi Laboratories FORTRAN
algorithm that Charles adapted and converted to C.
Designed and populated a Trident Missile Warhead chemical interactions database for DOD and DOE with Sandia Labs.
Corrected and made functional a DOT hazmat physical and chemical property database.
Designed
a research database for the UNM Hospital to collect trauma treatments
and patient outcomes to be analyzed. The database needed to be
more than 300 fields (255 fields the normal max) used for 2 types of
trauma (physical trauma, stab, gun shoot, car crash, or burn trauma)
in a single database, and all possible treatments (for instance all
antibiotics) in drop down menus for consistent entries. The database is
now used by several trauma centers.
1990
With expertise in database design and full text retrieval Charles asked
the National Library of Medicine [in person] if it needed help with searching the
existing genome data and was told the problem had been solved with
BLAST.
Presented
text retrieval solutions 2 times at the National Library of Medicine,
the Pentagon and the Library of Congress 3 times.
Charles wrote an accounting report (over 800 pages, quarterly) indexing and searching program for auditing a small city.
Charles
also wrote commercial applications in medical office management, dental
(200,000 patient capability), chiropractor, and a video rental store.
TESTING - Comparing Biocomputational to BLAST results.
The
test set was 20,000 randomly selected 25 nucleotide sequences
(micro-array size) . A sub set of 135 sequences that had an exact
match count of 10 to 20. Both brute force searches and the latest
version of Biocomputational were used for verification of the true
counts. The majority of the 135 test sequences produced results
of 30% to 50% false positives (after 2010), with the worst
sequence CGAAATGCCAGCTGAGGCACATGCC, BLAST reported 38 matches, when
only 9 exist, even though the BLAST report shows only 14 matches with a
perfect bit-score of 50.1. Only CTGGGTGTGGTGACGGGTGTCTGTA, 1 out
of 135 test sequences reported 10 where 10 existed and showed 10
perfect bit-score of 50.1. BLAST over reported (false
positives) on every other sequence, and the identification of missed
sequences characters impossible.
https://blast.ncbi.nlm.nih.gov/Blast.cgi, Human, Genome(GRCh38.7, reference assembly top-level) Try out BLAST.
|#1|Reference Number |#2|Search Query Sequence |#3|Biocomp Count 38.p2 |#4|BLAST Count 38.p2
|#5|Biocomp Count 38.p7 |#6|BLAST Count 38.p7 |#7|BLAST Result Lines and Bytes of X queries.
|#1|#2
|#3|#4|#5|#6|#7|
X 1 GCTTCCCAAAGTGCTGGGACTGACT|13|23|11|23|7M lines, 200MB
X 2 GGATTACAGCCGTGAGCCACCACAC|11|29|11|29|5.79M lines 172MB
X 3 TTGAGACGAGTCTTGCTCTGTTGCC|12|25|12|25|2M lines, 60MB
X 4 GCCTCAATCTTCTGGGCTCAAGTGA|12|22|18|32|.9M lines, 28MB
5 CTGGGTGTGGTGACGGGTGTCTGTA|10|10|10|10|
6 TGGGTTCTGTGCCCACACTCTAGAT|12|20|11|20|
7 CGAAATGCCAGCTGAGGCACATGCC| 9|38| 9|38|
8 ACCAACATGGAGAAATCTCGTCTCC|11|27|12|26|
9 TTGGCACCAGGGACTAGTTTTGTGG|16|27|16|27|70,000 lines, 2MB
Nucleotide Sequence to Amino-Acids - finds Nucleotide Mutations
10 GGAGTTTCACTGTTGTTGCCCAGGCGT|1|1|GVSLLLPRR Amino-Acids 100|
X11 TCTCCTGCCTCAGCCTCCCCGGTAGTT|2|2|SPASASPVV Amino-Acids 50|BLAST does not fully download.
Aspirering Bioinfomaticians Start Here. Get a compiled non object oriented[OOP] language like C, with a static type system
that prevents many unintended operations, used in supercomputing (US
government uses C for Mission Critical programs), and BLAST was more
accurate in C. Python written in C, is object-oriented, and interpreted.
Exact String Matching Algorithms, code in C, animation in Java, Thierry Lecroq http://www-igm.univ-mlv.fr/~lecroq/string/
SEE string_searching_algorithm - Wikipedia
SEE Boyer-Moore and Boyer-Moore-Horspool Commentz-Walter
Google NCBI C++ Toolkit - NIH and/or book, to find the 11,000 classes use to program BLAST (formally in C ).
• A 45GB pseudo chromosome for bioinformaticians to test/debug any sequence-alignment method. We used it for debugging.
Biocomputational will provide classes in informatics for beginners (start where you need, Algebra, statistics, programming).
Biocomputational will not discriminate on the basis of race, color, religion, sex, national origin, disability, or age.
We will only service the USA and Canada because of difficulties in
billing and currencies but we wish to service scientists everywhere.
Biocomp will do a second replacement search in the case of an error in
processing or delivery, and if the user makes an error in entering a
query.
Pricing, not yet established will probably range between 35$ and 60$ for nucleotide searches.