Hi Friends,

Even as I launch this today ( my 80th Birthday ), I realize that there is yet so much to say and do. There is just no time to look back, no time to wonder,"Will anyone read these pages?"

With regards,
Hemen Parekh
27 June 2013

Now as I approach my 90th birthday ( 27 June 2023 ) , I invite you to visit my Digital Avatar ( www.hemenparekh.ai ) – and continue chatting with me , even when I am no more here physically

Thursday 3 November 2016

Artificial Resume Deciphering Intelligent Software ( ARDIS )

Artificial  Resume  Deciphering  Intelligent  Software  ( ARDIS )

Artificial  Resume  Generating   Intelligent  Software   (  ARGIS  )

-------------------------------------------------------------------------------------------------------

Note : Most of what I envisaged in my following notes 20 years ago , must have materialized by now !

          But , it did lead to the launch of...www.3pJobs.com ....on 14 Nov 1997  !



Date written  :   01  Dec  1996

Uploaded      :    03  Nov  2016

--------------------------------------------------------------------------------------------------------------------------------

What are these software ? What will they do ? How will they help us  ? How will they help our clients / candidates ?

ARDIS :

This software will break up / dissect a Resume into its different constituents such as ,

#   Physical information ( data ) about a candidate ( Executive )

#   Academic information about a candidate

#   Employment Record ( Industry / Function / Products / Services , wise )

#   Salary

#   Achievements / Contributions

#   Attitudes  / Attributes  /  Skills  /  Knowledge

#   His preferences with respect to Industry / Function / Location



In fact , if every candidate was to fill in our EDS ( Executive Data Sheet ) , the info would automatically fall into " proper " slots / fields since our EDS forces a candidate to " dissect " himself into various compartments

But ,

Getting every applicant / executive to fill in our standardized EDS is next to impossible - and , may not even be necessary


Executives ( who have already spent a lot of time and energy preparing / typing their bio-data ), are most reluctant to sit down once more and spend a lot of time once again , to furnish us the SAME information / data in our neatly arranged blocks of EDS . For them , this duplication is a WASTE of TIME !


EDS is designed for our ( information handling / processing / retrieving ) convenience and that is the way he perceives it  ! Even if he is vaguely conscious that this ( filling in of EDS ) would help him in the long run , he does NOT see any IMMEDIATE BENEFIT from filling this - hence , reluctant to do so
We , too have a problem - a " COST / TIME / EFFORT " problem


If we are receiving 100 bio-data each day ( this should happen soon ) , whom to send our EDS and whom NOT to ?


This can be decided only by a SENIOR executive / consultant , who goes through each and every bio data , DAILY , and reaches a conclusion as to ,

*   which resumes are of " interest " & need sending an EDS

*   which resumes are " marginal " or not of immediate interest , where we need not spend time / money / energy
     of sending an EDS

We may not be able to employ a number of Senior / Competent Consultants who can scrutinize all incoming bio-data and take this decision on a DAILY basis ! This , itself would be a costly proposition

So ,

On one hand  >  we have time / cost / energy / effort of sending EDS to everyone ,

On second hand  >  we have time / cost of several Senior Consultants to separate out " chafe " from " wheat "


NEITHER  IS   DESIRABLE  !

But ,

from each bio data received daily , we still need to DECIPHER , and drop into relevant slots / fields , RELEVANT DATA / INFORMATION , which would enable us to ,

#   Match a candidate's profile with " Client Requirement Profile " against specific requests

#   Match a candidate's profile against " Specific Vacancies " that any Corporation ( client or not ) , may post on
     our VACANCY  BULLETIN   BOARD  ( un-advertized vacancies )

#   Match a candidate's profile against " Most Likely Companies who are likely to hire / need such an executive " ,
     using our CORPORATE  DATABASE , which will contain info such as , PRODUCTS / SERVICES of each and every
     Company


#   Convert each bio data received into a RE-CONSTITUTED BIO DATA ( Converted Bio data ) , to enable us to send
     it out to any client / non-client organization , at the click of a mouse


#   Generate ( for commercial / profitable exploitation ) , such by-product services as ,

     *  Compensation Trends

     *  Organization Charts

     *  Job Descriptions....etc

#    Permit a candidate to log into our DATABASE and remotely modify / alter his bio data


#    Permit a client ( or a non-client ) , to log into our DATABASE and remotely conduct a SEARCH


ARDIS  is required on the assumption that , for a long time to come , " TYPED BIO DATA " would form a major source of our database


Other sources , such as 
,
*   Duly filled in EDS ( hard copy )

*   EDS  on a floppy

*   Downloading EDS over Internet ( or Dial-Up phone lines ) ,  and uploading after filling in ( like Intellimatch ),


will continue to play a minor role in foreseeable future



HOW   WILL   ARDIS   WORK   ?


Step # 1 

Receive typed Bio Data



Step # 2

Scan bio data



Step # 3

Create BIT-MAP image



Step # 4

Using OCR , convert to ASCII ( using PageMaker )

Convert to English characters ( by comparison )



Step # 5

OWR / Optical Word Reader

Convert to English language WORDS , to create a Directory of Keywords ( using ISYS )

Compare with KEY-WORDS , stored in WORD DIRECTORY of " Most Frequently Used " WORDS in 3,500 converted bio-data ( ISYS analysis )



Step # 6

OPR / Optical Phrase Reader

Pick out " Phrases " and create DIRECTORY of " Key Phrases "  ( ARDIS )

*  Detect " Pre-fixes " & " Suffixes " used with each KEY WORD that go to make up " Most Frequently Used
   PHRASES "

*  Calculate " Occurrence  Frequency "

*  Calculate " Probability " of each Occurrence

*  Create " Phrase Directories " for comparison



Step # 7

OSR / Optical Sentence Reader

Pick out " Sentences " & create , Directory of " KEY SENTENCES "

Most commonly used VERBS / ADVERBS / PREPOSITIONS  , with each " Key Phrase " to create Directory of KEY SENTENCES



TO   RECAPITULATE :


ARDIS will ,

*   Recognize " Characters "

*   Convert to " WORDS "

*   Compare with 6,258 key words  which we have found in 3,500 converted Bio Data ( using ISYS ) . 

If a " Word "
    has not already appeared ( > 10 times ) in those 3500 bio data , then its " chance " ( probability ) of occurring
    in the next bio data , is very very small indeed



But even then ,

ARDIS software will store in memory , each " Occurrence " of each Word ( old or new / first time or a thousandth time ) ,

And ,

will continuously calculate its " Probability of Occurrence " as :


P =  [ No of Occurrence of the given word so far ] .. divided by... { Total No of occurrence of all the words in the
          in the entire population so far  }


So that ,

By the time we have SCANNED , 10,000 bio data , we would have literally covered ALL the words that have , even a small PROBABILITY of OCCURRENCE !

So , with each new bio data " scanned " , the " probability of occurrence " of each word is getting , more and more accurate !

Same logic will hold for,

*  KEY  PHRASES

*  KEY  SENTENCES

The " Name of the Game " is : Probability of Occurrence

As someone once said :


If you allow 1000 monkeys to keep on hammering keys of 1000 type-writers , for 1000 years , you will , at the end  find that , between them , they have " re-produced " , the entire literary works of Shakespeare  !


But  today , if you store into a Super Computer ,

*   all the words appearing in English language ( incl Verbs / Adverbs / Adjectives ..etc )

*  the " Logic " behind construction of English language ,

then ,

I am sure , the Super Computer could reproduce the entire works of Shakespeare , in 3 MONTHS !


And , as you would have noticed , ARDIS is a " SELF  LEARNING " type of software !

The more it reads ( scans ) , the more it learns ( memorizes words , phrases & even sentences )

Because of its SELF LEARNING / SELF CORRECTING / SELF IMPROVING , capability , ARDIS gets better & better equipped to detect , in a scanned bio data ,


*   Spelling  Mistakes  (  wrong WORD )

*   Context  Mistakes  ( wrong Prefix or Suffix )

*   Preposition  Mistakes  ( wrong PHRASE )

*   Verb / Adverb  Mistakes ( wrong SENTENCE ),


With minor variations ,

-  ALL Thoughts , Words ( written ) , Speech ( spoken ) and Actions , keep on " repeating " again and again and again


It is this REPETITIVENESS of Words , Phrases , and Sentences in Resumes , that we plan to exploit


In fact ,

by examining & memorizing the several hundred ( or thousand ) " Sequences " in which the words appear , it should be possible to " Construct " the " Grammar " ie: the logic behind the sequences


I suppose , this is the manner in which the experts were able to unravel the " meaning " of hierographic inscriptions on Egyptian tombs . 

They learned a completely strange / obscure language by studying the " Repetitiveness " & " Sequential " occurrence of unknown characters

===============================================================\

Added  on    11  JULY  2022 :

LaMDA: our breakthrough conversation technology


(  18  May  2021  )


Extract :

LaMDA’s conversational skills have been years in the making. Like many recent language models, including BERT and GPT-3, it’s built on Transformer, a neural network architecture that Google Research invented and open-sourced in 2017. That architecture produces a model that can be trained to read many words (a sentence or paragraph, for example), pay attention to how those words relate to one another and then predict what words it thinks will come next. 


But unlike most other language models, LaMDA was trained on dialogue. During its training, it picked up on several of the nuances that distinguish open-ended conversation from other forms of language. One of those nuances is sensibleness. Basically: Does the response to a given conversational context make sense? For instance, if someone says:


“I just started taking guitar lessons.”

You might expect another person to respond with something like: 

“How exciting! My mom has a vintage Martin that she loves to play.”


That response makes sense, given the initial statement. But sensibleness isn’t the only thing that makes a good response. After all, the phrase “that’s nice” is a sensible response to nearly any statement, much in the way “I don’t know” is a sensible response to most questions. Satisfying responses also tend to be specific, by relating clearly to the context of the conversation. In the example above, the response is sensible and specific.


LaMDA builds on earlier Google research, published in 2020, that showed Transformer-based language models trained on dialogue could learn to talk about virtually anything. Since then, we’ve also found that, once trained, LaMDA can be fine-tuned to significantly improve the sensibleness and specificity of its responses.



==============================================================



HOW  TO   BUILD  DIRECTORIES  OF  "  PHRASES  "  ?


From 6252 words , let us pick any word , say :  ACHIEVEMENT

Now we ask the software to scan the Directory containing 3500 converted Bio Data , with instruction that every time  the word " Achievement " is spotted , the software will immediately spot / record the " prefix " . 

The software will record , ALL the words that appeared before " Achievement " as also the " Number of times " each of this prefix appeared


Word  = ACHIEVEMENT


Prefix found......................... No of times found ( Occurrence )................ Probability of Occurrence


--------------------------------------------------------------------------------------------------------------------------------
*  Major........................................ 10.................................................. 10 / 55 =
*  Minor......................................... 9...................................................  9 / 55 =
*  Significant................................... 8 .................................................... 8 / 55 =
*  Relevant..................................     7  ...................................................  7 / 55 =
*  True..........................................   6  .................................................    6 / 55 =
*  Factual........................................  5
*  My  ........................................      4
*  Typical  ..................................       3
*   Collective...................................   2
*   Approximate.................................  1
--------------------------------------------------------------------------------------------------------------------------------
    TOTAL NO OF OCCURRENCES.......... 55...................................................... ( Total Probability )  1.000
--------------------------------------------------------------------------------------------------------------------------------
As more and more bio data are scanned ,

*   The Number of " Prefixes " will go on increasing

*   The Number of " Occurrences " of each prefix will also go on increasing

*   The overall " population size " will also go on increasing

*   The " Probability  of Occurrence "  of each prefix will go on getting more and more accurate ie; more and more
     representative

This process can go on and on and on ( as long as we keep on scanning bio data )

But " Accuracy Improvements " will decline / taper off , once a sufficiently large number of prefixes ( to the word , ACHIEVEMENT ), have been accumulated . Saturation will take place !


The whole process can be repeated with the WORDS that appear as " SUFFIXES " to the word " ACHIEVEMENT "


And the probability of occurrence of each " Suffix" , also determined


Word = ACHIEVEMENT
--------------------------------------------------------------------------------------------------------------------------------

Suffix ............................ No of Times Found....................................... Probability of Occurrence

--------------------------------------------------------------------------------------------------------------------------------

*   Attained.......................... 20  ..
............................................................. 20 / 54
*   Reached.......................    15...................................................................15 / 54
*   Planned.....................       10 ................................................................... 10 / 54
*   Targeted..................         5...................................................................    5  / 54
*   Arrived........................     3  ....................................................................  3  / 54
*   Recorded ...................      1   ...................................................................... 1 / 54
--------------------------------------------------------------------------------------------------------------------------------
TOTAL OF ALL OCCURRENCES .... 54 ( Population Size )..    Total Probability ........... 1.000
-------------------------------------------------------------------------------------------------------------------------------

 Having figured out the " Probabilities of Occurrences " of each of the prefixes and each of the suffixes ( to a given word , - in this case , ACHIEVEMENT ) , we could next tackle the issue of " a given combination of prefix and suffix "


eg;

What is the probability of :

*   Prefix  =  " Major "  /   Word  =  ACHIEVEMENT   /  Suffix = " Attained "  ?

Why is all of this Statistical exercise required ?

If we wish to stop at merely " Deciphering " a resume , then I don't think , we need to go through this

For mere " Deciphering " , all we need is to create a KNOWLEDGE  BASE of :

*   Skills

*   Knowledge

*   Attitudes

*   Attributes

*   Industries

*   Companies

*   Functions

*   Edu Qualifications

*   Products / Services

*   Names ...etc


Having created the " knowledge base " , simply scan a bio data , recognize " words " , compare with the words contained in the " knowledge base " , find CORRESPONDENCE / EQUIVALENCE , and allot / file each scanned word into respective " Fields " against each PEN ( Permanent Executive Number )


PRESTO  !


You have dissected  and stored the MAN , in appropriate boxes !

Our EDS has these " boxes " . Problem is manual data entry

The data entry operator ,

-  searches appropriate " word " from appropriate " EDS Box " and transfers to appropriate screen


To eliminate this manual ( time consuming operation ) , we need ARDIS

We already have a DATA BASE of 6500 words

All we need to do is to write down against each word , whether it is a ,

*  Skill

*  Attribute

*  Knowledge

*  Edu

*  Product

*  Company

*  Location

*  Industry

*  Function   etc


The moment we do this , what was a mere " Data base " , becomes a " Knowledge Base " , ready to serve as a " COMPARATOR "

And as each NEW bio data is scanned , it will throw up words for which there is no " Clue "

Each such NEW word will have to be manually " Categorized " and added to the " Knowledge base "

Then what is the advantage of calculating for ,

*  each  WORD

*  each  SUFFIX

*  each  PREFIX

*   each  PHRASE

*  each  SENTENCE  ,

- its probability of occurrence ?



The  ADVANTAGES  are :

# 1

 Detect " unlikely " prefixes / suffixes

Suppose ARDIS detects , " Manor Achievement "

ARDIS finds that the " probability " of ,

*  " Manor " as prefix to ACHIEVEMENT , is 0.00009 ( say , NIL )

hence , the CORRECT prefix has to be ,

*  " Major " ( and not " Manor " ) , for which , the probability is ( say ) ... 0.4056



# 2

 ARDIS detects words " Mr HANOVAR "

It recognizes this as a spelling mistake and corrects automatically to , " Mr HONAVAR "

OR,

it reads , place of birth as " KOLHAPURE "

It recognizes it as " KOLHAPUR " , or vice versa , if it says " My name is KOLHAPUR , "



# 3

Today , while scanning ( using OCR ) , when a mistake is detected , it gets highlighted on the screen or an asterisk / underline starts blinking

This draws attention of the operator , who manually corrects the " mistake " , after consulting a dictionary or his own knowledge base

Once ARDIS has calculated the probabilities of lakhs of words and even the probabilities of their " Most likely sequence of occurrences " , then , hopefully the OCR can " self - correct " any word or phrase , without operator intervention


So the scanning accuracy of OCR should eventually become 100 % and not 75 % - 85 % as at present



# 4

Eventually , we want that ,

-  a bio data is scanned , and automatically

-  re-constitutes itself into our converted BIO DATA  FORMAT



This is the concept of  ARGIS ( automatic resume generating intelligence software )


Here again , the idea is to eliminate the manual data entry of the entire bio data - our Ultimate Goal

But ARGIS is not possible without first installing ARDIS ,

and that too with the calculation of the " Probability of Occurrence " as THE MAIN FEATURE of the software

By studying and memorizing and calculating the " Probability of Occurrence " of lakhs of words / phrases / sentences , ARDIS actually " learns " English grammar through " Frequency of Usage "


And it is this Knowledge Base which enables ARGIS to re-constitute a bio data ( in our format ) , in a GRAMMATICALLY CORRECT way


 









 


No comments:

Post a Comment